Efficiently create sparse pivot tables in pandas?
Question:
I’m working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn’t as elegant as I would like. My end goal is to use it as the input for a predictive model.
Alternatively, is there some kind of sparse pivot capability outside of pandas?
edit: here is an example of a non-sparse pivot
import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]
frame
person thing count
0 me a 1
1 you a 1
2 him b 1
3 you c 1
4 him d 1
5 me d 1
frame.pivot('person','thing')
count
thing a b c d
person
him NaN 1 NaN 1
me 1 NaN NaN 1
you 1 NaN 1 NaN
This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.
http://docs.scipy.org/doc/scipy/reference/sparse.html
Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.
Answers:
Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u
and thing_u
are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.
from scipy.sparse import csr_matrix
person_u = list(sort(frame.person.unique()))
thing_u = list(sort(frame.thing.unique()))
data = frame['count'].tolist()
row = frame.person.astype('category', categories=person_u).cat.codes
col = frame.thing.astype('category', categories=thing_u).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))
>>> sparse_matrix
<3x4 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
[1, 0, 0, 1],
[1, 0, 1, 0]])
Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:
dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0)
for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)
>>> dfs
a b c d
him 0 1 0 1
me 1 0 0 1
you 1 0 1 0
>>> type(dfs)
pandas.sparse.frame.SparseDataFrame
I had a similar problem and I stumbled over this post. The only difference was that that I had two columns in the DataFrame
that define the “row dimension” (i
) of the output matrix. I thought this might be an interesting generalisation, I used the grouper
:
# function
import pandas as pd
from scipy.sparse import csr_matrix
def df_to_sm(data, vars_i, vars_j):
grpr_i = data.groupby(vars_i).grouper
idx_i = grpr_i.group_info[0]
grpr_j = data.groupby(vars_j).grouper
idx_j = grpr_j.group_info[0]
data_sm = csr_matrix((data['val'].values, (idx_i, idx_j)),
shape=(grpr_i.ngroups, grpr_j.ngroups))
return data_sm, grpr_i, grpr_j
# example
data = pd.DataFrame({'var_i_1' : ['a1', 'a1', 'a1', 'a2', 'a2', 'a3'],
'var_i_2' : ['b2', 'b1', 'b1', 'b1', 'b1', 'b4'],
'var_j_1' : ['c2', 'c3', 'c2', 'c1', 'c2', 'c3'],
'val' : [1, 2, 3, 4, 5, 6]})
data_sm, _, _ = df_to_sm(data, ['var_i_1', 'var_i_2'], ['var_j_1'])
data_sm.todense()
The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)
row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)),
shape=(person_c.categories.size, thing_c.categories.size))
>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
[1, 0, 0, 1],
[1, 0, 1, 0]], dtype=int64)
dfs = pd.SparseDataFrame(sparse_matrix,
index=person_c.categories,
columns=thing_c.categories,
default_fill_value=0)
>>> dfs
a b c d
him 0 1 0 1
me 1 0 0 1
you 1 0 1 0
The main changes were:
.astype()
no longer accepts “categorical”. You have to create a CategoricalDtype object.
sort()
doesn’t work anymore
Other changes were more superficial:
- using the category sizes instead of a length of the uniqued Series objects, just because I didn’t want to make another object unnecessarily
- the data input for the
csr_matrix
(frame["count"]
) doesn’t need to be a list object
- pandas
SparseDataFrame
accepts a scipy.sparse object directly now
Here is an answer that updates the approach in the answer by @Alnilam to use up-to-date pandas libraries which no longer contain all of the functions in that answer.
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
rcLabel, vLabel = ('person', 'thing'), 'count'
rcCat = [CategoricalDtype(sorted(frame[col].unique()), ordered=True) for col in rcLabel]
rc = [frame[column].astype(aType).cat.codes for column, aType in zip(rcLabel, rcCat)]
mat = csr_matrix((frame[vLabel], rc), shape=tuple(cat.categories.size for cat in rcCat))
dfPivot = ( pd.DataFrame.sparse.from_spmatrix(
mat, index=rcCat[0].categories, columns=rcCat[1].categories) )
I’m working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn’t as elegant as I would like. My end goal is to use it as the input for a predictive model.
Alternatively, is there some kind of sparse pivot capability outside of pandas?
edit: here is an example of a non-sparse pivot
import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]
frame
person thing count
0 me a 1
1 you a 1
2 him b 1
3 you c 1
4 him d 1
5 me d 1
frame.pivot('person','thing')
count
thing a b c d
person
him NaN 1 NaN 1
me 1 NaN NaN 1
you 1 NaN 1 NaN
This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.
http://docs.scipy.org/doc/scipy/reference/sparse.html
Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.
Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u
and thing_u
are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.
from scipy.sparse import csr_matrix
person_u = list(sort(frame.person.unique()))
thing_u = list(sort(frame.thing.unique()))
data = frame['count'].tolist()
row = frame.person.astype('category', categories=person_u).cat.codes
col = frame.thing.astype('category', categories=thing_u).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))
>>> sparse_matrix
<3x4 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
[1, 0, 0, 1],
[1, 0, 1, 0]])
Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:
dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0)
for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)
>>> dfs
a b c d
him 0 1 0 1
me 1 0 0 1
you 1 0 1 0
>>> type(dfs)
pandas.sparse.frame.SparseDataFrame
I had a similar problem and I stumbled over this post. The only difference was that that I had two columns in the DataFrame
that define the “row dimension” (i
) of the output matrix. I thought this might be an interesting generalisation, I used the grouper
:
# function
import pandas as pd
from scipy.sparse import csr_matrix
def df_to_sm(data, vars_i, vars_j):
grpr_i = data.groupby(vars_i).grouper
idx_i = grpr_i.group_info[0]
grpr_j = data.groupby(vars_j).grouper
idx_j = grpr_j.group_info[0]
data_sm = csr_matrix((data['val'].values, (idx_i, idx_j)),
shape=(grpr_i.ngroups, grpr_j.ngroups))
return data_sm, grpr_i, grpr_j
# example
data = pd.DataFrame({'var_i_1' : ['a1', 'a1', 'a1', 'a2', 'a2', 'a3'],
'var_i_2' : ['b2', 'b1', 'b1', 'b1', 'b1', 'b4'],
'var_j_1' : ['c2', 'c3', 'c2', 'c1', 'c2', 'c3'],
'val' : [1, 2, 3, 4, 5, 6]})
data_sm, _, _ = df_to_sm(data, ['var_i_1', 'var_i_2'], ['var_j_1'])
data_sm.todense()
The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)
row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)),
shape=(person_c.categories.size, thing_c.categories.size))
>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
[1, 0, 0, 1],
[1, 0, 1, 0]], dtype=int64)
dfs = pd.SparseDataFrame(sparse_matrix,
index=person_c.categories,
columns=thing_c.categories,
default_fill_value=0)
>>> dfs
a b c d
him 0 1 0 1
me 1 0 0 1
you 1 0 1 0
The main changes were:
.astype()
no longer accepts “categorical”. You have to create a CategoricalDtype object.sort()
doesn’t work anymore
Other changes were more superficial:
- using the category sizes instead of a length of the uniqued Series objects, just because I didn’t want to make another object unnecessarily
- the data input for the
csr_matrix
(frame["count"]
) doesn’t need to be a list object - pandas
SparseDataFrame
accepts a scipy.sparse object directly now
Here is an answer that updates the approach in the answer by @Alnilam to use up-to-date pandas libraries which no longer contain all of the functions in that answer.
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
rcLabel, vLabel = ('person', 'thing'), 'count'
rcCat = [CategoricalDtype(sorted(frame[col].unique()), ordered=True) for col in rcLabel]
rc = [frame[column].astype(aType).cat.codes for column, aType in zip(rcLabel, rcCat)]
mat = csr_matrix((frame[vLabel], rc), shape=tuple(cat.categories.size for cat in rcCat))
dfPivot = ( pd.DataFrame.sparse.from_spmatrix(
mat, index=rcCat[0].categories, columns=rcCat[1].categories) )