Grouping large sparse pandas dataframe with groupby.sum() is very slow
Question:
I have pandas dataframe of size (607875, 12294)
. The data is sparse and looks like:
ID BB CC DD ...
0 abc 0 0 1 ...
1 bcd 0 0 0 ...
2 abc 0 0 1 ...
...
I converted it to the sparse form by calling
dataframe = dataframe.to_sparse()
Later, I groupped it by ID
and sum
the row values by
dataframe = dataframe.groupby("ID").sum()
For smaller dataframes it works perfectly well, but for this size, it worked for an hour and did not finish the work.
Is there a way to speed it up or get around it? Is there any other sparse methods I can use because the to_sparse
method is deprecated.
The size of output dataframe would be (2000, 12294)
and look like (if there is no other 1 in abc
column):
ID BB CC DD ...
0 abc 0 0 2 ...
1 bcd 0 0 0 ...
...
I have 32 GB RAM on my PC, so it should be enough.
Answers:
Pandas has its limitations I’m afraid and is most efficient with relatively small datasets 100MB – 1GB. If you want to work with pandas only, one workaround would be to read in data from source in chunks which will reduce the dataframe. Or if possible, you can filter out unnecessary columns for your transformation.
Elsewhere, you should checkout frameworks such as PySpark or Hadoop which is more suitable for transformations on larger datasets.
I know it is counter intuitive, but looping over the columns without calling to sparse is faster. Try the code below.
df1 = df[['id', 'BB']].groupby(by='id').sum()
for i in df.columns[2:]:
df1[i] = df[['id', i]].groupby(by='id').sum()
# if you want to save space you can drop df columns after they are added to df1
Inspired by https://stackoverflow.com/a/50991732/8035867 here is a solution that relies on Sklearn to do a kind of sparse one-hot encoding of the group labels and then uses Scipy to do a dot product of two sparse row matrices.
Edit: Used One-Hot Encoder instead to cope with the situation where there are only two classes in the group by.
from sklearn.preprocessing import OneHotEncoder
def sparse_groupby_sum(df, groupby):
ohe = OneHotEncoder(sparse_output=True)
# Get all other columns we are not grouping by
other_columns = [col for col in df.columns if col != groupby]
# Get a 607875 x nDistinctIDs matrix in sparse row format with exactly
# 1 nonzero entry per row
onehot = ohe.fit_transform(df[groupby].values.reshape(-1, 1))
# Transpose it. then convert from sparse column back to sparse row, as
# dot products of two sparse row matrices are faster than sparse col with
# sparse row
onehot = onehot.T.tocsr()
# Dot the transposed matrix with the other columns of the df, converted to sparse row
# format, then convert the resulting matrix back into a sparse
# dataframe with the same column names
out = pd.DataFrame.sparse.from_spmatrix(
onehot.dot(df[other_columns].sparse.to_coo().tocsr()),
columns=other_columns)
# Add in the groupby column to this resulting dataframe with the proper class labels
out[groupby] = ohe.categories_[0]
# This final groupby sum simply ensures the result is in the format you would expect
# for a regular pandas groupby and sum, but you can just return out if this is going to be
# a performance penalty. Note in that case that the groupby column may have changed index
return out.groupby(groupby).sum()
dataframe = sparse_groupby_sum(dataframe, "ID")
Note that for performance purposes you can inline the definition of the onehot variable to the out =
line, I’ve just separated it out here for didactic purposes.
I have pandas dataframe of size (607875, 12294)
. The data is sparse and looks like:
ID BB CC DD ...
0 abc 0 0 1 ...
1 bcd 0 0 0 ...
2 abc 0 0 1 ...
...
I converted it to the sparse form by calling
dataframe = dataframe.to_sparse()
Later, I groupped it by ID
and sum
the row values by
dataframe = dataframe.groupby("ID").sum()
For smaller dataframes it works perfectly well, but for this size, it worked for an hour and did not finish the work.
Is there a way to speed it up or get around it? Is there any other sparse methods I can use because the to_sparse
method is deprecated.
The size of output dataframe would be (2000, 12294)
and look like (if there is no other 1 in abc
column):
ID BB CC DD ...
0 abc 0 0 2 ...
1 bcd 0 0 0 ...
...
I have 32 GB RAM on my PC, so it should be enough.
Pandas has its limitations I’m afraid and is most efficient with relatively small datasets 100MB – 1GB. If you want to work with pandas only, one workaround would be to read in data from source in chunks which will reduce the dataframe. Or if possible, you can filter out unnecessary columns for your transformation.
Elsewhere, you should checkout frameworks such as PySpark or Hadoop which is more suitable for transformations on larger datasets.
I know it is counter intuitive, but looping over the columns without calling to sparse is faster. Try the code below.
df1 = df[['id', 'BB']].groupby(by='id').sum()
for i in df.columns[2:]:
df1[i] = df[['id', i]].groupby(by='id').sum()
# if you want to save space you can drop df columns after they are added to df1
Inspired by https://stackoverflow.com/a/50991732/8035867 here is a solution that relies on Sklearn to do a kind of sparse one-hot encoding of the group labels and then uses Scipy to do a dot product of two sparse row matrices.
Edit: Used One-Hot Encoder instead to cope with the situation where there are only two classes in the group by.
from sklearn.preprocessing import OneHotEncoder
def sparse_groupby_sum(df, groupby):
ohe = OneHotEncoder(sparse_output=True)
# Get all other columns we are not grouping by
other_columns = [col for col in df.columns if col != groupby]
# Get a 607875 x nDistinctIDs matrix in sparse row format with exactly
# 1 nonzero entry per row
onehot = ohe.fit_transform(df[groupby].values.reshape(-1, 1))
# Transpose it. then convert from sparse column back to sparse row, as
# dot products of two sparse row matrices are faster than sparse col with
# sparse row
onehot = onehot.T.tocsr()
# Dot the transposed matrix with the other columns of the df, converted to sparse row
# format, then convert the resulting matrix back into a sparse
# dataframe with the same column names
out = pd.DataFrame.sparse.from_spmatrix(
onehot.dot(df[other_columns].sparse.to_coo().tocsr()),
columns=other_columns)
# Add in the groupby column to this resulting dataframe with the proper class labels
out[groupby] = ohe.categories_[0]
# This final groupby sum simply ensures the result is in the format you would expect
# for a regular pandas groupby and sum, but you can just return out if this is going to be
# a performance penalty. Note in that case that the groupby column may have changed index
return out.groupby(groupby).sum()
dataframe = sparse_groupby_sum(dataframe, "ID")
Note that for performance purposes you can inline the definition of the onehot variable to the out =
line, I’ve just separated it out here for didactic purposes.