Grouping large sparse pandas dataframe with groupby.sum() is very slow

Question:

I have pandas dataframe of size (607875, 12294). The data is sparse and looks like:

     ID BB CC DD ...
0   abc 0  0  1  ...
1   bcd 0  0  0  ...
2   abc 0  0  1  ...
...

I converted it to the sparse form by calling

dataframe = dataframe.to_sparse()

Later, I groupped it by ID and sum the row values by

dataframe = dataframe.groupby("ID").sum()

For smaller dataframes it works perfectly well, but for this size, it worked for an hour and did not finish the work.
Is there a way to speed it up or get around it? Is there any other sparse methods I can use because the to_sparse method is deprecated.

The size of output dataframe would be (2000, 12294) and look like (if there is no other 1 in abc column):

     ID BB CC DD ...
0   abc 0  0  2  ...
1   bcd 0  0  0  ...
...

I have 32 GB RAM on my PC, so it should be enough.

Asked By: Maria

||

Answers:

Pandas has its limitations I’m afraid and is most efficient with relatively small datasets 100MB – 1GB. If you want to work with pandas only, one workaround would be to read in data from source in chunks which will reduce the dataframe. Or if possible, you can filter out unnecessary columns for your transformation.

Elsewhere, you should checkout frameworks such as PySpark or Hadoop which is more suitable for transformations on larger datasets.

Answered By: Freddy Vandalay

I know it is counter intuitive, but looping over the columns without calling to sparse is faster. Try the code below.

df1 = df[['id', 'BB']].groupby(by='id').sum()
for i in df.columns[2:]:
    df1[i] = df[['id', i]].groupby(by='id').sum()
    # if you want to save space you can drop df columns after they are added to df1
Answered By: yosemite_k

Inspired by https://stackoverflow.com/a/50991732/8035867 here is a solution that relies on Sklearn to do a kind of sparse one-hot encoding of the group labels and then uses Scipy to do a dot product of two sparse row matrices.

Edit: Used One-Hot Encoder instead to cope with the situation where there are only two classes in the group by.

from sklearn.preprocessing import OneHotEncoder

def sparse_groupby_sum(df, groupby):
    ohe = OneHotEncoder(sparse_output=True)
    # Get all other columns we are not grouping by
    other_columns = [col for col in df.columns if col != groupby]
    # Get a 607875 x nDistinctIDs matrix in sparse row format with exactly 
    # 1 nonzero entry per row
    onehot = ohe.fit_transform(df[groupby].values.reshape(-1, 1))
    # Transpose it. then convert from sparse column back to sparse row, as 
    # dot products of two sparse row matrices are faster than sparse col with
    # sparse row
    onehot = onehot.T.tocsr()
    # Dot the transposed matrix with the other columns of the df, converted to sparse row 
    # format, then convert the resulting matrix back into a sparse 
    # dataframe with the same column names
    out = pd.DataFrame.sparse.from_spmatrix(
        onehot.dot(df[other_columns].sparse.to_coo().tocsr()), 
        columns=other_columns)
    # Add in the groupby column to this resulting dataframe with the proper class labels
    out[groupby] = ohe.categories_[0]
    # This final groupby sum simply ensures the result is in the format you would expect 
    # for a regular pandas groupby and sum, but you can just return out if this is going to be 
    # a performance penalty. Note in that case that the groupby column may have changed index
    return out.groupby(groupby).sum()

dataframe = sparse_groupby_sum(dataframe, "ID")

Note that for performance purposes you can inline the definition of the onehot variable to the out = line, I’ve just separated it out here for didactic purposes.

Answered By: tex94