Attaching a calculated column to an existing dataframe raises TypeError: incompatible index of inserted column with frame index

Question

I am starting to learn Pandas, and I was following the question here and could not get the solution proposed to work for me and I get an indexing error. This is what I have

from pandas import *
import pandas as pd
d = {'L1' : Series(['X','X','Z','X','Z','Y','Z','Y','Y',]),
     'L2' : Series([1,2,1,3,2,1,3,2,3]),
     'L3' : Series([50,100,15,200,10,1,20,10,100])}
df = DataFrame(d)  
df.groupby('L1', as_index=False).apply(lambda x : pd.expanding_sum(x.sort('L3', ascending=False)['L3'])/x['L3'].sum())

which outputs the following (I am using iPython)

L1   
X   3    0.571429
    1    0.857143
    0    1.000000
Y   8    0.900901
    7    0.990991
    5    1.000000
Z   6    0.444444
    2    0.777778
    4    1.000000
dtype: float64

Then, I try to append the cumulative number calculation under the label "new" as suggested in the post

df["new"] = df.groupby("L1", as_index=False).apply(lambda x : pd.expanding_sum(x.sort("L3", ascending=False)["L3"])/x["L3"].sum())

I get this:

   2196                         value = value.reindex(self.index).values
   2197                     except:
-> 2198                         raise TypeError('incompatible index of inserted column '
   2199                                         'with frame index')
   2200 
TypeError: incompatible index of inserted column with frame index

Does anybody knows what the problem is? How can I reinsert the calculated value into the dataframe so it shows the values in order (descending by "new" for each label X, Y, Z.)

Asked By: user2735720

||

Source

Answer 1

The problem is, as the Error message says, that the index of the calculated column you want to insert is incompatible with the index of df.

The index of df is a simple index:

In [8]: df.index
Out[8]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype='int64')

while the index of the calculated column is a MultiIndex (as you also already can see in the output), supposing we call it new_column:

In [15]: new_column.index
Out[15]: 
MultiIndex
[(u'X', 3), (u'X', 1), (u'X', 0), (u'Y', 8), (u'Y', 7), (u'Y', 5), (u'Z', 6), (u'Z', 2), (u'Z', 4)]

For this reason, you cannot insert it into the frame. However, this is a bug in 0.12, as this does work in 0.13 (for which the answer in the linked question was tested) and the keyword as_index=False should ensure the column L1 is not added to the index.

SOLUTION for 0.12:
Remove the first level of the MultiIndex, so you get back the original index:

In [13]: new_column = df.groupby('L1', as_index=False).apply(lambda x : pd.expanding_sum(x.sort('L3', ascending=False)['L3'])/x['L3'].sum())
In [14]: df["new"] = new_column.reset_index(level=0, drop=True)

In pandas 0.13 (in development) this is fixed (https://github.com/pydata/pandas/pull/4670). It is for this reason the as_index=False is used in the groupby call, so the column L1 (fow which you group) is not added to the index (creating a MultiIndex), so the original index is retained and the result can be appended to the original frame. But it seems the as_index keyword is ignored in 0.12 when using apply.

Answered By: joris

Answer 2

This problem still exists (as of pandas 1.5.0) if the indices don’t match. A modern version of the groupby.apply in the OP may be written as

df['new'] = df.groupby('L1')['L3'].apply(lambda x: x.sort_values(ascending=False).cumsum()/x.sum())

and it would raise TypeError: incompatible index of inserted column with frame index.

A solution is to drop the index level created by the groupby.

result = df.groupby('L1')['L3'].apply(lambda x: x.sort_values(ascending=False).cumsum()/x.sum())
df['new'] = result.droplevel(0)         # <--- drop the unwanted index level

In any case, to get a column that is indexed the same as the original dataframe (as is being tried in the OP), the canonical way is to transform the function using groupby.transform (as suggested by @DSM in a comment). The sorting has to be done beforehand.

df['new'] = df.sort_values(by='L3', ascending=False).groupby('L1')['L3'].transform(lambda y: y.cumsum()/y.sum())

Yet another way is to perform the division outside the groupby ditching lambda altogether.

g = df.sort_values(by='L3', ascending=False).groupby('L1')['L3']
df['new'] = g.cumsum() / g.transform('sum')

Answered By: cottontail

Attaching a calculated column to an existing dataframe raises TypeError: incompatible index of inserted column with frame index

Question:

Answers: