Attaching a calculated column to an existing dataframe raises TypeError: incompatible index of inserted column with frame index
Question:
I am starting to learn Pandas, and I was following the question here and could not get the solution proposed to work for me and I get an indexing error. This is what I have
from pandas import *
import pandas as pd
d = {'L1' : Series(['X','X','Z','X','Z','Y','Z','Y','Y',]),
'L2' : Series([1,2,1,3,2,1,3,2,3]),
'L3' : Series([50,100,15,200,10,1,20,10,100])}
df = DataFrame(d)
df.groupby('L1', as_index=False).apply(lambda x : pd.expanding_sum(x.sort('L3', ascending=False)['L3'])/x['L3'].sum())
which outputs the following (I am using iPython)
L1
X 3 0.571429
1 0.857143
0 1.000000
Y 8 0.900901
7 0.990991
5 1.000000
Z 6 0.444444
2 0.777778
4 1.000000
dtype: float64
Then, I try to append the cumulative number calculation under the label "new" as suggested in the post
df["new"] = df.groupby("L1", as_index=False).apply(lambda x : pd.expanding_sum(x.sort("L3", ascending=False)["L3"])/x["L3"].sum())
I get this:
2196 value = value.reindex(self.index).values
2197 except:
-> 2198 raise TypeError('incompatible index of inserted column '
2199 'with frame index')
2200
TypeError: incompatible index of inserted column with frame index
Does anybody knows what the problem is? How can I reinsert the calculated value into the dataframe so it shows the values in order (descending by "new" for each label X, Y, Z.)
Answers:
The problem is, as the Error message says, that the index of the calculated column you want to insert is incompatible with the index of df
.
The index of df
is a simple index:
In [8]: df.index
Out[8]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype='int64')
while the index of the calculated column is a MultiIndex (as you also already can see in the output), supposing we call it new_column
:
In [15]: new_column.index
Out[15]:
MultiIndex
[(u'X', 3), (u'X', 1), (u'X', 0), (u'Y', 8), (u'Y', 7), (u'Y', 5), (u'Z', 6), (u'Z', 2), (u'Z', 4)]
For this reason, you cannot insert it into the frame. However, this is a bug in 0.12, as this does work in 0.13 (for which the answer in the linked question was tested) and the keyword as_index=False
should ensure the column L1
is not added to the index.
SOLUTION for 0.12:
Remove the first level of the MultiIndex, so you get back the original index:
In [13]: new_column = df.groupby('L1', as_index=False).apply(lambda x : pd.expanding_sum(x.sort('L3', ascending=False)['L3'])/x['L3'].sum())
In [14]: df["new"] = new_column.reset_index(level=0, drop=True)
In pandas 0.13 (in development) this is fixed (https://github.com/pydata/pandas/pull/4670). It is for this reason the as_index=False
is used in the groupby call, so the column L1
(fow which you group) is not added to the index (creating a MultiIndex), so the original index is retained and the result can be appended to the original frame. But it seems the as_index
keyword is ignored in 0.12 when using apply
.
This problem still exists (as of pandas 1.5.0) if the indices don’t match. A modern version of the groupby.apply
in the OP may be written as
df['new'] = df.groupby('L1')['L3'].apply(lambda x: x.sort_values(ascending=False).cumsum()/x.sum())
and it would raise TypeError: incompatible index of inserted column with frame index
.
A solution is to drop the index level created by the groupby
.
result = df.groupby('L1')['L3'].apply(lambda x: x.sort_values(ascending=False).cumsum()/x.sum())
df['new'] = result.droplevel(0) # <--- drop the unwanted index level
In any case, to get a column that is indexed the same as the original dataframe (as is being tried in the OP), the canonical way is to transform the function using groupby.transform
(as suggested by @DSM in a comment). The sorting has to be done beforehand.
df['new'] = df.sort_values(by='L3', ascending=False).groupby('L1')['L3'].transform(lambda y: y.cumsum()/y.sum())
Yet another way is to perform the division outside the groupby
ditching lambda
altogether.
g = df.sort_values(by='L3', ascending=False).groupby('L1')['L3']
df['new'] = g.cumsum() / g.transform('sum')
I am starting to learn Pandas, and I was following the question here and could not get the solution proposed to work for me and I get an indexing error. This is what I have
from pandas import *
import pandas as pd
d = {'L1' : Series(['X','X','Z','X','Z','Y','Z','Y','Y',]),
'L2' : Series([1,2,1,3,2,1,3,2,3]),
'L3' : Series([50,100,15,200,10,1,20,10,100])}
df = DataFrame(d)
df.groupby('L1', as_index=False).apply(lambda x : pd.expanding_sum(x.sort('L3', ascending=False)['L3'])/x['L3'].sum())
which outputs the following (I am using iPython)
L1
X 3 0.571429
1 0.857143
0 1.000000
Y 8 0.900901
7 0.990991
5 1.000000
Z 6 0.444444
2 0.777778
4 1.000000
dtype: float64
Then, I try to append the cumulative number calculation under the label "new" as suggested in the post
df["new"] = df.groupby("L1", as_index=False).apply(lambda x : pd.expanding_sum(x.sort("L3", ascending=False)["L3"])/x["L3"].sum())
I get this:
2196 value = value.reindex(self.index).values
2197 except:
-> 2198 raise TypeError('incompatible index of inserted column '
2199 'with frame index')
2200
TypeError: incompatible index of inserted column with frame index
Does anybody knows what the problem is? How can I reinsert the calculated value into the dataframe so it shows the values in order (descending by "new" for each label X, Y, Z.)
The problem is, as the Error message says, that the index of the calculated column you want to insert is incompatible with the index of df
.
The index of df
is a simple index:
In [8]: df.index
Out[8]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype='int64')
while the index of the calculated column is a MultiIndex (as you also already can see in the output), supposing we call it new_column
:
In [15]: new_column.index
Out[15]:
MultiIndex
[(u'X', 3), (u'X', 1), (u'X', 0), (u'Y', 8), (u'Y', 7), (u'Y', 5), (u'Z', 6), (u'Z', 2), (u'Z', 4)]
For this reason, you cannot insert it into the frame. However, this is a bug in 0.12, as this does work in 0.13 (for which the answer in the linked question was tested) and the keyword as_index=False
should ensure the column L1
is not added to the index.
SOLUTION for 0.12:
Remove the first level of the MultiIndex, so you get back the original index:
In [13]: new_column = df.groupby('L1', as_index=False).apply(lambda x : pd.expanding_sum(x.sort('L3', ascending=False)['L3'])/x['L3'].sum())
In [14]: df["new"] = new_column.reset_index(level=0, drop=True)
In pandas 0.13 (in development) this is fixed (https://github.com/pydata/pandas/pull/4670). It is for this reason the as_index=False
is used in the groupby call, so the column L1
(fow which you group) is not added to the index (creating a MultiIndex), so the original index is retained and the result can be appended to the original frame. But it seems the as_index
keyword is ignored in 0.12 when using apply
.
This problem still exists (as of pandas 1.5.0) if the indices don’t match. A modern version of the groupby.apply
in the OP may be written as
df['new'] = df.groupby('L1')['L3'].apply(lambda x: x.sort_values(ascending=False).cumsum()/x.sum())
and it would raise TypeError: incompatible index of inserted column with frame index
.
A solution is to drop the index level created by the groupby
.
result = df.groupby('L1')['L3'].apply(lambda x: x.sort_values(ascending=False).cumsum()/x.sum())
df['new'] = result.droplevel(0) # <--- drop the unwanted index level
In any case, to get a column that is indexed the same as the original dataframe (as is being tried in the OP), the canonical way is to transform the function using groupby.transform
(as suggested by @DSM in a comment). The sorting has to be done beforehand.
df['new'] = df.sort_values(by='L3', ascending=False).groupby('L1')['L3'].transform(lambda y: y.cumsum()/y.sum())
Yet another way is to perform the division outside the groupby
ditching lambda
altogether.
g = df.sort_values(by='L3', ascending=False).groupby('L1')['L3']
df['new'] = g.cumsum() / g.transform('sum')