Pandas Lookup to be deprecated – elegant and efficient alternative
Question:
The Pandas lookup function is to be deprecated in a future version. As suggested by the warning, it is recommended to use .melt
and .loc
as an alternative.
df = pd.DataFrame({'B': ['X', 'X' , 'Y', 'X', 'Y', 'Y',
'X', 'X', 'Y', 'Y', 'X', 'Y'],
'group': ["IT", "IT", "IT", "MV", "MV", "MV",
"IT", "MV", "MV", "IT", "IT", "MV"]})
a = (pd.concat([df, df['B'].str.get_dummies()], axis=1)
.groupby('group').rolling(3, min_periods=1).sum()
.sort_index(level=1).reset_index(drop=True))
df['count'] = a.lookup(df.index, df['B'])
> Output Warning: <ipython-input-16-e5b517460c82>:7: FutureWarning:
> The 'lookup' method is deprecated and will be removed in a future
> version. You can use DataFrame.melt and DataFrame.loc as a substitute.
However, the alternative appears to be less elegant and slower:
b = pd.melt(a, value_vars=a.columns, var_name='B', ignore_index=False)
b.index.name='index'
df.index.name='index'
df = df.merge(b, on=['index','B'])
Is there a more elegant and more efficient approach to this?
Answers:
One idea is use DataFrame.stack
with DataFrame.join
f for match by index
and B
:
df1 = df.rename_axis('i').join(a.stack().rename('count'), on=['i','B'])
print (df1)
B group count
i
0 X IT 1.0
1 X IT 2.0
2 Y IT 1.0
3 X MV 1.0
4 Y MV 1.0
5 Y MV 2.0
6 X IT 2.0
7 X MV 1.0
8 Y MV 2.0
9 Y IT 2.0
10 X IT 2.0
11 Y MV 2.0
It looks like, you can just use the index to assign new values.
dfn = df.set_index('B', append=True)
dfn['count'] = a.stack()
You need to use indexing lookup to replace the deprecated lookup
:
idx, cols = pd.factorize(df['B'])
df['count'] = a.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Output:
B group count
0 X IT 1.0
1 X IT 2.0
2 Y IT 1.0
3 X MV 1.0
4 Y MV 1.0
5 Y MV 2.0
6 X IT 2.0
7 X MV 1.0
8 Y MV 2.0
9 Y IT 2.0
10 X IT 2.0
11 Y MV 2.0
Other solutions assume that you want to perform the lookup across all rows of the DataFrame, and your example does in fact do that. However, the original function allows for a list of row indices and a list of column names that form a set of pairs of coordinates to be looked up.
The following approach allows for this full functionality and seems to work in about the same time (slightly faster) as df.lookup:
a.to_numpy()[a.index.get_indexer(df.index), a.columns.get_indexer(df['B'])]
Or to put it in code that better matches the old df.lookup API:
df.to_numpy()[df.index.get_indexer(row_labels), df.columns.get_indexer(col_labels)]
I tested both the old lookup function and this new approach 100k times on a very small and a moderately large (100k x 4) DataFrame and in both cases this alternate approach ran marginally faster (39 seconds compared to 41.5 seconds)
The Pandas lookup function is to be deprecated in a future version. As suggested by the warning, it is recommended to use .melt
and .loc
as an alternative.
df = pd.DataFrame({'B': ['X', 'X' , 'Y', 'X', 'Y', 'Y',
'X', 'X', 'Y', 'Y', 'X', 'Y'],
'group': ["IT", "IT", "IT", "MV", "MV", "MV",
"IT", "MV", "MV", "IT", "IT", "MV"]})
a = (pd.concat([df, df['B'].str.get_dummies()], axis=1)
.groupby('group').rolling(3, min_periods=1).sum()
.sort_index(level=1).reset_index(drop=True))
df['count'] = a.lookup(df.index, df['B'])
> Output Warning: <ipython-input-16-e5b517460c82>:7: FutureWarning:
> The 'lookup' method is deprecated and will be removed in a future
> version. You can use DataFrame.melt and DataFrame.loc as a substitute.
However, the alternative appears to be less elegant and slower:
b = pd.melt(a, value_vars=a.columns, var_name='B', ignore_index=False)
b.index.name='index'
df.index.name='index'
df = df.merge(b, on=['index','B'])
Is there a more elegant and more efficient approach to this?
One idea is use DataFrame.stack
with DataFrame.join
f for match by index
and B
:
df1 = df.rename_axis('i').join(a.stack().rename('count'), on=['i','B'])
print (df1)
B group count
i
0 X IT 1.0
1 X IT 2.0
2 Y IT 1.0
3 X MV 1.0
4 Y MV 1.0
5 Y MV 2.0
6 X IT 2.0
7 X MV 1.0
8 Y MV 2.0
9 Y IT 2.0
10 X IT 2.0
11 Y MV 2.0
It looks like, you can just use the index to assign new values.
dfn = df.set_index('B', append=True)
dfn['count'] = a.stack()
You need to use indexing lookup to replace the deprecated lookup
:
idx, cols = pd.factorize(df['B'])
df['count'] = a.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Output:
B group count
0 X IT 1.0
1 X IT 2.0
2 Y IT 1.0
3 X MV 1.0
4 Y MV 1.0
5 Y MV 2.0
6 X IT 2.0
7 X MV 1.0
8 Y MV 2.0
9 Y IT 2.0
10 X IT 2.0
11 Y MV 2.0
Other solutions assume that you want to perform the lookup across all rows of the DataFrame, and your example does in fact do that. However, the original function allows for a list of row indices and a list of column names that form a set of pairs of coordinates to be looked up.
The following approach allows for this full functionality and seems to work in about the same time (slightly faster) as df.lookup:
a.to_numpy()[a.index.get_indexer(df.index), a.columns.get_indexer(df['B'])]
Or to put it in code that better matches the old df.lookup API:
df.to_numpy()[df.index.get_indexer(row_labels), df.columns.get_indexer(col_labels)]
I tested both the old lookup function and this new approach 100k times on a very small and a moderately large (100k x 4) DataFrame and in both cases this alternate approach ran marginally faster (39 seconds compared to 41.5 seconds)