Pandas Lookup to be deprecated – elegant and efficient alternative

Question:

The Pandas lookup function is to be deprecated in a future version. As suggested by the warning, it is recommended to use .melt and .loc as an alternative.

df = pd.DataFrame({'B': ['X', 'X' , 'Y', 'X', 'Y', 'Y', 
                         'X', 'X', 'Y', 'Y', 'X', 'Y'],
                   'group': ["IT", "IT", "IT", "MV", "MV", "MV", 
                             "IT", "MV", "MV", "IT", "IT", "MV"]})

a = (pd.concat([df, df['B'].str.get_dummies()], axis=1)
     .groupby('group').rolling(3, min_periods=1).sum()
     .sort_index(level=1).reset_index(drop=True))        

df['count'] = a.lookup(df.index, df['B'])

>  Output Warning:  <ipython-input-16-e5b517460c82>:7: FutureWarning:
> The 'lookup' method is deprecated and will be  removed in a future
> version. You can use DataFrame.melt and DataFrame.loc as a substitute.

However, the alternative appears to be less elegant and slower:

b = pd.melt(a, value_vars=a.columns, var_name='B', ignore_index=False)
b.index.name='index'
df.index.name='index'
df = df.merge(b, on=['index','B'])

Is there a more elegant and more efficient approach to this?

Asked By: nrcjea001

||

Answers:

One idea is use DataFrame.stack with DataFrame.joinf for match by index and B:

df1 = df.rename_axis('i').join(a.stack().rename('count'), on=['i','B'])
print (df1)
    B group  count
i                 
0   X    IT    1.0
1   X    IT    2.0
2   Y    IT    1.0
3   X    MV    1.0
4   Y    MV    1.0
5   Y    MV    2.0
6   X    IT    2.0
7   X    MV    1.0
8   Y    MV    2.0
9   Y    IT    2.0
10  X    IT    2.0
11  Y    MV    2.0
Answered By: jezrael

It looks like, you can just use the index to assign new values.

dfn = df.set_index('B', append=True)
dfn['count'] = a.stack()
Answered By: Ferris

You need to use indexing lookup to replace the deprecated lookup:

idx, cols = pd.factorize(df['B'])

df['count'] = a.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

Output:

    B group  count
0   X    IT    1.0
1   X    IT    2.0
2   Y    IT    1.0
3   X    MV    1.0
4   Y    MV    1.0
5   Y    MV    2.0
6   X    IT    2.0
7   X    MV    1.0
8   Y    MV    2.0
9   Y    IT    2.0
10  X    IT    2.0
11  Y    MV    2.0
Answered By: mozway

Other solutions assume that you want to perform the lookup across all rows of the DataFrame, and your example does in fact do that. However, the original function allows for a list of row indices and a list of column names that form a set of pairs of coordinates to be looked up.

The following approach allows for this full functionality and seems to work in about the same time (slightly faster) as df.lookup:

a.to_numpy()[a.index.get_indexer(df.index), a.columns.get_indexer(df['B'])]

Or to put it in code that better matches the old df.lookup API:

df.to_numpy()[df.index.get_indexer(row_labels), df.columns.get_indexer(col_labels)]

I tested both the old lookup function and this new approach 100k times on a very small and a moderately large (100k x 4) DataFrame and in both cases this alternate approach ran marginally faster (39 seconds compared to 41.5 seconds)

Answered By: Danny
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.