How to efficiently fill a column of a dataframe based on a dictionary

Question:

I have a dataframe and dictionary like this

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 1, 1, 2, 2, 3, 3, 3, 3],
    'ignore_me': range(9),
    'fill_me': [np.nan] * 9
})

di = {
    1: ['a', 'b'],
    2: ['c', 'd'],
    3: ['e', 'f', 'g']
}

   A  ignore_me  fill_me
0  1          0      NaN
1  1          1      NaN
2  1          2      NaN
3  2          3      NaN
4  2          4      NaN
5  3          5      NaN
6  3          6      NaN
7  3          7      NaN
8  3          8      NaN

The entries in A of df correspond to the keys in di. I would now like to fill the column fill_me using the values of di, so my desired outcome looks like this:

   A  ignore_me fill_me
0  1          0       a
1  1          1       b
2  1          2     NaN
3  2          3       c
4  2          4       d
5  3          5       e
6  3          6       f
7  3          7       g
8  3          8     NaN

One way of achieving this is as follows:

df_list = []
for key, values in di.items():
    temp_df = df[df['A'] == key].reset_index(drop=True)
    fill_df = pd.DataFrame({'A': [key]* len(values), 'fill_me': values})
    df_list.append(temp_df.combine_first(fill_df))

final_df = pd.concat(df_list, ignore_index=True)

which gives me the desired outcome. However, it requires looping, a concat and also creates a new dataframe. Does anyone see a more straightforward way of implementing this? Ideally I could "just" fill df using a smart way of using fillna or map.

Asked By: Cleb

||

Answers:

Use from this:

def f(x):
    return di.get(x).pop(0)

df['fill_me']= df.A.apply(lambda x: f(x), axis=1)
Answered By: Alireza

You can do cumcount create the key

s = pd.Series(di).explode().reset_index()

s.columns = ['A','fill']
df['key'] = df.groupby('A').cumcount()
s['key'] = s.groupby('A').cumcount()

df.merge(s,how='left')
Out[463]: 
   A  ignore_me  fill_me  key fill
0  1          0      NaN    0    a
1  1          1      NaN    1    b
2  1          2      NaN    2  NaN
3  2          3      NaN    0    c
4  2          4      NaN    1    d
5  3          5      NaN    0    e
6  3          6      NaN    1    f
7  3          7      NaN    2    g
8  3          8      NaN    3  NaN
Answered By: BENY

def fill(x):
    global di

    try:
        res= di[x].pop(0)
    except:
        res= np.nan

    return res

df['fill_me']= df['A'].map(fill)

I have compare the running time between your way and this way. Your way achieving 0.077 secs while this way do it in 0.005 secs

Answered By: zousan

One approach using groupby + map:

# create unique keys for each value in A
keys = df.groupby("A").cumcount().astype(str) + df["A"].astype(str)

# un-roll the dictionary, the new keys will match the value of keys
lookup = {f"{i}{k}": v for k, vs in di.items() for i, v in enumerate(vs)}

# use map to update the values
df["fill_me"] = keys.map(lookup)

print(df)

Output

   A  ignore_me fill_me
0  1          0       a
1  1          1       b
2  1          2     NaN
3  2          3       c
4  2          4       d
5  3          5       e
6  3          6       f
7  3          7       g
8  3          8     NaN
Answered By: Dani Mesejo

I always prefer to use apply when possible

di_copy = di.copy()

def f(x):
    l = di_copy.get(x, [])
    if l: 
        di_copy[x] = l[1:]
        return l[0]
    return np.nan

df['fill_me'] = df['A'].apply(f)
Answered By: pwasoutside

I’ve changed the way you’re implementing your di object:

di = {
    1: iter(['a', 'b']),
    2: iter(['c', 'd']),
    3: iter(['e', 'f', 'g'])
}

Thus, assuming x is always in di, you can do a one-liner:

df['A'].apply(lambda x: next(di[x], None))

If x is not guaranteed to be in di, go for:

df['A'].apply(lambda x: next(di.get(x, iter(())), None))
Answered By: Bil11

I realized that I did not chose the best example as in my actual use case fill_me does not only contain NaNs but also actual values that should not be overwritten. But the approach from BENY works fine for this case, too:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 1, 1, 2, 2, 3, 3, 3, 3],
    'ignore_me': range(9),
    'fill_me': ['stuff', np.nan, np.nan, np.nan, np.nan, np.nan, 'more', np.nan, 'stuff2']
})

di = {
    1: ['a', 'b'],
    2: ['c', 'd'],
    3: ['e', 'f', 'g']
}

df['isna'] = np.where(df['fill_me'].isna(), 'yes', 'no')
df['group'] = df.groupby(['A', 'isna']).cumcount()
df = df.set_index(['A', 'isna', 'group'])

s = (
    pd.Series(di).explode()
                 .reset_index()
                 .rename(columns={'index': 'A', 0: 'fill_me'})
)
s['isna'] = 'yes'
s['group'] = s.groupby('A').cumcount()
s = s.set_index(['A', 'isna', 'group'])

and then one can simply do

df = df.fillna(s)

yielding the desired output (after resetting the index and dropping columns)

              ignore_me fill_me
A isna group                   
1 no   0              0   stuff
  yes  0              1       a
       1              2       b
2 yes  0              3       c
       1              4       d
3 yes  0              5       e
  no   0              6    more
  yes  1              7       f
  no   1              8  stuff2
Answered By: Cleb