How to efficiently fill a column of a dataframe based on a dictionary
Question:
I have a dataframe and dictionary like this
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 1, 1, 2, 2, 3, 3, 3, 3],
'ignore_me': range(9),
'fill_me': [np.nan] * 9
})
di = {
1: ['a', 'b'],
2: ['c', 'd'],
3: ['e', 'f', 'g']
}
A ignore_me fill_me
0 1 0 NaN
1 1 1 NaN
2 1 2 NaN
3 2 3 NaN
4 2 4 NaN
5 3 5 NaN
6 3 6 NaN
7 3 7 NaN
8 3 8 NaN
The entries in A
of df
correspond to the keys in di
. I would now like to fill the column fill_me
using the values of di
, so my desired outcome looks like this:
A ignore_me fill_me
0 1 0 a
1 1 1 b
2 1 2 NaN
3 2 3 c
4 2 4 d
5 3 5 e
6 3 6 f
7 3 7 g
8 3 8 NaN
One way of achieving this is as follows:
df_list = []
for key, values in di.items():
temp_df = df[df['A'] == key].reset_index(drop=True)
fill_df = pd.DataFrame({'A': [key]* len(values), 'fill_me': values})
df_list.append(temp_df.combine_first(fill_df))
final_df = pd.concat(df_list, ignore_index=True)
which gives me the desired outcome. However, it requires looping, a concat
and also creates a new dataframe. Does anyone see a more straightforward way of implementing this? Ideally I could "just" fill df
using a smart way of using fillna
or map
.
Answers:
Use from this:
def f(x):
return di.get(x).pop(0)
df['fill_me']= df.A.apply(lambda x: f(x), axis=1)
You can do cumcount
create the key
s = pd.Series(di).explode().reset_index()
s.columns = ['A','fill']
df['key'] = df.groupby('A').cumcount()
s['key'] = s.groupby('A').cumcount()
df.merge(s,how='left')
Out[463]:
A ignore_me fill_me key fill
0 1 0 NaN 0 a
1 1 1 NaN 1 b
2 1 2 NaN 2 NaN
3 2 3 NaN 0 c
4 2 4 NaN 1 d
5 3 5 NaN 0 e
6 3 6 NaN 1 f
7 3 7 NaN 2 g
8 3 8 NaN 3 NaN
def fill(x):
global di
try:
res= di[x].pop(0)
except:
res= np.nan
return res
df['fill_me']= df['A'].map(fill)
I have compare the running time between your way and this way. Your way achieving 0.077 secs while this way do it in 0.005 secs
One approach using groupby
+ map
:
# create unique keys for each value in A
keys = df.groupby("A").cumcount().astype(str) + df["A"].astype(str)
# un-roll the dictionary, the new keys will match the value of keys
lookup = {f"{i}{k}": v for k, vs in di.items() for i, v in enumerate(vs)}
# use map to update the values
df["fill_me"] = keys.map(lookup)
print(df)
Output
A ignore_me fill_me
0 1 0 a
1 1 1 b
2 1 2 NaN
3 2 3 c
4 2 4 d
5 3 5 e
6 3 6 f
7 3 7 g
8 3 8 NaN
I always prefer to use apply when possible
di_copy = di.copy()
def f(x):
l = di_copy.get(x, [])
if l:
di_copy[x] = l[1:]
return l[0]
return np.nan
df['fill_me'] = df['A'].apply(f)
I’ve changed the way you’re implementing your di
object:
di = {
1: iter(['a', 'b']),
2: iter(['c', 'd']),
3: iter(['e', 'f', 'g'])
}
Thus, assuming x is always in di
, you can do a one-liner:
df['A'].apply(lambda x: next(di[x], None))
If x is not guaranteed to be in di
, go for:
df['A'].apply(lambda x: next(di.get(x, iter(())), None))
I realized that I did not chose the best example as in my actual use case fill_me
does not only contain NaN
s but also actual values that should not be overwritten. But the approach from BENY works fine for this case, too:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 1, 1, 2, 2, 3, 3, 3, 3],
'ignore_me': range(9),
'fill_me': ['stuff', np.nan, np.nan, np.nan, np.nan, np.nan, 'more', np.nan, 'stuff2']
})
di = {
1: ['a', 'b'],
2: ['c', 'd'],
3: ['e', 'f', 'g']
}
df['isna'] = np.where(df['fill_me'].isna(), 'yes', 'no')
df['group'] = df.groupby(['A', 'isna']).cumcount()
df = df.set_index(['A', 'isna', 'group'])
s = (
pd.Series(di).explode()
.reset_index()
.rename(columns={'index': 'A', 0: 'fill_me'})
)
s['isna'] = 'yes'
s['group'] = s.groupby('A').cumcount()
s = s.set_index(['A', 'isna', 'group'])
and then one can simply do
df = df.fillna(s)
yielding the desired output (after resetting the index and dropping columns)
ignore_me fill_me
A isna group
1 no 0 0 stuff
yes 0 1 a
1 2 b
2 yes 0 3 c
1 4 d
3 yes 0 5 e
no 0 6 more
yes 1 7 f
no 1 8 stuff2
I have a dataframe and dictionary like this
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 1, 1, 2, 2, 3, 3, 3, 3],
'ignore_me': range(9),
'fill_me': [np.nan] * 9
})
di = {
1: ['a', 'b'],
2: ['c', 'd'],
3: ['e', 'f', 'g']
}
A ignore_me fill_me
0 1 0 NaN
1 1 1 NaN
2 1 2 NaN
3 2 3 NaN
4 2 4 NaN
5 3 5 NaN
6 3 6 NaN
7 3 7 NaN
8 3 8 NaN
The entries in A
of df
correspond to the keys in di
. I would now like to fill the column fill_me
using the values of di
, so my desired outcome looks like this:
A ignore_me fill_me
0 1 0 a
1 1 1 b
2 1 2 NaN
3 2 3 c
4 2 4 d
5 3 5 e
6 3 6 f
7 3 7 g
8 3 8 NaN
One way of achieving this is as follows:
df_list = []
for key, values in di.items():
temp_df = df[df['A'] == key].reset_index(drop=True)
fill_df = pd.DataFrame({'A': [key]* len(values), 'fill_me': values})
df_list.append(temp_df.combine_first(fill_df))
final_df = pd.concat(df_list, ignore_index=True)
which gives me the desired outcome. However, it requires looping, a concat
and also creates a new dataframe. Does anyone see a more straightforward way of implementing this? Ideally I could "just" fill df
using a smart way of using fillna
or map
.
Use from this:
def f(x):
return di.get(x).pop(0)
df['fill_me']= df.A.apply(lambda x: f(x), axis=1)
You can do cumcount
create the key
s = pd.Series(di).explode().reset_index()
s.columns = ['A','fill']
df['key'] = df.groupby('A').cumcount()
s['key'] = s.groupby('A').cumcount()
df.merge(s,how='left')
Out[463]:
A ignore_me fill_me key fill
0 1 0 NaN 0 a
1 1 1 NaN 1 b
2 1 2 NaN 2 NaN
3 2 3 NaN 0 c
4 2 4 NaN 1 d
5 3 5 NaN 0 e
6 3 6 NaN 1 f
7 3 7 NaN 2 g
8 3 8 NaN 3 NaN
def fill(x):
global di
try:
res= di[x].pop(0)
except:
res= np.nan
return res
df['fill_me']= df['A'].map(fill)
I have compare the running time between your way and this way. Your way achieving 0.077 secs while this way do it in 0.005 secs
One approach using groupby
+ map
:
# create unique keys for each value in A
keys = df.groupby("A").cumcount().astype(str) + df["A"].astype(str)
# un-roll the dictionary, the new keys will match the value of keys
lookup = {f"{i}{k}": v for k, vs in di.items() for i, v in enumerate(vs)}
# use map to update the values
df["fill_me"] = keys.map(lookup)
print(df)
Output
A ignore_me fill_me
0 1 0 a
1 1 1 b
2 1 2 NaN
3 2 3 c
4 2 4 d
5 3 5 e
6 3 6 f
7 3 7 g
8 3 8 NaN
I always prefer to use apply when possible
di_copy = di.copy()
def f(x):
l = di_copy.get(x, [])
if l:
di_copy[x] = l[1:]
return l[0]
return np.nan
df['fill_me'] = df['A'].apply(f)
I’ve changed the way you’re implementing your di
object:
di = {
1: iter(['a', 'b']),
2: iter(['c', 'd']),
3: iter(['e', 'f', 'g'])
}
Thus, assuming x is always in di
, you can do a one-liner:
df['A'].apply(lambda x: next(di[x], None))
If x is not guaranteed to be in di
, go for:
df['A'].apply(lambda x: next(di.get(x, iter(())), None))
I realized that I did not chose the best example as in my actual use case fill_me
does not only contain NaN
s but also actual values that should not be overwritten. But the approach from BENY works fine for this case, too:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 1, 1, 2, 2, 3, 3, 3, 3],
'ignore_me': range(9),
'fill_me': ['stuff', np.nan, np.nan, np.nan, np.nan, np.nan, 'more', np.nan, 'stuff2']
})
di = {
1: ['a', 'b'],
2: ['c', 'd'],
3: ['e', 'f', 'g']
}
df['isna'] = np.where(df['fill_me'].isna(), 'yes', 'no')
df['group'] = df.groupby(['A', 'isna']).cumcount()
df = df.set_index(['A', 'isna', 'group'])
s = (
pd.Series(di).explode()
.reset_index()
.rename(columns={'index': 'A', 0: 'fill_me'})
)
s['isna'] = 'yes'
s['group'] = s.groupby('A').cumcount()
s = s.set_index(['A', 'isna', 'group'])
and then one can simply do
df = df.fillna(s)
yielding the desired output (after resetting the index and dropping columns)
ignore_me fill_me
A isna group
1 no 0 0 stuff
yes 0 1 a
1 2 b
2 yes 0 3 c
1 4 d
3 yes 0 5 e
no 0 6 more
yes 1 7 f
no 1 8 stuff2