Most efficient way to find and replace 'None' cells with value from a similar cell pair in dataframe?
Question:
I’m trying to find the most efficient way to fill in cells with a value of ‘None’ in a dataframe based on a rule.
If column ‘A’ has a specific value, such as ‘Apple’, and column ‘B’ is empty, I want to find all other instances of column ‘A’ with value ‘Apple’ and fill it with whatever is in column ‘B’s if it is not empty.
For example, if the input is this:
Column A
Column B
Apple
None
Apple
None
Orange
Soda
Banana
None
Apple
Pie
Banana
Bread
Orange
None
Then it should output this:
Column A
Column B
Apple
Pie
Apple
Pie
Orange
Soda
Banana
Bread
Apple
Pie
Banana
Bread
Orange
Soda
You can assume that for a particular Column A & B pair, it will always be the same (e.g. for every ‘Apple’ in Column A, there will either be ‘None’ or ‘Pie’ in Column B).
I tried the below, and it seems to work on this small test dataset, but I’m wondering if someone could suggest a more efficient method that I could use on my actual dataset (~100K rows).
for ind in data.index:
if data['Column B'][ind]=='None':
temp_A = data['Column A'][ind]
for ind2 in data.index:
if (data['Column A'][ind2]==temp_A) & (data['Column B'][ind2]!='None'):
data['Column B'][ind] = data['Column B'][ind2]
break
Answers:
Filter rows with no None
values, remove duplicates and convert to Series
, last mapping None
values by Series.map
:
m = df['Column B'].ne('None')
s = df[m].drop_duplicates('Column A').set_index('Column A')['Column B']
df.loc[~m, 'Column B'] = df.loc[~m, 'Column A'].map(s)
print (df)
Column A Column B
0 Apple Pie
1 Apple Pie
2 Orange Soda
3 Banana Bread
4 Apple Pie
5 Banana Bread
6 Orange Soda
I’m trying to find the most efficient way to fill in cells with a value of ‘None’ in a dataframe based on a rule.
If column ‘A’ has a specific value, such as ‘Apple’, and column ‘B’ is empty, I want to find all other instances of column ‘A’ with value ‘Apple’ and fill it with whatever is in column ‘B’s if it is not empty.
For example, if the input is this:
Column A | Column B |
---|---|
Apple | None |
Apple | None |
Orange | Soda |
Banana | None |
Apple | Pie |
Banana | Bread |
Orange | None |
Then it should output this:
Column A | Column B |
---|---|
Apple | Pie |
Apple | Pie |
Orange | Soda |
Banana | Bread |
Apple | Pie |
Banana | Bread |
Orange | Soda |
You can assume that for a particular Column A & B pair, it will always be the same (e.g. for every ‘Apple’ in Column A, there will either be ‘None’ or ‘Pie’ in Column B).
I tried the below, and it seems to work on this small test dataset, but I’m wondering if someone could suggest a more efficient method that I could use on my actual dataset (~100K rows).
for ind in data.index:
if data['Column B'][ind]=='None':
temp_A = data['Column A'][ind]
for ind2 in data.index:
if (data['Column A'][ind2]==temp_A) & (data['Column B'][ind2]!='None'):
data['Column B'][ind] = data['Column B'][ind2]
break
Filter rows with no None
values, remove duplicates and convert to Series
, last mapping None
values by Series.map
:
m = df['Column B'].ne('None')
s = df[m].drop_duplicates('Column A').set_index('Column A')['Column B']
df.loc[~m, 'Column B'] = df.loc[~m, 'Column A'].map(s)
print (df)
Column A Column B
0 Apple Pie
1 Apple Pie
2 Orange Soda
3 Banana Bread
4 Apple Pie
5 Banana Bread
6 Orange Soda