Replace a string if it contains a specific substring, based on its column name and a matching column name within another dataframe
Question:
I am trying to iterate through the columns in "main_frame" and find the values that contain the character/substring "<". For the cells that contain "<" I would like to replace them with the entry in the "Values" column in the names_values dataframe that has a value in the "Names" column that matches the "main_frame" column name.
data = [['Fiona','5'], ['Chris','6'], ['Mason','7'], ['June','8']]
names_values = pd.DataFrame(data, columns=[['Names', 'Values']])
data1 = {'Fiona':['<2', '3','4'],
'Chris': ['<7','12','8'],
'Mason': ['2','<3','11'],
'June': ['1','2','<9']}
main_frame = pd.DataFrame(data1)
For example, I would like my dataframe to look like this one:
data2= {'Fiona':['5', '3','4'],
'Chris': ['6','12','8'],
'Mason': ['2','7','11'],
'June': ['1','2','8']}
end_goal = pd.DataFrame(data2)
I’ve tried using pandas match() in combination with iterrows() and am having no luck.
Answers:
Slightly change the constructor of names_values
to avoid the single level MultiIndex, then replace
and fillna
:
data = [['Fiona','5'], ['Chris','6'], ['Mason','7'], ['June','8']]
# don't use a nested list for "columns"
names_values = pd.DataFrame(data, columns=['Names', 'Values'])
# match cells with any "<", replace with NaN, then fillna with a Series
end_goal = (main_frame
.replace('<', np.nan, regex=True)
.fillna(names_values.set_index('Names')['Values'])
)
Alternative using apply
:
repl = names_values.set_index('Names')['Values']
end_goal = main_frame.apply(lambda s: s.mask(s.str.contains('<'), repl[s.name]))
Output:
Fiona Chris Mason June
0 5 6 2 1
1 3 12 7 2
2 4 8 11 8
keeping orginal NaNs:
add a mask
:
data1 = {'Fiona':['<2', '3','4'],
'Chris': ['<7','12',np.nan],
'Mason': ['2','<3','11'],
'June': ['1','2','<9']}
main_frame = pd.DataFrame(data1)
data = [['Fiona','5'], ['Chris','6'], ['Mason','7'], ['June','8']]
names_values = pd.DataFrame(data, columns=['Names', 'Values'])
# match cells with any "<", replace with NaN, then fillna with a Series
end_goal = (main_frame
.replace('<', np.nan, regex=True)
.fillna(names_values.set_index('Names')['Values'])
.mask(main_frame.isna())
)
Or:
repl = names_values.set_index('Names')['Values']
end_goal = main_frame.apply(lambda s: s.mask(s.str.contains('<')&s.notna(), repl[s.name]))
Output:
Fiona Chris Mason June
0 5 6 2 1
1 3 12 7 2
2 4 NaN 11 8
First things first, there’s no point using a dataframe for the mapping values in names_values
; a dictionary makes much more sense.
names_values = {'Fiona': 5, 'Chris': 6, 'Mason': 7, 'June': 8}
Then you can use the .items()
method to iterate through the main_frame dataframe, and use the names_values
dictionary and map()
to replace the values in the dataframe.
names_values = {'Fiona': 5, 'Chris': 6, 'Mason': 7, 'June': 8}
for col_name, col_vals in main_frame.items():
updated_col: pd.Series = main_frame[col_name].map(
lambda x: names_values[col_name] if '<' in x else x
)
main_frame[col_name] = updated_col
main_frame
Also, it’s worth noting there’s an infinite number of ways to solve this issue, so I’m sure this one is not the most efficient (it is nice and understandable tho)! One of the beauties of coding!
Hope this helped!
ss1=names_values.set_index("Names")["Values"]
main_frame.T.apply(lambda ss:np.where(ss.str.contains("<"),ss1,ss)).T
out:
Fiona Chris Mason June
0 5 6 2 1
1 3 12 7 2
2 4 8 11 8
- The code is creating a new dataframe from an existing dataframe, main_frame, by replacing any values that contain the string "<" with the corresponding values from the names_values dataframe.
- The code first sets the index of the names_values dataframe to "Names" and then extracts the "Values" column from it.
- The code then uses the apply() method on the transposed main_frame dataframe, which applies a lambda function to each row of the dataframe.
- The lambda function checks if the row contains the string "<" and if it does, it replaces it with the corresponding value from the names_values dataframe.
- Finally, the code transposes the dataframe back to its original form.
Drop the double brackets for name_values
and then iterate through the columns and look for the <
with str.contains
. Also set names as the index for name_values
.
data1 = {'Fiona':['<2', '3','4'],
'Chris': ['<7','12','8'],
'Mason': ['2','<3','11'],
'June': ['1','2','<9']}
main_frame = pd.DataFrame(data1)
data = [['Fiona','5'], ['Chris','6'], ['Mason','7'], ['June','8']]
names_values = pd.DataFrame(data, columns=['Names', 'Values'])
names_values = names_values.set_index('Names')
for col in main_frame.columns:
main_frame.loc[main_frame[col].str.contains('<'), col] = names_values.at[col, 'Values']
I am trying to iterate through the columns in "main_frame" and find the values that contain the character/substring "<". For the cells that contain "<" I would like to replace them with the entry in the "Values" column in the names_values dataframe that has a value in the "Names" column that matches the "main_frame" column name.
data = [['Fiona','5'], ['Chris','6'], ['Mason','7'], ['June','8']]
names_values = pd.DataFrame(data, columns=[['Names', 'Values']])
data1 = {'Fiona':['<2', '3','4'],
'Chris': ['<7','12','8'],
'Mason': ['2','<3','11'],
'June': ['1','2','<9']}
main_frame = pd.DataFrame(data1)
For example, I would like my dataframe to look like this one:
data2= {'Fiona':['5', '3','4'],
'Chris': ['6','12','8'],
'Mason': ['2','7','11'],
'June': ['1','2','8']}
end_goal = pd.DataFrame(data2)
I’ve tried using pandas match() in combination with iterrows() and am having no luck.
Slightly change the constructor of names_values
to avoid the single level MultiIndex, then replace
and fillna
:
data = [['Fiona','5'], ['Chris','6'], ['Mason','7'], ['June','8']]
# don't use a nested list for "columns"
names_values = pd.DataFrame(data, columns=['Names', 'Values'])
# match cells with any "<", replace with NaN, then fillna with a Series
end_goal = (main_frame
.replace('<', np.nan, regex=True)
.fillna(names_values.set_index('Names')['Values'])
)
Alternative using apply
:
repl = names_values.set_index('Names')['Values']
end_goal = main_frame.apply(lambda s: s.mask(s.str.contains('<'), repl[s.name]))
Output:
Fiona Chris Mason June
0 5 6 2 1
1 3 12 7 2
2 4 8 11 8
keeping orginal NaNs:
add a mask
:
data1 = {'Fiona':['<2', '3','4'],
'Chris': ['<7','12',np.nan],
'Mason': ['2','<3','11'],
'June': ['1','2','<9']}
main_frame = pd.DataFrame(data1)
data = [['Fiona','5'], ['Chris','6'], ['Mason','7'], ['June','8']]
names_values = pd.DataFrame(data, columns=['Names', 'Values'])
# match cells with any "<", replace with NaN, then fillna with a Series
end_goal = (main_frame
.replace('<', np.nan, regex=True)
.fillna(names_values.set_index('Names')['Values'])
.mask(main_frame.isna())
)
Or:
repl = names_values.set_index('Names')['Values']
end_goal = main_frame.apply(lambda s: s.mask(s.str.contains('<')&s.notna(), repl[s.name]))
Output:
Fiona Chris Mason June
0 5 6 2 1
1 3 12 7 2
2 4 NaN 11 8
First things first, there’s no point using a dataframe for the mapping values in names_values
; a dictionary makes much more sense.
names_values = {'Fiona': 5, 'Chris': 6, 'Mason': 7, 'June': 8}
Then you can use the .items()
method to iterate through the main_frame dataframe, and use the names_values
dictionary and map()
to replace the values in the dataframe.
names_values = {'Fiona': 5, 'Chris': 6, 'Mason': 7, 'June': 8}
for col_name, col_vals in main_frame.items():
updated_col: pd.Series = main_frame[col_name].map(
lambda x: names_values[col_name] if '<' in x else x
)
main_frame[col_name] = updated_col
main_frame
Also, it’s worth noting there’s an infinite number of ways to solve this issue, so I’m sure this one is not the most efficient (it is nice and understandable tho)! One of the beauties of coding!
Hope this helped!
ss1=names_values.set_index("Names")["Values"]
main_frame.T.apply(lambda ss:np.where(ss.str.contains("<"),ss1,ss)).T
out:
Fiona Chris Mason June
0 5 6 2 1
1 3 12 7 2
2 4 8 11 8
- The code is creating a new dataframe from an existing dataframe, main_frame, by replacing any values that contain the string "<" with the corresponding values from the names_values dataframe.
- The code first sets the index of the names_values dataframe to "Names" and then extracts the "Values" column from it.
- The code then uses the apply() method on the transposed main_frame dataframe, which applies a lambda function to each row of the dataframe.
- The lambda function checks if the row contains the string "<" and if it does, it replaces it with the corresponding value from the names_values dataframe.
- Finally, the code transposes the dataframe back to its original form.
Drop the double brackets for name_values
and then iterate through the columns and look for the <
with str.contains
. Also set names as the index for name_values
.
data1 = {'Fiona':['<2', '3','4'],
'Chris': ['<7','12','8'],
'Mason': ['2','<3','11'],
'June': ['1','2','<9']}
main_frame = pd.DataFrame(data1)
data = [['Fiona','5'], ['Chris','6'], ['Mason','7'], ['June','8']]
names_values = pd.DataFrame(data, columns=['Names', 'Values'])
names_values = names_values.set_index('Names')
for col in main_frame.columns:
main_frame.loc[main_frame[col].str.contains('<'), col] = names_values.at[col, 'Values']