Replace a string if it contains a specific substring, based on its column name and a matching column name within another dataframe

Question

I am trying to iterate through the columns in "main_frame" and find the values that contain the character/substring "<". For the cells that contain "<" I would like to replace them with the entry in the "Values" column in the names_values dataframe that has a value in the "Names" column that matches the "main_frame" column name.

data = [['Fiona','5'], ['Chris','6'], ['Mason','7'], ['June','8']]
names_values = pd.DataFrame(data, columns=[['Names', 'Values']])

data1 = {'Fiona':['<2', '3','4'],
        'Chris': ['<7','12','8'],
        'Mason': ['2','<3','11'],
        'June': ['1','2','<9']}
main_frame = pd.DataFrame(data1)

For example, I would like my dataframe to look like this one:

data2= {'Fiona':['5', '3','4'],
        'Chris': ['6','12','8'],
        'Mason': ['2','7','11'],
        'June': ['1','2','8']}
end_goal = pd.DataFrame(data2)

I’ve tried using pandas match() in combination with iterrows() and am having no luck.

Asked By: brown_squirrel

||

Source

Answer 1

Slightly change the constructor of names_values to avoid the single level MultiIndex, then replace and fillna:

data = [['Fiona','5'], ['Chris','6'], ['Mason','7'], ['June','8']]

# don't use a nested list for "columns"
names_values = pd.DataFrame(data, columns=['Names', 'Values'])

# match cells with any "<", replace with NaN, then fillna with a Series
end_goal = (main_frame
            .replace('<', np.nan, regex=True)
            .fillna(names_values.set_index('Names')['Values'])
           )

Alternative using apply:

repl = names_values.set_index('Names')['Values']

end_goal = main_frame.apply(lambda s: s.mask(s.str.contains('<'), repl[s.name]))

Output:

  Fiona Chris Mason June
0     5     6     2    1
1     3    12     7    2
2     4     8    11    8

keeping orginal NaNs:

add a mask:

data1 = {'Fiona':['<2', '3','4'],
        'Chris': ['<7','12',np.nan],
        'Mason': ['2','<3','11'],
        'June': ['1','2','<9']}
main_frame = pd.DataFrame(data1)

data = [['Fiona','5'], ['Chris','6'], ['Mason','7'], ['June','8']]
names_values = pd.DataFrame(data, columns=['Names', 'Values'])

# match cells with any "<", replace with NaN, then fillna with a Series
end_goal = (main_frame
            .replace('<', np.nan, regex=True)
            .fillna(names_values.set_index('Names')['Values'])
            .mask(main_frame.isna())
           )

Or:

repl = names_values.set_index('Names')['Values']

end_goal = main_frame.apply(lambda s: s.mask(s.str.contains('<')&s.notna(), repl[s.name]))

Output:

  Fiona Chris Mason June
0     5     6     2    1
1     3    12     7    2
2     4   NaN    11    8

Answered By: mozway

Answer 2

First things first, there’s no point using a dataframe for the mapping values in names_values; a dictionary makes much more sense.

names_values = {'Fiona': 5, 'Chris': 6, 'Mason': 7, 'June': 8}

Then you can use the .items() method to iterate through the main_frame dataframe, and use the names_values dictionary and map() to replace the values in the dataframe.

names_values = {'Fiona': 5, 'Chris': 6, 'Mason': 7, 'June': 8}

for col_name, col_vals in main_frame.items():
    updated_col: pd.Series = main_frame[col_name].map(
        lambda x: names_values[col_name] if '<' in x else x
    )
    main_frame[col_name] = updated_col

main_frame

Also, it’s worth noting there’s an infinite number of ways to solve this issue, so I’m sure this one is not the most efficient (it is nice and understandable tho)! One of the beauties of coding!

Hope this helped!

Answered By: Maximilian

Answer 3

ss1=names_values.set_index("Names")["Values"]
main_frame.T.apply(lambda ss:np.where(ss.str.contains("<"),ss1,ss)).T

out:

  Fiona Chris Mason June
0     5     6     2    1
1     3    12     7    2
2     4     8    11    8

The code is creating a new dataframe from an existing dataframe, main_frame, by replacing any values that contain the string "<" with the corresponding values from the names_values dataframe.
The code first sets the index of the names_values dataframe to "Names" and then extracts the "Values" column from it.
The code then uses the apply() method on the transposed main_frame dataframe, which applies a lambda function to each row of the dataframe.
The lambda function checks if the row contains the string "<" and if it does, it replaces it with the corresponding value from the names_values dataframe.
Finally, the code transposes the dataframe back to its original form.

Answered By: G.G

Answer 4

Drop the double brackets for name_values and then iterate through the columns and look for the < with str.contains. Also set names as the index for name_values.

data1 = {'Fiona':['<2', '3','4'],
        'Chris': ['<7','12','8'],
        'Mason': ['2','<3','11'],
        'June': ['1','2','<9']}
main_frame = pd.DataFrame(data1)

data = [['Fiona','5'], ['Chris','6'], ['Mason','7'], ['June','8']]
names_values = pd.DataFrame(data, columns=['Names', 'Values'])
names_values = names_values.set_index('Names')

for col in main_frame.columns:
    main_frame.loc[main_frame[col].str.contains('<'), col] = names_values.at[col, 'Values']

Answered By: Michael Cao

Replace a string if it contains a specific substring, based on its column name and a matching column name within another dataframe

Question:

Answers:

keeping orginal NaNs: