mark one dataframe if pattern found in another dataframe
Question:
I’ve 2 dataframes and want to mark 2nd dataframes if pattern exists in first.
A: Very large of rows (>10000’s)
date | items
20100605 | apple is red
20110606 | orange is orange
20120607 | apple is green
B: shorter with few hundred rows.
id | color
123 | is Red
234 | not orange
235 | is green
Result would be to flag all columns in B if pattern found in A, possibly adding a column to B like.
B:
id | color | found
123 | is Red | true
234 | not orange | false
235 | is green | true
thinking of something like, dfB[‘found’] = dfB[‘color’].isin(dfA[‘items’]) but dont see any way to ignore case. Also, with this approach it will change true to false. Dont want to change those which are already set true. Also, i believe its in-efficient to loop large df more than once. running through A once and marking B would be better way but not sure how to achieve that using isin(). Any other ways? especially ignoring case sensitivity of pattern.
Answers:
You can use something like this:
df2['check'] = df2['color'].apply(lambda x: True if any(x.casefold() in i.casefold() for i in df['items']) else False)
or you can use str.contains:
df2['check'] = df2['color'].str.contains('|'.join(df['items'].str.split(" ").str[1] + ' ' + df['items'].str.split(" ").str[2]),case=False)
#get second and third words
I’ve 2 dataframes and want to mark 2nd dataframes if pattern exists in first.
A: Very large of rows (>10000’s)
date | items
20100605 | apple is red
20110606 | orange is orange
20120607 | apple is green
B: shorter with few hundred rows.
id | color
123 | is Red
234 | not orange
235 | is green
Result would be to flag all columns in B if pattern found in A, possibly adding a column to B like.
B:
id | color | found
123 | is Red | true
234 | not orange | false
235 | is green | true
thinking of something like, dfB[‘found’] = dfB[‘color’].isin(dfA[‘items’]) but dont see any way to ignore case. Also, with this approach it will change true to false. Dont want to change those which are already set true. Also, i believe its in-efficient to loop large df more than once. running through A once and marking B would be better way but not sure how to achieve that using isin(). Any other ways? especially ignoring case sensitivity of pattern.
You can use something like this:
df2['check'] = df2['color'].apply(lambda x: True if any(x.casefold() in i.casefold() for i in df['items']) else False)
or you can use str.contains:
df2['check'] = df2['color'].str.contains('|'.join(df['items'].str.split(" ").str[1] + ' ' + df['items'].str.split(" ").str[2]),case=False)
#get second and third words