Common words in two different pandas data frame and colum
Question:
A
x
disc
a
‘tall’, ‘short’, ‘medium’
b
‘small’, ‘long’, ‘short’
B
y
‘tall’, ‘short’
‘short’, ‘long’
‘small’, ‘tall’
output like-
x
disc
tall short
short long
a
‘tall’, ‘short’, ‘medium’
1
0
b
‘small’, ‘long’, ‘short’
0
1
Answers:
Convert values to sets and find common words with set new columns:
for x in B['y']:
s = set(x.split(', '))
A[x] = [int(set(y.split(', ')) >= s) for y in A['disc']]
If necessarry remove only 0
columns add:
out = A.loc[:, A.ne(0).any()]
You can use set comparison with numpy broadcasting:
out = A.join(pd.DataFrame((A['disc'].apply(set).to_numpy()[:,None]
>= B['y'].apply(set).to_numpy()).astype(int),
columns=B['y'].apply(' '.join), index=A.index)
)
Output:
x disc tall short short long small tall
0 a [tall, short, medium] 1 0 0
1 b [small, long, short] 0 1 0
If you want only the matches:
tmp = pd.DataFrame((A['disc'].apply(set).to_numpy()[:,None]
>= B['y'].apply(set).to_numpy()),
columns=B['y'].apply(' '.join), index=A.index)
out = A.join(tmp.loc[:, tmp.any()].astype(int))
Output:
x disc tall short short long
0 a [tall, short, medium] 1 0
1 b [small, long, short] 0 1
A
x | disc |
---|---|
a | ‘tall’, ‘short’, ‘medium’ |
b | ‘small’, ‘long’, ‘short’ |
B
y |
---|
‘tall’, ‘short’ |
‘short’, ‘long’ |
‘small’, ‘tall’ |
output like-
x | disc | tall short | short long |
---|---|---|---|
a | ‘tall’, ‘short’, ‘medium’ | 1 | 0 |
b | ‘small’, ‘long’, ‘short’ | 0 | 1 |
Convert values to sets and find common words with set new columns:
for x in B['y']:
s = set(x.split(', '))
A[x] = [int(set(y.split(', ')) >= s) for y in A['disc']]
If necessarry remove only 0
columns add:
out = A.loc[:, A.ne(0).any()]
You can use set comparison with numpy broadcasting:
out = A.join(pd.DataFrame((A['disc'].apply(set).to_numpy()[:,None]
>= B['y'].apply(set).to_numpy()).astype(int),
columns=B['y'].apply(' '.join), index=A.index)
)
Output:
x disc tall short short long small tall
0 a [tall, short, medium] 1 0 0
1 b [small, long, short] 0 1 0
If you want only the matches:
tmp = pd.DataFrame((A['disc'].apply(set).to_numpy()[:,None]
>= B['y'].apply(set).to_numpy()),
columns=B['y'].apply(' '.join), index=A.index)
out = A.join(tmp.loc[:, tmp.any()].astype(int))
Output:
x disc tall short short long
0 a [tall, short, medium] 1 0
1 b [small, long, short] 0 1