How to merge dataframe with semicolon in python?
Question:
I have two data frames
Product, Users
The product can be in multiple categories and where all the categories are separated with a semicolon.
User interest will be in multiple categories which are also separated from a semicolon as well.
Now I need to find all content ids where users have interaction.
I tried to split both columns of dataframes (Product, Users
) and tried to find isin()
value I get this error.
users['intrestcategory'].str.split(";", n=1, expand=True)
A value is trying to be set on a copy of a slice from a DataFrame
ValueError: Wrong number of items passed 0, placement implies 1
Below is a sample of data frames:
- product
Categories contentId
1
12;2 2
3
2 4
3;15 5
15 6
7
20 8
20;2 9
- Users
userid intrestcategories
2 12;2
3 3
4 15
- Final output
userid contentId
2 4
2 2
2 9
3 5
4 5
4 6
Answers:
First we use explode
(pandas version >= 0.25.0) to convert the multiple categories per column into multiple rows and then merge
on the categories and drop duplicates:
import pandas as pd
from numpy import nan
dfp = pd.DataFrame({'contentId': {0: nan, 1: 2.0, 2: nan, 3: 4.0, 4: 5.0, 5: 6.0, 6: nan, 7: 8.0, 8: 9.0}, 'Categories': {0: '1', 1: '12;2', 2: '3', 3: '2', 4: '3;15', 5: '15', 6: '7', 7: '20', 8: '20;2'}})
dfu = pd.DataFrame({'intrestcategories': {0: '12;2', 1: '3', 2: '15'}, 'userid': {0: 2, 1: 3, 2: 4}})
dfp.Categories = dfp.Categories.str.split(';')
dfp = dfp.explode('Categories')
dfu.intrestcategories = dfu.intrestcategories.str.split(';')
dfu = dfu.explode('intrestcategories')
dfp.dropna().merge(dfu,left_on='Categories',right_on='intrestcategories')[['userid','contentId']].drop_duplicates().astype(int)
Result:
userid contentId
0 2 2
2 2 4
3 2 9
4 3 5
5 4 5
6 4 6
I have two data frames
Product, Users
The product can be in multiple categories and where all the categories are separated with a semicolon.
User interest will be in multiple categories which are also separated from a semicolon as well.
Now I need to find all content ids where users have interaction.
I tried to split both columns of dataframes (Product, Users
) and tried to find isin()
value I get this error.
users['intrestcategory'].str.split(";", n=1, expand=True)
A value is trying to be set on a copy of a slice from a DataFrame
ValueError: Wrong number of items passed 0, placement implies 1
Below is a sample of data frames:
- product
Categories contentId
1
12;2 2
3
2 4
3;15 5
15 6
7
20 8
20;2 9
- Users
userid intrestcategories
2 12;2
3 3
4 15
- Final output
userid contentId
2 4
2 2
2 9
3 5
4 5
4 6
First we use explode
(pandas version >= 0.25.0) to convert the multiple categories per column into multiple rows and then merge
on the categories and drop duplicates:
import pandas as pd
from numpy import nan
dfp = pd.DataFrame({'contentId': {0: nan, 1: 2.0, 2: nan, 3: 4.0, 4: 5.0, 5: 6.0, 6: nan, 7: 8.0, 8: 9.0}, 'Categories': {0: '1', 1: '12;2', 2: '3', 3: '2', 4: '3;15', 5: '15', 6: '7', 7: '20', 8: '20;2'}})
dfu = pd.DataFrame({'intrestcategories': {0: '12;2', 1: '3', 2: '15'}, 'userid': {0: 2, 1: 3, 2: 4}})
dfp.Categories = dfp.Categories.str.split(';')
dfp = dfp.explode('Categories')
dfu.intrestcategories = dfu.intrestcategories.str.split(';')
dfu = dfu.explode('intrestcategories')
dfp.dropna().merge(dfu,left_on='Categories',right_on='intrestcategories')[['userid','contentId']].drop_duplicates().astype(int)
Result:
userid contentId
0 2 2
2 2 4
3 2 9
4 3 5
5 4 5
6 4 6