Filter and merge a dataframe in Python using Pandas
Question:
I have a dataframe and I need to filter out who is the owner of which books so we can send them notifications. I am having trouble merging the data in the format I need.
Existing dataframe
Book
Owner
The Alchemist
marry
To Kill a Mockingbird
john
Lord of the Flies
abel
Catcher in the Ry
marry
Alabama
julia;marry
Invisible Man
john
I need to create new dataframe that lists the owners in column A and all the books they own in Column B.
Desired output
Owners
Books
marry
The Alchemist, Catcher in the Ry, Alabama
john
To Kill a Mockingbird, Invisible Man
abel
Lord of the Flies
julia
Alabama
I tried creating 2 dfs from and then merging but the results are never accurate. Anyone know a more efficient way to do this?
Current code not working:
from pathlib import Path
import pandas as pd
file1 = Path.cwd() / "./bookgrid.xlsx"
df1 = pd.read_excel(file1)
df2 = pd.read_excel(file1)
##Perfrom the Vlookup Merge
merge = pd.merge(df1, df2, how="left")
merge.to_excel("./results.xlsx")
Answers:
You need to split
, explode
, groupby.agg
:
(df.assign(Owner=lambda d: d['Owner'].str.split(';'))
.explode('Owner')
.groupby('Owner', as_index=False, sort=False).agg(', '.join)
)
NB. if you need the plural form in the column headers, add .add_suffix('s')
or .rename(columns={'Book': 'Books', 'Owner': 'Owners'})
.
Output:
Owner Book
0 marry The Alchemist, Catcher in the Ry, Alabama
1 john To Kill a Mockingbird, Invisible Man
2 abel Lord of the Flies
3 julia Alabama
Lets try something new
s = df['Owner'].str.get_dummies(';')
(s.T @ df['Book'].add(', ')).str.rstrip(', ')
Result
abel Lord of the Flies
john To Kill a Mockingbird, Invisible Man
julia Alabama
marry The Alchemist, Catcher in the Ry, Alabama
dtype: object
Not the fastest way, but here’s an easy to follow way.
import pandas as pd
# Set up the example dataframe
data = {'Book':['The Alchemist','To Kill a Mockingbird','Lord of the Flies','Catcher in the Ry','Alabama','Invisible Man'],'Owner':['marry','john','abel','marry','julia;marry','john']}
df = pd.DataFrame(data)
# Turn your string of names into a list of names
df2['Owner'] = df2['Owner'].apply(lambda x: x.split(";"))
# get a unique list of customers
unique_owners = {single_owner for owners_list in df2['Owner'] for single_owner in owners_list}
# Gives a set -> {'abel', 'john', 'julia', 'marry'}
# for each customer, slice the dataframe for each customer
df2[['marry' in row for row in df2['Owner']]]
# select only the books, not the names
df2[['marry' in row for row in df2['Owner']]]['Book']
# convert the books to a list. Alternative - ",".join(df2[['marry' in row for row in df2['Owner']]]['Book']) turns all the books into a single piece of text.
df2[['marry' in row for row in df2['Owner']]]['Book'].to_list()
# set up data storage
names = []
books = []
# iterate through he unique owners set
[(names.append(single_owner), books.append(df2[[single_owner in row for row in df2['Owner']]]['Book'].to_list())) for single_owner in unique_owners]
new_df2 = pd.DataFrame({'Owner':names,'Books':books})
new_df2
I have a dataframe and I need to filter out who is the owner of which books so we can send them notifications. I am having trouble merging the data in the format I need.
Existing dataframe
Book | Owner |
---|---|
The Alchemist | marry |
To Kill a Mockingbird | john |
Lord of the Flies | abel |
Catcher in the Ry | marry |
Alabama | julia;marry |
Invisible Man | john |
I need to create new dataframe that lists the owners in column A and all the books they own in Column B.
Desired output
Owners | Books |
---|---|
marry | The Alchemist, Catcher in the Ry, Alabama |
john | To Kill a Mockingbird, Invisible Man |
abel | Lord of the Flies |
julia | Alabama |
I tried creating 2 dfs from and then merging but the results are never accurate. Anyone know a more efficient way to do this?
Current code not working:
from pathlib import Path
import pandas as pd
file1 = Path.cwd() / "./bookgrid.xlsx"
df1 = pd.read_excel(file1)
df2 = pd.read_excel(file1)
##Perfrom the Vlookup Merge
merge = pd.merge(df1, df2, how="left")
merge.to_excel("./results.xlsx")
You need to split
, explode
, groupby.agg
:
(df.assign(Owner=lambda d: d['Owner'].str.split(';'))
.explode('Owner')
.groupby('Owner', as_index=False, sort=False).agg(', '.join)
)
NB. if you need the plural form in the column headers, add .add_suffix('s')
or .rename(columns={'Book': 'Books', 'Owner': 'Owners'})
.
Output:
Owner Book
0 marry The Alchemist, Catcher in the Ry, Alabama
1 john To Kill a Mockingbird, Invisible Man
2 abel Lord of the Flies
3 julia Alabama
Lets try something new
s = df['Owner'].str.get_dummies(';')
(s.T @ df['Book'].add(', ')).str.rstrip(', ')
Result
abel Lord of the Flies
john To Kill a Mockingbird, Invisible Man
julia Alabama
marry The Alchemist, Catcher in the Ry, Alabama
dtype: object
Not the fastest way, but here’s an easy to follow way.
import pandas as pd
# Set up the example dataframe
data = {'Book':['The Alchemist','To Kill a Mockingbird','Lord of the Flies','Catcher in the Ry','Alabama','Invisible Man'],'Owner':['marry','john','abel','marry','julia;marry','john']}
df = pd.DataFrame(data)
# Turn your string of names into a list of names
df2['Owner'] = df2['Owner'].apply(lambda x: x.split(";"))
# get a unique list of customers
unique_owners = {single_owner for owners_list in df2['Owner'] for single_owner in owners_list}
# Gives a set -> {'abel', 'john', 'julia', 'marry'}
# for each customer, slice the dataframe for each customer
df2[['marry' in row for row in df2['Owner']]]
# select only the books, not the names
df2[['marry' in row for row in df2['Owner']]]['Book']
# convert the books to a list. Alternative - ",".join(df2[['marry' in row for row in df2['Owner']]]['Book']) turns all the books into a single piece of text.
df2[['marry' in row for row in df2['Owner']]]['Book'].to_list()
# set up data storage
names = []
books = []
# iterate through he unique owners set
[(names.append(single_owner), books.append(df2[[single_owner in row for row in df2['Owner']]]['Book'].to_list())) for single_owner in unique_owners]
new_df2 = pd.DataFrame({'Owner':names,'Books':books})
new_df2