Convert a `dict[str, list[any]]` into a binary `pandas.DataFrame`
Question:
I have the following dictionary
d = {
"anna": ["apple", "strawberry", "banana"],
"bob": ["strawberry", "banana", "peach"],
"chris": ["apple", "banana", "peach", "mango"]
}
and I want to convert it into the following pandas.DataFrame
apple banana mango peach strawberry
anna 1 1 0 0 1
bob 0 1 0 1 1
chris 1 1 1 1 0
It is not very complicated to implement in Python (see below), but I was wondering if there is already something in pandas
to do it automatically (or if the implementation below can be optimized)
Thanks in advance!
Python current implementation
import numpy as np
import pandas as pd
d = {
"anna": ["apple", "strawberry", "banana"],
"bob": ["strawberry", "banana", "peach"],
"chris": ["apple", "banana", "peach", "mango"]
}
fruits = sorted(set(np.hstack(d.values())))
df = pd.DataFrame(columns=fruits)
for client, client_fruits in d.items():
s = pd.Series({
fruit: fruit in client_fruits for fruit in fruits
}).astype(int)
df = pd.concat([df, pd.DataFrame({client: s}).T])
print(df)
Answers:
One option using str.get_dummies
:
out = pd.Series({k: '|'.join(v) for k,v in d.items()}).str.get_dummies()
Or from_dict
and pandas.get_dummies
:
out = (pd.get_dummies(pd.DataFrame.from_dict(d, orient='index').stack())
.groupby(level=0).max()
)
Or with a crosstab
:
out = pd.crosstab(*zip(*((k,v) for k,l in d.items() for v in l))).clip(upper=1)
Output:
apple banana mango peach strawberry
anna 1 1 0 0 1
bob 0 1 0 1 1
chris 1 1 1 1 0
df1=pd.concat([pd.DataFrame({k:v}) for k,v in d.items()],axis=1).stack().droplevel(0)
pd.crosstab(df1.index,df1)
out
col_0 apple banana mango peach strawberry
row_0
anna 1 1 0 0 1
bob 0 1 0 1 1
chris 1 1 1 1 0
You can use str.join()
on a Series.
pd.Series(d).str.join('|').str.get_dummies()
Output:
apple banana mango peach strawberry
anna 1 1 0 0 1
bob 0 1 0 1 1
chris 1 1 1 1 0
I have the following dictionary
d = {
"anna": ["apple", "strawberry", "banana"],
"bob": ["strawberry", "banana", "peach"],
"chris": ["apple", "banana", "peach", "mango"]
}
and I want to convert it into the following pandas.DataFrame
apple banana mango peach strawberry
anna 1 1 0 0 1
bob 0 1 0 1 1
chris 1 1 1 1 0
It is not very complicated to implement in Python (see below), but I was wondering if there is already something in pandas
to do it automatically (or if the implementation below can be optimized)
Thanks in advance!
Python current implementation
import numpy as np
import pandas as pd
d = {
"anna": ["apple", "strawberry", "banana"],
"bob": ["strawberry", "banana", "peach"],
"chris": ["apple", "banana", "peach", "mango"]
}
fruits = sorted(set(np.hstack(d.values())))
df = pd.DataFrame(columns=fruits)
for client, client_fruits in d.items():
s = pd.Series({
fruit: fruit in client_fruits for fruit in fruits
}).astype(int)
df = pd.concat([df, pd.DataFrame({client: s}).T])
print(df)
One option using str.get_dummies
:
out = pd.Series({k: '|'.join(v) for k,v in d.items()}).str.get_dummies()
Or from_dict
and pandas.get_dummies
:
out = (pd.get_dummies(pd.DataFrame.from_dict(d, orient='index').stack())
.groupby(level=0).max()
)
Or with a crosstab
:
out = pd.crosstab(*zip(*((k,v) for k,l in d.items() for v in l))).clip(upper=1)
Output:
apple banana mango peach strawberry
anna 1 1 0 0 1
bob 0 1 0 1 1
chris 1 1 1 1 0
df1=pd.concat([pd.DataFrame({k:v}) for k,v in d.items()],axis=1).stack().droplevel(0)
pd.crosstab(df1.index,df1)
out
col_0 apple banana mango peach strawberry
row_0
anna 1 1 0 0 1
bob 0 1 0 1 1
chris 1 1 1 1 0
You can use str.join()
on a Series.
pd.Series(d).str.join('|').str.get_dummies()
Output:
apple banana mango peach strawberry
anna 1 1 0 0 1
bob 0 1 0 1 1
chris 1 1 1 1 0