Pandas Explode on Multiple columns
Question:
Using Pandas 0.25.3, trying to explode a couple of columns.
Data looks like:
d1 = {'user':['user1','user2','user3','user4'],
'paid':['Y','Y','N','N']
'last_active':['11 Jul 2019','23 Sep 2018','08 Dec 2019','03 Mar 2018'],
'col4':'data'}
I sent this to a dataframe df=pd.DataFrame([d1],columns=d1.keys())
that looks like this:
user paid last_active col4
['user1','user2','user3','user4'] ['Y','Y','N','N'] ['11 Jul 2019','23 Sep 2018','08 Dec 2019','03 Mar 2018'] 'data'
there are other columns as well with one value per, {'A':'B'}
type stuff, but I’m not worried about those.
when I do df.explode('user')
it works for that one, and same for the other columns, but when I try to do df.explode(column=('user','paid','last_active')
it gives me the following error:
KeyError: ('user','paid','last_active')
So what I want to know, is how can I explode it with the explode
function on multiple columns to get the following df:
user paid last_active col4
'user1' 'Y' '11 Jul 2019' 'data'
'user2' 'Y' '23 Sep 2018' NaN
'user3' 'N' '08 Dec 2019' NaN
'user4' 'N' '03 Mar 2018' NaN
Answers:
I guess you need (note the difference in data for col4
which has None
as OP mentioned):
pd.DataFrame([[i] if not isinstance(i,list) else i
for i in d1.values()],index=d1.keys()).T
user paid last_active col4
0 user1 Y 11 Jul 2019 data
1 user2 Y 23 Sep 2018 None
2 user3 N 08 Dec 2019 None
3 user4 N 03 Mar 2018 None
Pandas does not have a multi-column explode. There are workarounds. One such simple way could be:
df = pd.DataFrame(
{
'A': [1, 2],
'B': [['a','b'], ['c','d']],
'C': [['z','y'], ['x','w']]
}
)
print(df)
--------------
A B C
--------------
1 [a, b] [z, y]
2 [c, d] [x, w]
##Let us say list_cols are the columns to be exploded
list_cols = {'B','C'}
other_cols = list(set(df.columns) - set(list_cols))
##other_cols now contains all the remaining column names in the df
##we temporarily convert to set() to easily get the differences in 2 lists
##now explode the list_cols using a loop
exploded = [df[col].explode() for col in list_cols]
##now we have long list of exploded values. Print to see the format
##This statement creates pairs of the exploded cols
##zip command is used to create the pairs
##dict puts it in an appropriate format from which a dataframe can be created
##Please print the individual outputs of each command to understand the flow
df2 = pd.DataFrame(dict(zip(list_cols, exploded)))
##Now merge back the other_cols as well
df2 = df[other_cols].merge(df2, how="right", left_index=True, right_index=True)
##lastly, re-create the original column order
df2 = df2.loc[:, df.columns]
print(df2)
------
A B C
------
1 a z
1 b y
2 c x
2 d w
Using Pandas 0.25.3, trying to explode a couple of columns.
Data looks like:
d1 = {'user':['user1','user2','user3','user4'],
'paid':['Y','Y','N','N']
'last_active':['11 Jul 2019','23 Sep 2018','08 Dec 2019','03 Mar 2018'],
'col4':'data'}
I sent this to a dataframe df=pd.DataFrame([d1],columns=d1.keys())
that looks like this:
user paid last_active col4
['user1','user2','user3','user4'] ['Y','Y','N','N'] ['11 Jul 2019','23 Sep 2018','08 Dec 2019','03 Mar 2018'] 'data'
there are other columns as well with one value per, {'A':'B'}
type stuff, but I’m not worried about those.
when I do df.explode('user')
it works for that one, and same for the other columns, but when I try to do df.explode(column=('user','paid','last_active')
it gives me the following error:
KeyError: ('user','paid','last_active')
So what I want to know, is how can I explode it with the explode
function on multiple columns to get the following df:
user paid last_active col4
'user1' 'Y' '11 Jul 2019' 'data'
'user2' 'Y' '23 Sep 2018' NaN
'user3' 'N' '08 Dec 2019' NaN
'user4' 'N' '03 Mar 2018' NaN
I guess you need (note the difference in data for col4
which has None
as OP mentioned):
pd.DataFrame([[i] if not isinstance(i,list) else i
for i in d1.values()],index=d1.keys()).T
user paid last_active col4
0 user1 Y 11 Jul 2019 data
1 user2 Y 23 Sep 2018 None
2 user3 N 08 Dec 2019 None
3 user4 N 03 Mar 2018 None
Pandas does not have a multi-column explode. There are workarounds. One such simple way could be:
df = pd.DataFrame(
{
'A': [1, 2],
'B': [['a','b'], ['c','d']],
'C': [['z','y'], ['x','w']]
}
)
print(df)
--------------
A B C
--------------
1 [a, b] [z, y]
2 [c, d] [x, w]
##Let us say list_cols are the columns to be exploded
list_cols = {'B','C'}
other_cols = list(set(df.columns) - set(list_cols))
##other_cols now contains all the remaining column names in the df
##we temporarily convert to set() to easily get the differences in 2 lists
##now explode the list_cols using a loop
exploded = [df[col].explode() for col in list_cols]
##now we have long list of exploded values. Print to see the format
##This statement creates pairs of the exploded cols
##zip command is used to create the pairs
##dict puts it in an appropriate format from which a dataframe can be created
##Please print the individual outputs of each command to understand the flow
df2 = pd.DataFrame(dict(zip(list_cols, exploded)))
##Now merge back the other_cols as well
df2 = df[other_cols].merge(df2, how="right", left_index=True, right_index=True)
##lastly, re-create the original column order
df2 = df2.loc[:, df.columns]
print(df2)
------
A B C
------
1 a z
1 b y
2 c x
2 d w