Turning column of list of lists (of unequal length) into separate variable columns (python, pandas)

Question:

I’m having trouble turning a column of lists of lists into separate columns. I have a bad solution that works by working on each row independently and then appending them to each other, but this takes far too long for ~500k rows. Wondering if someone has a better solution.

Here is the input:

>>> import pandas as pd 
>>> import numpy as np 
>>> pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
feat
0 [[str1, , 3], [str3, , 5], [str4, , 3]]
1 [[str1, , 4], [str2, , 5]]

Desired output:

>>> pd.DataFrame({'str1': [3, 4], 'str2': [np.nan,5] , 'str3': [5,np.nan], 'str4': [3,np.nan]})
str1 str2 str3 str4
0 3 NaN 5 3
1 4 5 NaN NaN

Update: Solved by @ifly6! Fastest solution by far. For 100k rows and 80 total variables, the total time taken was 8.9 seconds for my machine.

Asked By: Justin Young

||

Answers:

Loading your df, create df1 as follows:

df1 = pd.DataFrame.from_records(df.explode('feat').values.flatten()).replace('', np.nan)
df1.index = df.explode('feat').index

Set index on df1 from the original data to preserve row markers (passing index=df.explode('feat').index does not work). (Alternatively, to get to the point where you have separated the lists into columns, you could use df.explode('feat')['feat'].apply(pd.Series). I prefer, however, to avoid apply so use the DataFrame constructor instead.)

Reset index on df1 then set multi-index (cannot set the column 0 index directly because it overwrites the original index):

df1.reset_index().set_index(['index', 0])
# df1.set_index(0, append=True)  # alternatively should work

Then unstack. You can drop columns that are all NaN by appending .dropna(how='all', axis=1), yielding:

>>> df1.reset_index().set_index(['index', 0]).unstack().dropna(how='all', axis=1)
         2               
0     str1 str2 str3 str4
index                    
0      3.0  NaN  5.0  3.0
1      4.0  5.0  NaN  NaN

This solution also largely avoids hard-coding which specific columns to look at or move about.

Answered By: ifly6

here is one way to do it

# explode the list to rows

df=df.explode('feat')

# remove the [] from the list, and split on ","
df[['col1','col3','col2']]=df['feat'].astype('str').replace('[[]]','', regex=True).str.split(',', expand=True)

# use pivot after reindexing
df=df.reset_index()
df.pivot(index='index', columns='col1', values='col2')
df
col1    'str1'  'str2'  'str3'  'str4'
index               
0         3       NaN      5      3
1         4         5    NaN    NaN
Answered By: Naveed

Convert your nested lists to dictionaries that pd.Series can interpret:

df = df.feat.apply(lambda val: pd.Series({y[0]:y[2] for y in val}))
df = df[df.columns.sort_values()]
print(df)

Output:

   str1  str2  str3  str4
0   3.0   NaN   5.0   3.0
1   4.0   5.0   NaN   NaN
Answered By: BeRT2me

My solution is a brute force approach building the new df1 cell by cell using df1.loc[i, col_name].

import pandas as pd

df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df1 = pd.DataFrame()
for i in range(df.shape[0]):
    for e in df.loc[i, 'feat']:
        df1.loc[i, e[0]] = e[2]
print(df1)

Output (not in column order):

   str1  str3  str4  str2
0   3.0   5.0   3.0   NaN
1   4.0   NaN   NaN   5.0

And the time taken is

import timeit
timeit.timeit('''
import pandas as pd
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df1 = pd.DataFrame()
for i in range(df.shape[0]):
    for e in df.loc[i, 'feat']:
        df1.loc[i, e[0]] = e[2]
''', number=10000)

19.209370899999996

So it took about 20 seconds for 10K runs. I’m curious to know how the other algorithms perform. Please also run it on your own because time taken varies for different computers. And also varies with different dataset. Here they are:

#Answer from @ifly6

import timeit
timeit.timeit('''
import pandas as pd
import numpy as np
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df1 = pd.DataFrame.from_records(df.explode('feat').values.flatten()).replace('', np.nan)
df1.index = df.explode('feat').index
df1 = df1.reset_index().set_index(['index', 0]).unstack().dropna(how='all', axis=1)
''', number=10000)

48.217678400000295

#Answer from @Naveed

import timeit
timeit.timeit('''
import pandas as pd
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df = df.explode('feat')
df[['col1','col3','col2']] = df['feat'].astype('str').replace('[[]]','', regex=True).str.split(',', expand=True)
df = df.reset_index()
df = df.pivot(index='index', columns='col1', values='col2')
''', number=10000)

34.94540550000056

#Answer from @BeRT2me (it’s even faster without rearranging the columns with df = df[df.columns.sort_values()])

import timeit
timeit.timeit('''
import pandas as pd
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df = df.feat.apply(lambda val: pd.Series({y[0]:y[2] for y in val}))
df = df[df.columns.sort_values()]
''', number=10000)

12.745890199999849
Answered By: perpetualstudent
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.