Expand column containing list of tuples into the current dataframe

Question:

I have a dataframe in the following format:

df = pd.DataFrame({'column_with_tuples': [[('word1', 10), ('word2', 20), ('word3', 30)], [('word4', 40), ('word5', 50), ('word6', 60)]],
                   'category':['category1','category2']})

I want to move the tuples into two separate columns and preserve the category column to be able to easily filter the most common words for each category.

So the final result should look like this:

df_new = pd.DataFrame({'word': ['word1','word2', 'word3','word4','word5','word6'],
                   'frequency': [10, 20, 30, 40, 50, 60],
                   'category':['category1','category1', 'category1', 'category2', 'category2', 'category2']})

I tried with this code but the result is not the one I expect:

df_tuples = pd.concat([pd.DataFrame(x) for x in df['column_with_tuples']], ignore_index=True)

df_tuples.columns = ['word', 'frequency']

df.drop(['column_with_tuples'], axis=1, inplace=True)

df = pd.concat([df, df_tuples], axis=1)

I would appreciate some help here.

Asked By: Yana

||

Answers:

You should use .explode() method to expand the tuples in the column_with_tuples column into separate rows. After that, introduce .rename() method to change the name of the column, then unpack the tuples into separate columns and add the category column using the .apply() method. And finally assign() method to add the category column to the your dataframe.

df_new = df.explode("column_with_tuples")
df_new = df_new.rename(columns={"column_with_tuples": "word"})
df_new[["word", "frequency"]] = df_new["word"].apply(pd.Series)

df_new = df_new.assign(category=df["category"])
df_new = df_new[["word", "frequency", "category"]]
df_new.reset_index(drop=True, inplace=True)
print(df_new)

Simplified version of the above code:

df_new = df.explode("column_with_tuples").rename(columns={"column_with_tuples": "word"})
df_new[["word", "frequency"]] = df_new["word"].apply(pd.Series)
df_new.assign(category=df["category"])

df_new = df_new[["word", "frequency", "category"]].reset_index(drop=True)
print(df_new)

    word  frequency   category
0  word1         10  category1
1  word2         20  category1
2  word3         30  category1
3  word4         40  category2
4  word5         50  category2
5  word6         60  category2
Answered By: Jamiu S.

You can initially explode column_with_tuples into multiple rows and then build a multiindex from a series of tuples (word, freaquency) with pd.MultiIndex.from_tuples:

df2 = df.explode('column_with_tuples')
df2.set_index(pd.MultiIndex.from_tuples(df2['column_with_tuples']))
    .reset_index(names=['word', 'frequency']).drop(columns='column_with_tuples')

   word  frequency   category
0  word1         10  category1
1  word2         20  category1
2  word3         30  category1
3  word4         40  category2
4  word5         50  category2
5  word6         60  category2
Answered By: RomanPerekhrest

One option with the explode method:

(df
.explode('column_with_tuples')
.assign(word = lambda df: df.column_with_tuples.str[0], 
        frequency = lambda df: df.column_with_tuples.str[1])
.drop(columns='column_with_tuples')
) 
    category   word  frequency
0  category1  word1         10
0  category1  word2         20
0  category1  word3         30
1  category2  word4         40
1  category2  word5         50
1  category2  word6         60

Another option, using vanilla python, before creating the final dataframe:

from itertools import product, chain

out = [product([cat], tuples) 
       for cat, tuples 
       in zip(df.category, df.column_with_tuples)]

out = chain.from_iterable(out)

out = [(cat, *tuples) for cat, tuples in out]

pd.DataFrame(out, columns = ['category', 'word', 'frequency'])
    category   word  frequency
0  category1  word1         10
1  category1  word2         20
2  category1  word3         30
3  category2  word4         40
4  category2  word5         50
5  category2  word6         60
Answered By: sammywemmy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.