Expand column containing list of tuples into the current dataframe
Question:
I have a dataframe in the following format:
df = pd.DataFrame({'column_with_tuples': [[('word1', 10), ('word2', 20), ('word3', 30)], [('word4', 40), ('word5', 50), ('word6', 60)]],
'category':['category1','category2']})
I want to move the tuples into two separate columns and preserve the category column to be able to easily filter the most common words for each category.
So the final result should look like this:
df_new = pd.DataFrame({'word': ['word1','word2', 'word3','word4','word5','word6'],
'frequency': [10, 20, 30, 40, 50, 60],
'category':['category1','category1', 'category1', 'category2', 'category2', 'category2']})
I tried with this code but the result is not the one I expect:
df_tuples = pd.concat([pd.DataFrame(x) for x in df['column_with_tuples']], ignore_index=True)
df_tuples.columns = ['word', 'frequency']
df.drop(['column_with_tuples'], axis=1, inplace=True)
df = pd.concat([df, df_tuples], axis=1)
I would appreciate some help here.
Answers:
You should use .explode()
method to expand the tuples in the column_with_tuples
column into separate rows. After that, introduce .rename()
method to change the name of the column, then unpack the tuples into separate columns and add the category
column using the .apply()
method. And finally assign()
method to add the category
column to the your dataframe.
df_new = df.explode("column_with_tuples")
df_new = df_new.rename(columns={"column_with_tuples": "word"})
df_new[["word", "frequency"]] = df_new["word"].apply(pd.Series)
df_new = df_new.assign(category=df["category"])
df_new = df_new[["word", "frequency", "category"]]
df_new.reset_index(drop=True, inplace=True)
print(df_new)
Simplified version of the above code:
df_new = df.explode("column_with_tuples").rename(columns={"column_with_tuples": "word"})
df_new[["word", "frequency"]] = df_new["word"].apply(pd.Series)
df_new.assign(category=df["category"])
df_new = df_new[["word", "frequency", "category"]].reset_index(drop=True)
print(df_new)
word frequency category
0 word1 10 category1
1 word2 20 category1
2 word3 30 category1
3 word4 40 category2
4 word5 50 category2
5 word6 60 category2
You can initially explode column_with_tuples
into multiple rows and then build a multiindex from a series of tuples (word, freaquency)
with pd.MultiIndex.from_tuples
:
df2 = df.explode('column_with_tuples')
df2.set_index(pd.MultiIndex.from_tuples(df2['column_with_tuples']))
.reset_index(names=['word', 'frequency']).drop(columns='column_with_tuples')
word frequency category
0 word1 10 category1
1 word2 20 category1
2 word3 30 category1
3 word4 40 category2
4 word5 50 category2
5 word6 60 category2
One option with the explode
method:
(df
.explode('column_with_tuples')
.assign(word = lambda df: df.column_with_tuples.str[0],
frequency = lambda df: df.column_with_tuples.str[1])
.drop(columns='column_with_tuples')
)
category word frequency
0 category1 word1 10
0 category1 word2 20
0 category1 word3 30
1 category2 word4 40
1 category2 word5 50
1 category2 word6 60
Another option, using vanilla python, before creating the final dataframe:
from itertools import product, chain
out = [product([cat], tuples)
for cat, tuples
in zip(df.category, df.column_with_tuples)]
out = chain.from_iterable(out)
out = [(cat, *tuples) for cat, tuples in out]
pd.DataFrame(out, columns = ['category', 'word', 'frequency'])
category word frequency
0 category1 word1 10
1 category1 word2 20
2 category1 word3 30
3 category2 word4 40
4 category2 word5 50
5 category2 word6 60
I have a dataframe in the following format:
df = pd.DataFrame({'column_with_tuples': [[('word1', 10), ('word2', 20), ('word3', 30)], [('word4', 40), ('word5', 50), ('word6', 60)]],
'category':['category1','category2']})
I want to move the tuples into two separate columns and preserve the category column to be able to easily filter the most common words for each category.
So the final result should look like this:
df_new = pd.DataFrame({'word': ['word1','word2', 'word3','word4','word5','word6'],
'frequency': [10, 20, 30, 40, 50, 60],
'category':['category1','category1', 'category1', 'category2', 'category2', 'category2']})
I tried with this code but the result is not the one I expect:
df_tuples = pd.concat([pd.DataFrame(x) for x in df['column_with_tuples']], ignore_index=True)
df_tuples.columns = ['word', 'frequency']
df.drop(['column_with_tuples'], axis=1, inplace=True)
df = pd.concat([df, df_tuples], axis=1)
I would appreciate some help here.
You should use .explode()
method to expand the tuples in the column_with_tuples
column into separate rows. After that, introduce .rename()
method to change the name of the column, then unpack the tuples into separate columns and add the category
column using the .apply()
method. And finally assign()
method to add the category
column to the your dataframe.
df_new = df.explode("column_with_tuples")
df_new = df_new.rename(columns={"column_with_tuples": "word"})
df_new[["word", "frequency"]] = df_new["word"].apply(pd.Series)
df_new = df_new.assign(category=df["category"])
df_new = df_new[["word", "frequency", "category"]]
df_new.reset_index(drop=True, inplace=True)
print(df_new)
Simplified version of the above code:
df_new = df.explode("column_with_tuples").rename(columns={"column_with_tuples": "word"})
df_new[["word", "frequency"]] = df_new["word"].apply(pd.Series)
df_new.assign(category=df["category"])
df_new = df_new[["word", "frequency", "category"]].reset_index(drop=True)
print(df_new)
word frequency category
0 word1 10 category1
1 word2 20 category1
2 word3 30 category1
3 word4 40 category2
4 word5 50 category2
5 word6 60 category2
You can initially explode column_with_tuples
into multiple rows and then build a multiindex from a series of tuples (word, freaquency)
with pd.MultiIndex.from_tuples
:
df2 = df.explode('column_with_tuples')
df2.set_index(pd.MultiIndex.from_tuples(df2['column_with_tuples']))
.reset_index(names=['word', 'frequency']).drop(columns='column_with_tuples')
word frequency category
0 word1 10 category1
1 word2 20 category1
2 word3 30 category1
3 word4 40 category2
4 word5 50 category2
5 word6 60 category2
One option with the explode
method:
(df
.explode('column_with_tuples')
.assign(word = lambda df: df.column_with_tuples.str[0],
frequency = lambda df: df.column_with_tuples.str[1])
.drop(columns='column_with_tuples')
)
category word frequency
0 category1 word1 10
0 category1 word2 20
0 category1 word3 30
1 category2 word4 40
1 category2 word5 50
1 category2 word6 60
Another option, using vanilla python, before creating the final dataframe:
from itertools import product, chain
out = [product([cat], tuples)
for cat, tuples
in zip(df.category, df.column_with_tuples)]
out = chain.from_iterable(out)
out = [(cat, *tuples) for cat, tuples in out]
pd.DataFrame(out, columns = ['category', 'word', 'frequency'])
category word frequency
0 category1 word1 10
1 category1 word2 20
2 category1 word3 30
3 category2 word4 40
4 category2 word5 50
5 category2 word6 60