How to use pd.melt to unpivot a dataframe where columns share a prefix?
Question:
I’m trying to unpivot my data using pd.melt
but no success so far. Each row is a business, and the data contains information about the business and multiple reviews. I want my data to have every review as a row.
My first 150 columns are in groups of 15, each group column name shares the same pattern reviews/n/
for 0 < n < 9
. (reviews/0/text
, reviews/0/date
, … , reviews/9/date
).
The next 65 columns in the dataframe include more data about the business (e.g. business_id
, address
) that should remain as id_variables.
My current data looks like this:
business_id
address
reviews/0/date
reviews/0/text
reviews/1/date
reviews/1/text
12345
01 street
1/1/1990
"abc"
2/2/1995
"def"
and my new dataframe should have every review as a row instead of every business, and look like this:
business_id
address
review_number
review_date
review_text
12345
01 street
0
1/1/1990
"abc"
12345
01 street
1
2/2/1995
"def"
I tried using pd.melt
but could not succeed in making code that produced something valuable to me.
Answers:
You can use pandas.wide_to_long()
to do what you want.
However, you will need to rename your columns from the pattern reviews/N/COL
to reviews/COL/N
(or something similar) first, as wide_to_long()
can only unpivot based on prefixes, whereas in your column names, you have a prefix and a suffix.
You could do this manually or e.g. using the re
module and an appropriate regex:
df = df.rename(columns=lambda x: re.sub('reviews/(d)/(.*)', r'review_21', x))
After that, your data should look like this (note the changed colnames):
business_id
address
review_date0
review_text0
review_date1
review_text1
12345
01 street
1/1/1990
abc
2/2/1995
def
Now you can use pandas.wide_to_long()
and use the stubnames
parameter to specify the prefix of the columns that should be grouped when you unpivot.
df = pd.wide_to_long(df,
stubnames=['review_date','review_text'],
i=['business_id', 'address'],
j='review_number')
Finally, call .reset_index()
to achieve the result you asked for.
Full example:
import re
import pandas as pd
df = pd.DataFrame({'business_id': 12345,
'address': '01 street',
'reviews/0/date': '1/1/1990',
'reviews/0/text': 'abc',
'reviews/1/date': '2/2/1995',
'reviews/1/text': 'def'}, index = [0])
df = df.rename(columns=lambda x: re.sub('reviews/(d)/(.*)', r'review_21', x))
df = pd.wide_to_long(df,
stubnames=['review_date','review_text'],
i=['business_id', 'address'],
j='review_number').reset_index()
Result:
business_id
address
review_number
review_date
review_text
12345
01 street
0
1/1/1990
abc
12345
01 street
1
2/2/1995
def
You can get the names of all the non-review columns.
columns = df.columns[~df.columns.str.match(r'reviews/d+/')]
>>> columns
Index(['address', 'business_id'], dtype='object')
And use those to .melt()
df = df.melt(columns)
df['review_number'] = df['variable'].str.extract(r'reviews/(d+)')
df['variable'] = df['variable'].str.replace(r'reviews/d+/', 'review_')
>>> df
address business_id variable value review_number
0 street 1 review_date 1990-01-01 00:00:00 0
1 street 1 review_text "abc" 0
2 street 1 review_date 1995-02-02 00:00:00 1
3 street 1 review_text "def" 1
From there you can .pivot()
>>> df.pivot(index=columns.union(['review_number']).to_list(), columns='variable')
value
variable review_date review_text
address business_id review_number
street 1 0 1990-01-01 00:00:00 "abc"
1 1995-02-02 00:00:00 "def"
One option is with pivot_longer from pyjanitor – in this case we use the special placeholder .value
to identify the parts of the column that we want to remain as headers, while the rest get collated into a new column :
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = ['business_id', 'address'],
names_to = ['.value', 'reviewnumber', '.value'],
names_pattern = r"(review)s/(d+)/(.+)"
)
.rename(columns = lambda f: f.replace('review', 'review_'))
)
business_id address review_number review_date review_text
0 12345 01 street 0 1/1/1990 abc
1 12345 01 street 1 2/2/1995 def
A regex gives the flexibility to extract the labels into separate columns. Note that you can use .value
as many times as you want, as long as you get the regex right.
Another option is with pd.stack
, where the columns are split before transforming – as much as possible generally, if you can, split the columns before flipping, not after ( the larger the data size the more performant this option is ):
temp = df.set_index(['business_id', 'address'])
temp.columns = temp.columns.str.split("/", expand=True)
temp.columns.names = [None, 'review_numbers', None]
# quick route - the collapse_levels function
# is from pyjanitor
# temp.stack('review_numbers').collapse_levels().reset_index()
temp = temp.stack('review_numbers')
temp.columns = temp.columns.map("_".join)
temp.reset_index()
business_id address review_numbers reviews_date reviews_text
0 12345 01 street 0 1/1/1990 abc
1 12345 01 street 1 2/2/1995 def
I’m trying to unpivot my data using pd.melt
but no success so far. Each row is a business, and the data contains information about the business and multiple reviews. I want my data to have every review as a row.
My first 150 columns are in groups of 15, each group column name shares the same pattern reviews/n/
for 0 < n < 9
. (reviews/0/text
, reviews/0/date
, … , reviews/9/date
).
The next 65 columns in the dataframe include more data about the business (e.g. business_id
, address
) that should remain as id_variables.
My current data looks like this:
business_id | address | reviews/0/date | reviews/0/text | reviews/1/date | reviews/1/text |
---|---|---|---|---|---|
12345 | 01 street | 1/1/1990 | "abc" | 2/2/1995 | "def" |
and my new dataframe should have every review as a row instead of every business, and look like this:
business_id | address | review_number | review_date | review_text |
---|---|---|---|---|
12345 | 01 street | 0 | 1/1/1990 | "abc" |
12345 | 01 street | 1 | 2/2/1995 | "def" |
I tried using pd.melt
but could not succeed in making code that produced something valuable to me.
You can use pandas.wide_to_long()
to do what you want.
However, you will need to rename your columns from the pattern reviews/N/COL
to reviews/COL/N
(or something similar) first, as wide_to_long()
can only unpivot based on prefixes, whereas in your column names, you have a prefix and a suffix.
You could do this manually or e.g. using the re
module and an appropriate regex:
df = df.rename(columns=lambda x: re.sub('reviews/(d)/(.*)', r'review_21', x))
After that, your data should look like this (note the changed colnames):
business_id | address | review_date0 | review_text0 | review_date1 | review_text1 |
---|---|---|---|---|---|
12345 | 01 street | 1/1/1990 | abc | 2/2/1995 | def |
Now you can use pandas.wide_to_long()
and use the stubnames
parameter to specify the prefix of the columns that should be grouped when you unpivot.
df = pd.wide_to_long(df,
stubnames=['review_date','review_text'],
i=['business_id', 'address'],
j='review_number')
Finally, call .reset_index()
to achieve the result you asked for.
Full example:
import re
import pandas as pd
df = pd.DataFrame({'business_id': 12345,
'address': '01 street',
'reviews/0/date': '1/1/1990',
'reviews/0/text': 'abc',
'reviews/1/date': '2/2/1995',
'reviews/1/text': 'def'}, index = [0])
df = df.rename(columns=lambda x: re.sub('reviews/(d)/(.*)', r'review_21', x))
df = pd.wide_to_long(df,
stubnames=['review_date','review_text'],
i=['business_id', 'address'],
j='review_number').reset_index()
Result:
business_id | address | review_number | review_date | review_text |
---|---|---|---|---|
12345 | 01 street | 0 | 1/1/1990 | abc |
12345 | 01 street | 1 | 2/2/1995 | def |
You can get the names of all the non-review columns.
columns = df.columns[~df.columns.str.match(r'reviews/d+/')]
>>> columns
Index(['address', 'business_id'], dtype='object')
And use those to .melt()
df = df.melt(columns)
df['review_number'] = df['variable'].str.extract(r'reviews/(d+)')
df['variable'] = df['variable'].str.replace(r'reviews/d+/', 'review_')
>>> df
address business_id variable value review_number
0 street 1 review_date 1990-01-01 00:00:00 0
1 street 1 review_text "abc" 0
2 street 1 review_date 1995-02-02 00:00:00 1
3 street 1 review_text "def" 1
From there you can .pivot()
>>> df.pivot(index=columns.union(['review_number']).to_list(), columns='variable')
value
variable review_date review_text
address business_id review_number
street 1 0 1990-01-01 00:00:00 "abc"
1 1995-02-02 00:00:00 "def"
One option is with pivot_longer from pyjanitor – in this case we use the special placeholder .value
to identify the parts of the column that we want to remain as headers, while the rest get collated into a new column :
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = ['business_id', 'address'],
names_to = ['.value', 'reviewnumber', '.value'],
names_pattern = r"(review)s/(d+)/(.+)"
)
.rename(columns = lambda f: f.replace('review', 'review_'))
)
business_id address review_number review_date review_text
0 12345 01 street 0 1/1/1990 abc
1 12345 01 street 1 2/2/1995 def
A regex gives the flexibility to extract the labels into separate columns. Note that you can use .value
as many times as you want, as long as you get the regex right.
Another option is with pd.stack
, where the columns are split before transforming – as much as possible generally, if you can, split the columns before flipping, not after ( the larger the data size the more performant this option is ):
temp = df.set_index(['business_id', 'address'])
temp.columns = temp.columns.str.split("/", expand=True)
temp.columns.names = [None, 'review_numbers', None]
# quick route - the collapse_levels function
# is from pyjanitor
# temp.stack('review_numbers').collapse_levels().reset_index()
temp = temp.stack('review_numbers')
temp.columns = temp.columns.map("_".join)
temp.reset_index()
business_id address review_numbers reviews_date reviews_text
0 12345 01 street 0 1/1/1990 abc
1 12345 01 street 1 2/2/1995 def