How to use pd.melt to unpivot a dataframe where columns share a prefix?

Question:

I’m trying to unpivot my data using pd.melt but no success so far. Each row is a business, and the data contains information about the business and multiple reviews. I want my data to have every review as a row.

My first 150 columns are in groups of 15, each group column name shares the same pattern reviews/n/ for 0 < n < 9. (reviews/0/text, reviews/0/date, … , reviews/9/date).
The next 65 columns in the dataframe include more data about the business (e.g. business_id, address) that should remain as id_variables.

My current data looks like this:

business_id address reviews/0/date reviews/0/text reviews/1/date reviews/1/text
12345 01 street 1/1/1990 "abc" 2/2/1995 "def"

and my new dataframe should have every review as a row instead of every business, and look like this:

business_id address review_number review_date review_text
12345 01 street 0 1/1/1990 "abc"
12345 01 street 1 2/2/1995 "def"

I tried using pd.melt but could not succeed in making code that produced something valuable to me.

Asked By: raz

||

Answers:

You can use pandas.wide_to_long() to do what you want.

However, you will need to rename your columns from the pattern reviews/N/COL to reviews/COL/N (or something similar) first, as wide_to_long() can only unpivot based on prefixes, whereas in your column names, you have a prefix and a suffix.

You could do this manually or e.g. using the re module and an appropriate regex:

df = df.rename(columns=lambda x: re.sub('reviews/(d)/(.*)', r'review_21', x))

After that, your data should look like this (note the changed colnames):

business_id address review_date0 review_text0 review_date1 review_text1
12345 01 street 1/1/1990 abc 2/2/1995 def

Now you can use pandas.wide_to_long() and use the stubnames parameter to specify the prefix of the columns that should be grouped when you unpivot.

df = pd.wide_to_long(df,
                     stubnames=['review_date','review_text'],
                     i=['business_id', 'address'], 
                     j='review_number')

Finally, call .reset_index() to achieve the result you asked for.

Full example:

import re
import pandas as pd

df = pd.DataFrame({'business_id': 12345, 
                   'address': '01 street', 
                   'reviews/0/date': '1/1/1990', 
                   'reviews/0/text': 'abc', 
                   'reviews/1/date': '2/2/1995', 
                   'reviews/1/text': 'def'}, index = [0])

df = df.rename(columns=lambda x: re.sub('reviews/(d)/(.*)', r'review_21', x))

df = pd.wide_to_long(df,
                     stubnames=['review_date','review_text'],
                     i=['business_id', 'address'], 
                     j='review_number').reset_index()

Result:

business_id address review_number review_date review_text
12345 01 street 0 1/1/1990 abc
12345 01 street 1 2/2/1995 def
Answered By: buddemat

You can get the names of all the non-review columns.

columns = df.columns[~df.columns.str.match(r'reviews/d+/')]
>>> columns
Index(['address', 'business_id'], dtype='object')

And use those to .melt()

df = df.melt(columns)

df['review_number'] = df['variable'].str.extract(r'reviews/(d+)')
df['variable'] = df['variable'].str.replace(r'reviews/d+/', 'review_')
>>> df
  address  business_id     variable                value review_number
0  street            1  review_date  1990-01-01 00:00:00             0
1  street            1  review_text                "abc"             0
2  street            1  review_date  1995-02-02 00:00:00             1
3  street            1  review_text                "def"             1

From there you can .pivot()

>>> df.pivot(index=columns.union(['review_number']).to_list(), columns='variable')
                                                 value            
variable                                   review_date review_text
address business_id review_number                                 
street  1           0              1990-01-01 00:00:00       "abc"
                    1              1995-02-02 00:00:00       "def"
Answered By: jqurious

One option is with pivot_longer from pyjanitor – in this case we use the special placeholder .value to identify the parts of the column that we want to remain as headers, while the rest get collated into a new column :

# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
    index = ['business_id', 'address'], 
    names_to = ['.value', 'reviewnumber', '.value'], 
    names_pattern = r"(review)s/(d+)/(.+)"
 )
.rename(columns = lambda f: f.replace('review', 'review_'))
) 
   business_id    address review_number review_date review_text
0        12345  01 street             0    1/1/1990         abc
1        12345  01 street             1    2/2/1995         def

A regex gives the flexibility to extract the labels into separate columns. Note that you can use .value as many times as you want, as long as you get the regex right.

Another option is with pd.stack, where the columns are split before transforming – as much as possible generally, if you can, split the columns before flipping, not after ( the larger the data size the more performant this option is ):

temp = df.set_index(['business_id', 'address'])
temp.columns = temp.columns.str.split("/", expand=True)
temp.columns.names = [None, 'review_numbers', None]
# quick route - the collapse_levels function 
# is from pyjanitor
# temp.stack('review_numbers').collapse_levels().reset_index()
temp = temp.stack('review_numbers')
temp.columns = temp.columns.map("_".join)
temp.reset_index()
   business_id    address review_numbers reviews_date reviews_text
0        12345  01 street              0     1/1/1990          abc
1        12345  01 street              1     2/2/1995          def
Answered By: sammywemmy
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.