How do I melt a pandas dataframe?

Question:

On the tag, I often see users asking questions about melting dataframes in pandas. I am going to attempt a canonical Q&A (self-answer) with this topic.

I am is going to clarify:

  1. What is melt?

  2. How do I use melt?

  3. When do I use melt?

I see some hotter questions about melt, like:

So I am going to attempt a canonical Q&A for this topic.



Dataset:

I will have all my answers on this dataset of random grades for random people with random ages (easier to explain for the answers :D):

import pandas as pd
df = pd.DataFrame({'Name': ['Bob', 'John', 'Foo', 'Bar', 'Alex', 'Tom'],
                   'Math': ['A+', 'B', 'A', 'F', 'D', 'C'],
                   'English': ['C', 'B', 'B', 'A+', 'F', 'A'],
                   'Age': [13, 16, 16, 15, 15, 13]})
>>> df
   Name Math English  Age
0   Bob   A+       C   13
1  John    B       B   16
2   Foo    A       B   16
3   Bar    F      A+   15
4  Alex    D       F   15
5   Tom    C       A   13
>>>

Problems:

I am going to have some problems and they will be solved in my self-answer below.

Problem 1:

How do I melt a dataframe so that the original dataframe becomes the following?

    Name  Age  Subject Grade
0    Bob   13  English     C
1   John   16  English     B
2    Foo   16  English     B
3    Bar   15  English    A+
4   Alex   17  English     F
5    Tom   12  English     A
6    Bob   13     Math    A+
7   John   16     Math     B
8    Foo   16     Math     A
9    Bar   15     Math     F
10  Alex   17     Math     D
11   Tom   12     Math     C

I want to transpose this so that one column would be each subject and the other columns would be the repeated names of the students and their age and score.

Problem 2:

This is similar to Problem 1, but this time I want to make the Problem 1 output Subject column only have Math, I want to filter out the English column:

   Name  Age Subject Grades
0   Bob   13    Math     A+
1  John   16    Math      B
2   Foo   16    Math      A
3   Bar   15    Math      F
4  Alex   15    Math      D
5   Tom   13    Math      C

I want the output to be like the above.

Problem 3:

If I was to group the melt and order the students by their scores, how would I be able to do that, to get the desired output like the below:

  value             Name                Subjects
0     A         Foo, Tom           Math, English
1    A+         Bob, Bar           Math, English
2     B  John, John, Foo  Math, English, English
3     C         Tom, Bob           Math, English
4     D             Alex                    Math
5     F        Bar, Alex           Math, English

I need it to be ordered and the names separated by comma and also the Subjects separated by comma in the same order respectively.

Problem 4:

How would I unmelt a melted dataframe? Let’s say I already melted this dataframe:

print(df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades'))

To become:

    Name  Age  Subject Grades
0    Bob   13     Math     A+
1   John   16     Math      B
2    Foo   16     Math      A
3    Bar   15     Math      F
4   Alex   15     Math      D
5    Tom   13     Math      C
6    Bob   13  English      C
7   John   16  English      B
8    Foo   16  English      B
9    Bar   15  English     A+
10  Alex   15  English      F
11   Tom   13  English      A

Then how would I translate this back to the original dataframe, the below?

   Name Math English  Age
0   Bob   A+       C   13
1  John    B       B   16
2   Foo    A       B   16
3   Bar    F      A+   15
4  Alex    D       F   15
5   Tom    C       A   13

How would I go about doing this?

Problem 5:

If I was to group by the names of the students and separate the subjects and grades by comma, how would I do it?

   Name        Subject Grades
0  Alex  Math, English   D, F
1   Bar  Math, English  F, A+
2   Bob  Math, English  A+, C
3   Foo  Math, English   A, B
4  John  Math, English   B, B
5   Tom  Math, English   C, A

I want to have a dataframe like above.

Problem 6:

If I was is going to completely melt my dataframe, all columns as values, how would I do it?

     Column Value
0      Name   Bob
1      Name  John
2      Name   Foo
3      Name   Bar
4      Name  Alex
5      Name   Tom
6      Math    A+
7      Math     B
8      Math     A
9      Math     F
10     Math     D
11     Math     C
12  English     C
13  English     B
14  English     B
15  English    A+
16  English     F
17  English     A
18      Age    13
19      Age    16
20      Age    16
21      Age    15
22      Age    15
23      Age    13

I want to have a dataframe like above. All columns as values.

Please check my self-answer below 🙂
Asked By: U13-Forward

||

Answers:

Note for pandas versions < 0.20.0: I will be using df.melt(...) for my examples, but you will need to use pd.melt(df, ...) instead.

Documentation references:

Most of the solutions here would be used with melt, so to know the method melt, see the documentation explanation.

Unpivot a DataFrame from wide to long format, optionally leaving
identifiers set.

This function is useful to massage a DataFrame into a format where one
or more columns are identifier variables (id_vars), while all other
columns, considered measured variables (value_vars), are “unpivoted”
to the row axis, leaving just two non-identifier columns, ‘variable’
and ‘value’.

Parameters

  • id_vars : tuple, list, or ndarray, optional

    Column(s) to use as identifier variables.

  • value_vars : tuple, list, or ndarray, optional

    Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

  • var_name : scalar

    Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.

  • value_name : scalar, default ‘value’

    Name to use for the ‘value’ column.

  • col_level : int or str, optional

    If columns are a MultiIndex then use this level to melt.

  • ignore_index : bool, default True

    If True, original index is ignored. If False, the original index is retained. Index labels will be repeated
    as necessary.

    New in version 1.1.0.

Logic to melting:

Melting merges multiple columns and converts the dataframe from wide to long, for the solution to Problem 1 (see below), the steps are:

  1. First we got the original dataframe.

  2. Then the melt firstly merges the Math and English columns and makes the dataframe replicated (longer).

  3. Then finally it adds the column Subject which is the subject of the Grades columns value, respectively:

This is the simple logic to what the melt function does.

Solutions:

I will solve my own questions.

Problem 1:

Problem 1 could be solve using pd.DataFrame.melt with the following code:

print(df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades'))

This code passes the id_vars argument to ['Name', 'Age'], then automatically the value_vars would be set to the other columns (['Math', 'English']), which is transposed into that format.

You could also solve Problem 1 using stack like the below:

print(
    df.set_index(["Name", "Age"])
    .stack()
    .reset_index(name="Grade")
    .rename(columns={"level_2": "Subject"})
    .sort_values("Subject")
    .reset_index(drop=True)
)

This code sets the Name and Age columns as the index and stacks the rest of the columns Math and English, and resets the index and assigns Grade as the column name, then renames the other column level_2 to Subject and then sorts by the Subject column, then finally resets the index again.

Both of these solutions output:

    Name  Age  Subject Grade
0    Bob   13  English     C
1   John   16  English     B
2    Foo   16  English     B
3    Bar   15  English    A+
4   Alex   17  English     F
5    Tom   12  English     A
6    Bob   13     Math    A+
7   John   16     Math     B
8    Foo   16     Math     A
9    Bar   15     Math     F
10  Alex   17     Math     D
11   Tom   12     Math     C

Problem 2:

This is similar to my first question, but this one I only one to filter in the Math columns, this time the value_vars argument can come into use, like the below:

print(
    df.melt(
        id_vars=["Name", "Age"],
        value_vars="Math",
        var_name="Subject",
        value_name="Grades",
    )
)

Or we can also use stack with column specification:

print(
    df.set_index(["Name", "Age"])[["Math"]]
    .stack()
    .reset_index(name="Grade")
    .rename(columns={"level_2": "Subject"})
    .sort_values("Subject")
    .reset_index(drop=True)
)

Both of these solutions give:

   Name  Age Subject Grade
0   Bob   13    Math    A+
1  John   16    Math     B
2   Foo   16    Math     A
3   Bar   15    Math     F
4  Alex   15    Math     D
5   Tom   13    Math     C

Problem 3:

Problem 3 could be solved with melt and groupby, using the agg function with ', '.join, like the below:

print(
    df.melt(id_vars=["Name", "Age"])
    .groupby("value", as_index=False)
    .agg(", ".join)
)

It melts the dataframe then groups by the grades and aggregates them and joins them by a comma.

stack could be also used to solve this problem, with stack and groupby like the below:

print(
    df.set_index(["Name", "Age"])
    .stack()
    .reset_index()
    .rename(columns={"level_2": "Subjects", 0: "Grade"})
    .groupby("Grade", as_index=False)
    .agg(", ".join)
)

This stack function just transposes the dataframe in a way that is equivalent to melt, then resets the index, renames the columns and groups and aggregates.

Both solutions output:

  Grade             Name                Subjects
0     A         Foo, Tom           Math, English
1    A+         Bob, Bar           Math, English
2     B  John, John, Foo  Math, English, English
3     C         Bob, Tom           English, Math
4     D             Alex                    Math
5     F        Bar, Alex           Math, English

Problem 4:

We first melt the dataframe for the input data:

df = df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades')


Then now we can start solving this Problem 4.

Problem 4 could be solved with pivot_table, we would have to specify to the pivot_table arguments, values, index, columns and also aggfunc.

We could solve it with the below code:

print(
    df.pivot_table("Grades", ["Name", "Age"], "Subject", aggfunc="first")
    .reset_index()
    .rename_axis(columns=None)
)

Output:

   Name  Age English Math
0  Alex   15       F    D
1   Bar   15      A+    F
2   Bob   13       C   A+
3   Foo   16       B    A
4  John   16       B    B
5   Tom   13       A    C

The melted dataframe is converted back to the exact same format as the original dataframe.

We first pivot the melted dataframe and then reset the index and remove the column axis name.

Problem 5:

Problem 5 could be solved with melt and groupby like the following:

print(
    df.melt(id_vars=["Name", "Age"], var_name="Subject", value_name="Grades")
    .groupby("Name", as_index=False)
    .agg(", ".join)
)

That melts and groups by Name.

Or you could stack:

print(
    df.set_index(["Name", "Age"])
    .stack()
    .reset_index()
    .groupby("Name", as_index=False)
    .agg(", ".join)
    .rename({"level_2": "Subjects", 0: "Grades"}, axis=1)
)

Both codes output:

   Name       Subjects Grades
0  Alex  Math, English   D, F
1   Bar  Math, English  F, A+
2   Bob  Math, English  A+, C
3   Foo  Math, English   A, B
4  John  Math, English   B, B
5   Tom  Math, English   C, A

Problem 6:

Problem 6 could be solved with melt and no column needed to be specified, just specify the expected column names:

print(df.melt(var_name='Column', value_name='Value'))

That melts the whole dataframe.

Or you could stack:

print(
    df.stack()
    .reset_index(level=1)
    .sort_values("level_1")
    .reset_index(drop=True)
    .set_axis(["Column", "Value"], axis=1)
)

Both codes output:

     Column Value
0       Age    16
1       Age    15
2       Age    15
3       Age    16
4       Age    13
5       Age    13
6   English    A+
7   English     B
8   English     B
9   English     A
10  English     F
11  English     C
12     Math     C
13     Math    A+
14     Math     D
15     Math     B
16     Math     F
17     Math     A
18     Name  Alex
19     Name   Bar
20     Name   Tom
21     Name   Foo
22     Name  John
23     Name   Bob

Conclusion:

melt is a really handy function, and often it’s required. Once you meet these types of problems, don’t forget to try melt. It may well solve your problem.

Answered By: U13-Forward

There is another kind of melt not mentioned in the question, which is that with a dataframe whose column header contains common prefix and you want to melt the suffix to column value.

It is kind of the opposite of question 11 in How can I pivot a dataframe?


Say you have a following DataFrame, and you want to melt 1970, 1980 to column values

  A1970 A1980  B1970  B1980         X  id
0     a     d    2.5    3.2 -1.085631   0
1     b     e    1.2    1.3  0.997345   1
2     c     f    0.7    0.1  0.282978   2

In this case you can try pandas.wide_to_long

pd.wide_to_long(df, stubnames=["A", "B"], i="id", j="year")
                X  A    B
id year
0  1970 -1.085631  a  2.5
1  1970  0.997345  b  1.2
2  1970  0.282978  c  0.7
0  1980 -1.085631  d  3.2
1  1980  0.997345  e  1.3
2  1980  0.282978  f  0.1
Answered By: Ynjxsjmh

As described by U12-Forward, melting a dataframe primarily means reshaping the data from wide form to long form. More often than not, the new dataframe will have more rows and fewer columns compared to the original dataframe.

There are different scenarios when it comes to melting—all column labels could be melted into a single column, or multiple columns; some parts of the column labels could be retained as headers, while the rest are collated into a column, and so on. This answer shows how to melt a pandas dataframe, using pd.stack, pd.melt, pd.wide_to_long and pivot_longer from pyjanitor (I am a contributor to the pyjanitor library). The examples won’t be exhaustive, but hopefully should point you in the right direction when it comes to reshaping dataframes from wide to long form.

Sample Data

df = pd.DataFrame(
    {'Sepal.Length': [5.1, 5.9],
     'Sepal.Width': [3.5, 3.0],
     'Petal.Length': [1.4, 5.1],
     'Petal.Width': [0.2, 1.8],
     'Species': ['setosa', 'virginica']}
    )
df
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
0           5.1          3.5           1.4          0.2     setosa
1           5.9          3.0           5.1          1.8  virginica

Scenario 1 – Melt all columns:

In this case, we wish to convert all the specified column headers into rows – this can be done with pd.melt or pd.stack, and the solutions to problem 1 already cover this. The reshaping can also be done with pivot_longer

# pip install pyjanitor
import janitor

df.pivot_longer(index = 'Species')
     Species      variable  value
0     setosa  Sepal.Length    5.1
1  virginica  Sepal.Length    5.9
2     setosa   Sepal.Width    3.5
3  virginica   Sepal.Width    3.0
4     setosa  Petal.Length    1.4
5  virginica  Petal.Length    5.1
6     setosa   Petal.Width    0.2
7  virginica   Petal.Width    1.8

Just like in pd.melt, you can rename the variable and value column, by passing arguments to names_to and values_to parameters:

df.pivot_longer(index = 'Species',
                names_to = 'dimension',
                values_to = 'measurement_in_cm')
     Species     dimension  measurement_in_cm
0     setosa  Sepal.Length                5.1
1  virginica  Sepal.Length                5.9
2     setosa   Sepal.Width                3.5
3  virginica   Sepal.Width                3.0
4     setosa  Petal.Length                1.4
5  virginica  Petal.Length                5.1
6     setosa   Petal.Width                0.2
7  virginica   Petal.Width                1.8

You can also retain the original index, and keep the dataframe based on order of appearance:

df.pivot_longer(index = 'Species',
                names_to = 'dimension',
                values_to = 'measurement_in_cm',
                ignore_index = False,
                sort_by_appearance=True)
     Species     dimension  measurement_in_cm
0     setosa  Sepal.Length                5.1
0     setosa   Sepal.Width                3.5
0     setosa  Petal.Length                1.4
0     setosa   Petal.Width                0.2
1  virginica  Sepal.Length                5.9
1  virginica   Sepal.Width                3.0
1  virginica  Petal.Length                5.1
1  virginica   Petal.Width                1.8

By default, the values in names_to are strings; they can be converted to other data types via the names_transform parameter – this can be helpful/performant for large dataframes, as it is generally more efficient compared to converting the data types after the reshaping.

out = df.pivot_longer(index = 'Species',
                      names_to = 'dimension',
                      values_to = 'measurement_in_cm',
                      ignore_index = False,
                      sort_by_appearance=True,
                      names_transform = 'category')

out.dtypes
Species                object
dimension            category
measurement_in_cm     float64
dtype: object

Scenario 2 – Melt column labels into multiple columns:

So far, we’ve melted our data into single columns, one for the column names and one for the values. However, there might be scenarios where we wish to split the column labels into different columns, or even the values into different columns. Continuing with our sample data, we could prefer to have sepal and petal under a part column, while length and width are into a dimension column:

  • Via pd.melt – The separation is done after the melt:

    out = df.melt(id_vars = 'Species')
    arr = out.variable.str.split('.')
    (out
    .assign(part = arr.str[0],
            dimension = arr.str[1])
    .drop(columns = 'variable')
    )
    
         Species  value   part dimension
    0     setosa    5.1  Sepal    Length
    1  virginica    5.9  Sepal    Length
    2     setosa    3.5  Sepal     Width
    3  virginica    3.0  Sepal     Width
    4     setosa    1.4  Petal    Length
    5  virginica    5.1  Petal    Length
    6     setosa    0.2  Petal     Width
    7  virginica    1.8  Petal     Width
    
  • Via pd.stack – offers a more efficient way of splitting the columns; the split is done on the columns, meaning less number of rows to deal with, meaning potentially faster outcome, as the data size increases:

    out = df.set_index('Species')
    
    # This returns a MultiIndex
    out.columns = out.columns.str.split('.', expand = True)
    new_names = ['part', 'dimension']
    out.columns.names = new_names
    out.stack(new_names).rename('value').reset_index()
    
         Species   part dimension  value
    0     setosa  Petal    Length    1.4
    1     setosa  Petal     Width    0.2
    2     setosa  Sepal    Length    5.1
    3     setosa  Sepal     Width    3.5
    4  virginica  Petal    Length    5.1
    5  virginica  Petal     Width    1.8
    6  virginica  Sepal    Length    5.9
    7  virginica  Sepal     Width    3.0
    
  • Via pivot_longer – The key thing to note about pivot_longer is that it looks for patterns. The column labels are separated by a dot .. Simply pass a list/tuple of new names to names_to, and pass a separator to names_sep (under the hood it just uses pd.str.split):

    df.pivot_longer(index = 'Species',
                    names_to = ('part', 'dimension'),
                    names_sep='.')
    
         Species   part dimension  value
    0     setosa  Sepal    Length    5.1
    1  virginica  Sepal    Length    5.9
    2     setosa  Sepal     Width    3.5
    3  virginica  Sepal     Width    3.0
    4     setosa  Petal    Length    1.4
    5  virginica  Petal    Length    5.1
    6     setosa  Petal     Width    0.2
    7  virginica  Petal     Width    1.8
    

So far, we’ve seen how melt, stack and pivot_longer can split the column labels into multiple new columns, as long as there is a defined separator. What if there isn’t a clearly defined separator, like in the dataframe below?

# https://github.com/tidyverse/tidyr/blob/main/data-raw/who.csv
who = pd.DataFrame({'id': [1], 'new_sp_m5564': [2], 'newrel_f65': [3]})
who
   id  new_sp_m5564  newrel_f65
0   1             2           3

In the second column, we have multiple _, compared to the third column which has just one _. The goal here is to split the column labels into individual columns (sp & rel to diagnosis column, m & f to gender column, the numbers to age column). One option is to extract the column sub labels via a regex.

  • Via pd.melt – again with pd.melt, the reshaping occurs after the melt:

    out = who.melt('id')
    regex = r"new_?(?P<diagnosis>.+)_(?P<gender>.)(?P<age>d+)"
    new_df = out.variable.str.extract(regex)
    # pd.concat can be used here instead
    out.drop(columns='variable').assign(**new_df)
    
       id  value diagnosis gender   age
    0   1      2        sp      m  5564
    1   1      3       rel      f    65
    

    Note how the extracts occurred for the regex in groups (the one in parentheses).

  • Via pd.stack – just like in the previous example, the split is done on the columns, offering more in terms of efficiency:

    out = who.set_index('id')
    regex = r"new_?(.+)_(.)(d+)"
    new_names = ['diagnosis', 'age', 'gender']
    
    # Returns a dataframe
    new_cols = out.columns.str.extract(regex)
    new_cols.columns = new_names
    new_cols = pd.MultiIndex.from_frame(new_cols)
    out.columns = new_cols
    out.stack(new_names).rename('value').reset_index()
    
       id diagnosis age gender  value
    0   1       rel   f     65    3.0
    1   1        sp   m   5564    2.0
    

    Again, the extracts occur for the regex in groups.

  • Via pivot_longer – again we know the pattern, and the new column names, we simply pass those to the function, this time we use names_pattern, since we are dealing with a regex. The extracts will match the regular expression in the groups (the ones in parentheses):

    regex = r"new_?(.+)_(.)(d+)"
    new_names = ['diagnosis', 'age', 'gender']
    who.pivot_longer(index = 'id',
                     names_to = new_names,
                     names_pattern = regex)
    
       id diagnosis age gender  value
    0   1        sp   m   5564      2
    1   1       rel   f     65      3
    

Scenario 3 – Melt column labels and values into multiple columns:

What if we wish to split the values into multiple columns as well? Let’s use a fairly popular question on SO:

df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover'],
                   'State': ['Texas', 'Texas', 'Alabama'],
                   'Name':['Aria', 'Penelope', 'Niko'],
                   'Mango':[4, 10, 90],
                   'Orange': [10, 8, 14],
                   'Watermelon':[40, 99, 43],
                   'Gin':[16, 200, 34],
                   'Vodka':[20, 33, 18]},
                 columns=['City', 'State', 'Name', 'Mango', 'Orange', 'Watermelon', 'Gin', 'Vodka'])
df
      City    State      Name  Mango  Orange  Watermelon  Gin  Vodka
0  Houston    Texas      Aria      4      10          40   16     20
1   Austin    Texas  Penelope     10       8          99  200     33
2   Hoover  Alabama      Niko     90      14          43   34     18

The goal is to collate Mango, Orange, and Watermelon into a fruits column, Gin and Vodka into a Drinks column, and collate the respective values into Pounds and Ounces respectively.

  • Via pd.melt – I am copying the excellent solution verbatim:

    df1 = df.melt(id_vars=['City', 'State'],
                  value_vars=['Mango', 'Orange', 'Watermelon'],
                  var_name='Fruit', value_name='Pounds')
    df2 = df.melt(id_vars=['City', 'State'],
                  value_vars=['Gin', 'Vodka'],
                  var_name='Drink', value_name='Ounces')
    
    df1 = df1.set_index(['City', 'State', df1.groupby(['City', 'State']).cumcount()])
    df2 = df2.set_index(['City', 'State', df2.groupby(['City', 'State']).cumcount()])
    
    df3 = (pd.concat([df1, df2],axis=1)
             .sort_index(level=2)
             .reset_index(level=2, drop=True)
             .reset_index())
    print (df3)
    
          City    State       Fruit  Pounds  Drink  Ounces
    0   Austin    Texas       Mango      10    Gin   200.0
    1   Hoover  Alabama       Mango      90    Gin    34.0
    2  Houston    Texas       Mango       4    Gin    16.0
    3   Austin    Texas      Orange       8  Vodka    33.0
    4   Hoover  Alabama      Orange      14  Vodka    18.0
    5  Houston    Texas      Orange      10  Vodka    20.0
    6   Austin    Texas  Watermelon      99    NaN     NaN
    7   Hoover  Alabama  Watermelon      43    NaN     NaN
    8  Houston    Texas  Watermelon      40    NaN     NaN
    
  • Via pd.stack – I can’t think of a solution via stack, so I’ll skip

  • Via pivot_longer – The reshape can be efficiently done by passing the list of names to names_to and values_to, and pass a list of regular expressions to names_pattern– when splitting values into multiple columns, a list of regex to names_pattern is required:

    df.pivot_longer(
        index=["City", "State"],
        column_names=slice("Mango", "Vodka"),
        names_to=("Fruit", "Drink"),
        values_to=("Pounds", "Ounces"),
       names_pattern=[r"M|O|W", r"G|V"],
       )
    
          City    State       Fruit  Pounds  Drink  Ounces
    0  Houston    Texas       Mango       4    Gin    16.0
    1   Austin    Texas       Mango      10    Gin   200.0
    2   Hoover  Alabama       Mango      90    Gin    34.0
    3  Houston    Texas      Orange      10  Vodka    20.0
    4   Austin    Texas      Orange       8  Vodka    33.0
    5   Hoover  Alabama      Orange      14  Vodka    18.0
    6  Houston    Texas  Watermelon      40   None     NaN
    7   Austin    Texas  Watermelon      99   None     NaN
    8   Hoover  Alabama  Watermelon      43   None     NaN
    

The efficiency is even more as the dataframe size increases.

Scenario 4 – Group similar columns together:

Extending the concept of melting into multiple columns, let’s say we wish to group similar columns together. We do not care about retaining the column labels, just combining the values of similar columns into new columns.

df = pd.DataFrame({'x_1_mean': [10],
                   'x_2_mean': [20],
                   'y_1_mean': [30],
                   'y_2_mean': [40],
                   'unit': [50]})

df
   x_1_mean  x_2_mean  y_1_mean  y_2_mean  unit
0        10        20        30        40    50

For the code above, we wish to combine similar columns (columns that start with the same letter) into new unique columns – all x* columns will be lumped under x_mean, while all y* columns will be collated under y_mean. We are not saving the column labels, we are only interested in the values of these columns:

  • Via pd.melt – one possible way via melt is to run it via groupby on the columns:

    out = df.set_index('unit')
    grouped = out.columns.str.split('_d_').str.join('')
    # group on the split
    grouped = out.groupby(grouped, axis = 1)
    # iterate, melt individually, and recombine to get a new dataframe
    out = {key : frame.melt(ignore_index = False).value
           for key, frame in grouped}
    pd.DataFrame(out).reset_index()
    
       unit  xmean  ymean
    0    50     10     30
    1    50     20     40
    
  • Via pd.stack – Here we split the columns and build a MultiIndex:

    out = df.set_index('unit')
    split = out.columns.str.split('_(d)_')
    split = [(f"{first}{last}", middle)
              for first, middle, last
              in split]
    out.columns = pd.MultiIndex.from_tuples(split)
    out.stack(-1).droplevel(-1).reset_index()
    
       unit  xmean  ymean
    0    50     10     30
    1    50     20     40
    
  • Via pd.wide_to_long – Here we reorder the sub labels – move the numbers to the end of the columns:

    out = df.set_index('unit')
    out.columns = [f"{first}{last}_{middle}"
                   for first, middle, last
                   in out.columns.str.split('_(d)_')]
    
    (pd
    .wide_to_long(
        out.reset_index(),
        stubnames = ['xmean', 'ymean'],
        i = 'unit',
        j = 'num',
        sep = '_')
    .droplevel(-1)
    .reset_index()
    )
    
       unit  xmean  ymean
    0    50     10     30
    1    50     20     40
    
  • Via pivot_longer – Again, with pivot_longer, it is all about the patterns. Simply pass a list of new column names to names_to, and the corresponding regular expressions to names_pattern:

    df.pivot_longer(index = 'unit',
                    names_to = ['xmean', 'ymean'],
                    names_pattern = ['x', 'y']
                    )
    
       unit  xmean  ymean
    0    50     10     30
    1    50     20     40
    

Note that with this pattern it is on a first come first serve basis – if the column order was flipped, pivot_longer would give a different output. Lets see this in action:

# reorder the columns in a different form:
df = df.loc[:, ['x_1_mean', 'x_2_mean', 'y_2_mean', 'y_1_mean', 'unit']]
df
   x_1_mean  x_2_mean  y_2_mean  y_1_mean  unit
0        10        20        40        30    50

Because the order has changed, x_1_mean will be paired with y_2_mean, because that is the first y column it sees, while x_2_mean gets paired with y_1_mean:

df.pivot_longer(index = 'unit',
                names_to = ['xmean', 'ymean'],
                names_pattern = ['x', 'y']
                )
   unit  xmean  ymean
0    50     10     40
1    50     20     30

Note the difference in the output compared to the previous run. This is something to note when using names_pattern with a sequence. Order matters.

Scenario 5 – Retain part of the column names as headers:

This might probably be one of the biggest use cases when reshaping to long form. Some parts of the column label we may wish to keep as header, and move the remaining columns to new columns (or even ignore them).

Let’s revisit our iris dataframe:

df = pd.DataFrame(
    {'Sepal.Length': [5.1, 5.9],
     'Sepal.Width': [3.5, 3.0],
     'Petal.Length': [1.4, 5.1],
     'Petal.Width': [0.2, 1.8],
     'Species': ['setosa', 'virginica']}
    )

df
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
0           5.1          3.5           1.4          0.2     setosa
1           5.9          3.0           5.1          1.8  virginica

Our goal here is to keep Sepal, Petal as column names, and the rest (Length, Width) are collated into a dimension column:

  • Via pd.melt – A pivot is used after melting into long form:

    out = df.melt(id_vars = 'Species')
    arr = out.variable.str.split('.')
    (out
    .assign(part = arr.str[0],
            dimension = arr.str[1])
    .pivot(['Species', 'dimension'], 'part', 'value')
    .rename_axis(columns = None)
    .reset_index()
    )
    
         Species dimension  Petal  Sepal
    0     setosa    Length    1.4    5.1
    1     setosa     Width    0.2    3.5
    2  virginica    Length    5.1    5.9
    3  virginica     Width    1.8    3.0
    

    This is not as efficient as other options below, as this involves wide to long, then long to wide, this might have poor performance on large enough dataframe.

  • Via pd.stack – This offers more efficiency as most of the reshaping is on the columns – less is more.

    out = df.set_index('Species')
    out.columns = out.columns.str.split('.', expand = True)
    out.columns.names = [None, 'dimension']
    out.stack('dimension').reset_index()
    
         Species dimension  Petal  Sepal
    0     setosa    Length    1.4    5.1
    1     setosa     Width    0.2    3.5
    2  virginica    Length    5.1    5.9
    3  virginica     Width    1.8    3.0
    
  • Via pd.wide_to_long – Straightforward – simply pass in the relevant arguments:

    (pd
    .wide_to_long(
        df,
        stubnames=['Sepal', 'Petal'],
        i = 'Species',
        j = 'dimension',
        sep='.',
        suffix='.+')
    .reset_index()
    )
    
         Species dimension  Sepal  Petal
    0     setosa    Length    5.1    1.4
    1  virginica    Length    5.9    5.1
    2     setosa     Width    3.5    0.2
    3  virginica     Width    3.0    1.8
    

    As the data size increases, pd.wide_to_long might not be as efficient.

  • Via pivot_longer: Again, back to patterns. Since we are keeping a part of the column as header, we use .value as a placeholder. The function sees the .value and knows that that sub label has to remain as a header. The split in the columns can either be by names_sep or names_pattern. In this case, it is simpler to use names_sep:

    df.pivot_longer(index = 'Species',
                    names_to = ('.value', 'dimension'),
                    names_sep = '.')
    
         Species dimension  Sepal  Petal
    0     setosa    Length    5.1    1.4
    1  virginica    Length    5.9    5.1
    2     setosa     Width    3.5    0.2
    3  virginica     Width    3.0    1.8
    

    When the column is split with ., we have Petal, Length. When compared with ('.value', 'dimension'), Petal is associated with .value, while Length is associated with dimension. Petal stays as a column header, while Length is lumped into the dimension column. We didn’t need to be explicit about the column name, we just use .value and let the function do the heavy work. This way, if you have lots of columns, you don’t need to work out what the columns to stay as headers should be, as long as you have the right pattern via names_sep or names_pattern.

    What if we want the Length/Width as column names instead, and Petal/Sepal get lumped into a part column:

  • Via pd.melt

    out = df.melt(id_vars = 'Species')
    arr = out.variable.str.split('.')
    (out
    .assign(part = arr.str[0],
            dimension = arr.str[1])
    .pivot(['Species', 'part'], 'dimension', 'value')
    .rename_axis(columns = None)
    .reset_index()
    )
    
         Species   part  Length  Width
    0     setosa  Petal     1.4    0.2
    1     setosa  Sepal     5.1    3.5
    2  virginica  Petal     5.1    1.8
    3  virginica  Sepal     5.9    3.0
    
  • Via pd.stack:

    out = df.set_index('Species')
    out.columns = out.columns.str.split('.', expand = True)
    out.columns.names = ['part', None]
    out.stack('part').reset_index()
    
         Species   part  Length  Width
    0     setosa  Petal     1.4    0.2
    1     setosa  Sepal     5.1    3.5
    2  virginica  Petal     5.1    1.8
    3  virginica  Sepal     5.9    3.0
    
  • Via pd.wide_to_long – First, we need to reorder the columns, such that Length/Width are at the front:

    out = df.set_index('Species')
    out.columns = out.columns.str.split('.').str[::-1].str.join('.')
    (pd
    .wide_to_long(
        out.reset_index(),
        stubnames=['Length', 'Width'],
        i = 'Species',
        j = 'part',
        sep='.',
        suffix='.+')
    .reset_index()
    )
    
         Species   part  Length  Width
    0     setosa  Sepal     5.1    3.5
    1  virginica  Sepal     5.9    3.0
    2     setosa  Petal     1.4    0.2
    3  virginica  Petal     5.1    1.8
    
  • Via pivot_longer:

    df.pivot_longer(index = 'Species',
                    names_to = ('part', '.value'),
                    names_sep = '.')
    
         Species   part  Length  Width
    0     setosa  Sepal     5.1    3.5
    1  virginica  Sepal     5.9    3.0
    2     setosa  Petal     1.4    0.2
    3  virginica  Petal     5.1    1.8
    

    Notice that we did not have to do any column reordering (there are scenarios where column reordering is unavoidable), the function simply paired .value with whatever the split from names_sep gave and outputted the reshaped dataframe. You can even use multiple .value where applicable. Let’s revisit an earlier dataframe:

    df = pd.DataFrame({'x_1_mean': [10],
                       'x_2_mean': [20],
                       'y_1_mean': [30],
                       'y_2_mean': [40],
                       'unit': [50]})
    df
    
       x_1_mean  x_2_mean  y_1_mean  y_2_mean  unit
    0        10        20        30        40    50
    
    df.pivot_longer(index = 'unit',
                    names_to = ('.value', '.value'),
                    names_pattern = r"(.).+(mean)")
    
       unit  xmean  ymean
    0    50     10     30
    1    50     20     40
    

It is all about seeing the patterns and taking advantage of them. pivot_longer just offers efficient and performant abstractions over common reshaping scenarios – under the hood it is just Pandas, NumPy, and Python.

Hopefully, the various answers point you in the right direction when you need to reshape from wide to long.

Answered By: sammywemmy