Set column names when stacking pandas DataFrame
Question:
When stacking a pandas DataFrame
, a Series
is returned. Normally after I stack a DataFrame
, I convert it back into a DataFrame
. However, the default names coming from the stacked data make renaming the columns a bit hacky. What I’m looking for is an easier/built-in way to give columns sensible names after stacking.
E.g., for the following DataFrame
:
In [64]: df = pd.DataFrame({'id':[1,2,3],
...: 'date':['2015-09-31']*3,
...: 'value':[100, 95, 42],
...: 'value2':[200, 57, 27]}).set_index(['id','date'])
In [65]: df
Out[65]:
value value2
id date
1 2015-09-31 100 200
2 2015-09-31 95 57
3 2015-09-31 42 27
I stack and convert it back to a DataFrame
like so:
In [68]: df.stack().reset_index()
Out[68]:
id date level_2 0
0 1 2015-09-31 value 100
1 1 2015-09-31 value2 200
2 2 2015-09-31 value 95
3 2 2015-09-31 value2 57
4 3 2015-09-31 value 42
5 3 2015-09-31 value2 27
So in order to name these columns appropriately I would need to do something like this:
In [72]: stacked = df.stack()
In [73]: stacked
Out[73]:
id date
1 2015-09-31 value 100
value2 200
2 2015-09-31 value 95
value2 57
3 2015-09-31 value 42
value2 27
dtype: int64
In [74]: stacked.index.set_names('var_name', level=len(stacked.index.names)-1, inplace=True)
In [88]: stacked.reset_index().rename(columns={0:'value'})
Out[88]:
id date var_name value
0 1 2015-09-31 value 100
1 1 2015-09-31 value2 200
2 2 2015-09-31 value 95
3 2 2015-09-31 value2 57
4 3 2015-09-31 value 42
5 3 2015-09-31 value2 27
Ideally, the solution would look something like this:
df.stack(new_index_name='var_name', new_col_name='value')
But looking at the docs it doesn’t look like stack
takes any such arguments. Is there an easier/built-in way in pandas to deal with this workflow?
Answers:
So here’s one way that you may find a bit cleaner, using the fact that columns
and Series
can also carry names.
In [45]: df
Out[45]:
value value2
id date
1 2015-09-31 100 200
2 2015-09-31 95 57
3 2015-09-31 42 27
In [46]: df.columns.name = 'var_name'
In [47]: s = df.stack()
In [48]: s.name = 'value'
In [49]: s.reset_index()
Out[49]:
id date var_name value
0 1 2015-09-31 value 100
1 1 2015-09-31 value2 200
2 2 2015-09-31 value 95
3 2 2015-09-31 value2 57
4 3 2015-09-31 value 42
5 3 2015-09-31 value2 27
pd.melt
is often useful for converting DataFrames from “wide” to “long” format. You could use pd.melt
here if you convert the id
and date
index levels to columns first:
In [56]: pd.melt(df.reset_index(), id_vars=['id', 'date'], value_vars=['value', 'value2'], var_name='var_name', value_name='value')
Out[56]:
id date var_name value
0 1 2015-09-31 value 100
1 2 2015-09-31 value 95
2 3 2015-09-31 value 42
3 1 2015-09-31 value2 200
4 2 2015-09-31 value2 57
5 3 2015-09-31 value2 27
A pipe-ing friendly alternative to chrisb’s answer:
df.stack().rename_axis(['id', 'date', 'var_name']).rename('value').reset_index()
And if explicit is better than implicit:
(
df
.stack()
.rename_axis(index={'id': 'id', 'date': 'date', None: 'var_name'})
.rename('value')
.reset_index()
)
When using the dict mapper, you can skip the names which should stay the same:
df.stack().rename_axis(index={None: 'var_name'}).rename('value').reset_index()
Why not something like this?
Sometimes melt
is great,
but sometimes you want to keep your index,
and/or you want to have an index on that new column.
This is like @krassowski’s answer,
but it doesn’t require you to know the names of df’s indices in advance.
df.stack().rename_axis([*df.index.names, "var_name"]).rename("value")
To avoid a phantom column name when calling stack
, just rename the column axis beforehand:
df = pd.DataFrame({col: range(3) for col in list("ABC")})
df.rename_axis(columns="lol_goodbye_columns").stack()
lol_goodbye_columns
0 A 0
B 0
C 0
1 A 1
B 1
C 1
2 A 2
B 2
C 2
dtype: int64
When stacking a pandas DataFrame
, a Series
is returned. Normally after I stack a DataFrame
, I convert it back into a DataFrame
. However, the default names coming from the stacked data make renaming the columns a bit hacky. What I’m looking for is an easier/built-in way to give columns sensible names after stacking.
E.g., for the following DataFrame
:
In [64]: df = pd.DataFrame({'id':[1,2,3],
...: 'date':['2015-09-31']*3,
...: 'value':[100, 95, 42],
...: 'value2':[200, 57, 27]}).set_index(['id','date'])
In [65]: df
Out[65]:
value value2
id date
1 2015-09-31 100 200
2 2015-09-31 95 57
3 2015-09-31 42 27
I stack and convert it back to a DataFrame
like so:
In [68]: df.stack().reset_index()
Out[68]:
id date level_2 0
0 1 2015-09-31 value 100
1 1 2015-09-31 value2 200
2 2 2015-09-31 value 95
3 2 2015-09-31 value2 57
4 3 2015-09-31 value 42
5 3 2015-09-31 value2 27
So in order to name these columns appropriately I would need to do something like this:
In [72]: stacked = df.stack()
In [73]: stacked
Out[73]:
id date
1 2015-09-31 value 100
value2 200
2 2015-09-31 value 95
value2 57
3 2015-09-31 value 42
value2 27
dtype: int64
In [74]: stacked.index.set_names('var_name', level=len(stacked.index.names)-1, inplace=True)
In [88]: stacked.reset_index().rename(columns={0:'value'})
Out[88]:
id date var_name value
0 1 2015-09-31 value 100
1 1 2015-09-31 value2 200
2 2 2015-09-31 value 95
3 2 2015-09-31 value2 57
4 3 2015-09-31 value 42
5 3 2015-09-31 value2 27
Ideally, the solution would look something like this:
df.stack(new_index_name='var_name', new_col_name='value')
But looking at the docs it doesn’t look like stack
takes any such arguments. Is there an easier/built-in way in pandas to deal with this workflow?
So here’s one way that you may find a bit cleaner, using the fact that columns
and Series
can also carry names.
In [45]: df
Out[45]:
value value2
id date
1 2015-09-31 100 200
2 2015-09-31 95 57
3 2015-09-31 42 27
In [46]: df.columns.name = 'var_name'
In [47]: s = df.stack()
In [48]: s.name = 'value'
In [49]: s.reset_index()
Out[49]:
id date var_name value
0 1 2015-09-31 value 100
1 1 2015-09-31 value2 200
2 2 2015-09-31 value 95
3 2 2015-09-31 value2 57
4 3 2015-09-31 value 42
5 3 2015-09-31 value2 27
pd.melt
is often useful for converting DataFrames from “wide” to “long” format. You could use pd.melt
here if you convert the id
and date
index levels to columns first:
In [56]: pd.melt(df.reset_index(), id_vars=['id', 'date'], value_vars=['value', 'value2'], var_name='var_name', value_name='value')
Out[56]:
id date var_name value
0 1 2015-09-31 value 100
1 2 2015-09-31 value 95
2 3 2015-09-31 value 42
3 1 2015-09-31 value2 200
4 2 2015-09-31 value2 57
5 3 2015-09-31 value2 27
A pipe-ing friendly alternative to chrisb’s answer:
df.stack().rename_axis(['id', 'date', 'var_name']).rename('value').reset_index()
And if explicit is better than implicit:
(
df
.stack()
.rename_axis(index={'id': 'id', 'date': 'date', None: 'var_name'})
.rename('value')
.reset_index()
)
When using the dict mapper, you can skip the names which should stay the same:
df.stack().rename_axis(index={None: 'var_name'}).rename('value').reset_index()
Why not something like this?
Sometimes melt
is great,
but sometimes you want to keep your index,
and/or you want to have an index on that new column.
This is like @krassowski’s answer,
but it doesn’t require you to know the names of df’s indices in advance.
df.stack().rename_axis([*df.index.names, "var_name"]).rename("value")
To avoid a phantom column name when calling stack
, just rename the column axis beforehand:
df = pd.DataFrame({col: range(3) for col in list("ABC")})
df.rename_axis(columns="lol_goodbye_columns").stack()
lol_goodbye_columns
0 A 0
B 0
C 0
1 A 1
B 1
C 1
2 A 2
B 2
C 2
dtype: int64