Transforming pandas data frame. Sort of melting
Question:
I have this data frame:
pd.DataFrame({'day': [1, 1, 2, 2], 'category': ['a', 'b', 'a', 'b'],
'min_feature1': [1, 2, 3, 4], 'max_feature1': [8, 9, 10, 11],
'min_feature2': [2, 3, 4, 5], 'max_feature2': [6, 9, 12, 13]})
The result looks like this:
day
category
min_feature1
max_feature1
min_feature2
max_feature2
1
a
1
8
2
6
1
b
2
9
3
9
2
a
3
10
4
12
2
b
4
11
5
13
I want to transform this data, so it looks like this:
pd.DataFrame([[1, 'a', 'feature1', 1, 8],
[1, 'a', 'feature2', 2, 6],
[1, 'b', 'feature1', 2, 9],
[1, 'b', 'feature2', 3, 9],
[2, 'a', 'feature1', 3, 10],
[2, 'a', 'feature2', 4, 12],
[2, 'b', 'feature1', 4, 11],
[2, 'b', 'feature2', 5, 13],], columns=['day', 'category', 'feature', 'min', 'max'])
day
category
feature
min
max
1
a
feature1
1
8
1
a
feature2
2
6
1
b
feature1
2
9
1
b
feature2
3
9
2
a
feature1
3
10
2
a
feature2
4
12
2
b
feature1
4
11
2
b
feature2
5
13
How can I do this?
Answers:
One option using a custom reshape with a MultiIndex with str.split
, then stack
:
(df.set_index(['day', 'category'])
.pipe(lambda d: d.set_axis(d.columns.str.split('_', n=1, expand=True), axis=1))
.rename_axis(columns=(None, 'features'))
.stack().reset_index()
)
Or with janitor
‘s pivot_longer
:
# pip install janitor
import janitor
out = df.pivot_longer(['day', 'category'], sort_by_appearance=True,
names_sep='_', names_to=('.value', 'feature'))
Output:
day category features max min
0 1 a feature1 8 1
1 1 a feature2 6 2
2 1 b feature1 9 2
3 1 b feature2 9 3
4 2 a feature1 10 3
5 2 a feature2 12 4
6 2 b feature1 11 4
7 2 b feature2 13 5
Use str.split
for MultiIndex
with reshape by DataFrame.stack
:
df1 = df.set_index(['day','category'])
df1.columns= df1.columns.str.split('_', expand=True)
df1 = df1.rename_axis(columns=(None,'feature')).stack().reset_index()
print (df1)
day category feature max min
0 1 a feature1 8 1
1 1 a feature2 6 2
2 1 b feature1 9 2
3 1 b feature2 9 3
4 2 a feature1 10 3
5 2 a feature2 12 4
6 2 b feature1 11 4
7 2 b feature2 13 5
Another idea with wide_to_long
:
df.columns = df.columns.str.replace(r'(w+)_s*(w+)', r'2_1', regex=True)
df = (pd.wide_to_long(df,
stubnames=['feature1','feature2'],
i=['day','category'],
j='tmp',
sep='_',
suffix=r'w+').rename_axis(columns='feature')
.stack()
.unstack(2)
.reset_index()
.rename_axis(columns=None))
print (df)
day category feature max min
0 1 a feature1 8 1
1 1 a feature2 6 2
2 1 b feature1 9 2
3 1 b feature2 9 3
4 2 a feature1 10 3
5 2 a feature2 12 4
6 2 b feature1 11 4
7 2 b feature2 13 5
You can also use melt
as alternative:
out = (df.rename(columns=lambda x: tuple(m) if len(m := x.split('_')) > 1 else x)
.melt(['day', 'category'])
.assign(var1=lambda x: x['variable'].str[1], var2=lambda x: x['variable'].str[0])
.pivot(index=['day', 'category', 'var1'], columns='var2', values='value')
.rename_axis(columns=None).reset_index())
Output:
>>> out
day category var1 max min
0 1 a feature1 8 1
1 1 a feature2 6 2
2 1 b feature1 9 2
3 1 b feature2 9 3
4 2 a feature1 10 3
5 2 a feature2 12 4
6 2 b feature1 11 4
7 2 b feature2 13 5
Step by step for better understanding the transformation:
# Step 1: rename your columns
>>> out = df.rename(columns=lambda x: tuple(m) if len(m := x.split('_')) > 1 else x)
day category (min, feature1) (max, feature1) (min, feature2) (max, feature2)
0 1 a 1 8 2 6
1 1 b 2 9 3 9
2 2 a 3 10 4 12
3 2 b 4 11 5 13
# Step 2: flatten your dataframe
>>> out = out.melt(['day', 'category'])
day category variable value
0 1 a (min, feature1) 1
1 1 b (min, feature1) 2
2 2 a (min, feature1) 3
3 2 b (min, feature1) 4
4 1 a (max, feature1) 8
5 1 b (max, feature1) 9
...
# Step 3: expand variable column in two new variables
>>> out = out.assign(var1=lambda x: x['variable'].str[1], var2=lambda x: x['variable'].str[0])
day category variable value var1 var2
0 1 a (min, feature1) 1 feature1 min
1 1 b (min, feature1) 2 feature1 min
2 2 a (min, feature1) 3 feature1 min
3 2 b (min, feature1) 4 feature1 min
4 1 a (max, feature1) 8 feature1 max
5 1 b (max, feature1) 9 feature1 max
...
# Step 4: reshape your dataframe
>>> out = out.pivot(index=['day', 'category', 'var1'], columns='var2', values='value')
var2 max min
day category var1
1 a feature1 8 1
feature2 6 2
b feature1 9 2
feature2 9 3
2 a feature1 10 3
feature2 12 4
b feature1 11 4
feature2 13 5
# Step 5: final output
>>> out = out.rename_axis(columns=None).reset_index()
day category var1 max min
0 1 a feature1 8 1
1 1 a feature2 6 2
2 1 b feature1 9 2
3 1 b feature2 9 3
4 2 a feature1 10 3
5 2 a feature2 12 4
6 2 b feature1 11 4
7 2 b feature2 13 5
I have this data frame:
pd.DataFrame({'day': [1, 1, 2, 2], 'category': ['a', 'b', 'a', 'b'],
'min_feature1': [1, 2, 3, 4], 'max_feature1': [8, 9, 10, 11],
'min_feature2': [2, 3, 4, 5], 'max_feature2': [6, 9, 12, 13]})
The result looks like this:
day | category | min_feature1 | max_feature1 | min_feature2 | max_feature2 |
---|---|---|---|---|---|
1 | a | 1 | 8 | 2 | 6 |
1 | b | 2 | 9 | 3 | 9 |
2 | a | 3 | 10 | 4 | 12 |
2 | b | 4 | 11 | 5 | 13 |
I want to transform this data, so it looks like this:
pd.DataFrame([[1, 'a', 'feature1', 1, 8],
[1, 'a', 'feature2', 2, 6],
[1, 'b', 'feature1', 2, 9],
[1, 'b', 'feature2', 3, 9],
[2, 'a', 'feature1', 3, 10],
[2, 'a', 'feature2', 4, 12],
[2, 'b', 'feature1', 4, 11],
[2, 'b', 'feature2', 5, 13],], columns=['day', 'category', 'feature', 'min', 'max'])
day | category | feature | min | max |
---|---|---|---|---|
1 | a | feature1 | 1 | 8 |
1 | a | feature2 | 2 | 6 |
1 | b | feature1 | 2 | 9 |
1 | b | feature2 | 3 | 9 |
2 | a | feature1 | 3 | 10 |
2 | a | feature2 | 4 | 12 |
2 | b | feature1 | 4 | 11 |
2 | b | feature2 | 5 | 13 |
How can I do this?
One option using a custom reshape with a MultiIndex with str.split
, then stack
:
(df.set_index(['day', 'category'])
.pipe(lambda d: d.set_axis(d.columns.str.split('_', n=1, expand=True), axis=1))
.rename_axis(columns=(None, 'features'))
.stack().reset_index()
)
Or with janitor
‘s pivot_longer
:
# pip install janitor
import janitor
out = df.pivot_longer(['day', 'category'], sort_by_appearance=True,
names_sep='_', names_to=('.value', 'feature'))
Output:
day category features max min
0 1 a feature1 8 1
1 1 a feature2 6 2
2 1 b feature1 9 2
3 1 b feature2 9 3
4 2 a feature1 10 3
5 2 a feature2 12 4
6 2 b feature1 11 4
7 2 b feature2 13 5
Use str.split
for MultiIndex
with reshape by DataFrame.stack
:
df1 = df.set_index(['day','category'])
df1.columns= df1.columns.str.split('_', expand=True)
df1 = df1.rename_axis(columns=(None,'feature')).stack().reset_index()
print (df1)
day category feature max min
0 1 a feature1 8 1
1 1 a feature2 6 2
2 1 b feature1 9 2
3 1 b feature2 9 3
4 2 a feature1 10 3
5 2 a feature2 12 4
6 2 b feature1 11 4
7 2 b feature2 13 5
Another idea with wide_to_long
:
df.columns = df.columns.str.replace(r'(w+)_s*(w+)', r'2_1', regex=True)
df = (pd.wide_to_long(df,
stubnames=['feature1','feature2'],
i=['day','category'],
j='tmp',
sep='_',
suffix=r'w+').rename_axis(columns='feature')
.stack()
.unstack(2)
.reset_index()
.rename_axis(columns=None))
print (df)
day category feature max min
0 1 a feature1 8 1
1 1 a feature2 6 2
2 1 b feature1 9 2
3 1 b feature2 9 3
4 2 a feature1 10 3
5 2 a feature2 12 4
6 2 b feature1 11 4
7 2 b feature2 13 5
You can also use melt
as alternative:
out = (df.rename(columns=lambda x: tuple(m) if len(m := x.split('_')) > 1 else x)
.melt(['day', 'category'])
.assign(var1=lambda x: x['variable'].str[1], var2=lambda x: x['variable'].str[0])
.pivot(index=['day', 'category', 'var1'], columns='var2', values='value')
.rename_axis(columns=None).reset_index())
Output:
>>> out
day category var1 max min
0 1 a feature1 8 1
1 1 a feature2 6 2
2 1 b feature1 9 2
3 1 b feature2 9 3
4 2 a feature1 10 3
5 2 a feature2 12 4
6 2 b feature1 11 4
7 2 b feature2 13 5
Step by step for better understanding the transformation:
# Step 1: rename your columns
>>> out = df.rename(columns=lambda x: tuple(m) if len(m := x.split('_')) > 1 else x)
day category (min, feature1) (max, feature1) (min, feature2) (max, feature2)
0 1 a 1 8 2 6
1 1 b 2 9 3 9
2 2 a 3 10 4 12
3 2 b 4 11 5 13
# Step 2: flatten your dataframe
>>> out = out.melt(['day', 'category'])
day category variable value
0 1 a (min, feature1) 1
1 1 b (min, feature1) 2
2 2 a (min, feature1) 3
3 2 b (min, feature1) 4
4 1 a (max, feature1) 8
5 1 b (max, feature1) 9
...
# Step 3: expand variable column in two new variables
>>> out = out.assign(var1=lambda x: x['variable'].str[1], var2=lambda x: x['variable'].str[0])
day category variable value var1 var2
0 1 a (min, feature1) 1 feature1 min
1 1 b (min, feature1) 2 feature1 min
2 2 a (min, feature1) 3 feature1 min
3 2 b (min, feature1) 4 feature1 min
4 1 a (max, feature1) 8 feature1 max
5 1 b (max, feature1) 9 feature1 max
...
# Step 4: reshape your dataframe
>>> out = out.pivot(index=['day', 'category', 'var1'], columns='var2', values='value')
var2 max min
day category var1
1 a feature1 8 1
feature2 6 2
b feature1 9 2
feature2 9 3
2 a feature1 10 3
feature2 12 4
b feature1 11 4
feature2 13 5
# Step 5: final output
>>> out = out.rename_axis(columns=None).reset_index()
day category var1 max min
0 1 a feature1 8 1
1 1 a feature2 6 2
2 1 b feature1 9 2
3 1 b feature2 9 3
4 2 a feature1 10 3
5 2 a feature2 12 4
6 2 b feature1 11 4
7 2 b feature2 13 5