Tricky conversion of field names to values while performing row by row de-aggregation (using Pandas)
Question:
I have a dataset where I would like to convert specific field names to values while performing a de aggregation the values into their own unique rows as well as perform a long pivot.
Data
Start Date End Area Final Type Middle Stat Low Stat High Stat Middle Stat1 Low Stat1 High Stat1
8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC 226 20 10 0 0 0
8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA 130 50 0 0 0 0
data = {
"Start": ['8/1/2013', '8/1/2013'],
"Date": ['9/1/2013', '9/1/2013'],
"End": ['10/1/2013', '10/1/2013'],
"Area": ['NY', 'CA'],
"Final": ['3/1/2023', '3/1/2023'],
"Type": ['CC', 'AA'],
"Middle Stat": [226, 130],
"Low Stat": [20, 50],
"High Stat": [10, 0],
"Middle Stat1": [0, 0],
"Low Stat1": [0, 0],
"High Stat1": [0, 0]
}
Desired
Start Date End Area Final Type Stat Range Stat1
8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC 20 Low 0
8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA 50 Low 0
8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC 226 Middle 0
8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA 130 Middle 0
8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC 10 High 0
8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA 0 High 0
Doing
I believe I have to inject some sort of wide to long method, (SO member assisted) however unsure how to incorporate this whilst having the same suffix in the targeted (columns of interest) column names.
pd.wide_to_long(df,
stubnames=['Low','Middle','High'],
i=['Start','Date','End','Area','Final'],
j='',
sep=' ',
suffix='(stat)'
).unstack(level=-1, fill_value=0).stack(level=0).reset_index()
Any suggestion is appreciated.
#Original Dataset
import pandas as pd
# create DataFrame
data = {'Start': ['9/1/2013', '10/1/2013', '11/1/2013', '12/1/2013'],
'Date': ['10/1/2016', '11/1/2016', '12/1/2016', '1/1/2017'],
'End': ['11/1/2016', '12/1/2016', '1/1/2017', '2/1/2017'],
'Area': ['NY', 'NY', 'NY', 'NY'],
'Final': ['3/1/2023', '3/1/2023', '3/1/2023', '3/1/2023'],
'Type': ['CC', 'CC', 'CC', 'CC'],
'Low Stat': ['', '', '', ''],
'Low Stat1': ['', '', '', ''],
'Middle Stat': ['0', '0', '0', '0'],
'Middle Stat1': ['0', '0', '0', '0'],
'Re': ['','','',''],
'Set': ['0', '0', '0', '0'],
'Set2': ['0', '0', '0', '0'],
'Set3': ['0', '0', '0', '0'],
'High Stat': ['', '', '', ''],
'High Stat1': ['', '', '', '']}
df = pd.DataFrame(data)
Answers:
You can try to rename the columns first:
import re
df = df.rename(columns=lambda x: re.sub(r'(Low|Middle|High) Stat', r'Stat1', x))
x = pd.wide_to_long(df,
stubnames='Stat',
i=['Start','Date','End','Area','Final'],
j='Range', suffix=r'(?:Low|Middle|High)').reset_index()
print(x)
Prints:
Start Date End Area Final Range Type Stat
0 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 Middle CC 226
1 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 Low CC 20
2 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 High CC 10
3 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 Middle AA 130
4 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 Low AA 50
5 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 High AA 0
EDIT: To have more stub-names:
import re
df = df.rename(columns=lambda x: re.sub(r"(Low|Middle|High) Stat$", r"Stat1", x))
df = df.rename(columns=lambda x: re.sub(r"(Low|Middle|High) Stat1$", r"1Stat1", x))
x = (
pd.wide_to_long(
df,
stubnames=["Stat", "1Stat"],
i=["Start", "Date", "End", "Area", "Final"],
j="Range",
suffix=r"(?:Low|Middle|High)",
)
.reset_index()
.rename(columns={"1Stat": "Stat1"})
)
Prints:
Start Date End Area Final Range Type Stat Stat1
0 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 Middle CC 226 0
1 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 Low CC 20 0
2 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 High CC 10 0
3 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 Middle AA 130 0
4 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 Low AA 50 0
5 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 High AA 0 0
df.melt(id_vars=df.columns[:6], value_name='Values')
Start Date End Area Final Type variable Values
0 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC Middle Stat 226
1 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA Middle Stat 130
2 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC Low Stat 20
3 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA Low Stat 50
4 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC High Stat 10
5 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA High Stat 0
One option is with pivot_longer from pyjanitor – in this case we use the special placeholder .value
to identify the parts of the column that we want to remain as headers, while the rest get collated into a new column :
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = slice('Start', 'Type'),
names_to = ("Range", ".value"),
names_sep = " ")
)
Start Date End Area Final Type Range Stat Stat1
0 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC Middle 226 0
1 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA Middle 130 0
2 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC Low 20 0
3 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA Low 50 0
4 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC High 10 0
5 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA High 0 0
I have a dataset where I would like to convert specific field names to values while performing a de aggregation the values into their own unique rows as well as perform a long pivot.
Data
Start Date End Area Final Type Middle Stat Low Stat High Stat Middle Stat1 Low Stat1 High Stat1
8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC 226 20 10 0 0 0
8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA 130 50 0 0 0 0
data = {
"Start": ['8/1/2013', '8/1/2013'],
"Date": ['9/1/2013', '9/1/2013'],
"End": ['10/1/2013', '10/1/2013'],
"Area": ['NY', 'CA'],
"Final": ['3/1/2023', '3/1/2023'],
"Type": ['CC', 'AA'],
"Middle Stat": [226, 130],
"Low Stat": [20, 50],
"High Stat": [10, 0],
"Middle Stat1": [0, 0],
"Low Stat1": [0, 0],
"High Stat1": [0, 0]
}
Desired
Start Date End Area Final Type Stat Range Stat1
8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC 20 Low 0
8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA 50 Low 0
8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC 226 Middle 0
8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA 130 Middle 0
8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC 10 High 0
8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA 0 High 0
Doing
I believe I have to inject some sort of wide to long method, (SO member assisted) however unsure how to incorporate this whilst having the same suffix in the targeted (columns of interest) column names.
pd.wide_to_long(df,
stubnames=['Low','Middle','High'],
i=['Start','Date','End','Area','Final'],
j='',
sep=' ',
suffix='(stat)'
).unstack(level=-1, fill_value=0).stack(level=0).reset_index()
Any suggestion is appreciated.
#Original Dataset
import pandas as pd
# create DataFrame
data = {'Start': ['9/1/2013', '10/1/2013', '11/1/2013', '12/1/2013'],
'Date': ['10/1/2016', '11/1/2016', '12/1/2016', '1/1/2017'],
'End': ['11/1/2016', '12/1/2016', '1/1/2017', '2/1/2017'],
'Area': ['NY', 'NY', 'NY', 'NY'],
'Final': ['3/1/2023', '3/1/2023', '3/1/2023', '3/1/2023'],
'Type': ['CC', 'CC', 'CC', 'CC'],
'Low Stat': ['', '', '', ''],
'Low Stat1': ['', '', '', ''],
'Middle Stat': ['0', '0', '0', '0'],
'Middle Stat1': ['0', '0', '0', '0'],
'Re': ['','','',''],
'Set': ['0', '0', '0', '0'],
'Set2': ['0', '0', '0', '0'],
'Set3': ['0', '0', '0', '0'],
'High Stat': ['', '', '', ''],
'High Stat1': ['', '', '', '']}
df = pd.DataFrame(data)
You can try to rename the columns first:
import re
df = df.rename(columns=lambda x: re.sub(r'(Low|Middle|High) Stat', r'Stat1', x))
x = pd.wide_to_long(df,
stubnames='Stat',
i=['Start','Date','End','Area','Final'],
j='Range', suffix=r'(?:Low|Middle|High)').reset_index()
print(x)
Prints:
Start Date End Area Final Range Type Stat
0 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 Middle CC 226
1 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 Low CC 20
2 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 High CC 10
3 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 Middle AA 130
4 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 Low AA 50
5 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 High AA 0
EDIT: To have more stub-names:
import re
df = df.rename(columns=lambda x: re.sub(r"(Low|Middle|High) Stat$", r"Stat1", x))
df = df.rename(columns=lambda x: re.sub(r"(Low|Middle|High) Stat1$", r"1Stat1", x))
x = (
pd.wide_to_long(
df,
stubnames=["Stat", "1Stat"],
i=["Start", "Date", "End", "Area", "Final"],
j="Range",
suffix=r"(?:Low|Middle|High)",
)
.reset_index()
.rename(columns={"1Stat": "Stat1"})
)
Prints:
Start Date End Area Final Range Type Stat Stat1
0 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 Middle CC 226 0
1 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 Low CC 20 0
2 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 High CC 10 0
3 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 Middle AA 130 0
4 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 Low AA 50 0
5 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 High AA 0 0
df.melt(id_vars=df.columns[:6], value_name='Values')
Start Date End Area Final Type variable Values
0 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC Middle Stat 226
1 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA Middle Stat 130
2 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC Low Stat 20
3 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA Low Stat 50
4 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC High Stat 10
5 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA High Stat 0
One option is with pivot_longer from pyjanitor – in this case we use the special placeholder .value
to identify the parts of the column that we want to remain as headers, while the rest get collated into a new column :
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = slice('Start', 'Type'),
names_to = ("Range", ".value"),
names_sep = " ")
)
Start Date End Area Final Type Range Stat Stat1
0 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC Middle 226 0
1 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA Middle 130 0
2 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC Low 20 0
3 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA Low 50 0
4 8/1/2013 9/1/2013 10/1/2013 NY 3/1/2023 CC High 10 0
5 8/1/2013 9/1/2013 10/1/2013 CA 3/1/2023 AA High 0 0