Detect immediate parent rows of each row and add their values to a new column
Question:
I have a dataframe that contains multiple hierarchy columns and a value attach to them, something like this:
location hierarhcy
industry hierarchy
value
0
US
All
V1
1
US-California
All
V2
2
US-California
All-HT
V3
3
US-California-LA
All
V4
4
US-California-LA
All-HT
V5
5
US-California-LA
All-HT-SoftWare
V6
6
US-California-LA
All-Manufacturing
V7
Now I want to find the immediate parents of each row and add their value a a new column. My example just shows two parents but it could be more. So, the expected output would be as follows:
location hierarhcy
industry hierarchy
value
value-parent-1
value-parent-2
0
US
All
V1
NA
NA
1
US-California
All
V2
V1
NA
2
US-California
All-HT
V3
V2
NA
3
US-California-LA
All
V4
V2
NA
4
US-California-LA
All-HT
V5
V3
V4
5
US-California-LA
All-HT-SoftWare
V6
V5
NA
6
US-California-LA
All-Manufacturing
V7
V4
NA
My Dataframe could be extremely large, so the speed and memory are important factors. So, row-wise process won’t really work.
Logic:
The immediate parents for a row will be all rows that are one level higher than one of the hierarchies. For example for row 5, the potential immediate parents will be:
–
location hierarhcy
industry hierarchy
0
US-California
All-HT-SoftWare
1
US-California-LA
All-HT
This is the code to create the dataframe if you want to play with it.
import pandas as pd
df = pd.DataFrame({'location hierarchy': ['US', 'US-California', 'US-California', 'US-California-LA', 'US-California-LA', 'US-California-LA', 'US-California-LA'],
'industry hierarchy': ['All', 'All', 'All-HT', 'All', 'All-HT', 'All-HT-SoftWare', 'All-Manufacturing'],
'value': ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7']})
Answers:
Here you are:
df = pd.DataFrame({'Location Hierarchy': ['US', 'US-California', 'US-California', 'US-California-LA', 'US-California-LA', 'US-California-LA', 'US-California-LA'],
'Industry Hierarchy': ['All', 'All', 'All-HT', 'All', 'All-HT', 'All-HT-SoftWare', 'All-Manufacturing'],
'Value': ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7']})
hierarchies = ["Location Hierarchy", "Industry Hierarchy"]
for i, hierarchy in enumerate(hierarchies):
df1 = df.copy()
df1[hierarchy] = df1[hierarchy].str.split('-').str[:-1].str.join("-")
df[f"Value_parent_{i+1}"] = df1.merge(df, on=hierarchies, how="left", suffixes=["", "_parent"])["Value_parent"]
I have a dataframe that contains multiple hierarchy columns and a value attach to them, something like this:
location hierarhcy | industry hierarchy | value | |
---|---|---|---|
0 | US | All | V1 |
1 | US-California | All | V2 |
2 | US-California | All-HT | V3 |
3 | US-California-LA | All | V4 |
4 | US-California-LA | All-HT | V5 |
5 | US-California-LA | All-HT-SoftWare | V6 |
6 | US-California-LA | All-Manufacturing | V7 |
Now I want to find the immediate parents of each row and add their value a a new column. My example just shows two parents but it could be more. So, the expected output would be as follows:
location hierarhcy | industry hierarchy | value | value-parent-1 | value-parent-2 | |
---|---|---|---|---|---|
0 | US | All | V1 | NA | NA |
1 | US-California | All | V2 | V1 | NA |
2 | US-California | All-HT | V3 | V2 | NA |
3 | US-California-LA | All | V4 | V2 | NA |
4 | US-California-LA | All-HT | V5 | V3 | V4 |
5 | US-California-LA | All-HT-SoftWare | V6 | V5 | NA |
6 | US-California-LA | All-Manufacturing | V7 | V4 | NA |
My Dataframe could be extremely large, so the speed and memory are important factors. So, row-wise process won’t really work.
Logic:
The immediate parents for a row will be all rows that are one level higher than one of the hierarchies. For example for row 5, the potential immediate parents will be:
– | location hierarhcy | industry hierarchy |
---|---|---|
0 | US-California | All-HT-SoftWare |
1 | US-California-LA | All-HT |
This is the code to create the dataframe if you want to play with it.
import pandas as pd
df = pd.DataFrame({'location hierarchy': ['US', 'US-California', 'US-California', 'US-California-LA', 'US-California-LA', 'US-California-LA', 'US-California-LA'],
'industry hierarchy': ['All', 'All', 'All-HT', 'All', 'All-HT', 'All-HT-SoftWare', 'All-Manufacturing'],
'value': ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7']})
Here you are:
df = pd.DataFrame({'Location Hierarchy': ['US', 'US-California', 'US-California', 'US-California-LA', 'US-California-LA', 'US-California-LA', 'US-California-LA'],
'Industry Hierarchy': ['All', 'All', 'All-HT', 'All', 'All-HT', 'All-HT-SoftWare', 'All-Manufacturing'],
'Value': ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7']})
hierarchies = ["Location Hierarchy", "Industry Hierarchy"]
for i, hierarchy in enumerate(hierarchies):
df1 = df.copy()
df1[hierarchy] = df1[hierarchy].str.split('-').str[:-1].str.join("-")
df[f"Value_parent_{i+1}"] = df1.merge(df, on=hierarchies, how="left", suffixes=["", "_parent"])["Value_parent"]