Creating hierarchy using 4 columns in dataframe – pandas
Question:
Dataframe is below
ID ParentID Filter Text
0 98 97 NULL AA
1 99 98 NULL BB
2 100 99 NULL CC
3 107 100 1 DD
4 9999 1231 NULL EE
5 10000 1334 NULL FF
6 10001 850 2 GG
7 850 230 NULL HH
8 230 121 NULL II
9 121 96 NULL JJ
10 96 0 NULL KK
11 97 0 NULL LL
I need to add an additional column hierarchy like this:
ID ParentID Filter Text Hierarchy
0 98 97 NULL AA
1 99 98 NULL BB
2 100 99 NULL CC
3 107 100 1 DD DD/CC/BB/AA/LL
4 9999 1231 NULL EE
5 10000 1334 NULL FF
6 10001 850 2 GG GG/HH/II/JJ/KK
7 850 230 NULL HH
8 230 121 NULL II
9 121 96 NULL JJ
10 96 0 NULL KK
11 97 0 NULL LL
The rules I am looking at are below:
-
Only populate hierarchy column for rows which have filter value populated, the rest of the rows don’t need hierarchy done.
-
When a row is found having filter value not null, lookup its parentID, then search this parentid in ID column. When found reclusively keep going up till, parent id is 0.
-
Trying to do this with itertools but the looping is taking too long as the original dataset is huge
4)Recordset size is ~ 200k
The below solution provided kindly by mozway seems to work but for a recorset of 200k records, it takes a lot of time. Is there a tweak that can be done to this to get to the solution faster ?
Answers:
This is a graph problem, which you can easily solve with networkx
.
- create a directed graph with
nx.from_pandas_edgelist
- find the roots with a list comprehension and
in_degree
- get the paths between the selected nodes and the roots using
all_simple_paths
and a mapping Series
- assign to you DataFrame
import networkx as nx
m = df['Filter'].notna()
nodes = df.loc[m, 'ID']
mapper = df[m].set_index('ID')['Text']
# create graph
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID',
create_using=nx.DiGraph)
# find roots
roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}
# retrieve hierarchy
df.loc[m, 'Hierarchy'] = [
';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
for p in nx.all_simple_paths(G, r, n)])
for n in nodes for r in roots
for p in nx.all_simple_paths(G, r, n)
]
Note that there could be several hierarchies if the graph is branched. In this case, this would return all of them separated by ;
.
Output:
ID ParentID Filter Text Hierarchy
0 98 97 NaN AA NaN
1 99 98 NaN BB NaN
2 100 99 NaN CC NaN
3 107 100 1.0 DD DD/CC/BB/AA/LL
4 9999 1231 NaN EE NaN
5 10000 1334 NaN FF NaN
6 10001 850 2.0 GG GG/HH/II/JJ/KK
7 850 230 NaN HH NaN
8 230 121 NaN II NaN
9 121 96 NaN JJ NaN
10 96 0 NaN KK NaN
11 97 0 NaN LL NaN
Graph:
potential optimization
If the dataset is huge, a potential optimization might be to only iterate over the roots that are part of the connected components. You’d have to try in the real dataset if this improves performance.
import networkx as nx
m = df['Filter'].notna()
nodes = df.loc[m, 'ID']
mapper = df[m].set_index('ID')['Text']
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID', create_using=nx.DiGraph)
roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}
roots_dict = {n: s&roots for s in nx.weakly_connected_components(G) for n in s}
df.loc[m, 'Hierarchy'] = [
';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
for p in nx.all_simple_paths(G, r, n)])
for n in nodes for r in roots_dict[n]
for p in nx.all_simple_paths(G, r, n)
]
Maybe you can try dictionaries. Not sure, but let’s see.
Creating a test dataframe:
import pandas as pd
data = {
'ID': [98, 99, 100, 107, 9999, 10000, 10001, 850, 230, 121, 96, 97],
'ParentID': [97, 98, 99, 100, 1231, 1334, 850, 230, 121, 96, 0, 0],
'Filter Text': [None, None, None, '1', None, None, '2', None, None, None, None, None],
'Text': ['AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'GG', 'HH', 'II', 'JJ', 'KK', 'LL']
}
df = pd.DataFrame(data)
Initialize where you will keep your data:
df = pd.DataFrame(data)
df['Hierarchy'] = ""
parent_child_dict = {}
The main logic to play with dictionaries
for index, row in df.iterrows():
current_id = row['ID']
parent_id = row['ParentID']
parent_child_dict[current_id] = parent_id
for index, row in df.iterrows():
hierarchy = []
current_id = row['ID']
while current_id != 0:
parent_id = parent_child_dict.get(current_id)
if parent_id is None:
break
parent_row = df.loc[df['ID'] == parent_id]
if parent_row.empty:
break
parent_text = parent_row['Text'].values[0]
hierarchy.insert(0, parent_text)
current_id = parent_id
hierarchy.append(row['Text'])
df.at[index, 'Hierarchy'] = '/'.join(hierarchy)
print(df)
Here is a solution involving one pass of dfs for each root node.
The worst time complexity is O(V + FV)
where V
is the number of columns and F
is
the number of columns to populate hierarchy for.
This might be faster than other solutions as it exploits the fact that the
given graph is a tree and hence there is only one path from root to any node.
# this is a recursive dfs code with additional logic to store the hierarchy
# of interesting nodes
def dfs(graph, stack, interesting, hierarchy):
node = stack[-1]
for child in graph[node]:
stack.append(child)
if child in interesting:
hierarchy[child] = stack[:]
dfs(graph, stack, interesting, hierarchy)
stack.pop()
# make 'ID' the index
df = df.set_index("ID")
# find the roots
roots = df[df["ParentID"] == 0].index
# interesting nodes to find the hierarchy for
interesting = set(df[df["Filter"].notna()].index)
# dict to store the id -> hierarchy mapping
hierarchy = {}
# build a directed graph in adjacency list of parent -> children
graph = defaultdict(list)
for node, parent in zip(df.index, df["ParentID"]):
graph[parent].append(node)
# run dfs on each root
for root in roots:
stack = [root]
dfs(graph, stack, interesting, hierarchy)
# populate the hierarchy column
df["Hierarchy"] = ""
for node, path in hierarchy.items():
df.loc[node, "Hierarchy"] = "/".join(df.loc[path, "Text"])
# make 'ID' a column again
df = df.reset_index()
# now we're done!
print(df)
Full code is in https://pastebin.com/6MFaqZQw.
Dataframe is below
ID ParentID Filter Text
0 98 97 NULL AA
1 99 98 NULL BB
2 100 99 NULL CC
3 107 100 1 DD
4 9999 1231 NULL EE
5 10000 1334 NULL FF
6 10001 850 2 GG
7 850 230 NULL HH
8 230 121 NULL II
9 121 96 NULL JJ
10 96 0 NULL KK
11 97 0 NULL LL
I need to add an additional column hierarchy like this:
ID ParentID Filter Text Hierarchy
0 98 97 NULL AA
1 99 98 NULL BB
2 100 99 NULL CC
3 107 100 1 DD DD/CC/BB/AA/LL
4 9999 1231 NULL EE
5 10000 1334 NULL FF
6 10001 850 2 GG GG/HH/II/JJ/KK
7 850 230 NULL HH
8 230 121 NULL II
9 121 96 NULL JJ
10 96 0 NULL KK
11 97 0 NULL LL
The rules I am looking at are below:
-
Only populate hierarchy column for rows which have filter value populated, the rest of the rows don’t need hierarchy done.
-
When a row is found having filter value not null, lookup its parentID, then search this parentid in ID column. When found reclusively keep going up till, parent id is 0.
-
Trying to do this with itertools but the looping is taking too long as the original dataset is huge
4)Recordset size is ~ 200k
The below solution provided kindly by mozway seems to work but for a recorset of 200k records, it takes a lot of time. Is there a tweak that can be done to this to get to the solution faster ?
This is a graph problem, which you can easily solve with networkx
.
- create a directed graph with
nx.from_pandas_edgelist
- find the roots with a list comprehension and
in_degree
- get the paths between the selected nodes and the roots using
all_simple_paths
and a mapping Series - assign to you DataFrame
import networkx as nx
m = df['Filter'].notna()
nodes = df.loc[m, 'ID']
mapper = df[m].set_index('ID')['Text']
# create graph
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID',
create_using=nx.DiGraph)
# find roots
roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}
# retrieve hierarchy
df.loc[m, 'Hierarchy'] = [
';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
for p in nx.all_simple_paths(G, r, n)])
for n in nodes for r in roots
for p in nx.all_simple_paths(G, r, n)
]
Note that there could be several hierarchies if the graph is branched. In this case, this would return all of them separated by ;
.
Output:
ID ParentID Filter Text Hierarchy
0 98 97 NaN AA NaN
1 99 98 NaN BB NaN
2 100 99 NaN CC NaN
3 107 100 1.0 DD DD/CC/BB/AA/LL
4 9999 1231 NaN EE NaN
5 10000 1334 NaN FF NaN
6 10001 850 2.0 GG GG/HH/II/JJ/KK
7 850 230 NaN HH NaN
8 230 121 NaN II NaN
9 121 96 NaN JJ NaN
10 96 0 NaN KK NaN
11 97 0 NaN LL NaN
Graph:
potential optimization
If the dataset is huge, a potential optimization might be to only iterate over the roots that are part of the connected components. You’d have to try in the real dataset if this improves performance.
import networkx as nx
m = df['Filter'].notna()
nodes = df.loc[m, 'ID']
mapper = df[m].set_index('ID')['Text']
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID', create_using=nx.DiGraph)
roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}
roots_dict = {n: s&roots for s in nx.weakly_connected_components(G) for n in s}
df.loc[m, 'Hierarchy'] = [
';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
for p in nx.all_simple_paths(G, r, n)])
for n in nodes for r in roots_dict[n]
for p in nx.all_simple_paths(G, r, n)
]
Maybe you can try dictionaries. Not sure, but let’s see.
Creating a test dataframe:
import pandas as pd
data = {
'ID': [98, 99, 100, 107, 9999, 10000, 10001, 850, 230, 121, 96, 97],
'ParentID': [97, 98, 99, 100, 1231, 1334, 850, 230, 121, 96, 0, 0],
'Filter Text': [None, None, None, '1', None, None, '2', None, None, None, None, None],
'Text': ['AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'GG', 'HH', 'II', 'JJ', 'KK', 'LL']
}
df = pd.DataFrame(data)
Initialize where you will keep your data:
df = pd.DataFrame(data)
df['Hierarchy'] = ""
parent_child_dict = {}
The main logic to play with dictionaries
for index, row in df.iterrows():
current_id = row['ID']
parent_id = row['ParentID']
parent_child_dict[current_id] = parent_id
for index, row in df.iterrows():
hierarchy = []
current_id = row['ID']
while current_id != 0:
parent_id = parent_child_dict.get(current_id)
if parent_id is None:
break
parent_row = df.loc[df['ID'] == parent_id]
if parent_row.empty:
break
parent_text = parent_row['Text'].values[0]
hierarchy.insert(0, parent_text)
current_id = parent_id
hierarchy.append(row['Text'])
df.at[index, 'Hierarchy'] = '/'.join(hierarchy)
print(df)
Here is a solution involving one pass of dfs for each root node.
The worst time complexity is O(V + FV)
where V
is the number of columns and F
is
the number of columns to populate hierarchy for.
This might be faster than other solutions as it exploits the fact that the
given graph is a tree and hence there is only one path from root to any node.
# this is a recursive dfs code with additional logic to store the hierarchy
# of interesting nodes
def dfs(graph, stack, interesting, hierarchy):
node = stack[-1]
for child in graph[node]:
stack.append(child)
if child in interesting:
hierarchy[child] = stack[:]
dfs(graph, stack, interesting, hierarchy)
stack.pop()
# make 'ID' the index
df = df.set_index("ID")
# find the roots
roots = df[df["ParentID"] == 0].index
# interesting nodes to find the hierarchy for
interesting = set(df[df["Filter"].notna()].index)
# dict to store the id -> hierarchy mapping
hierarchy = {}
# build a directed graph in adjacency list of parent -> children
graph = defaultdict(list)
for node, parent in zip(df.index, df["ParentID"]):
graph[parent].append(node)
# run dfs on each root
for root in roots:
stack = [root]
dfs(graph, stack, interesting, hierarchy)
# populate the hierarchy column
df["Hierarchy"] = ""
for node, path in hierarchy.items():
df.loc[node, "Hierarchy"] = "/".join(df.loc[path, "Text"])
# make 'ID' a column again
df = df.reset_index()
# now we're done!
print(df)
Full code is in https://pastebin.com/6MFaqZQw.