Creating hierarchy using 4 columns in dataframe – pandas

Question:

Dataframe is below

    ID        ParentID   Filter Text
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL

I need to add an additional column hierarchy like this:

    ID        ParentID   Filter Text   Hierarchy
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD      DD/CC/BB/AA/LL
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG      GG/HH/II/JJ/KK
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL

The rules I am looking at are below:

  1. Only populate hierarchy column for rows which have filter value populated, the rest of the rows don’t need hierarchy done.

  2. When a row is found having filter value not null, lookup its parentID, then search this parentid in ID column. When found reclusively keep going up till, parent id is 0.

  3. Trying to do this with itertools but the looping is taking too long as the original dataset is huge

4)Recordset size is ~ 200k

The below solution provided kindly by mozway seems to work but for a recorset of 200k records, it takes a lot of time. Is there a tweak that can be done to this to get to the solution faster ?

Asked By: misguided

||

Answers:

This is a graph problem, which you can easily solve with networkx.

import networkx as nx

m = df['Filter'].notna()
nodes = df.loc[m, 'ID']

mapper = df[m].set_index('ID')['Text']

# create graph
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID',
                            create_using=nx.DiGraph)

# find roots
roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}

# retrieve hierarchy
df.loc[m, 'Hierarchy'] = [
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
                        for p in nx.all_simple_paths(G, r, n)])
    for n in nodes for r in roots
    for p in nx.all_simple_paths(G, r, n)
]

Note that there could be several hierarchies if the graph is branched. In this case, this would return all of them separated by ;.

Output:

       ID  ParentID  Filter Text       Hierarchy
0      98        97     NaN   AA             NaN
1      99        98     NaN   BB             NaN
2     100        99     NaN   CC             NaN
3     107       100     1.0   DD  DD/CC/BB/AA/LL
4    9999      1231     NaN   EE             NaN
5   10000      1334     NaN   FF             NaN
6   10001       850     2.0   GG  GG/HH/II/JJ/KK
7     850       230     NaN   HH             NaN
8     230       121     NaN   II             NaN
9     121        96     NaN   JJ             NaN
10     96         0     NaN   KK             NaN
11     97         0     NaN   LL             NaN

Graph:

graph hierarchy

potential optimization

If the dataset is huge, a potential optimization might be to only iterate over the roots that are part of the connected components. You’d have to try in the real dataset if this improves performance.

import networkx as nx

m = df['Filter'].notna()
nodes = df.loc[m, 'ID']

mapper = df[m].set_index('ID')['Text']

G = nx.from_pandas_edgelist(df, source='ParentID', target='ID', create_using=nx.DiGraph)

roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}

roots_dict = {n: s&roots for s in nx.weakly_connected_components(G) for n in s}

df.loc[m, 'Hierarchy'] = [
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
                        for p in nx.all_simple_paths(G, r, n)])
    for n in nodes for r in roots_dict[n]
    for p in nx.all_simple_paths(G, r, n)
]
Answered By: mozway

Maybe you can try dictionaries. Not sure, but let’s see.
Creating a test dataframe:

import pandas as pd

data = {
    'ID': [98, 99, 100, 107, 9999, 10000, 10001, 850, 230, 121, 96, 97],
    'ParentID': [97, 98, 99, 100, 1231, 1334, 850, 230, 121, 96, 0, 0],
    'Filter Text': [None, None, None, '1', None, None, '2', None, None, None, None, None],
    'Text': ['AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'GG', 'HH', 'II', 'JJ', 'KK', 'LL']
}

df = pd.DataFrame(data)

Initialize where you will keep your data:

df = pd.DataFrame(data)

df['Hierarchy'] = ""

parent_child_dict = {}

The main logic to play with dictionaries

for index, row in df.iterrows():
    current_id = row['ID']
    parent_id = row['ParentID']
    
    parent_child_dict[current_id] = parent_id

for index, row in df.iterrows():
    hierarchy = []
    current_id = row['ID']
    
    while current_id != 0:
        parent_id = parent_child_dict.get(current_id)
        
        if parent_id is None:
            break
        
        parent_row = df.loc[df['ID'] == parent_id]
        
        if parent_row.empty:
            break
        
        parent_text = parent_row['Text'].values[0]
        hierarchy.insert(0, parent_text)
        current_id = parent_id
    
    hierarchy.append(row['Text'])
    
    df.at[index, 'Hierarchy'] = '/'.join(hierarchy)
print(df)
Answered By: Mamed

Here is a solution involving one pass of dfs for each root node.
The worst time complexity is O(V + FV) where V is the number of columns and F is
the number of columns to populate hierarchy for.
This might be faster than other solutions as it exploits the fact that the
given graph is a tree and hence there is only one path from root to any node.

# this is a recursive dfs code with additional logic to store the hierarchy
# of interesting nodes
def dfs(graph, stack, interesting, hierarchy):
    node = stack[-1]
    for child in graph[node]:
        stack.append(child)
        if child in interesting:
            hierarchy[child] = stack[:]
        dfs(graph, stack, interesting, hierarchy)
    stack.pop()


# make 'ID' the index
df = df.set_index("ID")

# find the roots
roots = df[df["ParentID"] == 0].index
# interesting nodes to find the hierarchy for
interesting = set(df[df["Filter"].notna()].index)
# dict to store the id -> hierarchy mapping
hierarchy = {}

# build a directed graph in adjacency list of parent -> children
graph = defaultdict(list)
for node, parent in zip(df.index, df["ParentID"]):
    graph[parent].append(node)

# run dfs on each root
for root in roots:
    stack = [root]
    dfs(graph, stack, interesting, hierarchy)

# populate the hierarchy column
df["Hierarchy"] = ""
for node, path in hierarchy.items():
    df.loc[node, "Hierarchy"] = "/".join(df.loc[path, "Text"])

# make 'ID' a column again
df = df.reset_index()

# now we're done!
print(df)

Full code is in https://pastebin.com/6MFaqZQw.

Answered By: Jun
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.