Is there a way to count the length of a daisy chain of internal table references using Pandas / Python?

Question:

We’ve a table that contains an Id, and on the same row, a reference to another Id in the same table. The Id record was infected by the referenced Id record. The referenced Id itself may or may not have a reference to another Id, it may not exist, or it may become a circular reference (linking back upon itself). Put into pandas, the problem looks a bit like this:

import pandas as pd
import numpy as np
# example data frame
inp = [{'Id': 1, 'refId': np.nan},
   {'Id': 2, 'refId': 1},
   {'Id': 3, 'refId': 2},
   {'Id': 4, 'refId': 3}, 
   {'Id': 5, 'refId': np.nan},
   {'Id': 6, 'refId': 7},
   {'Id': 7, 'refId': 20},
   {'Id': 8, 'refId': 9},
   {'Id': 9, 'refId': 8}, 
   {'Id': 10, 'refId': 8}
   ]
df = pd.DataFrame(inp)
print(df.dtypes)

What I am trying to do is count of how far back the references go for each row in the table. The logic would:

  • Starting with Result = 0 for each row:
  • If a Ref-Id is not nan, then add 1,
  • If the referenced-Id exists, and this referenced-Id has a reference, and the referenced-Id reference is not a back-reference, add 1 to the Result, then
    repeat this step until one of the conditions is NOT met, then go to
    Else;
  • Else (no reference-Id, no reference for the referenced-Id, or
    reference loops back to a previous reference), return the Result.

Results from example should look like:

Id  RefId  Result
1     -      0
2     1      1
3     2      2
4     3      3
5     -      0
6     7      2
7     20     1
8     9      1
9     8      1
10    8      2

Every approach I’ve tried ends up needed a new column for each reference to a reference, but the table is quite enourmous, and I’m not sure how long the daisy-chain of internal table references will ultimately be. I’m hoping there might be a better way, that isn’t too difficult for me to learn.

Asked By: DrWhat

||

Answers:

This is a graph problem, so you can use networkx.

Convert your dataframe to directed graph:

import networkx as nx

G = nx.from_pandas_edgelist(df.fillna(-1).astype(int),
                            source='Id', target='refId',   # source -> target
                            create_using=nx.DiGraph()      # directed graph
                            )

# removing the NaN (replaced by "-1" for enabling indexing)
G.remove_node(-1)

This give this graph:

graph

Then simply count the children:

nodes = {n: len(nx.descendants(G,n)) for n in G.nodes}

df['Result'] = df['Id'].map(lambda x: nodes.get(x, 0))

output:

   Id  refId  Result
0   1    NaN       0
1   2    1.0       1
2   3    2.0       2
3   4    3.0       3
4   5    NaN       0
5   6    7.0       2
6   7   20.0       1
7   8    9.0       1
8   9    8.0       1
9  10    8.0       2

NB. the result is a bit different, so maybe I did not fully grasp your logic, but this gives you the general idea. Please elaborate on the logic.

Answered By: mozway