Is there a way in Python to identify all possible relationships back up to a single parent

Question

I’m currently working through a problem within the dataset (stored as a dataframe) I have that I’m not sure how to solve for. One of the tables in the dataset shows all relationships for a given ID. But there are multiple instances where a child of a parent will have other relationships as well as the grandchild having its own set and so on… What I want to achieve is a single table that shows all possible matches for a given ID.

Example of Original Table

Parent	Child
1	2
1	3
2	1
2	4
3	2

Final Result

Parent	Child
1	2
1	3
1	4
2	1
2	3
2	4
3	1
3	2
3	4

Is this the right way to structure the data? the true dataset has 15k rows so I think it will be a lot larger after this operation.

Cheers

I’ve looked into tree structures but not 100% sure if that’s what’s needed.

Asked By: thecoxinitiative

||

Source

Answer 1

Original table is just a regular graph.

graph LR;
    1-->2;
    1-->3;
    2-->1;
    2-->4;
    3-->2;

"shows all possible matches for a given ID": use BFS or DFS starting at ID after generating graph, and exclude starting ID.

Answered By: fejyesynb

Answer 2

With df your dataframe you could try the following:

def descendants(childs, source):
    descendants = set()
    stack = [source]
    while stack:
        descendant = stack.pop()
        if descendant not in childs:
            continue
        for child in childs[descendant]:
            if child != source and child not in descendants:
                stack.append(child)
                descendants.add(child)
    return descendants

childs = {parent: set(childs) for parent, childs in df.groupby("Parent")["Child"]}

df_descendants = pd.DataFrame(
    [(parent, descendants(childs, parent)) for parent in df["Parent"].unique()],
    columns=["Parent", "Descendant"]
).explode("Descendant", ignore_index=True).sort_values(["Parent", "Descendant"])

Or you could just use networkx:

import networkx as nx

G = nx.from_pandas_edgelist(df, source="Parent", target="Child")
df_descendants = pd.DataFrame(
    [(parent, nx.descendants(G, parent)) for parent in df["Parent"].unique()],
    columns=["Parent", "Descendant"]
).explode("Descendant", ignore_index=True).sort_values(["Parent", "Descendant"])

Result for the sample (both alternatives):

   Parent Descendant
0       1          2
1       1          3
2       1          4
3       2          1
4       2          3
5       2          4
6       3          1
7       3          2
8       3          4

If you want to include the distance of the descendants from their ancestors then you have to switch from a DFS to a BFS (by using a deque with .appendleft for stack):

from collections import deque

def descendants(childs, source):
    descendants = {}
    stack = deque([(source, 0)])
    while stack:
        descendant, distance = stack.pop()
        distance += 1
        if descendant not in childs:
            continue
        for child in childs[descendant]:
            if child != source and child not in descendants:
                stack.appendleft((child, distance))
                descendants[child] = distance
    return descendants

childs = {
    parent: set(childs) for parent, childs in df.groupby("Parent")["Child"]
}

df_descendants = pd.DataFrame(
    [
        (parent, descendant, distance)
        for parent in df["Parent"].unique()
        for descendant, distance in descendants(childs, parent).items()
    ],
    columns=["Parent", "Descendant", "Distance"]
).sort_values(["Parent", "Distance", "Descendant"])

Result for the sample:

   Parent  Descendant  Distance
0       1           2         1
1       1           3         1
2       1           4         2
3       2           1         1
4       2           4         1
5       2           3         2
6       3           2         1
7       3           1         2
8       3           4         2

Answered By: Timus

Is there a way in Python to identify all possible relationships back up to a single parent

Question:

Answers: