Find connected components recursively in a data frame

Question:

Consider the following data frame:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {
        "main": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "component": [
            [1, 2],
            [np.nan],
            [3, 8],
            [np.nan],
            [1, 5, 6],
            [np.nan],
            [7],
            [np.nan],
            [9, 10],
            [np.nan],
            [np.nan],
        ],
    }
)

The column main represents a certain approach. Each approach consists of components. A component itself could also be an approach and is then called sub-approach.

I want to find all connected sub-approaches/components for a certain approach.

Suppose, for instance, I want to find all connected sub-approaches/components for the main approach ‘0’.
Then, my desired output would look like this:

target = pd.DataFrame({
    "main": [0, 0, 2, 2, 8, 8],
    "component": [1, 2, 3, 8, 9, 10]
})

Ideally, I want to be able to just choose the approach and then get all sub-connections.
I am convinced that there is a smart approach to do so using networkx. Any hint is appreciated.

Ultimately, I want to create a graph that looks somewhat like this (for approach 0):

enter image description here

Additional information:

You can explode the data frame and then remove all components from the main column (components are approaches that do not have any component).

df_exploded = df.explode(column="component").dropna(subset="component")

The graph can be constructed as follows:

import networkx as nx
import graphviz

G = nx.Graph()
G.add_edges_from([(i, j) for i, j in target.values])

graph_attr = dict(rankdir="LR", nodesep="0.2")
g = graphviz.Digraph(graph_attr=graph_attr)

for k, v in G.nodes.items():
    g.node(str(k), shape="box", style="filled", height="0.35")

for n1, n2 in G.edges:
    g.edge(str(n2), str(n1))

g
Asked By: ko3

||

Answers:

You can use nx.dfs_edges

edges = df.explode(column='component').dropna(subset='component')

G = nx.from_pandas_edgelist(edges, source='main', target='component', create_using=nx.DiGraph)

target = pd.DataFrame(nx.dfs_edges(G, 0), columns=['main', 'component'])

Output:

>>> target
   main  component
0     0          1
1     0          2
2     2          3
3     2          8
4     8          9
5     8         10

To extract the subgraph, use:

H = G.edge_subgraph(nx.dfs_edges(G, 0))
Answered By: Corralien