How to group a pandas dataframe by array intersection

Question:

Say I have a DataFrame like below

  UUID             domains
0  asd   [foo.com, foo.ca]
1  jkl    [foo.ca, foo.fr]
2  xyz            [foo.fr]
3  iek  [bar.com, bar.org]
4  qkr           [bar.org]
5  kij          [buzz.net]

How can I turn it in to something like this?

  UUID
0  [asd, jkl, xyz]
1  [iek, qkr]
2  [kij]

I want to group all the UUIDs where any domain is present in any other domains column. For example, rows 0 and 1 both contain foo.ca and rows 1 and 2 both contain foo.fr so should be grouped together.

The size of my data set is millions of rows so I can’t brute force it.

Asked By: Iain

||

Answers:

We can do explode first then use networkx

import networkx as nx
s = df.explode('domains')
G = nx.from_pandas_edgelist(s, 'UUID', 'domains')
out = pd.Series([[y for y in x if y not in s.domains.tolist()] for x in [*nx.connected_components(G)]])
Out[209]: 
0    [xyz, jkl, asd]
1         [iek, qkr]
2              [kij]
dtype: object
Answered By: BENY

Assuming the following input with domains as lists:

df = pd.DataFrame({'UUID': ['asd', 'jkl', 'xyz', 'iek', 'qkr', 'kij'],
                   'domains': [['foo.com', 'foo.ca'], ['foo.ca', 'foo.fr'], ['foo.fr'], ['bar.com', 'bar.org'], ['bar.org'], ['buzz.net']]}
                 )

You problem is a graph problem. You want to find the roots of the disconnected subgraphs:

graph

This is easily achieved with networkx.

# transform dataframe into graph
import networkx as nx
G = nx.from_pandas_edgelist(df.explode('domains'),
                            source='UUID', target='domains',
                            create_using=nx.DiGraph)

# split the subgraphs (weakly_connected) and find the roots (degree: 0)
# the output is a generator
groups = ([n for n,g in G.subgraph(c).in_degree if g==0]
          for c in nx.weakly_connected_components(G))

# transform the generator to Series
s = pd.Series(groups)

output:

0    [asd, jkl, xyz]
1         [iek, qkr]
2              [kij]
Answered By: mozway