Pandas, combination of values and check if any of the combiantions is in a list

Question:

I have a df1 with some item_id‘s and some values for each item (called "nodes"):

df1 = pd.DataFrame({'item_id':['1','1','1','2','2','2','3','3'],'nodes':['a','b','c','d','a','e','f','g']})

and a df2 that is a list of "vectors" where each row is a tuple of nodes (that can be in df1, but some of them aren’t):

df2=pd.DataFrame({'vectors':[('a','b'),('b','c'),('d','f'),('e','b')]})

I need to count the number of different item_id‘s in df1 that have at least one vector in df2, given the fact that a vector can be constructed from all possible combiantions of nodes for that item.

For example, item_id = 1 have the nodes [a,b,c], so these vectors can be formed: [(a,b),(a,c),(b,a),(b,c),(c,a),(c,b)]. Since the vectors (a,b) and (b,c) exist in df2, then I should count item_id = 1. However, I should not count item_id = 2 since from all the vectors that can be formed from the combination of its nodes, none of them is in df2.

I don’t know how can I achieve that. I can obtain a list of all possible combinations of nodes to form the different vectors for the first item_id in df1, using:

from itertools import product
nodes_fa=df1[df1.item_id=="1"].nodes.to_list()
vectors_fa = pd.DataFrame(product(nodes_fa,nodes_fa),columns=['u','v'],dtype='str')
vectors_fa['vector'] = vectors_fa[["u", "v"]].agg(tuple, axis=1)
vectors_fa = vectors_fa[['vector']]
display(vectors_fa)

but I don’t know how to expand this to all the item_id‘s, nor how to check if any value in this list is in df2 inside a loop.

Any help would be much appreciated.

Asked By: ElTitoFranki

||

Answers:

You can use itertools.combinations and groupby.apply, with help of set/frozenset:

To consider the edges undirected (('a', 'b') == ('b', 'a')):

from itertools import combinations

S = set(frozenset(x) for x in df2['vectors'])

out = (
 df1.groupby('item_id')['nodes']
    .apply(lambda g: any(frozenset(t) in S for t in combinations(g, r=2)))
    .sum()
)

For directed edges (('a', 'b') != ('b', 'a')):

from itertools import combinations

S = set(df2['vectors'])

out = (
 df1.groupby('item_id')['nodes']
    .apply(lambda g: any(t in S for t in combinations(g, r=2)))
    .sum()
)

Output: 1

Alternative with pandas functions (likely less efficient):

s = (df1.groupby('item_id')['nodes']
        .agg(lambda g: list(combinations(g, r=2)))
        .explode()
     )

out = s.isin(df2['vectors']).groupby(level=0).any().sum()
Answered By: mozway

I would like to propose a solution using merges instead of relying on apply
Depending on how many item_id values exist, and how many rows for each item_id, it might be preferable for performance as well.

merged_df1 = pd.merge(df1, df1, left_on='item_id', right_on='item_id')
#eliminates the (x,y) pairs where x == y. Remove if this is not the intended behavior
merged_df1 =  merged_df1[merged_df1['nodes_x'] != merged_df1['nodes_y']]

#splitting tuples of df2 into 2 columns
df2[['node_1', 'node_2']] = pd.DataFrame(df2.vectors.tolist())
valid_id = pd.merge(merged_df1, df2, 
                    left_on=['nodes_x','nodes_y'], 
                    right_on=['node_1', 'node_2']
                    ).item_id.unique()
out = len(valid_id)
Answered By: rorshan