How to find related rows based on column value similarity with pandas

Question

Given a random dataset, I need to find rows related to the first row.

|Row|Foo|Bar|Baz|Qux|
|---|---|---|---|---|
| 0 | A |A  |A | A |
| 1 | B | B | B | B |
| 2 | C | C | C |D |
| 3 | D |A  | D |D |

I should get the related rows which are 0, 2, and 3 because 0['Bar'] == 3['Bar'] and 3['Qux'] == 2['Qux'].

I can just iterate over the columns to get the similarities but that would be slow and inefficient and I would also need to iterate again if there are new similarities.

I hope someone can point me to the right direction like which pandas concept should I be looking at or which functions can help me solve this problem of retrieving intersecting data. Do I even need to use pandas?

Edit:

Providing the solution as suggested by @goodside. This solution will loop until there are no more new matched index found.

table = [
['A', 'A', 'A', 'A'],
['B', 'B', 'B', 'B'],
['C', 'C', 'C', 'D'],
['D', 'A', 'D', 'D']
]

comparators = [0]

while True:
  for idx_row, row in enumerate(table):
    if idx_row in comparators:
      continue

    for idx_col, cell in enumerate(row):
      for comparator in comparators:
        if cell == table[comparator][idx_col]:
          comparators.append(idx_row)
          break
      else:
        continue
      break
    else:
      continue
    break
  else:
    break

for item in comparators:
  print(table[item])

Asked By: JMY

||

Source

Answer 1

This is a graph problem. You can use networkx:

# get the list of connected nodes per column
def get_edges(s):
    return df['Row'].groupby(s).agg(frozenset)

edges = set(df.apply(get_edges).stack())
edges = list(map(set, edges))
# [{2}, {2, 3}, {0, 3}, {3}, {1}, {0}]

from itertools import pairwise, chain
# pairwise is python ≥ 3.10, see the doc for a recipe for older versions

# create the graph
import networkx as nx
G = nx.from_edgelist(chain.from_iterable(pairwise(e) for e in edges))
G.add_nodes_from(set.union(*edges))

# get the connected components
list(nx.connected_components(G))

Output: [{0, 2, 3}, {1}]

NB. You can read more on the logic to create the graph in this question of mine.

Used input:

df = pd.DataFrame({'Row': [0, 1, 2, 3],
                   'Foo': ['A', 'B', 'C', 'D'],
                   'Bar': ['A', 'B', 'C', 'A'],
                   'Baz': ['A', 'B', 'C', 'D'],
                   'Qux': ['A', 'B', 'D', 'D']})

Answered By: mozway

How to find related rows based on column value similarity with pandas

Question:

Answers: