Python algorithm with numpy

Question:

I want to group in a 2D array (couples) to see the family:

rij = [[11, 2], [15, 6], [7, 8], [3, 6], [9, 2], [2, 3], [2, 3]]
rij = np.sort(rij, axis=1) #sort inside array
rij = np.unique(rij, axis=0) #remove duplicates

After this code I get this:

[[ 2  3]
 [ 2  9]
 [ 2 11]
 [ 3  6]
 [ 6 15]
 [ 7  8]
 [ 7  20]]

This is where I get stuck, I need to loop through and see if the number already exists.

Expected result (the family) would be:

[2, 3, 6, 9, 11, 15]
[7, 8, 20]

Nice to have would be that I could add the degree, family in 2nd degree.

[2, 3, 9, 11]
[6, 15]
[7, 8, 20]

family in 3rd degree.

[2, 3, 6, 9, 11, 15]
[7, 8, 20]

family in last degree. (same as previous in this example)

[2, 3, 6, 9, 11, 15]
[7, 8, 20]
Asked By: user1737853

||

Answers:

We can solve this using scipy’s sparse matrix and graph module. Your rij forms an adjacency matrix. That is a matrix that is 1 if two nodes are connected and 0 if not. From this, we can compute other properties.

Let’s apply this to your problem. We start by cleaning up your input. As @Ali_Sh noted, there is an inconsistency in your example. The first list of rij has different elements than the sorted and unique array below. I ignore the first line and start with the sorted unique version.

import numpy as np

pairings = ((2, 3), (2, 9), (2, 11), (3, 6), (6, 15), (7, 8), (7, 20))
pairings = np.array(pairings)

The IDs are not consecutive. This will waste resources further down so let’s compress our range. The index will be the graph node. The value at the index is the original ID in pairings. We can use this as a lookup table. For the inverse mapping I use a simple dictionary.

node_to_id = np.unique(np.sort(np.ravel(pairings)))
id_to_node = {id_: node for node, id_ in enumerate(node_to_id)}

Now we build a sparse adjacency matrix. A node i is connected to node j if matrix[i, j] is true. Since our "family" relationship is undirected (if i is related to j, then j is always related to i), we build a symmetric matrix.

Scipy claims that directed graph algorithms with symmetric matrices are faster. So this allows us to do just that.

The graph algorithms need CSR format (compressed sparse row). We start with DOK format (dictionary of keys) and convert afterwards because it is easier to build. Since our input is sorted, LIL format (list of lists) may be faster but DOK has better worst-case performance in case we don’t sort beforehand.

from scipy import sparse

n_nodes = len(node_to_id)
dok_mat = sparse.dok_matrix((n_nodes, n_nodes), dtype=bool)
for left, right in pairings:
    row, col  = id_to_node[left], id_to_node[right]
    dok_mat[row, col] = True
    dok_mat[col, row] = True # undirected graph
csr_mat = dok_mat.tocsr()
del dok_mat

Connected components gives us our families. For each row in the matrix, we get an integer label that marks its component.

import collections
from scipy.sparse import csgraph

_, components = csgraph.connected_components(csr_mat)
families = collections.defaultdict(list)
for id_, component in zip(node_to_id, components):
    families[component].append(id_)
print("families", list(families.values()))

The shortest path gives the number of hops, i.e. the distance in relationship. Unrelated nodes have infinite distance.

shortest_paths = csgraph.shortest_path(csr_mat)
maxdist = 2.
for id_, row in zip(node_to_id, shortest_paths):
    immediate_family = node_to_id[row <= maxdist]
    print(id_, immediate_family)

The output will be

families [[2, 3, 6, 9, 11, 15], [7, 8, 20]]
2 [ 2  3  6  9 11]
3 [ 2  3  6  9 11 15]
6 [ 2  3  6 15]
7 [ 7  8 20]
8 [ 7  8 20]
9 [ 2  3  9 11]
11 [ 2  3  9 11]
15 [ 3  6 15]
20 [ 7  8 20]
Answered By: Homer512
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.