Fast Pathfinder associative network algorithm (PFNET) in Python

Question:

I’ve been trying to implement a "Fast Pathfinder" network pruning algorithm from https://doi.org/10.1016/j.ipm.2007.09.005 in Python/networkX, and have finally stumbled on something that is returning something that looks more or less right.

I’m not quite competent enough to test if the results are consistently (or ever, for that matter) correct though. Especially for directed graphs I have my doubts, and I’m unsure if the original is even intended to work for directed graphs. I have not found a Python implementation for any pathfinder network algorithms yet, but if there is an existing alternative to use I would also be interested for comparing results. I know there is an implementation in R (https://rdrr.io/cran/comato/src/R/pathfinder.r) where I took some inspiration as well.

Based on my best (read: poor) understanding, the algorithm described in the paper uses a distance matrix of shortest paths generated by the Floyd-Warshall algorithm, and compares those distances to the weighted adjacency matrix, picking only the matches as links. The intuition for the expected result in the undirected case is the union of all edges in all of its possible minimum spanning trees.

That is what I am attempting to emulate with the below function:

def minimal_pathfinder(G, r = float("inf")):
    """ 
    Args:
    -----
    G [networkX graph]:
        Graph to filter links from.
    r [float]:
        "r" parameter as in the paper.

    Returns:
    -----
    PFNET [networkX graph]:
        Graph containing only the PFNET links.
    """
    
    import networkx as nx
    from collections import defaultdict
    
    H = G.copy()
    
    # Initialize adjacency matrix W
    W = defaultdict(lambda: defaultdict(lambda: float("inf")))
    
    # Set diagonal to 0
    for u in H.nodes():
        W[u][u] = 0 
    
    # Get weights and set W values
    for i, j, d in H.edges(data=True):
        W[i][j] = d['weight'] # Add weights to W
        
    # Get shortest path distance matrix D
    dist = nx.floyd_warshall_predecessor_and_distance(H, weight='weight')[1]
    
    # Iterate over all triples to get values for D
    for k in H.nodes():
        for i in H.nodes():
            for j in H.nodes():
                if r == float("inf"): # adapted from the R-comato version which does a similar check
                # Discard non-shortest paths
                    dist[i][j] = min(dist[i][j], (dist[i][k] + dist[k][j]))
                else:
                    dist[i][j] = min(dist[i][j], (((dist[i][k]) ** r) + ((dist[k][j]) ** r )) ** (1/r))
                
    # Check for type; set placeholder for either case
    if not H.is_directed():
        PFNET = nx.Graph()
        PFNET.add_nodes_from(H.nodes(data=True))
    else:
        PFNET = nx.DiGraph()
        PFNET.add_nodes_from(H.nodes(data=True))
        
    # Add links D_ij only if == W_ij
    for i in H.nodes():
        for j in H.nodes():
            if dist[i][j] == W[i][j]: # If shortest path distance equals distance in adjacency
                if dist[i][j] == float("inf"): # Skip infinite path lengths
                    pass
                elif i == j: # Skip the diagonal
                    pass
                else: # Add link to PFNET
                    weight = dist[i][j]
                    PFNET.add_edge(i, j, weight=weight)
                    
    return PFNET

I’ve tested this with a bunch of real networks (both directed and undirected) and randomly generated networks, both cases ranging from 20ish nodes up to around 300 nodes, maximum few thousand edges (e.g. complete graphs, connected caveman graphs). In all cases it returns something, but I have little confidence the results are correct. As I find no other implementations I’m unsure how to verify this is working consistently (I’m not really using any other languages at all).

I am fairly sure there is still something wrong with this but I am unsure of what it might be.

Simple use case:

G = nx.complete_graph(50) # Generate a complete graph

# Add random weights
for (u,v,w) in G.edges(data=True):
    w['weight'] = np.random.randint(1,20)
    
PFNET = minimal_pathfinder(G)

print(nx.info(G))
print(nx.info(PFNET))

Output:

Graph with 50 nodes and 1225 edges
Graph with 50 nodes and 236 edges

I was wondering about two things:

1. Any idea what might be wrong with the implementation? Should I have confidence in the results?

  1. Any idea how this might converted to work with similarity data instead of distances?

To the second I considered normalizing the weights to 0-1 range and converting all the distances to similarities by 1 – distance. But I am unsure if this is theoretically valid, and was hoping for a second opinion.

EDIT: I possibly discovered solution to Q2. in the original paper: change float("inf") to float("-inf") and change min to max in the first loop. From the authors’ footnote:

Actually, using similarities or distances has no influence at all in
our proposal. In case of using similarities, we would only need to
replace MIN by MAX, ’>’ by ’<’, and use r = -inf to mimic the MIN
function instead of the MAX function in the Fast Pathfinder algorithm.

Any inputs much appreciated, thanks!

EDIT (adding example of how it goes wrong from here) per comment, using the "example from a datafile" section:

Adjacency in starting graph:

matrix([[0, 1, 4, 2, 2],
        [1, 0, 2, 3, 0],
        [4, 2, 0, 3, 1],
        [2, 3, 3, 0, 3],
        [2, 0, 1, 3, 0]], dtype=int32)

And after pruning with the function, converting first into a networkX undirected graph:

matrix([[0, 1, 0, 2, 2],
        [1, 0, 2, 3, 0],
        [0, 2, 0, 3, 1],
        [2, 3, 3, 0, 3],
        [2, 0, 1, 3, 0]], dtype=int32)

It seems to drop only the highest weight overall leaving all other edges. Since the expected result is in an edgelist on the linked example, here’s the edgelist of the result I obtain as well:

source  target  weight
1       2       1
1       4       2
1       5       2
2       3       2
2       4       3 
3       4       3
3       5       1
4       5       3
Asked By: Huug.

||

Answers:

Disclaimer : I am one of the author of the optimisation papers (Fast PFNET, but there is also a faster version, MST-PFNET). Note that the MST-PFNET version can only be applied to a subset of the original PFNET algorithm, ie, can only work with q=n-1 and r=oo. Sorry for the delay of my answer, but I just have seen this post today.

I will try to address as many questions as possible:

  • First, to avoid any confusion, as I see the both concepts are mixed in the post and the comments below, the Fast PFNET (or Fast Pathfinder) algorithm, an optimisation of the original PFNET algorithm from Schvaneveldt, is based on a shortest path algorithm. The MST PFNET version is even faster and is based on Minimum Spanning Trees (MST). Both optimisations work only with (different) subsets of the original algorithm parameters (see this page to see which ones). Thus they are not compatible.

  • I am not aware too about a Python version. But if you are fluent with C, you can find all the versions of this algorithm on GIT here. Those versions should be straightforward to compile (using the Makefile) and to use (input file format are in Pajek format, some examples are included, the command line is <executable> <input_filename> and the output in Pajek format is directly sent to stdout).

  • The original PFNET version from Schvaneveldt is intended to be used with directed and undirected graphs, but the optimised versions are defined only for undirected graphs. You will find a comparison table for all the versions on this page.

  • I am not able to check your Python version now, but on the mentioned page there is a very simple example to test your implementation. The versions on GIT are also well tested (with thousands of random graphs against the original slow version, the code to create the random graphs is also on the GIT) so the output of any random graph can be considered secure enough.

  • Why do you think your implementation might be wrong? Based only on the statistics, it is perfectly normal the algorithm prunes the edges but not the nodes, this is the nominal behaviour of this algorithm.

  • The implementations on GIT are supposed to work with similarity and not distance for the weights of the graphs. As noted in the paper, switching from similarity to distance and vice versa does not change the algorithm itself, but we should only adapt the comparison operators and some other instructions.

  • As said in one of the comments, MST-PFNET (or the original PFNET but with the restriction applied to the parameters) applied to a tree returns the exact same tree.

  • If a graph has multiple/different MSTs, this means that some edges of these MSTs share the same weight. The result of MST-PFNET is the superposition of those multiple MSTs (ie, keeping each edge appearing in at least one of the MSTs).

  • I confirm the behaviour for unweighted graph (or a graph having all the edges with the same weight): the result of MST PFNET should be the input graph itself.

Answered By: mountrix

Below is a possible implementation of Fast-Pathfinder in Python using the networkx library. Note:

  • the implementation corresponds to the paper.
  • it is inspired from the C implementation found in GitHub.
  • only the maximum variant is implemented, where the input matrix is a similarity matrix and not a distance matrix (edges with the highest values are kept).
def fast_pfnet(G, q, r):
    
    s = G.number_of_nodes()
    weights_init = np.zeros((s,s))
    weights = np.zeros((s,s))
    hops = np.zeros((s,s))
    pfnet = np.zeros((s,s))

    for i, j, d in G.edges(data=True):
        weights_init[i,j] = d['weight']
        weights_init[j,i] = d['weight']

    for i in range(s):
        for j in range(s):
            weights[i,j] = -weights_init[i,j]
            if i==j:
                hops[i,j] = 0
            else:
                hops[i,j] = 1

    def update_weight_maximum(i, j, k, wik, wkj, weights, hops, p):
        if p<=q:
            if r==0:
                # r == infinity
                dist = max(wik, wkj)
            else:
                dist = (wik**r + wkj**r) ** (1/r)

            if dist < weights[i,j]:
                weights[i,j] = dist
                weights[j,i] = dist
                hops[i,j] = p
                hops[j,i] = p
                
    def is_equal(a, b):
        return abs(a-b)<0.00001

    for k in range(s):
        for i in range(s):
            if i!=k:
                beg = i+1
                for j in range(beg, s):
                    if j!=k:
                        update_weight_maximum(i, j, k, weights_init[i,k], weights_init[k,j], weights, hops, 2)
                        update_weight_maximum(i, j, k, weights[i,k], weights[k,j], weights, hops, hops[i,k]+hops[k,j])

    for i in range(s):
        for j in range(s): # Possible optimisation: in case of symmetrical matrices, we do not need to go from 0 to s but from i+1 to s
            if not is_equal(weights_init[i,j], 0):
                if is_equal(weights[i,j], -weights_init[i,j]):
                    pfnet[i,j] = weights_init[i,j]
                else:
                    pfnet[i,j] = 0

    return nx.from_numpy_matrix(pfnet)

Usage:

m = np.matrix([[0, 1, 4, 2, 2],
        [1, 0, 2, 3, 0],
        [4, 2, 0, 3, 1],
        [2, 3, 3, 0, 3],
        [2, 0, 1, 3, 0]], dtype=np.int32)

G = nx.from_numpy_matrix(m)

# Fast-PFNET parameters set to emulate MST-PFNET
# This variant is OK for other parameters (q, r) but for the ones below
# it is faster to implement the MST-PFNET variant instead.
q = G.number_of_nodes()-1
r = 0

P = fast_pfnet(G, q, r)

list(P.edges(data=True))

This should return:

[(0, 2, {'weight': 4.0}),
 (1, 3, {'weight': 3.0}),
 (2, 3, {'weight': 3.0}),
 (3, 4, {'weight': 3.0})]

Which is similar to what is shown on the website (see the example in the section "After the application of Pathfinder").

Answered By: mountrix