Transform a dataframe for network analysis using pandas

Question

I have a data frame of online game matches including two specific columns: IDs of matches and IDs of players participated in a particular match. For instance:

match_id	player_id
0	1
0	2
0	3
0	4
0	5
1	6
1	1
1	7
1	8
1	2

Hence, player_id is a unique identificator of a player. Meanwhile, match_id is an ID of a match played, and it is duplicated fixed number of times (say, 5), since 5 is a maximum number of players that are able to participate in a certain match. So in each row, match_id corresponds player_id meaning that a certain player participated in a particular game.

As it can be seen from the table above, two or more players can play together more than one time (or they can have not any co-plays at all). And it’s why I’m interested in transforming this initial data frame into a adjacency matrix, in which the intersection of row and a column would give the number of co-played matches. Another option would be to create a data frame like following:

player_1	player_2	coplays_number
1	2	2
1	3	1
1	4	1
1	10	0
1	5	1
…	…	…

Hereby, my task is to prepare the data for a further analysis of a co-plays network using igraph or networkx. I also want to get a weighted network, that is a weight of an edge would mean a number of co-played matches between two nodes (players). Edge in this case means that two users have played together, i.e. they have participated in the same match once or they have played together as a team in two or more matches (like players’ IDs 1 and 2 in the initial data example above).

My question is: how can I transform my initial data frame into network data, that igraph or networkx functions would take as an argument, using pandas and numpy? Or maybe I do not need any data manipulations and igraph or networkx functions are able to work with the initial data frame?

Thanks in advance for your answers and recommendations!

Asked By: alyx

||

Source

Answer 1

I think you don’t need networkx if you use permutations from itertools and pd.crosstab:

from itertools import permutations

pairs = (df.groupby('match_id')['player_id']
           .apply(lambda x: list(permutations(x, r=2)))
           .explode())
adj = pd.crosstab(pairs.str[0], pairs.str[1],
                  rownames=['Player 1'], colnames=['Player 2'])

Output:

>>> adj
Player 2  1  2  3  4  5  6  7  8
Player 1                        
1         0  2  1  1  1  1  1  1
2         2  0  1  1  1  1  1  1
3         1  1  0  1  1  0  0  0
4         1  1  1  0  1  0  0  0
5         1  1  1  1  0  0  0  0
6         1  1  0  0  0  0  1  1
7         1  1  0  0  0  1  0  1
8         1  1  0  0  0  1  1  0

If you want a flat list (not an adjacency matrix), use combinations:

from itertools import combinations

pairs = (df.groupby('match_id')['player_id']
           .apply(lambda x: frozenset(combinations(x, r=2)))
           .explode().value_counts())

coplays = pd.DataFrame({'Player 1': pairs.index.str[0],
                        'Player 2': pairs.index.str[1],
                        'coplays_number': pairs.tolist()})

Output:

>>> coplays
    Player 1  Player 2  coplays_number
0          1         2               2
1          2         4               1
2          6         2               1
3          8         2               1
4          7         2               1
5          1         7               1
6          6         7               1
7          1         8               1
8          6         8               1
9          6         1               1
10         3         5               1
11         1         3               1
12         2         5               1
13         4         5               1
14         2         3               1
15         1         4               1
16         1         5               1
17         3         4               1
18         7         8               1

Answered By: Corralien

Answer 2

You could inner merge your initial df with itself on the match_id.
Then group by player_1, player_2 and size() to get a weighted-edges dataframe.

df.merge(df, how='inner', on='match_id', suffixes=('1', '2'))
.groupby(['player_id1', 'player_id2'], as_index=False).size()

You’ll also get lines where player_id1 == player_id2: it will be the total number of matches the player played in.

Example

import pandas as pd
import networkx as nx

a, b, c = 'a', 'b', 'c'

df = pd.DataFrame(
{
    'match_id':  [0, 0, 0, 1, 1, 2],
    'player_id': [a, b, c, a, b, c],
})
print(df)

   match_id player_id
0         0         a
1         0         b
2         0         c
3         1         a
4         1         b
5         2         c

edges = df.merge(df, on='match_id', how='inner', suffixes=('1', '2'))
.groupby(['player_id1', 'player_id2'], as_index=False).size()
print(edges)

  player_id1 player_id2  size
0          a          a     2
1          a          b     2
2          a          c     1
3          b          a     2
4          b          b     2
5          b          c     1
6          c          a     1
7          c          b     1
8          c          c     2

graph = nx.from_pandas_edgelist(edges, source='player_id1', target='player_id2',
edge_attr='size', create_using=nx.Graph)

pos = nx.spring_layout(graph)
nx.draw_networkx(graph, pos, with_labels=True)
nx.draw_networkx_edge_labels(graph, pos, edge_labels=nx.get_edge_attributes(graph,'size'))

gives

You can use create_using=nx.DiGraph to get:

Networkx doesn’t plot it but the selfloops are weighted:

>>> graph['a']['a']
{'size': 2}

Answered By: politinsa

Answer 3

igraph has a function to ingest a pandas.DataFrame containing the edges and weights and make a graph out of it.

To prepare the edge dataframe, you could merge the df with itself as shown in the other answer, that’s legit. However, if the data is big you might run out of memory. Moreover, the graph is not directed so you can record each player contact only once. For the sake of diversity, here’s how I would do it:

from collections import Counter
import pandas as pd
import igraph as ig

# 1. Make a nonredundant counter for player contacts
edge_dict = Counter()
for _, group in df.groupby('match_id'):
    players = group['player_id']
    for i1, p1 in enumerate(players):
        for p2 in players[:i1]:
            edge_dict[(min(p1, p2), max(p1, p2))] += 1

# 2. Convert the counter to an edge DataFrame
edges = pd.Series(edge_dict, name='weight').to_frame().reset_index()

# 3. Ingest the DataFrame into igraph, including weights
g = ig.Graph.DataFrame(
        edges,
        use_vids=False)

Answered By: iosonofabio

Transform a dataframe for network analysis using pandas

Question:

Answers:

Example