Network graph for plotting value counts in pandas df

Question:

I have a huge dataset which I have sliced by years, so I have seperate dataframes for every year.
Now every year, the value_counts of column label is different.

Suppose for df_2020, it is:

label
patch               622
minor               289
major.minor         181
major.patch.minor   175
major               150
patch.minor         144
major.patch         95
pre                 53
dev                 11
minor.pre           11
major.pre           11
patch.minor.pre     10
patch.pre           7
major.minor.pre     6
pre.dev             6
minor.dev           5
minor.pre.dev       4
patch.minor.dev.    3
major.patch.minor.dev 4

I want to create a network graph that splits the labels by . and calculates which labels occur together. So for example from the value_counts above, the line between pre and dev should have 6, and same for the other labels.

I am a bit new to networkx so I am not sure how can I achieve this. Any help would be appreciated.

Asked By: Brie MerryWeather

||

Answers:

You can use str.extract to get the last pair of x.y values, then pass it to `networkx.from_pandas_edgelist:

import networkx as nx

# s = df_2020['label'].value_counts()

tmp = (
 s.reset_index(name='label')
  .sort_values(by='index', key=lambda s: s.str.count('.'))
  .pipe(lambda d: d.join(d['index'].str.extract(r'(?:([^.]+).)?(b[^.]+)$')))
  .drop_duplicates(subset=[0, 1])
)

G = nx.from_pandas_edgelist(tmp, source=0, target=1,
                            create_using=nx.DiGraph, edge_attr='label')

# remove "nan" as node
G.remove_node(np.nan)

Resulting graph:

networkx graph

tmp DataFrame:

          index  label      0      1
7           pre     53    NaN    pre
8           dev     11    NaN    dev
0         patch    622    NaN  patch
1         minor    289    NaN  minor
4         major    150    NaN  major
14      pre.dev      6    pre    dev
15    minor.dev      5  minor    dev
12    patch.pre      7  patch    pre
10    major.pre     11  major    pre
9     minor.pre     11  minor    pre
6   major.patch     95  major  patch
5   patch.minor    144  patch  minor
2   major.minor    181  major  minor

networkx builtin graph

Here without NaN

pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_size=3000)
nx.draw_networkx_edge_labels(G, pos, edge_labels=nx.get_edge_attributes(G, 'label'))

Output:

enter image description here

with self-loops:
import networkx as nx

tmp = (
 s.reset_index(name='label')
  .sort_values(by='index', key=lambda s: s.str.count('.'))
  .pipe(lambda d: d.join(d['index'].str.extract(r'(?:([^.]+).)?(b[^.]+)$').bfill(axis=1)))
  .drop_duplicates(subset=[0, 1])
)

G = nx.from_pandas_edgelist(tmp, source=0, target=1,
                            create_using=nx.DiGraph, edge_attr='label')
Graph:

dot version:

enter image description here

networkx/matplotlib version:

enter image description here

Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.