Pandas: Create Origin-Destination matrix, keeping destination-of-destination on same row

Question:

I am trying to create a Origin-Destination matrix that takes into account destination-of-destination in one row.

The dataset I have is similar to the following (edited both given_dataset and expected_result based on @mozway comment):

origin_id link_type applied_id
1 A 2
2 B 3
2 D 3
3 C 4
5 D 6
1 E 4

And the expected result would be:

origin_id A B C D E
1 2 3 4 3 4
2 3 4 3
3 4
5 6

In other words, since 1 is linked to 2 via A, and 2 is linked to 3 via B and D – ecc. ecc. -, I would like to transpose this path back to the row with origin_id = 1, where link_type becomes my new header.

Notable mention: there is no scenario where 1 goes to both 2 and 3, and 2 goes to 3 via the same link_type.

I am currently using pivot_table function (df.pivot_table(values='applied_id', index="origin__id", columns='link_type', aggfunc=max)), and, although the result is near to what I am trying to achieve, it is not quite right:

origin_id A B C D
1 2
2 3 3
3 4

What would be an efficient way to achieve my expected result given my starting dataframe?

EDIT -> More context:

I have a dataset that maps any transaction (applied_id) in our ERP with any other transaction (origin_id) that the former has been generated from.

For instance, an invoice (applied_id) being generated by a sales order (origin_id), via link_type = 'Invoicing'

Then, the same invoice (origin_id) might have a credit memo (applied_id) applied on it (link_type = 'Credit Memo'), because the customer wanted his money back.

Same for payments applied to invoices.

My goal is to trace back the invoice, the payment and the credit memo to the original sales order row, as well as the credit memo to the invoice row and payment row, and payment to the invoice row.

Hopefully this clarifies the goal here.

EDIT -> Working answer:

df = pd.DataFrame({'origin_id': ['1', '2', '2', '3', '5', '1'], 
                   'link_type': ['A', 'B', 'D', 'C', 'D', 'E'], 
                   'applied_id':['2', '3', '3', '4', '6', '4']})

G = nx.from_pandas_edgelist(df, source='origin_id', target='applied_id', edge_attr='link_type', create_using=nx.MultiDiGraph)
dict_for_df = {}
# Grabbing only link_types I am interested in
link_type_list = ['A', 'B', 'C', 'D']

for n in df['origin_id'].unique():
    value_dict = {}
    for value in link_type_list:
        # As I want the "arriving" origin_id for each link_type, I am here grabbing key[1]
        value_list = list(set([key[1] for key, val in nx.get_edge_attributes(G.subgraph({str(n)}|nx.descendants(G, str(n))),'link_type').items() if val == value]))
        value_dict[value] = value_list
    dict_for_df[n] = value_dict

final = pd.DataFrame.from_dict(dict_for_df, orient='index').reset_index().rename(columns={'index':'origin_id'})

# output:
#    origin_id  A   B   C   D
# 0  1         [2]  [3] [4] [3]
# 1  2         []   [3] [4] [3]
# 2  3         []   []  [4] []
# 3  5         []   []  []  [6]

Asked By: E. Faslo

||

Answers:

This is a graph problem that can be solved with networkx.

Your (updated) data looks like:

graph

You need to find for each origin, the descendants and get all the edges.

Here I aggregated as list as there can be several options, see below for your original data.

import networkx as nx

G = nx.from_pandas_edgelist(df, source='origin_id', target='applied_id',
                            edge_attr='link_type', create_using=nx.MultiDiGraph)

out = [list(nx.get_edge_attributes(G.subgraph({n}|nx.descendants(G, n)),
                             'link_type').values())
       for n in df['origin_id'].unique()]
# [['A', 'E', 'B', 'D', 'C'], ['B', 'D', 'C'], ['C'], ['D']]
s = pd.Series(out, index=df['origin_id'].unique())

final = (df
 .assign(link=df['origin_id'].map(s)).explode('link')
 .pivot_table(index='origin_id', columns='link', values='applied_id',
              aggfunc=list) # aggregation function can be changed
)

output:

link            A       B       C       D       E
origin_id                                        
1          [2, 4]  [2, 4]  [2, 4]  [2, 4]  [2, 4]
2             NaN  [3, 3]  [3, 3]  [3, 3]     NaN
3             NaN     NaN     [4]     NaN     NaN
5             NaN     NaN     NaN     [6]     NaN

In your original example, there was no duplicate so you can aggregate with aggfunc='first':

enter image description here

output:

link         A    B    C    D
origin_id                    
1          2.0  2.0  2.0  2.0
2          NaN  3.0  3.0  3.0
3          NaN  NaN  4.0  NaN
Answered By: mozway