Pandas: Create Origin-Destination matrix, keeping destination-of-destination on same row
Question:
I am trying to create a Origin-Destination matrix that takes into account destination-of-destination in one row.
The dataset I have is similar to the following (edited both given_dataset and expected_result based on @mozway comment):
origin_id
link_type
applied_id
1
A
2
2
B
3
2
D
3
3
C
4
5
D
6
1
E
4
And the expected result would be:
origin_id
A
B
C
D
E
1
2
3
4
3
4
2
3
4
3
3
4
5
6
In other words, since 1 is linked to 2 via A, and 2 is linked to 3 via B and D – ecc. ecc. -, I would like to transpose this path back to the row with origin_id = 1
, where link_type
becomes my new header.
Notable mention: there is no scenario where 1 goes to both 2 and 3, and 2 goes to 3 via the same link_type.
I am currently using pivot_table
function (df.pivot_table(values='applied_id', index="origin__id", columns='link_type', aggfunc=max)
), and, although the result is near to what I am trying to achieve, it is not quite right:
origin_id
A
B
C
D
1
2
2
3
3
3
4
What would be an efficient way to achieve my expected result given my starting dataframe?
EDIT -> More context:
I have a dataset that maps any transaction (applied_id
) in our ERP with any other transaction (origin_id
) that the former has been generated from.
For instance, an invoice (applied_id
) being generated by a sales order (origin_id
), via link_type = 'Invoicing'
Then, the same invoice (origin_id
) might have a credit memo (applied_id
) applied on it (link_type = 'Credit Memo'
), because the customer wanted his money back.
Same for payments applied to invoices.
My goal is to trace back the invoice, the payment and the credit memo to the original sales order row, as well as the credit memo to the invoice row and payment row, and payment to the invoice row.
Hopefully this clarifies the goal here.
EDIT -> Working answer:
df = pd.DataFrame({'origin_id': ['1', '2', '2', '3', '5', '1'],
'link_type': ['A', 'B', 'D', 'C', 'D', 'E'],
'applied_id':['2', '3', '3', '4', '6', '4']})
G = nx.from_pandas_edgelist(df, source='origin_id', target='applied_id', edge_attr='link_type', create_using=nx.MultiDiGraph)
dict_for_df = {}
# Grabbing only link_types I am interested in
link_type_list = ['A', 'B', 'C', 'D']
for n in df['origin_id'].unique():
value_dict = {}
for value in link_type_list:
# As I want the "arriving" origin_id for each link_type, I am here grabbing key[1]
value_list = list(set([key[1] for key, val in nx.get_edge_attributes(G.subgraph({str(n)}|nx.descendants(G, str(n))),'link_type').items() if val == value]))
value_dict[value] = value_list
dict_for_df[n] = value_dict
final = pd.DataFrame.from_dict(dict_for_df, orient='index').reset_index().rename(columns={'index':'origin_id'})
# output:
# origin_id A B C D
# 0 1 [2] [3] [4] [3]
# 1 2 [] [3] [4] [3]
# 2 3 [] [] [4] []
# 3 5 [] [] [] [6]
Answers:
This is a graph problem that can be solved with networkx
.
Your (updated) data looks like:
You need to find for each origin, the descendants and get all the edges.
Here I aggregated as list as there can be several options, see below for your original data.
import networkx as nx
G = nx.from_pandas_edgelist(df, source='origin_id', target='applied_id',
edge_attr='link_type', create_using=nx.MultiDiGraph)
out = [list(nx.get_edge_attributes(G.subgraph({n}|nx.descendants(G, n)),
'link_type').values())
for n in df['origin_id'].unique()]
# [['A', 'E', 'B', 'D', 'C'], ['B', 'D', 'C'], ['C'], ['D']]
s = pd.Series(out, index=df['origin_id'].unique())
final = (df
.assign(link=df['origin_id'].map(s)).explode('link')
.pivot_table(index='origin_id', columns='link', values='applied_id',
aggfunc=list) # aggregation function can be changed
)
output:
link A B C D E
origin_id
1 [2, 4] [2, 4] [2, 4] [2, 4] [2, 4]
2 NaN [3, 3] [3, 3] [3, 3] NaN
3 NaN NaN [4] NaN NaN
5 NaN NaN NaN [6] NaN
In your original example, there was no duplicate so you can aggregate with aggfunc='first'
:
output:
link A B C D
origin_id
1 2.0 2.0 2.0 2.0
2 NaN 3.0 3.0 3.0
3 NaN NaN 4.0 NaN
I am trying to create a Origin-Destination matrix that takes into account destination-of-destination in one row.
The dataset I have is similar to the following (edited both given_dataset and expected_result based on @mozway comment):
origin_id | link_type | applied_id |
---|---|---|
1 | A | 2 |
2 | B | 3 |
2 | D | 3 |
3 | C | 4 |
5 | D | 6 |
1 | E | 4 |
And the expected result would be:
origin_id | A | B | C | D | E |
---|---|---|---|---|---|
1 | 2 | 3 | 4 | 3 | 4 |
2 | 3 | 4 | 3 | ||
3 | 4 | ||||
5 | 6 |
In other words, since 1 is linked to 2 via A, and 2 is linked to 3 via B and D – ecc. ecc. -, I would like to transpose this path back to the row with origin_id = 1
, where link_type
becomes my new header.
Notable mention: there is no scenario where 1 goes to both 2 and 3, and 2 goes to 3 via the same link_type.
I am currently using pivot_table
function (df.pivot_table(values='applied_id', index="origin__id", columns='link_type', aggfunc=max)
), and, although the result is near to what I am trying to achieve, it is not quite right:
origin_id | A | B | C | D |
---|---|---|---|---|
1 | 2 | |||
2 | 3 | 3 | ||
3 | 4 |
What would be an efficient way to achieve my expected result given my starting dataframe?
EDIT -> More context:
I have a dataset that maps any transaction (applied_id
) in our ERP with any other transaction (origin_id
) that the former has been generated from.
For instance, an invoice (applied_id
) being generated by a sales order (origin_id
), via link_type = 'Invoicing'
Then, the same invoice (origin_id
) might have a credit memo (applied_id
) applied on it (link_type = 'Credit Memo'
), because the customer wanted his money back.
Same for payments applied to invoices.
My goal is to trace back the invoice, the payment and the credit memo to the original sales order row, as well as the credit memo to the invoice row and payment row, and payment to the invoice row.
Hopefully this clarifies the goal here.
EDIT -> Working answer:
df = pd.DataFrame({'origin_id': ['1', '2', '2', '3', '5', '1'],
'link_type': ['A', 'B', 'D', 'C', 'D', 'E'],
'applied_id':['2', '3', '3', '4', '6', '4']})
G = nx.from_pandas_edgelist(df, source='origin_id', target='applied_id', edge_attr='link_type', create_using=nx.MultiDiGraph)
dict_for_df = {}
# Grabbing only link_types I am interested in
link_type_list = ['A', 'B', 'C', 'D']
for n in df['origin_id'].unique():
value_dict = {}
for value in link_type_list:
# As I want the "arriving" origin_id for each link_type, I am here grabbing key[1]
value_list = list(set([key[1] for key, val in nx.get_edge_attributes(G.subgraph({str(n)}|nx.descendants(G, str(n))),'link_type').items() if val == value]))
value_dict[value] = value_list
dict_for_df[n] = value_dict
final = pd.DataFrame.from_dict(dict_for_df, orient='index').reset_index().rename(columns={'index':'origin_id'})
# output:
# origin_id A B C D
# 0 1 [2] [3] [4] [3]
# 1 2 [] [3] [4] [3]
# 2 3 [] [] [4] []
# 3 5 [] [] [] [6]
This is a graph problem that can be solved with networkx
.
Your (updated) data looks like:
You need to find for each origin, the descendants and get all the edges.
Here I aggregated as list as there can be several options, see below for your original data.
import networkx as nx
G = nx.from_pandas_edgelist(df, source='origin_id', target='applied_id',
edge_attr='link_type', create_using=nx.MultiDiGraph)
out = [list(nx.get_edge_attributes(G.subgraph({n}|nx.descendants(G, n)),
'link_type').values())
for n in df['origin_id'].unique()]
# [['A', 'E', 'B', 'D', 'C'], ['B', 'D', 'C'], ['C'], ['D']]
s = pd.Series(out, index=df['origin_id'].unique())
final = (df
.assign(link=df['origin_id'].map(s)).explode('link')
.pivot_table(index='origin_id', columns='link', values='applied_id',
aggfunc=list) # aggregation function can be changed
)
output:
link A B C D E
origin_id
1 [2, 4] [2, 4] [2, 4] [2, 4] [2, 4]
2 NaN [3, 3] [3, 3] [3, 3] NaN
3 NaN NaN [4] NaN NaN
5 NaN NaN NaN [6] NaN
In your original example, there was no duplicate so you can aggregate with aggfunc='first'
:
output:
link A B C D
origin_id
1 2.0 2.0 2.0 2.0
2 NaN 3.0 3.0 3.0
3 NaN NaN 4.0 NaN