How to append all possible pairings as a column in pandas
Question:
I have a dataframe like below:
Class Value1 Value2
A 2 1
B 3 3
C 4 5
I wanted to generate all possible pairings and my output dataframe looks like below
Class Value1 Value2 A_Value1 A_Value2 B_Value1 B_Value2 C_Value1 C_Value2
A 2 1 2 1 3 3 4 5
B 3 3 2 1 3 3 4 5
C 4 5 2 1 3 3 4 5
Please assume there are nearly 1000 such classes. Is there any efficient way to do this ? Ultimately, I wanted to find the difference between (Value1 and value2) across each pairings
EDIT:
A_B_Value is created based on the formula
A_B_Value = absolute(ClassA_value1 - ClassB_value1) + absolute(ClassA_value2 - ClassB_value2)
Class Value1 Value2 A_B_Value A_C_Value B_C_Value
A 2 1 3 6 3
B 3 3 3 6 3
C 4 5 3 6 3
Thank you
Answers:
If need subtract columns Value1,Value2
and append new columns create dictionary and add them by DataFrame.assign
:
d = dict(zip(df['Class'].add('_diff'),
df['Value1'].sub(df['Value2'])))
print (d)
{'A_diff': 1, 'B_diff': 0, 'C_diff': -1}
df = df.assign(**d)
print (df)
Class Value1 Value2 A_diff B_diff C_diff
0 A 2 1 1 0 -1
1 B 3 3 1 0 -1
2 C 4 5 1 0 -1
EDIT: You can create all combinations by itertools.combinations
and in dictionary comprehension get difference, last create new columns by DataFrame.assign
:
from itertools import combinations
df1 = df.set_index('Class')
cols = list(combinations(df1.index,2))
d = {f'{a}_{b}_Value' : abs(df1.loc[a, 'Value1'] - df1.loc[b, 'Value1']) +
abs(df1.loc[a, 'Value2'] - df1.loc[b, 'Value2']) for a, b in cols}
df = df.assign(**d)
print (df)
Class Value1 Value2 A_B_Value A_C_Value B_C_Value
0 A 2 1 3 6 3
1 B 3 3 3 6 3
2 C 4 5 3 6 3
EDIT1: Because performance is important, here is vectorized solution inspired this:
#convert Class to index
df1 = df.set_index('Class')
#convert DataFrame to 2d array
v = df1.to_numpy()
#get indices of combinations
i, j = np.tril_indices(len(df1.index), -1)
#select array - first column Value1 is 0
out1_1 = v[i, 0]
out1_2 = v[j, 0]
#select array - second column Value2 is 1
out2_1 = v[i, 1]
out2_2 = v[j, 1]
#new columns names by combinations
cols = [f'{a}_{b}_Value' for a, b in zip(df1.index[j], df1.index[i])]
#new values in array
arr = np.abs(out1_1 - out1_2) + np.abs(out2_1 - out2_2)
#appended new columns
df = df.assign(**dict(zip(cols, arr)))
print (df)
Class Value1 Value2 A_B_Value A_C_Value B_C_Value
0 A 2 1 3 6 3
1 B 3 3 3 6 3
2 C 4 5 3 6 3
Performance comparison:
#100 rows DataFrame
N = 100
df = pd.DataFrame({'Class':[f'val{i+1}' for i in np.arange(N)],
'Value1':np.random.randint(100, size=N),
'Value2':np.random.randint(100, size=N)})
# print (df)
%%timeit
df1 = df.set_index('Class')
v = df1.to_numpy()
i, j = np.tril_indices(len(df1.index), -1)
#first column Value1 is 0
v1_1 = v[i, 0]
v1_2 = v[j, 0]
#second column Value2 is 1
v2_1 = v[i, 1]
v2_2 = v[j, 1]
cols = [f'{a}_{b}_Value' for a, b in zip(df1.index[j], df1.index[i])]
arr = np.abs(v1_1-v1_2) + np.abs(v2_1-v2_2)
df.assign(**dict(zip(cols, arr)))
303 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
tmp = df.set_index('Class')
cols = list(combinations(tmp.index,2))
idx1, idx2 = map(list, zip(*cols))
v1_1 = tmp.loc[idx1, 'Value1'].to_numpy()
v1_2 = tmp.loc[idx2, 'Value1'].to_numpy()
v2_1 = tmp.loc[idx1, 'Value2'].to_numpy()
v2_2 = tmp.loc[idx2, 'Value2'].to_numpy()
df[[f'{x1}_{x2}_Value' for x1, x2 in cols]
] = np.repeat((abs(v1_1-v1_2)+abs(v2_1-v2_2))[None], len(df), axis=0)
733 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can stack
and flatten the MultiIndex, then perform a cross merge
:
s = df.set_index('Class').stack()
s.index = s.index.map('_'.join)
out = df.merge(s.to_frame().T, how='cross')
Output:
Class Value1 Value2 A_Value1 A_Value2 B_Value1 B_Value2 C_Value1 C_Value2
0 A 2 1 2 1 3 3 4 5
1 B 3 3 2 1 3 3 4 5
2 C 4 5 2 1 3 3 4 5
vectorial numerical computation
from itertools import combinations
tmp = df.set_index('Class')
cols = list(combinations(tmp.index,2))
idx1, idx2 = map(list, zip(*cols))
v1_1 = tmp.loc[idx1, 'Value1'].to_numpy()
v1_2 = tmp.loc[idx2, 'Value1'].to_numpy()
v2_1 = tmp.loc[idx1, 'Value2'].to_numpy()
v2_2 = tmp.loc[idx2, 'Value2'].to_numpy()
df[[f'{x1}_{x2}_Value' for x1, x2 in cols]
] = np.repeat((abs(v1_1-v1_2)+abs(v2_1-v2_2))[None], len(df), axis=0)
print(df)
Output:
Class Value1 Value2 A_B_Value A_C_Value B_C_Value
0 A 2 1 3 6 3
1 B 3 3 3 6 3
2 C 4 5 3 6 3
I have a dataframe like below:
Class Value1 Value2
A 2 1
B 3 3
C 4 5
I wanted to generate all possible pairings and my output dataframe looks like below
Class Value1 Value2 A_Value1 A_Value2 B_Value1 B_Value2 C_Value1 C_Value2
A 2 1 2 1 3 3 4 5
B 3 3 2 1 3 3 4 5
C 4 5 2 1 3 3 4 5
Please assume there are nearly 1000 such classes. Is there any efficient way to do this ? Ultimately, I wanted to find the difference between (Value1 and value2) across each pairings
EDIT:
A_B_Value is created based on the formula
A_B_Value = absolute(ClassA_value1 - ClassB_value1) + absolute(ClassA_value2 - ClassB_value2)
Class Value1 Value2 A_B_Value A_C_Value B_C_Value
A 2 1 3 6 3
B 3 3 3 6 3
C 4 5 3 6 3
Thank you
If need subtract columns Value1,Value2
and append new columns create dictionary and add them by DataFrame.assign
:
d = dict(zip(df['Class'].add('_diff'),
df['Value1'].sub(df['Value2'])))
print (d)
{'A_diff': 1, 'B_diff': 0, 'C_diff': -1}
df = df.assign(**d)
print (df)
Class Value1 Value2 A_diff B_diff C_diff
0 A 2 1 1 0 -1
1 B 3 3 1 0 -1
2 C 4 5 1 0 -1
EDIT: You can create all combinations by itertools.combinations
and in dictionary comprehension get difference, last create new columns by DataFrame.assign
:
from itertools import combinations
df1 = df.set_index('Class')
cols = list(combinations(df1.index,2))
d = {f'{a}_{b}_Value' : abs(df1.loc[a, 'Value1'] - df1.loc[b, 'Value1']) +
abs(df1.loc[a, 'Value2'] - df1.loc[b, 'Value2']) for a, b in cols}
df = df.assign(**d)
print (df)
Class Value1 Value2 A_B_Value A_C_Value B_C_Value
0 A 2 1 3 6 3
1 B 3 3 3 6 3
2 C 4 5 3 6 3
EDIT1: Because performance is important, here is vectorized solution inspired this:
#convert Class to index
df1 = df.set_index('Class')
#convert DataFrame to 2d array
v = df1.to_numpy()
#get indices of combinations
i, j = np.tril_indices(len(df1.index), -1)
#select array - first column Value1 is 0
out1_1 = v[i, 0]
out1_2 = v[j, 0]
#select array - second column Value2 is 1
out2_1 = v[i, 1]
out2_2 = v[j, 1]
#new columns names by combinations
cols = [f'{a}_{b}_Value' for a, b in zip(df1.index[j], df1.index[i])]
#new values in array
arr = np.abs(out1_1 - out1_2) + np.abs(out2_1 - out2_2)
#appended new columns
df = df.assign(**dict(zip(cols, arr)))
print (df)
Class Value1 Value2 A_B_Value A_C_Value B_C_Value
0 A 2 1 3 6 3
1 B 3 3 3 6 3
2 C 4 5 3 6 3
Performance comparison:
#100 rows DataFrame
N = 100
df = pd.DataFrame({'Class':[f'val{i+1}' for i in np.arange(N)],
'Value1':np.random.randint(100, size=N),
'Value2':np.random.randint(100, size=N)})
# print (df)
%%timeit
df1 = df.set_index('Class')
v = df1.to_numpy()
i, j = np.tril_indices(len(df1.index), -1)
#first column Value1 is 0
v1_1 = v[i, 0]
v1_2 = v[j, 0]
#second column Value2 is 1
v2_1 = v[i, 1]
v2_2 = v[j, 1]
cols = [f'{a}_{b}_Value' for a, b in zip(df1.index[j], df1.index[i])]
arr = np.abs(v1_1-v1_2) + np.abs(v2_1-v2_2)
df.assign(**dict(zip(cols, arr)))
303 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
tmp = df.set_index('Class')
cols = list(combinations(tmp.index,2))
idx1, idx2 = map(list, zip(*cols))
v1_1 = tmp.loc[idx1, 'Value1'].to_numpy()
v1_2 = tmp.loc[idx2, 'Value1'].to_numpy()
v2_1 = tmp.loc[idx1, 'Value2'].to_numpy()
v2_2 = tmp.loc[idx2, 'Value2'].to_numpy()
df[[f'{x1}_{x2}_Value' for x1, x2 in cols]
] = np.repeat((abs(v1_1-v1_2)+abs(v2_1-v2_2))[None], len(df), axis=0)
733 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can stack
and flatten the MultiIndex, then perform a cross merge
:
s = df.set_index('Class').stack()
s.index = s.index.map('_'.join)
out = df.merge(s.to_frame().T, how='cross')
Output:
Class Value1 Value2 A_Value1 A_Value2 B_Value1 B_Value2 C_Value1 C_Value2
0 A 2 1 2 1 3 3 4 5
1 B 3 3 2 1 3 3 4 5
2 C 4 5 2 1 3 3 4 5
vectorial numerical computation
from itertools import combinations
tmp = df.set_index('Class')
cols = list(combinations(tmp.index,2))
idx1, idx2 = map(list, zip(*cols))
v1_1 = tmp.loc[idx1, 'Value1'].to_numpy()
v1_2 = tmp.loc[idx2, 'Value1'].to_numpy()
v2_1 = tmp.loc[idx1, 'Value2'].to_numpy()
v2_2 = tmp.loc[idx2, 'Value2'].to_numpy()
df[[f'{x1}_{x2}_Value' for x1, x2 in cols]
] = np.repeat((abs(v1_1-v1_2)+abs(v2_1-v2_2))[None], len(df), axis=0)
print(df)
Output:
Class Value1 Value2 A_B_Value A_C_Value B_C_Value
0 A 2 1 3 6 3
1 B 3 3 3 6 3
2 C 4 5 3 6 3