Outer product on Pandas DataFrame rows
Question:
Have two DataFrames with identical columns labels. Columns are label, data1, data2, ..., dataN
.
Need to take the product of the DataFrames, multiplying data1 * data1, data2 * data2, etc for every possible combination of rows in DataFrame1 with the rows in DataFrame2. As such, want the resulting DataFrame to maintain the label column of both frames in some way
Example:
Frame 1:
label
d1
d2
d3
a
1
2
3
b
4
5
6
Frame 2:
label
d1
d2
d3
c
7
8
9
d
10
11
12
Result:
label_1
label_2
d1
d2
d3
a
c
7
16
27
a
d
10
22
36
b
c
28
40
54
b
d
40
55
72
I feel like there is a nice way to do this, but all I can come up with is gross loops with lots of memory reallocation.
Answers:
Let’s do a cross merge first then mutiple the dn_x
with dn_y
out = df1.merge(df2, how='cross')
out = (out.filter(like='label')
.join(out.filter(regex='d.*_x')
.mul(out.filter(regex='d.*_y').values)
.rename(columns=lambda col: col.split('_')[0])))
print(out)
label_x label_y d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
first idea with DataFrame.reindex
and MultiIndex created by MultiIndex.from_product
:
mux = pd.MultiIndex.from_product([df1['label'], df2['label']])
df = (df1.set_index('label').reindex(mux, level=0)
.mul(df2.set_index('label').reindex(mux, level=1))
.rename_axis(['label1','label2'])
.reset_index())
print (df)
label1 label2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
Or solution with cross join:
df = (df1.rename(columns={'label':'label1'})
.merge(df2.rename(columns={'label':'label2'}),
how='cross',
suffixes=('_','')))
For multiple columns get cols ends by _
and multiple same columns without _
, last drop columns cols
:
cols = df.filter(regex='_$').columns
no_ = cols.str.rstrip('_')
df[no_] *= df[cols].to_numpy()
df = df.drop(cols, axis=1)
print (df)
label1 label2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
One option is with a cross join, using expand_grid from pyjanitor, before computing the products:
# pip install pyjanitor
import pandas as pd
import janitor as jn
others = {'df1':df1, 'df2':df2}
out = jn.expand_grid(others=others)
numbers = out.select_dtypes('number')
numbers= numbers['df1'] * numbers['df2']
labels = out.select_dtypes('object')
labels.columns = ['label_1', 'label_2']
pd.concat([labels, numbers], axis = 1)
label_1 label_2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
OP here. Ynjxsjmh’s answer allowed me to write the code to solve my problem, but I just wanted to post a function which is a little more general in the end and includes a little more explanation for anyone who stumbles here in the future.
Hit me with suggestions if you think of anything.
def exhaustive_df_operation(
self,
df1: pd.DataFrame,
df2: pd.DataFrame,
func: callable,
label_cols: list,
suffixes: tuple = ("_x", "_y"),
):
"""
Given DataFrames with multiple rows, executes the given
function on all row combinations ie in an exhaustive manner.
DataFrame column names must be the same. Label cols are the
columns which label the input/output and should not be used in
the computation.
Arguments:
df1: pd.DataFrame
First DataFrame to act on.
df2: pd.DataFrame
Second DataFrame to act on.
func: callable
numpy function to call as the operation on the DataFrames.
label_cols: list
The columns names corresponding to columns that label the
rows as distinct. Must be common to the DataFrames, but
several may be passed.
suffixes: tuple
The suffixes to use when calculating the cross merge.
Returns:
result: pd.DataFrame
DataFrame that results from product, will have
len(df1)*len(df2) rows. label_cols will label the DataFrame
from which the row was sourced.
eg. df1 df2
label a b label a b
i 1 2 k 5 6
j 3 4 l 7 8
func = np.add
label_cols = ['label']
suffixes = ("_x","_y")
result =
label_x label_y a b
i k 6 8
i l 8 10
j k 8 10
j l 10 12
"""
# Creating a merged DataFrame with an exhaustive "cross"
# product
merged = df1.merge(df2, how="cross", suffixes=suffixes)
# The names of the columns that will identify result rows
label_col_names = [col + suf for col in label_cols for suf in suffixes]
# The actual identifying columns
label_cols = merged[label_col_names]
# Non label columns ending suffix[0]
data_col_names = [
col
for col in merged.columns
if (suffixes[0] in col and col not in label_col_names)
]
data_1 = merged[data_col_names]
# Will need for rename later - removes suffix from column
# names with data
name_fix_dict = {old: old[: -len(suffixes[0])] for old in data_col_names}
# Non label columns ending suffix[1]
data_col_names = [
col
for col in merged.columns
if (suffixes[1] in col and col not in label_col_names)
]
data_2 = merged[data_col_names]
# Need .values because data_1 and data_2 have different column
# labels which confuses pandas/numpy.
result = label_cols.join(func(data_1, data_2.values))
# Removing suffixes from data columns
result.rename(columns=name_fix_dict, inplace=True)
return result
Have two DataFrames with identical columns labels. Columns are label, data1, data2, ..., dataN
.
Need to take the product of the DataFrames, multiplying data1 * data1, data2 * data2, etc for every possible combination of rows in DataFrame1 with the rows in DataFrame2. As such, want the resulting DataFrame to maintain the label column of both frames in some way
Example:
Frame 1:
label | d1 | d2 | d3 |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
Frame 2:
label | d1 | d2 | d3 |
---|---|---|---|
c | 7 | 8 | 9 |
d | 10 | 11 | 12 |
Result:
label_1 | label_2 | d1 | d2 | d3 |
---|---|---|---|---|
a | c | 7 | 16 | 27 |
a | d | 10 | 22 | 36 |
b | c | 28 | 40 | 54 |
b | d | 40 | 55 | 72 |
I feel like there is a nice way to do this, but all I can come up with is gross loops with lots of memory reallocation.
Let’s do a cross merge first then mutiple the dn_x
with dn_y
out = df1.merge(df2, how='cross')
out = (out.filter(like='label')
.join(out.filter(regex='d.*_x')
.mul(out.filter(regex='d.*_y').values)
.rename(columns=lambda col: col.split('_')[0])))
print(out)
label_x label_y d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
first idea with DataFrame.reindex
and MultiIndex created by MultiIndex.from_product
:
mux = pd.MultiIndex.from_product([df1['label'], df2['label']])
df = (df1.set_index('label').reindex(mux, level=0)
.mul(df2.set_index('label').reindex(mux, level=1))
.rename_axis(['label1','label2'])
.reset_index())
print (df)
label1 label2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
Or solution with cross join:
df = (df1.rename(columns={'label':'label1'})
.merge(df2.rename(columns={'label':'label2'}),
how='cross',
suffixes=('_','')))
For multiple columns get cols ends by _
and multiple same columns without _
, last drop columns cols
:
cols = df.filter(regex='_$').columns
no_ = cols.str.rstrip('_')
df[no_] *= df[cols].to_numpy()
df = df.drop(cols, axis=1)
print (df)
label1 label2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
One option is with a cross join, using expand_grid from pyjanitor, before computing the products:
# pip install pyjanitor
import pandas as pd
import janitor as jn
others = {'df1':df1, 'df2':df2}
out = jn.expand_grid(others=others)
numbers = out.select_dtypes('number')
numbers= numbers['df1'] * numbers['df2']
labels = out.select_dtypes('object')
labels.columns = ['label_1', 'label_2']
pd.concat([labels, numbers], axis = 1)
label_1 label_2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
OP here. Ynjxsjmh’s answer allowed me to write the code to solve my problem, but I just wanted to post a function which is a little more general in the end and includes a little more explanation for anyone who stumbles here in the future.
Hit me with suggestions if you think of anything.
def exhaustive_df_operation(
self,
df1: pd.DataFrame,
df2: pd.DataFrame,
func: callable,
label_cols: list,
suffixes: tuple = ("_x", "_y"),
):
"""
Given DataFrames with multiple rows, executes the given
function on all row combinations ie in an exhaustive manner.
DataFrame column names must be the same. Label cols are the
columns which label the input/output and should not be used in
the computation.
Arguments:
df1: pd.DataFrame
First DataFrame to act on.
df2: pd.DataFrame
Second DataFrame to act on.
func: callable
numpy function to call as the operation on the DataFrames.
label_cols: list
The columns names corresponding to columns that label the
rows as distinct. Must be common to the DataFrames, but
several may be passed.
suffixes: tuple
The suffixes to use when calculating the cross merge.
Returns:
result: pd.DataFrame
DataFrame that results from product, will have
len(df1)*len(df2) rows. label_cols will label the DataFrame
from which the row was sourced.
eg. df1 df2
label a b label a b
i 1 2 k 5 6
j 3 4 l 7 8
func = np.add
label_cols = ['label']
suffixes = ("_x","_y")
result =
label_x label_y a b
i k 6 8
i l 8 10
j k 8 10
j l 10 12
"""
# Creating a merged DataFrame with an exhaustive "cross"
# product
merged = df1.merge(df2, how="cross", suffixes=suffixes)
# The names of the columns that will identify result rows
label_col_names = [col + suf for col in label_cols for suf in suffixes]
# The actual identifying columns
label_cols = merged[label_col_names]
# Non label columns ending suffix[0]
data_col_names = [
col
for col in merged.columns
if (suffixes[0] in col and col not in label_col_names)
]
data_1 = merged[data_col_names]
# Will need for rename later - removes suffix from column
# names with data
name_fix_dict = {old: old[: -len(suffixes[0])] for old in data_col_names}
# Non label columns ending suffix[1]
data_col_names = [
col
for col in merged.columns
if (suffixes[1] in col and col not in label_col_names)
]
data_2 = merged[data_col_names]
# Need .values because data_1 and data_2 have different column
# labels which confuses pandas/numpy.
result = label_cols.join(func(data_1, data_2.values))
# Removing suffixes from data columns
result.rename(columns=name_fix_dict, inplace=True)
return result