Outer product on Pandas DataFrame rows

Question:

Have two DataFrames with identical columns labels. Columns are label, data1, data2, ..., dataN.
Need to take the product of the DataFrames, multiplying data1 * data1, data2 * data2, etc for every possible combination of rows in DataFrame1 with the rows in DataFrame2. As such, want the resulting DataFrame to maintain the label column of both frames in some way

Example:

Frame 1:

label d1 d2 d3
a 1 2 3
b 4 5 6

Frame 2:

label d1 d2 d3
c 7 8 9
d 10 11 12

Result:

label_1 label_2 d1 d2 d3
a c 7 16 27
a d 10 22 36
b c 28 40 54
b d 40 55 72

I feel like there is a nice way to do this, but all I can come up with is gross loops with lots of memory reallocation.

Asked By: Immot

||

Answers:

Let’s do a cross merge first then mutiple the dn_x with dn_y

out = df1.merge(df2, how='cross')
out = (out.filter(like='label')
       .join(out.filter(regex='d.*_x')
             .mul(out.filter(regex='d.*_y').values)
             .rename(columns=lambda col: col.split('_')[0])))
print(out)

  label_x label_y  d1  d2  d3
0       a       c   7  16  27
1       a       d  10  22  36
2       b       c  28  40  54
3       b       d  40  55  72
Answered By: Ynjxsjmh

first idea with DataFrame.reindex and MultiIndex created by MultiIndex.from_product:

mux = pd.MultiIndex.from_product([df1['label'], df2['label']])

df = (df1.set_index('label').reindex(mux, level=0)
         .mul(df2.set_index('label').reindex(mux, level=1))
         .rename_axis(['label1','label2'])
         .reset_index())
print (df)
  label1 label2  d1  d2  d3
0      a      c   7  16  27
1      a      d  10  22  36
2      b      c  28  40  54
3      b      d  40  55  72

Or solution with cross join:

df = (df1.rename(columns={'label':'label1'})
          .merge(df2.rename(columns={'label':'label2'}), 
                 how='cross',
                 suffixes=('_','')))

For multiple columns get cols ends by _ and multiple same columns without _, last drop columns cols:

cols = df.filter(regex='_$').columns
no_ = cols.str.rstrip('_')
df[no_] *= df[cols].to_numpy()
df = df.drop(cols, axis=1)
print (df)
  label1 label2  d1  d2  d3
0      a      c   7  16  27
1      a      d  10  22  36
2      b      c  28  40  54
3      b      d  40  55  72
Answered By: jezrael

One option is with a cross join, using expand_grid from pyjanitor, before computing the products:

# pip install pyjanitor
import pandas as pd
import janitor as jn

others = {'df1':df1, 'df2':df2}
out = jn.expand_grid(others=others)
numbers = out.select_dtypes('number')
numbers= numbers['df1'] * numbers['df2']
labels = out.select_dtypes('object')
labels.columns = ['label_1', 'label_2']
pd.concat([labels, numbers], axis = 1)

  label_1 label_2  d1  d2  d3
0       a       c   7  16  27
1       a       d  10  22  36
2       b       c  28  40  54
3       b       d  40  55  72
Answered By: sammywemmy

OP here. Ynjxsjmh’s answer allowed me to write the code to solve my problem, but I just wanted to post a function which is a little more general in the end and includes a little more explanation for anyone who stumbles here in the future.

Hit me with suggestions if you think of anything.

def exhaustive_df_operation(
        self,
        df1: pd.DataFrame,
        df2: pd.DataFrame,
        func: callable,
        label_cols: list,
        suffixes: tuple = ("_x", "_y"),
    ):
        """
        Given DataFrames with multiple rows, executes the given 
        function on all row combinations ie in an exhaustive manner. 
        DataFrame column names must be the same. Label cols are the 
        columns which label the input/output and should not be used in
        the computation.

        Arguments:
        df1: pd.DataFrame
            First DataFrame to act on.
        df2: pd.DataFrame
            Second DataFrame to act on.
        func: callable
            numpy function to call as the operation on the DataFrames.
        label_cols: list
            The columns names corresponding to columns that label the 
            rows as distinct. Must be common to the DataFrames, but 
            several may be passed.
        suffixes: tuple
            The suffixes to use when calculating the cross merge.

        Returns:
        result: pd.DataFrame
            DataFrame that results from product, will have 
            len(df1)*len(df2) rows. label_cols will label the DataFrame
            from which the row was sourced.

        eg. df1                df2
            label   a   b      label   a   b
                i   1   2          k   5   6
                j   3   4          l   7   8
        func = np.add
        label_cols = ['label']
        suffixes = ("_x","_y")
        result =
            label_x  label_y   a   b
                  i        k   6   8
                  i        l   8  10
                  j        k   8  10
                  j        l  10  12
        """

        # Creating a merged DataFrame with an exhaustive "cross" 
        # product
        merged = df1.merge(df2, how="cross", suffixes=suffixes)

        # The names of the columns that will identify result rows
        label_col_names = [col + suf for col in label_cols for suf in suffixes]
        # The actual identifying columns
        label_cols = merged[label_col_names]

        # Non label columns ending suffix[0]
        data_col_names = [
            col
            for col in merged.columns
            if (suffixes[0] in col and col not in label_col_names)
        ]
        data_1 = merged[data_col_names]
        # Will need for rename later - removes suffix from column 
        # names with data
        name_fix_dict = {old: old[: -len(suffixes[0])] for old in data_col_names}

        # Non label columns ending suffix[1]
        data_col_names = [
            col
            for col in merged.columns
            if (suffixes[1] in col and col not in label_col_names)
        ]
        data_2 = merged[data_col_names]

        # Need .values because data_1 and data_2 have different column
        # labels which confuses pandas/numpy.
        result = label_cols.join(func(data_1, data_2.values))

        # Removing suffixes from data columns
        result.rename(columns=name_fix_dict, inplace=True)

        return result
Answered By: Immot
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.