Merging many to many Dask

Question:

say I have the following databases (suppose they are Dask data frames:

df A =

1
1
2
2
2
2
3
4
5
5
5
5
5
5

df B =

1
2
2
3
3
3
4
5
5
5

and I would like to merge the two so that the resulting DataFrame has the most information among the two (so for instance in the case of observation 1 I would like to preserve the info of df A, in case of observation number 3, I would like to preserve the info of df B and iso on…).
In other words the resulting DataFrame should be like this:

df C=

1
1
2
2
2
2
3
3
3
4
5
5
5
5
5
5

Is there a way to do that in Dask?

Thank you

Asked By: Nutarelli Federico

||

Answers:

Notes:


If one wants to do it with Pandas dataframes, there are various ways to achieve that.

One approach would creating a dataframe to store the dataframes that have the highest number of rows per sample_id, and then apply a custom made function. Let’s invest a bit more time in that approach.

We will first create a dataframe to store the number of rows that each dataframe has per sample_id as follows

df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()})

df_count['df_a'] = df_count['sample_id'].map(df_a.groupby('sample_id').size())

df_count['df_b'] = df_count['sample_id'].map(df_b.groupby('sample_id').size())

As it will be helpful, let us create a column df_max that will store the dataframe that has more rows per sample_id

df_count['df_max'] = df_count[['df_a', 'df_b']].idxmax(axis=1)

[Out]:
   sample_id  df_a  df_b df_max
0          1     2     1   df_a
1          2     4     2   df_a
2          3     1     3   df_b
3          4     1     1   df_a
4          5     6     3   df_a

A one-liner to create the desired df_count would look like the following

df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()}).assign(df_a=lambda x: x['sample_id'].map(df_a.groupby('sample_id').size()), df_b=lambda x: x['sample_id'].map(df_b.groupby('sample_id').size()), df_max=lambda x: x[['df_a', 'df_b']].idxmax(axis=1))

Now, given df_a, df_b, and df_count, one will want a function to merge the dataframes based on a specific condition:

  • If df_max is df_a, then take the rows from df_a.

  • If df_max is df_b, then take the rows from df_b.

One can create a function merge_df that takes df_a, df_b, and df_count and returns the merged dataframe

def merge_df(df_a, df_b, df_count):

    # Create a list to store the dataframes
    df_list = []

    # Iterate over the rows in df_count
    for index, row in df_count.iterrows():

        # If df_max is df_a, then take the rows from df_a
        if row['df_max'] == 'df_a':
            df_list.append(df_a[df_a['sample_id'] == row['sample_id']])

        # If df_max is df_b, then take the rows from df_b
        elif row['df_max'] == 'df_b':
            df_list.append(df_b[df_b['sample_id'] == row['sample_id']])

        # If df_max is neither df_a nor df_b, then use the first dataframe
        else:
            df_list.append(df_a[df_a['sample_id'] == row['sample_id']])

    # Concatenate the dataframes in df_list and return the result. Also, reset the index.
    return pd.concat(df_list).reset_index(drop=True)

Then one can apply the function

df_merged = merge_df(df_a, df_b, df_count)

[Out]:
    sample_id
0           1
1           1
2           2
3           2
4           2
5           2
6           3
7           3
8           3
9           4
10          5
11          5
12          5
13          5
14          5
15          5
Answered By: Gonçalo Peres
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.