Merge a dictionary of dataframes and create a new column called source to show where it came from, also merge duplicates

Question:

I have the following dictionary of dataframe, the actual one is much bigger

data = {
    'src1': pd.DataFrame({
        'x1': ['SNN', 'YH', 'CDD', 'ONT', 'ONT'],
        'x2': ['AAGH', 'KSD', 'CHH', '002274', '301002']
    }),
    'src2': pd.DataFrame({
        'x1': ['HA', 'TRA', 'GHJ', 'AH', 'ONT'],
        'x2': ['NNG', 'ASGH', 'CTT', 'AGF', '002274']
    }),
    'src3': pd.DataFrame({
        'x1': ['AX', 'TG', 'ONT', 'XR', 'ONT'],
        'x2': ['GG61A', 'X3361', '301002', '07512', '002274']
    })
}

I want to merge it into a single dataframe, and create a new column called source which shows which key it came from so that I can recreate the original dictionary after manipulating the data.

I also don’t want duplicates, so for instances in the row ONT 002274, maybe the source would look like [‘src2′,’src3’].

I tried,

keys = list(df_dict.keys())
df = pd.concat([data[key].assign(Key=key) for key in keys])

But I get,


x1  x2  Key
0   SNN AAGH    src1
1   YH  KSD src1
2   CDD CHH src1
3   ONT 002274  src1
4   ONT 301002  src1
0   HA  NNG src2
1   TRA ASGH    src2
2   GHJ CTT src2
3   AH  AGF src2
4   ONT 002274  src2
0   AX  GG61A   src3
1   TG  X3361   src3
2   ONT 301002  src3
3   XR  07512   src3
4   ONT 002274  src3

I want,


x1  x2  Key
0   SNN AAGH    src1
1   YH  KSD src1
2   CDD CHH src1
3   ONT 002274  [src1, src2, src3]
4   ONT 301002  [src1,src3]
0   HA  NNG src2
1   TRA ASGH    src2
2   GHJ CTT src2
3   AH  AGF src2
0   AX  GG61A   src3
1   TG  X3361   src3
3   XR  07512   src3

Would that be enough to recreate the original dictionary? I plan to do it by iterating over the column and appending each row to the dataframe in which the key belongs to.

Is there a better way to recreate my original dataframe?

Asked By: anarchy

||

Answers:

You can use dict comprehension with concat first and then aggregate lists if duplicates in lambda function:

f = lambda x: list(x) if len(x) > 1 else x
df = (pd.concat({k: v.assign(Key=k) for k, v in data.items()})
        .groupby(['x1','x2'])['Key'].agg(f).reset_index())

Another idea:

f = lambda x: list(x) if len(x) > 1 else x
df = (pd.concat({k: v for k, v in data.items()})
        .droplevel(-1)
        .rename_axis('Key')
        .reset_index()
        .groupby(['x1','x2'])['Key'].agg(f).reset_index()
        )


print (df)
     x1      x2                 Key
0    AH     AGF                src2
1    AX   GG61A                src3
2   CDD     CHH                src1
3   GHJ     CTT                src2
4    HA     NNG                src2
5   ONT  002274  [src1, src2, src3]
6   ONT  301002        [src1, src3]
7   SNN    AAGH                src1
8    TG   X3361                src3
9   TRA    ASGH                src2
10   XR   07512                src3
11   YH     KSD                src1

Your solution:

keys = list(data.keys())

f = lambda x: list(x) if len(x) > 1 else x
df = (pd.concat([data[key].assign(Key=key) for key in keys])
        .groupby(['x1','x2'])['Key'].agg(f).reset_index())


print (df)
     x1      x2                 Key
0    AH     AGF                src2
1    AX   GG61A                src3
2   CDD     CHH                src1
3   GHJ     CTT                src2
4    HA     NNG                src2
5   ONT  002274  [src1, src2, src3]
6   ONT  301002        [src1, src3]
7   SNN    AAGH                src1
8    TG   X3361                src3
9   TRA    ASGH                src2
10   XR   07512                src3
11   YH     KSD                src1
Answered By: jezrael