Merge a dictionary of dataframes and create a new column called source to show where it came from, also merge duplicates
Question:
I have the following dictionary of dataframe, the actual one is much bigger
data = {
'src1': pd.DataFrame({
'x1': ['SNN', 'YH', 'CDD', 'ONT', 'ONT'],
'x2': ['AAGH', 'KSD', 'CHH', '002274', '301002']
}),
'src2': pd.DataFrame({
'x1': ['HA', 'TRA', 'GHJ', 'AH', 'ONT'],
'x2': ['NNG', 'ASGH', 'CTT', 'AGF', '002274']
}),
'src3': pd.DataFrame({
'x1': ['AX', 'TG', 'ONT', 'XR', 'ONT'],
'x2': ['GG61A', 'X3361', '301002', '07512', '002274']
})
}
I want to merge it into a single dataframe, and create a new column called source
which shows which key it came from so that I can recreate the original dictionary after manipulating the data.
I also don’t want duplicates, so for instances in the row ONT 002274
, maybe the source would look like [‘src2′,’src3’].
I tried,
keys = list(df_dict.keys())
df = pd.concat([data[key].assign(Key=key) for key in keys])
But I get,
x1 x2 Key
0 SNN AAGH src1
1 YH KSD src1
2 CDD CHH src1
3 ONT 002274 src1
4 ONT 301002 src1
0 HA NNG src2
1 TRA ASGH src2
2 GHJ CTT src2
3 AH AGF src2
4 ONT 002274 src2
0 AX GG61A src3
1 TG X3361 src3
2 ONT 301002 src3
3 XR 07512 src3
4 ONT 002274 src3
I want,
x1 x2 Key
0 SNN AAGH src1
1 YH KSD src1
2 CDD CHH src1
3 ONT 002274 [src1, src2, src3]
4 ONT 301002 [src1,src3]
0 HA NNG src2
1 TRA ASGH src2
2 GHJ CTT src2
3 AH AGF src2
0 AX GG61A src3
1 TG X3361 src3
3 XR 07512 src3
Would that be enough to recreate the original dictionary? I plan to do it by iterating over the column and appending each row to the dataframe in which the key belongs to.
Is there a better way to recreate my original dataframe?
Answers:
You can use dict comprehension with concat
first and then aggregate lists if duplicates in lambda function:
f = lambda x: list(x) if len(x) > 1 else x
df = (pd.concat({k: v.assign(Key=k) for k, v in data.items()})
.groupby(['x1','x2'])['Key'].agg(f).reset_index())
Another idea:
f = lambda x: list(x) if len(x) > 1 else x
df = (pd.concat({k: v for k, v in data.items()})
.droplevel(-1)
.rename_axis('Key')
.reset_index()
.groupby(['x1','x2'])['Key'].agg(f).reset_index()
)
print (df)
x1 x2 Key
0 AH AGF src2
1 AX GG61A src3
2 CDD CHH src1
3 GHJ CTT src2
4 HA NNG src2
5 ONT 002274 [src1, src2, src3]
6 ONT 301002 [src1, src3]
7 SNN AAGH src1
8 TG X3361 src3
9 TRA ASGH src2
10 XR 07512 src3
11 YH KSD src1
Your solution:
keys = list(data.keys())
f = lambda x: list(x) if len(x) > 1 else x
df = (pd.concat([data[key].assign(Key=key) for key in keys])
.groupby(['x1','x2'])['Key'].agg(f).reset_index())
print (df)
x1 x2 Key
0 AH AGF src2
1 AX GG61A src3
2 CDD CHH src1
3 GHJ CTT src2
4 HA NNG src2
5 ONT 002274 [src1, src2, src3]
6 ONT 301002 [src1, src3]
7 SNN AAGH src1
8 TG X3361 src3
9 TRA ASGH src2
10 XR 07512 src3
11 YH KSD src1
I have the following dictionary of dataframe, the actual one is much bigger
data = {
'src1': pd.DataFrame({
'x1': ['SNN', 'YH', 'CDD', 'ONT', 'ONT'],
'x2': ['AAGH', 'KSD', 'CHH', '002274', '301002']
}),
'src2': pd.DataFrame({
'x1': ['HA', 'TRA', 'GHJ', 'AH', 'ONT'],
'x2': ['NNG', 'ASGH', 'CTT', 'AGF', '002274']
}),
'src3': pd.DataFrame({
'x1': ['AX', 'TG', 'ONT', 'XR', 'ONT'],
'x2': ['GG61A', 'X3361', '301002', '07512', '002274']
})
}
I want to merge it into a single dataframe, and create a new column called source
which shows which key it came from so that I can recreate the original dictionary after manipulating the data.
I also don’t want duplicates, so for instances in the row ONT 002274
, maybe the source would look like [‘src2′,’src3’].
I tried,
keys = list(df_dict.keys())
df = pd.concat([data[key].assign(Key=key) for key in keys])
But I get,
x1 x2 Key
0 SNN AAGH src1
1 YH KSD src1
2 CDD CHH src1
3 ONT 002274 src1
4 ONT 301002 src1
0 HA NNG src2
1 TRA ASGH src2
2 GHJ CTT src2
3 AH AGF src2
4 ONT 002274 src2
0 AX GG61A src3
1 TG X3361 src3
2 ONT 301002 src3
3 XR 07512 src3
4 ONT 002274 src3
I want,
x1 x2 Key
0 SNN AAGH src1
1 YH KSD src1
2 CDD CHH src1
3 ONT 002274 [src1, src2, src3]
4 ONT 301002 [src1,src3]
0 HA NNG src2
1 TRA ASGH src2
2 GHJ CTT src2
3 AH AGF src2
0 AX GG61A src3
1 TG X3361 src3
3 XR 07512 src3
Would that be enough to recreate the original dictionary? I plan to do it by iterating over the column and appending each row to the dataframe in which the key belongs to.
Is there a better way to recreate my original dataframe?
You can use dict comprehension with concat
first and then aggregate lists if duplicates in lambda function:
f = lambda x: list(x) if len(x) > 1 else x
df = (pd.concat({k: v.assign(Key=k) for k, v in data.items()})
.groupby(['x1','x2'])['Key'].agg(f).reset_index())
Another idea:
f = lambda x: list(x) if len(x) > 1 else x
df = (pd.concat({k: v for k, v in data.items()})
.droplevel(-1)
.rename_axis('Key')
.reset_index()
.groupby(['x1','x2'])['Key'].agg(f).reset_index()
)
print (df)
x1 x2 Key
0 AH AGF src2
1 AX GG61A src3
2 CDD CHH src1
3 GHJ CTT src2
4 HA NNG src2
5 ONT 002274 [src1, src2, src3]
6 ONT 301002 [src1, src3]
7 SNN AAGH src1
8 TG X3361 src3
9 TRA ASGH src2
10 XR 07512 src3
11 YH KSD src1
Your solution:
keys = list(data.keys())
f = lambda x: list(x) if len(x) > 1 else x
df = (pd.concat([data[key].assign(Key=key) for key in keys])
.groupby(['x1','x2'])['Key'].agg(f).reset_index())
print (df)
x1 x2 Key
0 AH AGF src2
1 AX GG61A src3
2 CDD CHH src1
3 GHJ CTT src2
4 HA NNG src2
5 ONT 002274 [src1, src2, src3]
6 ONT 301002 [src1, src3]
7 SNN AAGH src1
8 TG X3361 src3
9 TRA ASGH src2
10 XR 07512 src3
11 YH KSD src1