Merging two Pandas DataFrames based on the sequential order of two columns

Question:

I know questions related to this one have been asked multiple times, but I can’t find anything specific this one. I have gone through pandas.pydata.org/docs/user_guide/merging.html but I still can’t find what I need.

I have two very large output files that I need to merge together, based on the timestamp column of each file. I need the timestamp columns to interleave together in sequential order. Here is an example.

df1
x1   y1   z1
25   12   0.71
16   13   0.63
41   13   0.84
3    14   0.55
25   17   0.49

df2
x2   y2   z2
73   11   0.31
105  12   0.57
64   12   0.86
92   13   0.42
92   15   0.63
81   18   0.74

I need these DataFrames merged based on the sequential order of the y1 and y2 columns.

df3
x3   y3   z3
73   11   0.31
25   12   0.71
105  12   0.57
64   12   0.86
41   13   0.84
92   13   0.42
3    14   0.55
92   15   0.63
25   17   0.49
81   18   0.74

So far I have tried using Pandas concat with sort_values.

df3 = pd.concat([df1,df2]).sort_values(by=['y1','y2'], ascending=True)

Unfortunately I keep getting errors this way. I know there’s a way to do this, but I haven’t been able to find it. Can anyone offer advice?

Asked By: cat_herder

||

Answers:

The column names differ – you could rename the columns in one of the dataframes so they align.

pd.concat([
   df1,
   df2.rename(columns=dict(zip(df2.columns, df1.columns)))
]).sort_values("y1")
    x1  y1    z1
0   73  11  0.31
0   25  12  0.71
1  105  12  0.57
2   64  12  0.86
1   16  13  0.63
2   41  13  0.84
3   92  13  0.42
3    3  14  0.55
4   92  15  0.63
4   25  17  0.49
5   81  18  0.74

You can use ignore_index=True in the .concat if desired.

Answered By: jqurious

To make it easier to combine (concatenate) two dataframes vertically, first rename both the dataframes;

df1.columns = ['x', 'y', 'z']
df2.columns = ['x', 'y', 'z']

Once the column are renamed, we can sort_values at column y. Use ignore_index = True to generate new row index.

pd.concat([df_1, df_2], ignore_index=True).sort_values('y')

Output:

    x   y   z
5   73  11  0.31
0   25  12  0.71
6   105 12  0.57
7   64  12  0.86
1   16  13  0.63
2   41  13  0.84
8   92  13  0.42
3   3   14  0.55
9   92  15  0.63
4   25  17  0.49
10  81  18  0.74

Answered By: Ugyen Norbu