Merging two files according to multiple column values

Question:

I’m trying to merge two files that are constructed like this:

The first file is a csv file that is constructed like this:

Hour   Longitude Latitude 
21:30  54.05     23
22:30  54.05     23
23:30  54.05     23

The second file is represented like this:

Hour   Longitude Latitude Meteo
21:30  54.05     23       20 degrees
22:30  106.05    67       -5 degrees
23:30  14.05     102      12 degrees

I want to merge the values of the first file with the values of the second file only by the matching hours, matching longitude and matching latitude.

Witch would give me this file:

Hour   Longitude Latitude Meteo
21:30  54.05     23       20 degrees
22:30  54.05     23       
23:30  54.05     23

As you can see the hour, the longitude and the latitude matched between those files, so the new column Meteo is added to the first file.

Asked By: LiquidSnake

||

Answers:

The straight-forward approach would be:

df1.merge(df2, how='left')

But it might be quicker with creating a map (dictionary). Maybe you can try both with your dataset.

m = df2.set_index(['Hour','Longitude','Latitude'])['Meteo']
df1['meteo'] = [m.get(tuple(i), '') for i in df1.values]

Setup

import pandas as pd

data1 = '''
Hour   Longitude Latitude 
21:30  54.05     23
22:30  54.05     23
23:30  54.05     23'''

data2 = '''
Hour   Longitude Latitude Meteo
21:30  54.05     23       20degrees
22:30  106.05    67       -5degrees
23:30  14.05     102      12degrees'''

df1 = pd.read_csv(pd.compat.StringIO(data1), sep='s+')
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='s+')

TIMEIT

%timeit df1.merge(df2, how='left').fillna('')
%timeit m = df2.set_index(['Hour','Longitude','Latitude'])['Meteo']; df1['meteo'] = [m.get(i,'') for i in zip(df1['Hour'],df1['Longitude'],df1['Latitude'])]
%timeit m = df2.set_index(['Hour','Longitude','Latitude'])['Meteo']; df1['meteo'] = [m.get(tuple(i), '') for i in df1.values]

Returns

100 loops, best of 3: 3.69 ms per loop
100 loops, best of 3: 3.03 ms per loop
100 loops, best of 3: 2.99 ms per loop
Answered By: Anton vBR