Merging two files according to multiple column values
Question:
I’m trying to merge two files that are constructed like this:
The first file is a csv file that is constructed like this:
Hour Longitude Latitude
21:30 54.05 23
22:30 54.05 23
23:30 54.05 23
The second file is represented like this:
Hour Longitude Latitude Meteo
21:30 54.05 23 20 degrees
22:30 106.05 67 -5 degrees
23:30 14.05 102 12 degrees
I want to merge the values of the first file with the values of the second file only by the matching hours, matching longitude and matching latitude.
Witch would give me this file:
Hour Longitude Latitude Meteo
21:30 54.05 23 20 degrees
22:30 54.05 23
23:30 54.05 23
As you can see the hour, the longitude and the latitude matched between those files, so the new column Meteo is added to the first file.
Answers:
The straight-forward approach would be:
df1.merge(df2, how='left')
But it might be quicker with creating a map (dictionary). Maybe you can try both with your dataset.
m = df2.set_index(['Hour','Longitude','Latitude'])['Meteo']
df1['meteo'] = [m.get(tuple(i), '') for i in df1.values]
Setup
import pandas as pd
data1 = '''
Hour Longitude Latitude
21:30 54.05 23
22:30 54.05 23
23:30 54.05 23'''
data2 = '''
Hour Longitude Latitude Meteo
21:30 54.05 23 20degrees
22:30 106.05 67 -5degrees
23:30 14.05 102 12degrees'''
df1 = pd.read_csv(pd.compat.StringIO(data1), sep='s+')
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='s+')
TIMEIT
%timeit df1.merge(df2, how='left').fillna('')
%timeit m = df2.set_index(['Hour','Longitude','Latitude'])['Meteo']; df1['meteo'] = [m.get(i,'') for i in zip(df1['Hour'],df1['Longitude'],df1['Latitude'])]
%timeit m = df2.set_index(['Hour','Longitude','Latitude'])['Meteo']; df1['meteo'] = [m.get(tuple(i), '') for i in df1.values]
Returns
100 loops, best of 3: 3.69 ms per loop
100 loops, best of 3: 3.03 ms per loop
100 loops, best of 3: 2.99 ms per loop
I’m trying to merge two files that are constructed like this:
The first file is a csv file that is constructed like this:
Hour Longitude Latitude
21:30 54.05 23
22:30 54.05 23
23:30 54.05 23
The second file is represented like this:
Hour Longitude Latitude Meteo
21:30 54.05 23 20 degrees
22:30 106.05 67 -5 degrees
23:30 14.05 102 12 degrees
I want to merge the values of the first file with the values of the second file only by the matching hours, matching longitude and matching latitude.
Witch would give me this file:
Hour Longitude Latitude Meteo
21:30 54.05 23 20 degrees
22:30 54.05 23
23:30 54.05 23
As you can see the hour, the longitude and the latitude matched between those files, so the new column Meteo is added to the first file.
The straight-forward approach would be:
df1.merge(df2, how='left')
But it might be quicker with creating a map (dictionary). Maybe you can try both with your dataset.
m = df2.set_index(['Hour','Longitude','Latitude'])['Meteo']
df1['meteo'] = [m.get(tuple(i), '') for i in df1.values]
Setup
import pandas as pd
data1 = '''
Hour Longitude Latitude
21:30 54.05 23
22:30 54.05 23
23:30 54.05 23'''
data2 = '''
Hour Longitude Latitude Meteo
21:30 54.05 23 20degrees
22:30 106.05 67 -5degrees
23:30 14.05 102 12degrees'''
df1 = pd.read_csv(pd.compat.StringIO(data1), sep='s+')
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='s+')
TIMEIT
%timeit df1.merge(df2, how='left').fillna('')
%timeit m = df2.set_index(['Hour','Longitude','Latitude'])['Meteo']; df1['meteo'] = [m.get(i,'') for i in zip(df1['Hour'],df1['Longitude'],df1['Latitude'])]
%timeit m = df2.set_index(['Hour','Longitude','Latitude'])['Meteo']; df1['meteo'] = [m.get(tuple(i), '') for i in df1.values]
Returns
100 loops, best of 3: 3.69 ms per loop
100 loops, best of 3: 3.03 ms per loop
100 loops, best of 3: 2.99 ms per loop