Joining CSV or Tables
Question:
I have two csv fies with different columns.
Table1
title stage jan time
darn 3.001 0.421 5/23/2016 13:14
darn 2.054 0.1213 5/24/2016 14:14
ok 2.829 1.036 5/23/2016 14:14
five 1.115 1.146 5/23/2016 17:14
three 2 5 5/23/2016 21:14
Table 2
title mar apr may jun date
darn 0.631 1.321 0.951 1.751 5/23/2016 12:14
ok 1.001 0.247 2.456 0.3216 5/24/2016 18:41
three 0.285 1.283 0.924 956 5/25/2016 17:41
I need to join them filtered by title(primary key) and the condition that the time in date field in table 2 is equal to (time – 1 hour) in date field in table 1. So the output should be something like this:
title stage jan mar apr may jun date
darn 3.001 0.421 0.631 1.321 0.951 1.751 5/23/2016 13:14
I was wondering if it can be done using Pandas or SQL query is the best way forward. I looked up and saw that pandas can merge based on unique key value.
import pandas as pd
a = pd.read_csv("1.csv")
b = pd.read_csv("2.csv")
merged = a.merge(b, on='title')
merged.to_csv("output.csv", index=False)
This is the program. I am struggling on how to set the condition for the date field.Bot SQL and Pandas solution is welcome
Answers:
assuming your time and date variables are recognized as such by Pandas,
just add
merged = merged[merged.date == (merged.time - pd.Timedelta('1 hours'))]
I would create a dummy column (to match “time” in df
):
In [11]: df1["time"] = df1["date"] + pd.offsets.Hour(1)
Now you can merge cleanly:
In [12]: df.merge(df1)
Out[12]:
title stage jan time mar apr may jun date
0 darn 3.001 0.421 2016-05-23 13:14:00 0.631 1.321 0.951 1.751 2016-05-23 12:14:00
In [13]: df.merge(df1, on=["title", "time"]) # potentially less reckless to specify columns
Out[13]:
title stage jan time mar apr may jun date
0 darn 3.001 0.421 2016-05-23 13:14:00 0.631 1.321 0.951 1.751 2016-05-23 12:14:00
Note: This means you don’t have to do the complete merge (on just title) which potentially could be very space inefficient.
I have two csv fies with different columns.
Table1
title stage jan time
darn 3.001 0.421 5/23/2016 13:14
darn 2.054 0.1213 5/24/2016 14:14
ok 2.829 1.036 5/23/2016 14:14
five 1.115 1.146 5/23/2016 17:14
three 2 5 5/23/2016 21:14
Table 2
title mar apr may jun date
darn 0.631 1.321 0.951 1.751 5/23/2016 12:14
ok 1.001 0.247 2.456 0.3216 5/24/2016 18:41
three 0.285 1.283 0.924 956 5/25/2016 17:41
I need to join them filtered by title(primary key) and the condition that the time in date field in table 2 is equal to (time – 1 hour) in date field in table 1. So the output should be something like this:
title stage jan mar apr may jun date
darn 3.001 0.421 0.631 1.321 0.951 1.751 5/23/2016 13:14
I was wondering if it can be done using Pandas or SQL query is the best way forward. I looked up and saw that pandas can merge based on unique key value.
import pandas as pd
a = pd.read_csv("1.csv")
b = pd.read_csv("2.csv")
merged = a.merge(b, on='title')
merged.to_csv("output.csv", index=False)
This is the program. I am struggling on how to set the condition for the date field.Bot SQL and Pandas solution is welcome
assuming your time and date variables are recognized as such by Pandas,
just add
merged = merged[merged.date == (merged.time - pd.Timedelta('1 hours'))]
I would create a dummy column (to match “time” in df
):
In [11]: df1["time"] = df1["date"] + pd.offsets.Hour(1)
Now you can merge cleanly:
In [12]: df.merge(df1)
Out[12]:
title stage jan time mar apr may jun date
0 darn 3.001 0.421 2016-05-23 13:14:00 0.631 1.321 0.951 1.751 2016-05-23 12:14:00
In [13]: df.merge(df1, on=["title", "time"]) # potentially less reckless to specify columns
Out[13]:
title stage jan time mar apr may jun date
0 darn 3.001 0.421 2016-05-23 13:14:00 0.631 1.321 0.951 1.751 2016-05-23 12:14:00
Note: This means you don’t have to do the complete merge (on just title) which potentially could be very space inefficient.