Concatenating Pandas datetime
Question:
I have solutions for this question, 2 solutions in fact, but I’m not happy with them. The reason is that the files I’m trying to read have about 12 millions rows, and using these solutions, it takes a huge amount of time to process them. Mainly, the reason is that the solutions are row-by-row operations.
So, I read the file like this:
In [1]: df = pd.read_csv('C:/Projects/NPMRDS/FHWA_TASK2-4_NJ_09_2013_TT.CSV')
df.head()
Out [1]: TMC DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS
0 103N04152 9252013 211 12 12 NaN
1 103N04152 9262013 0 7 7 NaN
2 103N04152 9032013 177 8 8 NaN
3 103N04152 9042013 176 8 9 7
My problem is with the DATE and EPOCH columns. I want to merge them into a single datetime column.
-
DATE is in ‘%m%d%Y’ format (with the leading zero missing)
-
EPOCH is 5 minute epoch of a day:
Time EPOCH
00:00:00 => 0
00:05:00 => 1
...
...
12:00:00 => 144
12:05:00 => 145
...
...
23:50:00 => 286
23:55:00 => 287
What I want is something like this:
In [2]: df.head()
Out [2]: TMC DATE_TIME DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS
0 103N04152 2013-09-25 17:35:00 9252013 211 12 12 NaN
1 103N04152 2013-09-26 00:00:00 9262013 0 7 7 NaN
2 103N04152 2013-09-03 14:45:00 9032013 177 8 8 NaN
3 103N04152 2013-09-04 14:30:00 9042013 176 8 9 7
Now, I can do this row-by-row as I mentioned earlier by doing either of these three things:
In [3]: df = pd.read_csv('C:/Projects/NPMRDS/FHWA_TASK2-4_NJ_09_2013_TT.CSV',
converters={'DATE': lambda x: datetime.datetime.strptime(x, '%m%d%Y'),
'EPOCH': lambda x: str(datetime.timedelta(minutes = int(x)*5))},
parse_dates = {'date_time': ['DATE', 'EPOCH']},
keep_date_col = True)
df.head()
Out [3]: date_time TMC DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS
0 2013-09-25 17:35:00 103N04152 2013-09-25 17:35:00 12 12 NaN
1 2013-09-26 00:00:00 103N04152 2013-09-26 00:00:00 7 7 NaN
2 2013-09-03 14:45:00 103N04152 2013-09-03 14:45:00 8 8 NaN
3 2013-09-04 14:40:00 103N04152 2013-09-04 14:40:00 8 9 7
4 2013-09-05 09:35:00 103N04152 2013-09-05 09:35:00 10 10 NaN
In this method I lose the original formatting of DATE and EPOCH, but it doesn’t really affect further computations on the dataframe. Instead of using converters as an argument, I could have used date_parser. Or, after reading the data, similar to line 1, I could have done something like this:
In [4]: df = pd.read_csv('C:/Projects/NPMRDS/FHWA_TASK2-4_NJ_09_2013_TT.CSV')
df['date_time'] = pd.to_datetime([datetime.datetime.strptime(str(df['DATE'][x]), '%m%d%Y') + datetime.timedelta(minutes = int(df['EPOCH'][x]*5)) for x in range(len(df))])
df.head()
Out [4]: TMC DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS DATE_TIME
0 103N04152 9252013 211 12 12 NaN 2013-09-25 17:35:00
1 103N04152 9262013 0 7 7 NaN 2013-09-26 00:00:00
2 103N04152 9032013 177 8 8 NaN 2013-09-03 14:45:00
3 103N04152 9042013 176 8 9 7 2013-09-04 14:40:00
4 103N04152 9052013 115 10 10 NaN 2013-09-05 09:35:00
A more desirable result (don’t worry about the column orders), but still row-by-row, and takes a huge amount of time.
Then there are pandas.to_datetime
and pandas.to_timedelta
, which run much faster than the methods described above. But I cannot merge the results together without resorting to string functions, which are again mainly row-by-row.
Does anyone know a better way to do this?
Answers:
Try this out – reduced runtime for me to about 1s (compared to 15s) on 4M rows of test data.
df = pd.read_csv('temp.csv')
df['DATE'] = pd.to_datetime(df['DATE'], format='%m%d%Y')
df['EPOCH'] = pd.to_timedelta((df['EPOCH'].astype(int) * 5).astype('timedelta64[m]'))
df['DATE_TIME'] = df['DATE'] + df['EPOCH']
In addition to chrisb’s answer, I found a way to do it as well. The trick lies in setting the box
parameter to False
in pandas.to_datetime()
. Like so:
df['DATE_TIME'] = pd.to_datetime(df['DATE'], format='%m%d%Y', box=False) + pd.to_timedelta(df['EPOCH']*5*60, unit='s')
Setting that to False
returns a numpy.datetime[64]
array, instead of pandas.DatetimeIndex
. More information can be found in the pandas.to_datetime()
documentation. And, pandas.to_timedelta()
does not work with unit='m'
.
This answer was posted as an edit to the question Concatenating Pandas datetime by the OP Kartik under CC BY-SA 3.0.
I have solutions for this question, 2 solutions in fact, but I’m not happy with them. The reason is that the files I’m trying to read have about 12 millions rows, and using these solutions, it takes a huge amount of time to process them. Mainly, the reason is that the solutions are row-by-row operations.
So, I read the file like this:
In [1]: df = pd.read_csv('C:/Projects/NPMRDS/FHWA_TASK2-4_NJ_09_2013_TT.CSV')
df.head()
Out [1]: TMC DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS
0 103N04152 9252013 211 12 12 NaN
1 103N04152 9262013 0 7 7 NaN
2 103N04152 9032013 177 8 8 NaN
3 103N04152 9042013 176 8 9 7
My problem is with the DATE and EPOCH columns. I want to merge them into a single datetime column.
-
DATE is in ‘%m%d%Y’ format (with the leading zero missing)
-
EPOCH is 5 minute epoch of a day:
Time EPOCH 00:00:00 => 0 00:05:00 => 1 ... ... 12:00:00 => 144 12:05:00 => 145 ... ... 23:50:00 => 286 23:55:00 => 287
What I want is something like this:
In [2]: df.head()
Out [2]: TMC DATE_TIME DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS
0 103N04152 2013-09-25 17:35:00 9252013 211 12 12 NaN
1 103N04152 2013-09-26 00:00:00 9262013 0 7 7 NaN
2 103N04152 2013-09-03 14:45:00 9032013 177 8 8 NaN
3 103N04152 2013-09-04 14:30:00 9042013 176 8 9 7
Now, I can do this row-by-row as I mentioned earlier by doing either of these three things:
In [3]: df = pd.read_csv('C:/Projects/NPMRDS/FHWA_TASK2-4_NJ_09_2013_TT.CSV',
converters={'DATE': lambda x: datetime.datetime.strptime(x, '%m%d%Y'),
'EPOCH': lambda x: str(datetime.timedelta(minutes = int(x)*5))},
parse_dates = {'date_time': ['DATE', 'EPOCH']},
keep_date_col = True)
df.head()
Out [3]: date_time TMC DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS
0 2013-09-25 17:35:00 103N04152 2013-09-25 17:35:00 12 12 NaN
1 2013-09-26 00:00:00 103N04152 2013-09-26 00:00:00 7 7 NaN
2 2013-09-03 14:45:00 103N04152 2013-09-03 14:45:00 8 8 NaN
3 2013-09-04 14:40:00 103N04152 2013-09-04 14:40:00 8 9 7
4 2013-09-05 09:35:00 103N04152 2013-09-05 09:35:00 10 10 NaN
In this method I lose the original formatting of DATE and EPOCH, but it doesn’t really affect further computations on the dataframe. Instead of using converters as an argument, I could have used date_parser. Or, after reading the data, similar to line 1, I could have done something like this:
In [4]: df = pd.read_csv('C:/Projects/NPMRDS/FHWA_TASK2-4_NJ_09_2013_TT.CSV')
df['date_time'] = pd.to_datetime([datetime.datetime.strptime(str(df['DATE'][x]), '%m%d%Y') + datetime.timedelta(minutes = int(df['EPOCH'][x]*5)) for x in range(len(df))])
df.head()
Out [4]: TMC DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS DATE_TIME
0 103N04152 9252013 211 12 12 NaN 2013-09-25 17:35:00
1 103N04152 9262013 0 7 7 NaN 2013-09-26 00:00:00
2 103N04152 9032013 177 8 8 NaN 2013-09-03 14:45:00
3 103N04152 9042013 176 8 9 7 2013-09-04 14:40:00
4 103N04152 9052013 115 10 10 NaN 2013-09-05 09:35:00
A more desirable result (don’t worry about the column orders), but still row-by-row, and takes a huge amount of time.
Then there are pandas.to_datetime
and pandas.to_timedelta
, which run much faster than the methods described above. But I cannot merge the results together without resorting to string functions, which are again mainly row-by-row.
Does anyone know a better way to do this?
Try this out – reduced runtime for me to about 1s (compared to 15s) on 4M rows of test data.
df = pd.read_csv('temp.csv')
df['DATE'] = pd.to_datetime(df['DATE'], format='%m%d%Y')
df['EPOCH'] = pd.to_timedelta((df['EPOCH'].astype(int) * 5).astype('timedelta64[m]'))
df['DATE_TIME'] = df['DATE'] + df['EPOCH']
In addition to chrisb’s answer, I found a way to do it as well. The trick lies in setting the box
parameter to False
in pandas.to_datetime()
. Like so:
df['DATE_TIME'] = pd.to_datetime(df['DATE'], format='%m%d%Y', box=False) + pd.to_timedelta(df['EPOCH']*5*60, unit='s')
Setting that to False
returns a numpy.datetime[64]
array, instead of pandas.DatetimeIndex
. More information can be found in the pandas.to_datetime()
documentation. And, pandas.to_timedelta()
does not work with unit='m'
.
This answer was posted as an edit to the question Concatenating Pandas datetime by the OP Kartik under CC BY-SA 3.0.