pandas scatter plot versus time of day?
Question:
I want a scatter plot duration(mins)
versus start
time like this (which is a time of day, irrespective of what date it was on):
I have a CSV file commute.csv
which looks like this:
date, prediction, start, stop, duration, duration(mins), Day of week
14/08/2015, , 08:02:00, 08:22:00, 00:20:00, 20, Fri
25/08/2015, , 18:16:00, 18:27:00, 00:11:00, 11, Tue
26/08/2015, , 08:26:00, 08:46:00, 00:20:00, 20, Wed
26/08/2015, , 18:28:00, 18:46:00, 00:18:00, 18, Wed
The full CSV file is here.
I can import the CSV file like so:
import pandas as pd
times = pd.read_csv('commute.csv', parse_dates=[[0, 2], [0, 3]], dayfirst=True)
times.head()
Out:
date_start date_stop prediction duration duration(mins) Day of week
0 2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00 20 Fri
1 2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00 11 Tue
2 2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00 20 Wed
3 2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00 18 Wed
4 2015-08-28 08:37:00 2015-08-28 08:52:00 NaN 00:15:00 15 Fri
I am now struggling to plot duration(mins)
versus start
time (without the date). Please help!
@jezrael has been a great help… one of the comments on issue 8113 proposes using a variant of df.plot(x=x, y=y, style=”.”). I tried it:
times.plot(x='start', y='duration(mins)', style='.')
However, it doesn’t show the same as my intended plot: the output is incorrect because the X axis has been stretched so that each data point is the same distance apart in X:
Is there no way to plot against time?
Answers:
I think there is problem use time
– issue 8113 in scatter graph.
But you can use hour
:
df['hours'] = df.date_start.dt.hour
print df
date_start date_stop prediction duration
0 2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00
1 2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00
2 2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00
3 2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00
duration(mins) Dayofweek hours
0 20 Fri 8
1 11 Tue 18
2 20 Wed 8
3 18 Wed 18
df.plot.scatter(x='hours', y='duration(mins)')
Another solution with counting time
in minutes
:
df['time'] = df.date_start.dt.hour * 60 + df.date_start.dt.minute
print df
date_start date_stop prediction duration
0 2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00
1 2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00
2 2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00
3 2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00
duration(mins) Dayofweek time
0 20 Fri 482
1 11 Tue 1096
2 20 Wed 506
3 18 Wed 1108
df.plot.scatter(x='time', y='duration(mins)')
In the end, I wrote a function to turn hours, minutes and seconds into a floating point number of hours.
def to_hours(dt):
"""Return floating point number of hours through the day in `datetime` dt."""
return dt.hour + dt.minute / 60 + dt.second / 3600
# Unit test the to_hours() function
import datetime
dt = datetime.datetime(2010, 4, 23) # Dummy date for testing
assert to_hours(dt) == 0
assert to_hours(dt.replace(hour=1)) == 1
assert to_hours(dt.replace(hour=2, minute=30)) == 2.5
assert to_hours(dt.replace(minute=15)) == 0.25
assert to_hours(dt.replace(second=30)) == 30 / 3600
Then create a column of the floating point number of hours:
# Convert start and stop times to hours
commutes['start_hour'] = commutes['start_date'].map(to_hours)
The full example is in my Jupyter notebook.
To follow up, as this question is close to the top of the search results & it’s difficult to put the necessary answer all in a comment;
To set the proper time tick labels along the horizontal axis for start time granularity of minutes, you need to set the frequency of the tick labels then convert to datetime.
This code sample has the horizontal axis datetime as the index of the DataFrame, although of course that could equally be a column rather than an index; notice that when it is a DatetimeIndex you access the minute & hour directly rather than through the dt
attribute of a datetime column.
This code interprets the datetimes as UTC datetimes datetime.utcfromtimestamp()
, see https://stackoverflow.com/a/44572082/437948 for a subtly different approach.
You could add handling of second granularity according to a similar theme.
df = pd.DataFrame({'value': np.random.randint(0, 11, 6 * 24 * 7)},
index = pd.DatetimeIndex(start='2018-10-03', freq='600s',
periods=6 * 24 * 7))
df['time'] = 60 * df.index.hour + df.index.minute
f, a = plt.subplots(figsize=(20, 10))
df.plot.scatter(x='time', y='value', style='.', ax=a)
plt.xticks(np.arange(0, 25 * 60, 60))
a.set_xticklabels([datetime.utcfromtimestamp(ts * 60).strftime('%H:%M')
for ts in a.get_xticks()])
I want a scatter plot duration(mins)
versus start
time like this (which is a time of day, irrespective of what date it was on):
I have a CSV file commute.csv
which looks like this:
date, prediction, start, stop, duration, duration(mins), Day of week
14/08/2015, , 08:02:00, 08:22:00, 00:20:00, 20, Fri
25/08/2015, , 18:16:00, 18:27:00, 00:11:00, 11, Tue
26/08/2015, , 08:26:00, 08:46:00, 00:20:00, 20, Wed
26/08/2015, , 18:28:00, 18:46:00, 00:18:00, 18, Wed
The full CSV file is here.
I can import the CSV file like so:
import pandas as pd
times = pd.read_csv('commute.csv', parse_dates=[[0, 2], [0, 3]], dayfirst=True)
times.head()
Out:
date_start date_stop prediction duration duration(mins) Day of week
0 2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00 20 Fri
1 2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00 11 Tue
2 2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00 20 Wed
3 2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00 18 Wed
4 2015-08-28 08:37:00 2015-08-28 08:52:00 NaN 00:15:00 15 Fri
I am now struggling to plot duration(mins)
versus start
time (without the date). Please help!
@jezrael has been a great help… one of the comments on issue 8113 proposes using a variant of df.plot(x=x, y=y, style=”.”). I tried it:
times.plot(x='start', y='duration(mins)', style='.')
However, it doesn’t show the same as my intended plot: the output is incorrect because the X axis has been stretched so that each data point is the same distance apart in X:
Is there no way to plot against time?
I think there is problem use time
– issue 8113 in scatter graph.
But you can use hour
:
df['hours'] = df.date_start.dt.hour
print df
date_start date_stop prediction duration
0 2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00
1 2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00
2 2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00
3 2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00
duration(mins) Dayofweek hours
0 20 Fri 8
1 11 Tue 18
2 20 Wed 8
3 18 Wed 18
df.plot.scatter(x='hours', y='duration(mins)')
Another solution with counting time
in minutes
:
df['time'] = df.date_start.dt.hour * 60 + df.date_start.dt.minute
print df
date_start date_stop prediction duration
0 2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00
1 2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00
2 2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00
3 2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00
duration(mins) Dayofweek time
0 20 Fri 482
1 11 Tue 1096
2 20 Wed 506
3 18 Wed 1108
df.plot.scatter(x='time', y='duration(mins)')
In the end, I wrote a function to turn hours, minutes and seconds into a floating point number of hours.
def to_hours(dt):
"""Return floating point number of hours through the day in `datetime` dt."""
return dt.hour + dt.minute / 60 + dt.second / 3600
# Unit test the to_hours() function
import datetime
dt = datetime.datetime(2010, 4, 23) # Dummy date for testing
assert to_hours(dt) == 0
assert to_hours(dt.replace(hour=1)) == 1
assert to_hours(dt.replace(hour=2, minute=30)) == 2.5
assert to_hours(dt.replace(minute=15)) == 0.25
assert to_hours(dt.replace(second=30)) == 30 / 3600
Then create a column of the floating point number of hours:
# Convert start and stop times to hours
commutes['start_hour'] = commutes['start_date'].map(to_hours)
The full example is in my Jupyter notebook.
To follow up, as this question is close to the top of the search results & it’s difficult to put the necessary answer all in a comment;
To set the proper time tick labels along the horizontal axis for start time granularity of minutes, you need to set the frequency of the tick labels then convert to datetime.
This code sample has the horizontal axis datetime as the index of the DataFrame, although of course that could equally be a column rather than an index; notice that when it is a DatetimeIndex you access the minute & hour directly rather than through the dt
attribute of a datetime column.
This code interprets the datetimes as UTC datetimes datetime.utcfromtimestamp()
, see https://stackoverflow.com/a/44572082/437948 for a subtly different approach.
You could add handling of second granularity according to a similar theme.
df = pd.DataFrame({'value': np.random.randint(0, 11, 6 * 24 * 7)},
index = pd.DatetimeIndex(start='2018-10-03', freq='600s',
periods=6 * 24 * 7))
df['time'] = 60 * df.index.hour + df.index.minute
f, a = plt.subplots(figsize=(20, 10))
df.plot.scatter(x='time', y='value', style='.', ax=a)
plt.xticks(np.arange(0, 25 * 60, 60))
a.set_xticklabels([datetime.utcfromtimestamp(ts * 60).strftime('%H:%M')
for ts in a.get_xticks()])