pandas scatter plot versus time of day?

Question:

I want a scatter plot duration(mins) versus start time like this (which is a time of day, irrespective of what date it was on):

enter image description here

I have a CSV file commute.csv which looks like this:

date,   prediction, start,  stop,   duration,   duration(mins), Day of week
14/08/2015, ,   08:02:00,   08:22:00,   00:20:00,   20, Fri
25/08/2015, ,   18:16:00,   18:27:00,   00:11:00,   11, Tue
26/08/2015, ,   08:26:00,   08:46:00,   00:20:00,   20, Wed
26/08/2015, ,   18:28:00,   18:46:00,   00:18:00,   18, Wed

The full CSV file is here.

I can import the CSV file like so:

import pandas as pd
times = pd.read_csv('commute.csv', parse_dates=[[0, 2], [0, 3]], dayfirst=True)
times.head()

Out:

    date_start  date_stop   prediction  duration    duration(mins)  Day of week
0   2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00    20  Fri
1   2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00    11  Tue
2   2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00    20  Wed
3   2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00    18  Wed
4   2015-08-28 08:37:00 2015-08-28 08:52:00 NaN 00:15:00    15  Fri

I am now struggling to plot duration(mins) versus start time (without the date). Please help!

@jezrael has been a great help… one of the comments on issue 8113 proposes using a variant of df.plot(x=x, y=y, style=”.”). I tried it:

times.plot(x='start', y='duration(mins)', style='.')

However, it doesn’t show the same as my intended plot: the output is incorrect because the X axis has been stretched so that each data point is the same distance apart in X:

enter image description here

Is there no way to plot against time?

Asked By: blokeley

||

Answers:

I think there is problem use timeissue 8113 in scatter graph.

But you can use hour:

df['hours'] = df.date_start.dt.hour
print df
           date_start           date_stop  prediction  duration  
0 2015-08-14 08:02:00 2015-08-14 08:22:00         NaN  00:20:00   
1 2015-08-25 18:16:00 2015-08-25 18:27:00         NaN  00:11:00   
2 2015-08-26 08:26:00 2015-08-26 08:46:00         NaN  00:20:00   
3 2015-08-26 18:28:00 2015-08-26 18:46:00         NaN  00:18:00   

   duration(mins) Dayofweek  hours  
0              20       Fri      8  
1              11       Tue     18  
2              20       Wed      8  
3              18       Wed     18  

df.plot.scatter(x='hours', y='duration(mins)')

graph

Another solution with counting time in minutes:

df['time'] = df.date_start.dt.hour * 60 + df.date_start.dt.minute
print df
           date_start           date_stop  prediction  duration  
0 2015-08-14 08:02:00 2015-08-14 08:22:00         NaN  00:20:00   
1 2015-08-25 18:16:00 2015-08-25 18:27:00         NaN  00:11:00   
2 2015-08-26 08:26:00 2015-08-26 08:46:00         NaN  00:20:00   
3 2015-08-26 18:28:00 2015-08-26 18:46:00         NaN  00:18:00   

   duration(mins) Dayofweek  time  
0              20       Fri   482  
1              11       Tue  1096  
2              20       Wed   506  
3              18       Wed  1108  

df.plot.scatter(x='time', y='duration(mins)')

graph1

Answered By: jezrael

In the end, I wrote a function to turn hours, minutes and seconds into a floating point number of hours.

def to_hours(dt):
    """Return floating point number of hours through the day in `datetime` dt."""
    return dt.hour + dt.minute / 60 + dt.second / 3600


# Unit test the to_hours() function
import datetime
dt = datetime.datetime(2010, 4, 23)  # Dummy date for testing
assert to_hours(dt) == 0
assert to_hours(dt.replace(hour=1)) == 1
assert to_hours(dt.replace(hour=2, minute=30)) == 2.5
assert to_hours(dt.replace(minute=15)) == 0.25
assert to_hours(dt.replace(second=30)) == 30 / 3600

Then create a column of the floating point number of hours:

# Convert start and stop times to hours
commutes['start_hour'] = commutes['start_date'].map(to_hours)

The full example is in my Jupyter notebook.

Answered By: blokeley

To follow up, as this question is close to the top of the search results & it’s difficult to put the necessary answer all in a comment;

To set the proper time tick labels along the horizontal axis for start time granularity of minutes, you need to set the frequency of the tick labels then convert to datetime.

This code sample has the horizontal axis datetime as the index of the DataFrame, although of course that could equally be a column rather than an index; notice that when it is a DatetimeIndex you access the minute & hour directly rather than through the dt attribute of a datetime column.

This code interprets the datetimes as UTC datetimes datetime.utcfromtimestamp(), see https://stackoverflow.com/a/44572082/437948 for a subtly different approach.

You could add handling of second granularity according to a similar theme.

df = pd.DataFrame({'value': np.random.randint(0, 11, 6 * 24 * 7)},
                  index = pd.DatetimeIndex(start='2018-10-03', freq='600s',
                                           periods=6 * 24 * 7))
df['time'] = 60 * df.index.hour + df.index.minute
f, a = plt.subplots(figsize=(20, 10))
df.plot.scatter(x='time', y='value', style='.', ax=a)
plt.xticks(np.arange(0, 25 * 60, 60))
a.set_xticklabels([datetime.utcfromtimestamp(ts * 60).strftime('%H:%M')
                   for ts in a.get_xticks()])

graph result from the code sample

Answered By: Mark
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.