pandas: count things

Question:

In the following, male_trips is a big pandas data frame and stations is a small pandas data frame. For each station id I’d like to know how many male trips took place. The following does the job, but takes a long time:

mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]

how should I go about this instead?


Update! So there were two main approaches: groupby() followed by size(), and the simpler .value_counts(). I did a quick timeit, and the groupby approach wins by quite a large margin! Here is the code:

from timeit import Timer
setup = "import pandas; male_trips=pandas.load('maletrips')"
a  = "male_trips.start_station_id.value_counts()"
b = "male_trips.groupby('start_station_id').size()"
Timer(a,setup).timeit(100)
Timer(b,setup).timeit(100)

and here is the result:

In [4]: Timer(a,setup).timeit(100) # <- this is value_counts
Out[4]: 9.709594964981079

In [5]: Timer(b,setup).timeit(100) # <- this is groupby / size
Out[5]: 1.5574288368225098

Note that, at this speed, for exploring data typing value_counts is marginally quicker and less remembering!

Asked By: Mike Dewar

||

Answers:

how long would this take:

df = male_trips.groupby('start_station_id').sum()
Answered By: vgoklani

edit: after seeing in the answer above that isin and value_counts exist (and value_counts even comes with its own entry in pandas.core.algorithm and also isin isn’t simply np.in1d) I updated the three methods below

male_trips.start_station_id[male_trips.start_station_id.isin(station.id)].value_counts()

You could also do an inner join on stations.id:
pd.merge(male_trips, station, left_on='start_station_id', right_on='id') followed by value_counts.
Or:

male_trips.set_index('start_station_id, inplace=True)
station.set_index('id, inplace=True)
male_trips.ix[male_trips.index.intersection(station.index)].reset_index().start_station_id.value_counts()

If you have the time I’d be interested how this performs differently with a huge DataFrame.

Answered By: Arthur G

I’d do like Vishal but instead of using sum() using size() to get a count of the number of rows allocated to each group of ‘start_station_id’. So:

df = male_trips.groupby('start_station_id').size()
Answered By: Dani Arribas-Bel

My answer below works in Pandas 0.7.3. Not sure about the new releases.

This is what the pandas.Series.value_counts method is for:

count_series = male_trips.start_station_id.value_counts()

It should be straight-forward to then inspect count_series based on the values in stations['id']. However, if you insist on only considering those values, you could do the following:

count_series = (
                male_trips[male_trips.start_station_id.isin(stations.id.values)]
                    .start_station_id
                    .value_counts()
               )

and this will only give counts for station IDs actually found in stations.id.

Answered By: ely
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.