Combining rows in pandas
Question:
I have a DataFrame with an index called city_id
of cities in the format [city],[state]
(e.g., new york,ny
containing integer counts in the columns. The problem is that I have multiple rows for the same city, and I want to collapse the rows sharing a city_id
by adding their column values. I looked at groupby()
but it wasn’t immediately obvious how to apply it to this problem.
Edit:
An example: I’d like to change this:
city_id val1 val2 val3
houston,tx 1 2 0
houston,tx 0 0 1
houston,tx 2 1 1
into this:
city_id val1 val2 val3
houston,tx 3 3 2
if there are ~10-20k rows.
Answers:
Starting from
>>> df
val1 val2 val3
city_id
houston,tx 1 2 0
houston,tx 0 0 1
houston,tx 2 1 1
somewhere,ew 4 3 7
I might do
>>> df.groupby(df.index).sum()
val1 val2 val3
city_id
houston,tx 3 3 2
somewhere,ew 4 3 7
or
>>> df.reset_index().groupby("city_id").sum()
val1 val2 val3
city_id
houston,tx 3 3 2
somewhere,ew 4 3 7
The first approach passes the index values (in this case, the city_id
values) to groupby
and tells it to use those as the group keys, and the second resets the index and then selects the city_id
column. See this section of the docs for more examples. Note that there are lots of other methods in the DataFrameGroupBy
objects, too:
>>> df.groupby(df.index)
<pandas.core.groupby.DataFrameGroupBy object at 0x1045a1790>
>>> df.groupby(df.index).max()
val1 val2 val3
city_id
houston,tx 2 2 1
somewhere,ew 4 3 7
>>> df.groupby(df.index).mean()
val1 val2 val3
city_id
houston,tx 1 1 0.666667
somewhere,ew 4 3 7.000000
Something in the same line. Sorry not the exact replica.
mydata = [{'subid' : 'B14-111', 'age': 75, 'fdg':1.78},
{'subid' : 'B14-112', 'age': 22, 'fdg':1.56},{'subid' : 'B14-112', 'age': 40, 'fdg':2.00},]
df = pandas.DataFrame(mydata)
gg = df.groupby("subid",sort=True).sum()
I have a DataFrame with an index called city_id
of cities in the format [city],[state]
(e.g., new york,ny
containing integer counts in the columns. The problem is that I have multiple rows for the same city, and I want to collapse the rows sharing a city_id
by adding their column values. I looked at groupby()
but it wasn’t immediately obvious how to apply it to this problem.
Edit:
An example: I’d like to change this:
city_id val1 val2 val3
houston,tx 1 2 0
houston,tx 0 0 1
houston,tx 2 1 1
into this:
city_id val1 val2 val3
houston,tx 3 3 2
if there are ~10-20k rows.
Starting from
>>> df
val1 val2 val3
city_id
houston,tx 1 2 0
houston,tx 0 0 1
houston,tx 2 1 1
somewhere,ew 4 3 7
I might do
>>> df.groupby(df.index).sum()
val1 val2 val3
city_id
houston,tx 3 3 2
somewhere,ew 4 3 7
or
>>> df.reset_index().groupby("city_id").sum()
val1 val2 val3
city_id
houston,tx 3 3 2
somewhere,ew 4 3 7
The first approach passes the index values (in this case, the city_id
values) to groupby
and tells it to use those as the group keys, and the second resets the index and then selects the city_id
column. See this section of the docs for more examples. Note that there are lots of other methods in the DataFrameGroupBy
objects, too:
>>> df.groupby(df.index)
<pandas.core.groupby.DataFrameGroupBy object at 0x1045a1790>
>>> df.groupby(df.index).max()
val1 val2 val3
city_id
houston,tx 2 2 1
somewhere,ew 4 3 7
>>> df.groupby(df.index).mean()
val1 val2 val3
city_id
houston,tx 1 1 0.666667
somewhere,ew 4 3 7.000000
Something in the same line. Sorry not the exact replica.
mydata = [{'subid' : 'B14-111', 'age': 75, 'fdg':1.78},
{'subid' : 'B14-112', 'age': 22, 'fdg':1.56},{'subid' : 'B14-112', 'age': 40, 'fdg':2.00},]
df = pandas.DataFrame(mydata)
gg = df.groupby("subid",sort=True).sum()