How to aggregate 3 columns in DataFrame to have count and distribution of values in separated columns in Python Pandas?

Question

I have Pandas DataFrame like below:

data types:

ID – int
TIME – int
TG – int

ID TIME TG

111 20210101 0

111 20210201 0

111 20210301 1

222 20210101 0

222 20210201 1

333 20210201 1

And I need to aggregate above DataFrame so as to know:

how many IDs are per each value in TIME
how many "1" from TG are per each value in TIME
how many "0" from TG are per each value in TIME

So I need to something like below:

TIME     | num_ID | num_1 | num_0
---------|--------|-------|--------
20210101 | 2      | 0     | 2
20210201 | 3      | 2     | 1
20210301 | 1      | 1     | 0

How can I do that in Python Padas ?

Asked By: dingaro

||

Source

Answer 1

Use GroupBy.size for counts TIME values with crosstab for count number of 0 and 1 values:

df1 = (df.groupby('TIME').size().to_frame('num_ID')
         .join(pd.crosstab(df['TIME'], df['TG']).add_prefix('num_'))
         .reset_index())
print (df1)
       TIME  num_ID  num_0  num_1
0  20210101       2      2      0
1  20210201       3      1      2
2  20210301       1      0      1

Another idea if need count only 0 and 1 values in GroupBy.agg:

df1 = (df.assign(num_0 = df['TG'].eq(0),
                num_1 = df['TG'].eq(1))
        .groupby('TIME').agg(num_ID = ('TG','size'),
                             num_1=('num_1','sum'),
                             num_0=('num_0','sum'),
                             )
        .reset_index()
        )
print (df1)
       TIME  num_ID  num_1  num_0
0  20210101       2      0      2
1  20210201       3      2      1
2  20210301       1      1      0

Answered By: jezrael

Answer 2

import pandas as pd

# Create the DataFrame
df = pd.DataFrame({
    'ID': [111, 111, 111, 222, 222, 333],
    'TIME': [20210101, 20210201, 20210301, 20210101, 20210201, 20210201],
    'TG': [0, 0, 1, 0, 1, 1]
})

# Group the DataFrame by the 'TIME' column
grouped_df = df.groupby('TIME')

# Aggregate the grouped DataFrame and create a new DataFrame
# that counts the number of IDs, number of 1s and number of 0s
# for each value in the 'TIME' column
result_df = grouped_df.agg({
    'ID': 'nunique',  # Count the number of unique IDs
    'TG':'sum' 
}).rename(columns={'ID': 'num_ID', 'TG': 'num_1'})

# Calculate the number of 0s in the 'TG' column
# by subtracting the number of 1s from the total number of entries
result_df['num_0'] = grouped_df['TG'].count() - result_df['num_1']

# Reorder the columns in the result DataFrame
result_df = result_df[['num_ID', 'num_1', 'num_0']]

# Print the result DataFrame
print(result_df)

Answered By: David

Answer 3

dict1 = {'ID':pd.Series.nunique, 'TG': [lambda x: x.eq(1).sum(), lambda x: x.eq(0).sum()]}
col1 = ['num_id', 'num_1', 'num_0']
df.groupby('TIME').agg(dict1).set_axis(col1, axis=1).reset_index()

result:

    TIME        num_id  num_1   num_0
0   20210101    2       0       2
1   20210201    3       2       1
2   20210301    1       1       0

Answered By: Panda Kim

How to aggregate 3 columns in DataFrame to have count and distribution of values in separated columns in Python Pandas?

Question:

Answers:

ID	TIME	TG
111	20210101	0
111	20210201	0
111	20210301	1
222	20210101	0
222	20210201	1
333	20210201	1