Group and count total of blanks and total of rows in pandas dataframe

Question:

I have the following dataframe

  Country   Name  Code Signed  Index
0      CZ  Paulo     3      x   1
1      AE  Paulo   Yes   None   1
2      AE  Paulo   Yes   None   2
3      AE  Paulo     1    Yes   5
4      CZ  Paulo  None   None   6
5      DK  Paulo   Yes   None   9
6      DK  Paulo  None   None   20
7      PT  Paulo     2    Yes   20
8      PT  Paulo     1    Yes   22

I need three new columns after grouping by country

  1. count the missing values in Code and Signed column
  2. total of rows that have both values filled
  3. total of rows that have the same Country value
  4. point the rows where we have any of those values blank per Country (list or non list format) using the column "Index" as reference

If any of the Countries have all their Code and Signed rows filled, remove it from the dataframe.

In this case, it would return this dataframe:

  Country  Total_Blanks_on_Code  Total_Blanks_on_Signed Total_of_rows_with_both_values_filled  Total_of_rows_of_the_Country  Rows with any blank
0      CZ                     1                       1                                  None                             2                    6
1      AE                     2                       0                                     1                             3                   [1,2]
2      DK                     2                       1                                  None                             2                   [9,20]

Thank you for your help!

Asked By: Paulo Cortez

||

Answers:

Based on the definitions/conditions you gave, the country AE should have a total blanks on Code equal to 0 and not 2.

Anyway, you can use the code below to get the format of output you’re looking for :

out = (df.assign(Total1 = df['Code'].isna(),
                 Total2 = df['Signed?'].isna(),
                 Total3 = ~df['Code'].isna() & ~df['Signed?'].isna())
        .groupby('Country', as_index=False)
        .agg(NumberOfCountries = ('Country','size'),
             Total1 = ('Total1','sum'),
             Total2 = ('Total2','sum'),
             Total3 = ('Total3','sum'))
        ).rename(columns={'Total1': 'Total Blanks on Code', 'Total2': 'Total Blanks on Signed?', 
                    'Total3': 'Total of rows with both values filled', 'NumberOfCountries': 'Total of rows of the Country'})

>>> print(out) :

enter image description here

Answered By: L'Artiste

Here’s a way to do what your question asks:

df['both_filled'] = (df.Code.notna() & df.Signed.notna()).map({True:True, False:None})
df['Rows_with_any_blank'] = df.Index[df['both_filled'].isna()]
gb = df.groupby('Country', sort=False)
df2 = ( gb.count().assign(
    Rows_with_any_blank=gb['Rows_with_any_blank']
    .agg(lambda x: list(x.dropna().astype(int)))) )
df2 = ( df2.assign(
    Total_Blanks_on_Code=df2.Name - df2.Code,
    Total_Blanks_on_Signed=df2.Name - df2.Signed)
    [df2.both_filled < df2.Name]
    [['Total_Blanks_on_Code','Total_Blanks_on_Signed',
        'both_filled','Name','Rows_with_any_blank']]
    .reset_index()
    .rename(columns={
        'Name':'Total_of_rows_of_the_Country', 
        'both_filled':'Total_of_rows_with_both_values_filled'
    }) )

Input:

  Country   Name  Code Signed  Index
0      CZ  Paulo     3      x      1
1      AE  Paulo   Yes   None      1
2      AE  Paulo   Yes   None      2
3      AE  Paulo     1    Yes      5
4      CZ  Paulo  None   None      6
5      DK  Paulo   Yes   None      9
6      DK  Paulo  None   None     20
7      PT  Paulo     2    Yes     20
8      PT  Paulo     1    Yes     22

Output:

  Country  Total_Blanks_on_Code  Total_Blanks_on_Signed  Total_of_rows_with_both_values_filled  Total_of_rows_of_the_Country Rows_with_any_blank
0      CZ                     1                       1                                      1                             2                 [6]
1      AE                     0                       2                                      1                             3              [1, 2]
2      DK                     1                       2                                      0                             2             [9, 20]

Explanation:

  • Create both_filled column which is True if both Code and Signed are non-null and is None otherwise (this allows us to later use count() to effectively sum the number of rows having both columns non-null)
  • Create Rows_with_any_blank column which contains the value in Index for rows where neither of Code and Signed is null
  • Create a groubpy object gb by Country
  • Use count() to get the number of non-null entries per group in each column of gb
  • Use assign() to overwrite the Rows_with_any_blank column to be a list of the non-null Index values for each group
  • Use assign() to create and populate columns Total_Blanks_on_Code and Total_Blanks_on_Signed
  • Keep only rows where the count in both_filled < the count in Name (which is the total number of rows in the original df); this removes any Country for which all Code and Signed rows are filled
  • Select the 5 desired columns in the specified order using [[]]
  • Use reset_index() to switch Country from the index to a column
  • Use rename() to change Name and both_filled to have the specified labels Total_of_rows_of_the_Country and Total_of_rows_with_both_values_filled.
Answered By: constantstranger
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.