Count matching values between different columns of grouped rows

Question:

This particular dataset consists of 3 households and its members. Columns 3 and 4 indicate if that member lives with their parents. Its value is the identity of mother and father in the Member column. For example member 3 lives with mother (2) and father (1).

Household  Member  Lives_with_m  Lives_with_f  New_col
    1        1         NaN          NaN          1
    1        2         NaN          NaN          1
    1        3          2            1           1
    1        4         NaN          NaN          1    
    1        5          3            4          NaN
    2        1          3           NaN         NaN
    2        2         NaN          NaN          1
    2        3          2           NaN          2
    2        4          3           NaN         NaN
    3        1          3            2          NaN
    3        2         NaN          NaN          1
    3        3         NaN          NaN          1

I’ve been trying to create a new column with the total number of kids that every member has, which implies relating the values of columns 3 and 4 with the second one, always for the same household.

I’ve tried achieving this with a for loop starting only with ‘Lives_with_m’ and I realized its quite slow. Most importantly, I couldn’t get the expected results.

for p in set(d['Household']):
    u = d['Member']
    v = d['Lives_with_m']
    if  bool(set(u) & set(v)):
        d['New_col'] = 0

I’m not sure if i’m going in the right direction… Any help would be very much appreciated!

d = {'group': [1, 1, 1, 1, 1 ,2, 2, 2, 2, 3, 3, 3], 'member': [1, 2, 3, 4, 5, 1, 2, 3, 4, 1, 2, 3], 'lives_with_m': [np.nan, np.nan, 2, np.nan, 3, 3, np.nan, 2, 3, 3, np.nan, np.nan], 'lives_with_f': [np.nan, np.nan, 1, np.nan, 4, np.nan, np.nan, np.nan, np.nan, 2, np.nan, np.nan  }
pd.DataFrame(data= d)

Asked By: g herb

||

Answers:

One way to go, would be as follows:

import pandas as pd
import numpy as np

d = {'group': [1, 1, 1, 1, 1 ,2, 2, 2, 2, 3, 3, 3], 'member': [1, 2, 3, 4, 5, 1, 2, 3, 4, 1, 2, 3], 'lives_with_m': [np.nan, np.nan, 2, np.nan, 3, 3, np.nan, 2, 3, 3, np.nan, np.nan], 'lives_with_f': [np.nan, np.nan, 1, np.nan, 4, np.nan, np.nan, np.nan, np.nan, 2, np.nan, np.nan] }
df = pd.DataFrame(data= d)

value_counts = df.groupby('group')[['lives_with_m','lives_with_f']].agg({i:'value_counts' for i in ['lives_with_m','lives_with_f']}).apply(lambda x: np.nansum(x), axis=1)
value_counts.name = 'new'

df = df.merge(value_counts, how='left', left_on=['group','member'], right_index=True)
df

    group  member  lives_with_m  lives_with_f  new
0       1       1           NaN           NaN  1.0
1       1       2           NaN           NaN  1.0
2       1       3           2.0           1.0  1.0
3       1       4           NaN           NaN  1.0
4       1       5           3.0           4.0  NaN
5       2       1           3.0           NaN  NaN
6       2       2           NaN           NaN  1.0
7       2       3           2.0           NaN  2.0
8       2       4           3.0           NaN  NaN
9       3       1           3.0           2.0  NaN
10      3       2           NaN           NaN  1.0
11      3       3           NaN           NaN  1.0
  • First, we use df.groupby to isolate ['lives_with_m','lives_with_f'] per group.
  • Next, we use df.agg to a retrieve df.value_counts for both columns.
  • Finally, we apply np.nansum to each row to end up with a pd.Series looking like this:
print(value_counts)
group     
1      2.0    1.0
       3.0    1.0
2      3.0    2.0
       2.0    1.0
3      3.0    1.0
1      1.0    1.0
       4.0    1.0
3      2.0    1.0
Name: new, dtype: float64

At this stage, we just need to merge the original df with our series on the matching columns (left: ['group','member'], right: s.index).

Answered By: ouroboros1
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.