Is there pandas aggregate function that combines features of 'any' and 'unique'?

Question

I have a large dataset with similar data:

>>> df = pd.DataFrame(
...     {'A': ['one', 'two', 'two', 'one', 'one', 'three'],
...      'B': ['a', 'b', 'c', 'a', 'a', np.nan]})          
>>> df
       A    B
0    one    a
1    two    b
2    two    c
3    one    a
4    one    a
5  three  NaN

There are two aggregation functions ‘any’ and ‘unique’:

>>> df.groupby('A')['B'].any()
A
one       True
three    False
two       True
Name: B, dtype: bool

>>> df.groupby('A')['B'].unique()
A
one         [a]
three     [nan]
two      [b, c]
Name: B, dtype: object

but I want to get the folowing result (or something close to it):

A
one           a
three     False
two        True

I can do it with some complex code, but it is better for me to find appropriate function in python packages or the easiest way to solve problem. I’d be grateful if you could help me with that.

Asked By: Anna

||

Source

Answer 1

You can aggregate Series.nunique for first column and unique values with remove possible missing values for another columns:

df1 = df.groupby('A').agg(count=('B','nunique'), 
                          uniq_without_NaNs = ('B', lambda x: x.dropna().unique()))
print (df1)
       count uniq_without_NaNs
A                             
one        1               [a]
three      0                []
two        2            [b, c]

Then create mask if greater column count by 1 and replace values by uniq_without_NaNs if equal count with 1:

out = df1['count'].gt(1).mask(df1['count'].eq(1), df1['uniq_without_NaNs'].str[0])
print (out)
A
one          a
three    False
two       True
Name: count, dtype: object

Answered By: jezrael

Answer 2

>>> g = df.groupby("A")["B"].agg
>>> nun = g("nunique")
>>> pd.Series(np.select([nun > 1, nun == 1],
                        [True, g("unique").str[0]],
                        default=False),
              index=nun.index)

A
one          a
three    False
two       True
dtype: object

get a hold on the group aggreagator
count number of uniques
- if > 1, i.e., more than 1 uniques, put True
- if == 1, i.e., only 1 unique, put that unique value
- else, i.e., no uniques (full NaNs), put False

Answered By: Mustafa Aydın

Answer 3

You can combine groupby with agg and use boolean mask to choose the correct output:

# Your code
agg = df.groupby('A')['B'].agg(['any', 'unique'])

# Boolean mask to choose between 'any' and 'unique' column
m = agg['unique'].str.len().eq(1) & agg['unique'].str[0].notna()

# Final output
out = agg['any'].mask(m, other=agg['unique'].str[0])

Output:

>>> out
A
one          a
three    False
two       True

>>> agg
         any  unique
A                   
one     True     [a]
three  False   [nan]
two     True  [b, c]

>>> m
A
one       True  # choose 'unique' column
three    False  # choose 'any' column
two      False  # choose 'any' column

Answered By: Corralien

Answer 4

new_df = df.groupby('A')['B'].apply(lambda x: x.notna().any())
new_df = new_df .reset_index()
new_df .columns = ['A', 'B']

this will give you:

       A      B
0    one   True
1  three  False
2    two   True

now if we want to find the values we can do:

df.groupby('A')['B'].apply(lambda x: x[x.notna()].unique()[0] if x.notna().any() else np.nan)

which gives:

A
one        a
three    NaN
two        b

Answered By: Wiliam

Answer 5

The expression

series = df.groupby('A')['B'].agg(lambda x: pd.Series(x.unique()))

will give the next result:

one        a
three    Nan
two   [b, c]

where simple value can be identified by the type:

series[series.apply(type) == str]

Think it is easy enough to use often, but probably it is not the optimal solution.

Answered By: Anna

Is there pandas aggregate function that combines features of 'any' and 'unique'?

Question:

Answers: