Is there pandas aggregate function that combines features of 'any' and 'unique'?

Question:

I have a large dataset with similar data:

>>> df = pd.DataFrame(
...     {'A': ['one', 'two', 'two', 'one', 'one', 'three'],
...      'B': ['a', 'b', 'c', 'a', 'a', np.nan]})          
>>> df
       A    B
0    one    a
1    two    b
2    two    c
3    one    a
4    one    a
5  three  NaN

There are two aggregation functions ‘any’ and ‘unique’:

>>> df.groupby('A')['B'].any()
A
one       True
three    False
two       True
Name: B, dtype: bool

>>> df.groupby('A')['B'].unique()
A
one         [a]
three     [nan]
two      [b, c]
Name: B, dtype: object

but I want to get the folowing result (or something close to it):

A
one           a
three     False
two        True

I can do it with some complex code, but it is better for me to find appropriate function in python packages or the easiest way to solve problem. I’d be grateful if you could help me with that.

Asked By: Anna

||

Answers:

You can aggregate Series.nunique for first column and unique values with remove possible missing values for another columns:

df1 = df.groupby('A').agg(count=('B','nunique'), 
                          uniq_without_NaNs = ('B', lambda x: x.dropna().unique()))
print (df1)
       count uniq_without_NaNs
A                             
one        1               [a]
three      0                []
two        2            [b, c]

Then create mask if greater column count by 1 and replace values by uniq_without_NaNs if equal count with 1:

out = df1['count'].gt(1).mask(df1['count'].eq(1), df1['uniq_without_NaNs'].str[0])
print (out)
A
one          a
three    False
two       True
Name: count, dtype: object
Answered By: jezrael
>>> g = df.groupby("A")["B"].agg
>>> nun = g("nunique")
>>> pd.Series(np.select([nun > 1, nun == 1],
                        [True, g("unique").str[0]],
                        default=False),
              index=nun.index)

A
one          a
three    False
two       True
dtype: object
  • get a hold on the group aggreagator
  • count number of uniques
    • if > 1, i.e., more than 1 uniques, put True
    • if == 1, i.e., only 1 unique, put that unique value
    • else, i.e., no uniques (full NaNs), put False
Answered By: Mustafa Aydın

You can combine groupby with agg and use boolean mask to choose the correct output:

# Your code
agg = df.groupby('A')['B'].agg(['any', 'unique'])

# Boolean mask to choose between 'any' and 'unique' column
m = agg['unique'].str.len().eq(1) & agg['unique'].str[0].notna()

# Final output
out = agg['any'].mask(m, other=agg['unique'].str[0])

Output:

>>> out
A
one          a
three    False
two       True

>>> agg
         any  unique
A                   
one     True     [a]
three  False   [nan]
two     True  [b, c]

>>> m
A
one       True  # choose 'unique' column
three    False  # choose 'any' column
two      False  # choose 'any' column
Answered By: Corralien
new_df = df.groupby('A')['B'].apply(lambda x: x.notna().any())
new_df = new_df .reset_index()
new_df .columns = ['A', 'B']

this will give you:

       A      B
0    one   True
1  three  False
2    two   True

now if we want to find the values we can do:

df.groupby('A')['B'].apply(lambda x: x[x.notna()].unique()[0] if x.notna().any() else np.nan)

which gives:

A
one        a
three    NaN
two        b
Answered By: Wiliam

The expression

series = df.groupby('A')['B'].agg(lambda x: pd.Series(x.unique()))

will give the next result:

one        a
three    Nan
two   [b, c]

where simple value can be identified by the type:

series[series.apply(type) == str]

Think it is easy enough to use often, but probably it is not the optimal solution.

Answered By: Anna
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.