Finding the longest streak of numbers, sum the values of that group and create an new dataframe

Question:

This is an extension to this post.

My dataframe is:

import pandas as pd

df = pd.DataFrame(
    {
        'a': [
            'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',
            'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b',
        ],
        'b': [
            -20, 20, 20, 20,-70, -70, 10, -1000, -10, 100, 100,
            -11, -100, -1, -1, -100, 100, 1, 90, -1, -2, 1000, 900
        ],
        'c': [
            'f', 'f', 'f', 'f', 'f', 'x', 'x', 'x', 'y', 'y', 'y', 'a',
            'k', 'k', 'k', 'k', 'k', 't', 't', 't', 't', 's', 'e',
        ],
    }
)

And this is the output that I want. I want a dataframe with six columns:

a  direction  length   sum      start       end
a         -1       2 -1010       x           y
a          1       3    60       f           f
b         -1       4  -202       k           k
b          1       3   191       k           t

I want to get the largest positive and negative streak in column b for each group in column a and sum the values of column b after that. This issue has already been solved here. In the post that is noted on top I explained the issue in more detail.

Now what I want to add is: After finding the sum of longest negative and positive streak in b, I need the start and end values of column c of those streaks.

In this image I highlighted the groups that have the longest streak:

enter image description here

What I have tried is:

df['sign'] = np.sign(df.b)
group = df['sign'].ne(df['sign'].shift()).cumsum()

out = (df
   .assign(direction=np.sign(df['b']))
   .groupby(['a', 'direction', group], as_index=False)
   .agg(length=('b', 'count'),
        sum=('b', 'sum'))
   .sort_values(by='sum', key=abs, ascending=False)
   .loc[lambda d: d.groupby(['a', 'direction'])['length'].idxmax(),
        ['a','direction', 'length', 'sum']]
)

df['streak'] = df['sign'].ne(df['sign'].shift()).cumsum()
df['length'] = df.groupby('streak')['b'].transform('size')
df['sum'] = df.groupby('streak', as_index=False)['b'].transform(sum)
dfm = df.merge(out, on=['a', 'length', 'sum'], how='inner')

It is getting close but it feels like this is not the way to do it.

Asked By: x_Amir_x

||

Answers:

Add extra aggregations in agg with first/last:

out = (df
   .assign(direction=np.sign(df['b']))
   .groupby(['a', 'direction', group], as_index=False)
   .agg(length=('b', 'count'),
        sum=('b', 'sum'),
        start=('c', 'first'),
        end=('c', 'last'))
   .sort_values(by='sum', key=abs, ascending=False)
   .loc[lambda d: d.groupby(['a', 'direction'])['length'].idxmax()]
)

Output:

   a  direction  length   sum start end
2  a         -1       2 -1010     x   y
4  a          1       3    60     f   f
7  b         -1       4  -202     k   k
9  b          1       3   191     k   t
Answered By: mozway

Relative to your last question, you just need to aggregate first and last values and add a to the groupby and drop_duplicates:

group = np.sign(df['b']).ne(np.sign(df['b']).shift()).cumsum()

out = (df
    .assign(direction=np.where(df['b'] >= 0, 'long', 'short'))
    .groupby(['a', 'direction', group], as_index=False)
    .agg(length=('b','size'),sum=('b','sum'),start=('c','first'),end=('c','last'))
    .sort_values(['length', 'sum'], key=lambda s:s.abs(), ascending=False)
    .drop_duplicates(['a', 'direction'])
)

Output:

   a direction  length   sum start end
9  b     short       4  -202     k   k
7  b      long       3   191     k   t
0  a      long       3    60     f   f
5  a     short       2 -1010     x   y
Answered By: Nick
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.