Finding the longest streak of numbers, sum the values of that group and create an new dataframe
Question:
This is an extension to this post.
My dataframe is:
import pandas as pd
df = pd.DataFrame(
{
'a': [
'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',
'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b',
],
'b': [
-20, 20, 20, 20,-70, -70, 10, -1000, -10, 100, 100,
-11, -100, -1, -1, -100, 100, 1, 90, -1, -2, 1000, 900
],
'c': [
'f', 'f', 'f', 'f', 'f', 'x', 'x', 'x', 'y', 'y', 'y', 'a',
'k', 'k', 'k', 'k', 'k', 't', 't', 't', 't', 's', 'e',
],
}
)
And this is the output that I want. I want a dataframe with six columns:
a direction length sum start end
a -1 2 -1010 x y
a 1 3 60 f f
b -1 4 -202 k k
b 1 3 191 k t
I want to get the largest positive and negative streak in column b
for each group in column a
and sum the values of column b
after that. This issue has already been solved here. In the post that is noted on top I explained the issue in more detail.
Now what I want to add is: After finding the sum of longest negative and positive streak in b
, I need the start and end values of column c
of those streaks.
In this image I highlighted the groups that have the longest streak:
What I have tried is:
df['sign'] = np.sign(df.b)
group = df['sign'].ne(df['sign'].shift()).cumsum()
out = (df
.assign(direction=np.sign(df['b']))
.groupby(['a', 'direction', group], as_index=False)
.agg(length=('b', 'count'),
sum=('b', 'sum'))
.sort_values(by='sum', key=abs, ascending=False)
.loc[lambda d: d.groupby(['a', 'direction'])['length'].idxmax(),
['a','direction', 'length', 'sum']]
)
df['streak'] = df['sign'].ne(df['sign'].shift()).cumsum()
df['length'] = df.groupby('streak')['b'].transform('size')
df['sum'] = df.groupby('streak', as_index=False)['b'].transform(sum)
dfm = df.merge(out, on=['a', 'length', 'sum'], how='inner')
It is getting close but it feels like this is not the way to do it.
Answers:
Add extra aggregations in agg
with first
/last
:
out = (df
.assign(direction=np.sign(df['b']))
.groupby(['a', 'direction', group], as_index=False)
.agg(length=('b', 'count'),
sum=('b', 'sum'),
start=('c', 'first'),
end=('c', 'last'))
.sort_values(by='sum', key=abs, ascending=False)
.loc[lambda d: d.groupby(['a', 'direction'])['length'].idxmax()]
)
Output:
a direction length sum start end
2 a -1 2 -1010 x y
4 a 1 3 60 f f
7 b -1 4 -202 k k
9 b 1 3 191 k t
Relative to your last question, you just need to aggregate first
and last
values and add a
to the groupby
and drop_duplicates
:
group = np.sign(df['b']).ne(np.sign(df['b']).shift()).cumsum()
out = (df
.assign(direction=np.where(df['b'] >= 0, 'long', 'short'))
.groupby(['a', 'direction', group], as_index=False)
.agg(length=('b','size'),sum=('b','sum'),start=('c','first'),end=('c','last'))
.sort_values(['length', 'sum'], key=lambda s:s.abs(), ascending=False)
.drop_duplicates(['a', 'direction'])
)
Output:
a direction length sum start end
9 b short 4 -202 k k
7 b long 3 191 k t
0 a long 3 60 f f
5 a short 2 -1010 x y
This is an extension to this post.
My dataframe is:
import pandas as pd
df = pd.DataFrame(
{
'a': [
'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',
'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b',
],
'b': [
-20, 20, 20, 20,-70, -70, 10, -1000, -10, 100, 100,
-11, -100, -1, -1, -100, 100, 1, 90, -1, -2, 1000, 900
],
'c': [
'f', 'f', 'f', 'f', 'f', 'x', 'x', 'x', 'y', 'y', 'y', 'a',
'k', 'k', 'k', 'k', 'k', 't', 't', 't', 't', 's', 'e',
],
}
)
And this is the output that I want. I want a dataframe with six columns:
a direction length sum start end
a -1 2 -1010 x y
a 1 3 60 f f
b -1 4 -202 k k
b 1 3 191 k t
I want to get the largest positive and negative streak in column b
for each group in column a
and sum the values of column b
after that. This issue has already been solved here. In the post that is noted on top I explained the issue in more detail.
Now what I want to add is: After finding the sum of longest negative and positive streak in b
, I need the start and end values of column c
of those streaks.
In this image I highlighted the groups that have the longest streak:
What I have tried is:
df['sign'] = np.sign(df.b)
group = df['sign'].ne(df['sign'].shift()).cumsum()
out = (df
.assign(direction=np.sign(df['b']))
.groupby(['a', 'direction', group], as_index=False)
.agg(length=('b', 'count'),
sum=('b', 'sum'))
.sort_values(by='sum', key=abs, ascending=False)
.loc[lambda d: d.groupby(['a', 'direction'])['length'].idxmax(),
['a','direction', 'length', 'sum']]
)
df['streak'] = df['sign'].ne(df['sign'].shift()).cumsum()
df['length'] = df.groupby('streak')['b'].transform('size')
df['sum'] = df.groupby('streak', as_index=False)['b'].transform(sum)
dfm = df.merge(out, on=['a', 'length', 'sum'], how='inner')
It is getting close but it feels like this is not the way to do it.
Add extra aggregations in agg
with first
/last
:
out = (df
.assign(direction=np.sign(df['b']))
.groupby(['a', 'direction', group], as_index=False)
.agg(length=('b', 'count'),
sum=('b', 'sum'),
start=('c', 'first'),
end=('c', 'last'))
.sort_values(by='sum', key=abs, ascending=False)
.loc[lambda d: d.groupby(['a', 'direction'])['length'].idxmax()]
)
Output:
a direction length sum start end
2 a -1 2 -1010 x y
4 a 1 3 60 f f
7 b -1 4 -202 k k
9 b 1 3 191 k t
Relative to your last question, you just need to aggregate first
and last
values and add a
to the groupby
and drop_duplicates
:
group = np.sign(df['b']).ne(np.sign(df['b']).shift()).cumsum()
out = (df
.assign(direction=np.where(df['b'] >= 0, 'long', 'short'))
.groupby(['a', 'direction', group], as_index=False)
.agg(length=('b','size'),sum=('b','sum'),start=('c','first'),end=('c','last'))
.sort_values(['length', 'sum'], key=lambda s:s.abs(), ascending=False)
.drop_duplicates(['a', 'direction'])
)
Output:
a direction length sum start end
9 b short 4 -202 k k
7 b long 3 191 k t
0 a long 3 60 f f
5 a short 2 -1010 x y