Delete all columns for which value repents consecutively more than 3 times
Question:
I have adf
that looks like this:
date
stock1
stock2
stock3
stock4
stock5
stock6
stock7
stock8
stock9
stock10
10/20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.9
11/20
0.1
0.9
0.3
0.4
0.3
0.5
0.3
0.2
0.4
0.1
12/20
0.1
0.6
0.9
0.5
0.6
0.7
0.8
0.7
0.9
0.1
10/20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.9
11/20
0.8
0.9
0.3
0.4
0.3
0.5
0.3
0.2
0.9
0.1
12/20
0.3
0.6
0.9
0.5
0.6
0.7
0.8
0.7
0.9
0.1
10/20
0.1
0.2
0.3
0.4
0.5
0.7
0.7
0.8
0.9
0.9
11/20
0.8
0.9
0.3
0.4
0.3
0.7
0.3
0.2
0.4
0.1
12/20
0.3
0.6
0.9
0.5
0.6
0.7
0.8
0.7
0.9
0.1
I want to delete all columns for which the same value repeats, consecutively, more than 3 times. In this example, the columns "stock1", "stock6" and "stock9" should be deleted. In the other columns, we have repeating values more than 3 times, but not one after the other.
I think I can adapt the code from that question Removing values that repeat more than 5 times in Pandas DataFrame, but I could not manage to do that yet.
Answers:
You could want avoid apply here:
N = 3
df.loc[:,
df.set_index('date')
.ne(df.shift()).cumsum()
.stack()
.groupby(level=1)
.value_counts()
.max(level=0).le(N)]
You can set "date" aside as index, then check if the rows are different from the next one as use it to groupby
+cumcount
.
Then compute the max count per column, if greater than N-1, drop the column:
df2 = df.set_index('date')
N = 3
df2.loc[:, df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount()).max().lt(N-1)]
output:
stock2 stock3 stock4 stock5 stock7 stock8 stock10
date
10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
intermediate count of successive values:
>>> df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount())
stock1 stock2 stock3 stock4 stock5 stock6 stock7 stock8 stock9 stock10
date
10/20 0 0 0 0 0 0 0 0 0 0
11/20 1 0 1 1 0 0 0 0 0 0
12/20 2 0 0 0 0 0 0 0 0 1
10/20 3 0 0 0 0 0 0 0 1 0
11/20 0 0 1 1 0 0 0 0 2 0
12/20 0 0 0 0 0 0 0 0 3 1
10/20 0 0 0 0 0 1 0 0 4 0
11/20 0 0 1 1 0 2 0 0 0 0
12/20 0 0 0 0 0 3 0 0 0 1
df1[['date']].join(
df1.iloc[:,1:].loc[:,df1.iloc[:,1:].apply(lambda ss:(ss.diff()!=0).cumsum().value_counts().max())<3]
)
date stock2 stock3 stock4 stock5 stock7 stock8 stock10
0 10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
1 11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
2 12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
3 10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
4 11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
5 12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
6 10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
7 11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
8 12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
I have adf
that looks like this:
date | stock1 | stock2 | stock3 | stock4 | stock5 | stock6 | stock7 | stock8 | stock9 | stock10 |
---|---|---|---|---|---|---|---|---|---|---|
10/20 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 0.9 |
11/20 | 0.1 | 0.9 | 0.3 | 0.4 | 0.3 | 0.5 | 0.3 | 0.2 | 0.4 | 0.1 |
12/20 | 0.1 | 0.6 | 0.9 | 0.5 | 0.6 | 0.7 | 0.8 | 0.7 | 0.9 | 0.1 |
10/20 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 0.9 |
11/20 | 0.8 | 0.9 | 0.3 | 0.4 | 0.3 | 0.5 | 0.3 | 0.2 | 0.9 | 0.1 |
12/20 | 0.3 | 0.6 | 0.9 | 0.5 | 0.6 | 0.7 | 0.8 | 0.7 | 0.9 | 0.1 |
10/20 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.7 | 0.7 | 0.8 | 0.9 | 0.9 |
11/20 | 0.8 | 0.9 | 0.3 | 0.4 | 0.3 | 0.7 | 0.3 | 0.2 | 0.4 | 0.1 |
12/20 | 0.3 | 0.6 | 0.9 | 0.5 | 0.6 | 0.7 | 0.8 | 0.7 | 0.9 | 0.1 |
I want to delete all columns for which the same value repeats, consecutively, more than 3 times. In this example, the columns "stock1", "stock6" and "stock9" should be deleted. In the other columns, we have repeating values more than 3 times, but not one after the other.
I think I can adapt the code from that question Removing values that repeat more than 5 times in Pandas DataFrame, but I could not manage to do that yet.
You could want avoid apply here:
N = 3
df.loc[:,
df.set_index('date')
.ne(df.shift()).cumsum()
.stack()
.groupby(level=1)
.value_counts()
.max(level=0).le(N)]
You can set "date" aside as index, then check if the rows are different from the next one as use it to groupby
+cumcount
.
Then compute the max count per column, if greater than N-1, drop the column:
df2 = df.set_index('date')
N = 3
df2.loc[:, df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount()).max().lt(N-1)]
output:
stock2 stock3 stock4 stock5 stock7 stock8 stock10
date
10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
intermediate count of successive values:
>>> df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount())
stock1 stock2 stock3 stock4 stock5 stock6 stock7 stock8 stock9 stock10
date
10/20 0 0 0 0 0 0 0 0 0 0
11/20 1 0 1 1 0 0 0 0 0 0
12/20 2 0 0 0 0 0 0 0 0 1
10/20 3 0 0 0 0 0 0 0 1 0
11/20 0 0 1 1 0 0 0 0 2 0
12/20 0 0 0 0 0 0 0 0 3 1
10/20 0 0 0 0 0 1 0 0 4 0
11/20 0 0 1 1 0 2 0 0 0 0
12/20 0 0 0 0 0 3 0 0 0 1
df1[['date']].join(
df1.iloc[:,1:].loc[:,df1.iloc[:,1:].apply(lambda ss:(ss.diff()!=0).cumsum().value_counts().max())<3]
)
date stock2 stock3 stock4 stock5 stock7 stock8 stock10
0 10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
1 11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
2 12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
3 10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
4 11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
5 12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
6 10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
7 11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
8 12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1