Pandas – get first n-rows based on percentage
Question:
I have a dataframe i want to pop certain number of records, instead on number I want to pass as a percentage value.
for example,
df.head(n=10)
Pops out first 10 records from data set. I want a small change instead of 10 records i want to pop first 5% of record from my data set.
How to do this in pandas.
I’m looking for a code like this,
df.head(frac=0.05)
Is there any simple way to get this?
Answers:
I want to pop first 5% of record
There is no built-in method but you can do this:
You can multiply
the total number of rows to your percent and use the result as parameter for head
method.
n = 5
df.head(int(len(df)*(n/100)))
So if your dataframe contains 1000
rows and n = 5%
you will get the first 50
rows.
I’ve extended Mihai’s answer for my usage and it may be useful to people out there.
The purpose is automated top-n records selection for time series sampling, so you’re sure you’re taking old records for training and recent records for testing.
# having
# import pandas as pd
# df = pd.DataFrame...
def sample_first_prows(data, perc=0.7):
import pandas as pd
return data.head(int(len(data)*(perc)))
train = sample_first_prows(df)
test = df.iloc[max(train.index):]
df=pd.DataFrame(np.random.randn(10,2))
print(df)
0 1
0 0.375727 -1.297127
1 -0.676528 0.301175
2 -2.236334 0.154765
3 -0.127439 0.415495
4 1.399427 -1.244539
5 -0.884309 -0.108502
6 -0.884931 2.089305
7 0.075599 0.404521
8 1.836577 -0.762597
9 0.294883 0.540444
#70% of the Dataframe
part_70=df.sample(frac=0.7,random_state=10)
print(part_70)
0 1
8 1.836577 -0.762597
2 -2.236334 0.154765
5 -0.884309 -0.108502
6 -0.884931 2.089305
3 -0.127439 0.415495
1 -0.676528 0.301175
0 0.375727 -1.297127
may be this will help:
tt = tmp.groupby('id').apply(lambda x: x.head(int(len(x)*0.05))).reset_index(drop=True)
I also had the same problem and @mihai’s solution was useful. For my case I re-wrote to:-
percentage_to_take = 5/100
rows = int(df.shape[0]*percentage_to_take)
df.head(rows)
I presume for last percentage rows df.tail(rows)
or df.head(-rows)
would work as well.
I have a dataframe i want to pop certain number of records, instead on number I want to pass as a percentage value.
for example,
df.head(n=10)
Pops out first 10 records from data set. I want a small change instead of 10 records i want to pop first 5% of record from my data set.
How to do this in pandas.
I’m looking for a code like this,
df.head(frac=0.05)
Is there any simple way to get this?
I want to pop first 5% of record
There is no built-in method but you can do this:
You can multiply
the total number of rows to your percent and use the result as parameter for head
method.
n = 5
df.head(int(len(df)*(n/100)))
So if your dataframe contains 1000
rows and n = 5%
you will get the first 50
rows.
I’ve extended Mihai’s answer for my usage and it may be useful to people out there.
The purpose is automated top-n records selection for time series sampling, so you’re sure you’re taking old records for training and recent records for testing.
# having
# import pandas as pd
# df = pd.DataFrame...
def sample_first_prows(data, perc=0.7):
import pandas as pd
return data.head(int(len(data)*(perc)))
train = sample_first_prows(df)
test = df.iloc[max(train.index):]
df=pd.DataFrame(np.random.randn(10,2))
print(df)
0 1
0 0.375727 -1.297127
1 -0.676528 0.301175
2 -2.236334 0.154765
3 -0.127439 0.415495
4 1.399427 -1.244539
5 -0.884309 -0.108502
6 -0.884931 2.089305
7 0.075599 0.404521
8 1.836577 -0.762597
9 0.294883 0.540444
#70% of the Dataframe
part_70=df.sample(frac=0.7,random_state=10)
print(part_70)
0 1
8 1.836577 -0.762597
2 -2.236334 0.154765
5 -0.884309 -0.108502
6 -0.884931 2.089305
3 -0.127439 0.415495
1 -0.676528 0.301175
0 0.375727 -1.297127
may be this will help:
tt = tmp.groupby('id').apply(lambda x: x.head(int(len(x)*0.05))).reset_index(drop=True)
I also had the same problem and @mihai’s solution was useful. For my case I re-wrote to:-
percentage_to_take = 5/100
rows = int(df.shape[0]*percentage_to_take)
df.head(rows)
I presume for last percentage rows df.tail(rows)
or df.head(-rows)
would work as well.