Using Pandas, i'm trying to keep on my DataFrame only 100 rows of each value of my column "neighborhood"
Question:
I have a super large dataset that i’m trying to shrink.
My idea is to keep 100 rows by neighborhood.
Here’s an overview of my data :
index
name
neighborhood
0
name 1
neighborhood A
1
name 2
neighborhood A
2
name 3
neighborhood B
3
name 4
neighborhood B
4
name 5
neighborhood C
5
name 6
neighborhood C
6
name 7
neighborhood D
7
name 8
neighborhood D
8
name 9
neighborhood E
9
name 10
neighborhood E
What is the more efficient way to do so ?
Thanks in advance
I’m expecting to create something that looks like :
index
name
neighborhood
0
name 1
neighborhood A
1
name 3
neighborhood B
2
name 5
neighborhood C
3
name 7
neighborhood D
4
name 9
neighborhood E
Answers:
i think, you can use groupby and *nth:
dfx=df.groupby('neighborhood').nth[:100]
It depends how you want to select the rows.
first n with groupby.head
:
n = 100
out = df.groupby('neighborhood').head(n)
random n rows with groupby.sample
:
n = 100
out = df.groupby('neighborhood').sample(n=n)
I have a super large dataset that i’m trying to shrink.
My idea is to keep 100 rows by neighborhood.
Here’s an overview of my data :
index | name | neighborhood |
---|---|---|
0 | name 1 | neighborhood A |
1 | name 2 | neighborhood A |
2 | name 3 | neighborhood B |
3 | name 4 | neighborhood B |
4 | name 5 | neighborhood C |
5 | name 6 | neighborhood C |
6 | name 7 | neighborhood D |
7 | name 8 | neighborhood D |
8 | name 9 | neighborhood E |
9 | name 10 | neighborhood E |
What is the more efficient way to do so ?
Thanks in advance
I’m expecting to create something that looks like :
index | name | neighborhood |
---|---|---|
0 | name 1 | neighborhood A |
1 | name 3 | neighborhood B |
2 | name 5 | neighborhood C |
3 | name 7 | neighborhood D |
4 | name 9 | neighborhood E |
i think, you can use groupby and *nth:
dfx=df.groupby('neighborhood').nth[:100]
It depends how you want to select the rows.
first n with groupby.head
:
n = 100
out = df.groupby('neighborhood').head(n)
random n rows with groupby.sample
:
n = 100
out = df.groupby('neighborhood').sample(n=n)