How to remove some rows from a Pandas dataframe to balance it

Question:

I have a csv file and after reading it with pandas it has the following structure:

file_path, label
   -        -

The labels are only zeros and ones, and the frequency count is as follows:

data["labels"].value_counts()

0    197664
1     78444

I would like to remove an amount of rows which has the value 0, lets say 20k for example so that the frequency counts will have these values.

data["labels"].value_counts()

0    195664
1     78444
Asked By: Omar

||

Answers:

You can drop the last 20K rows on some condition using pandas drop.

df.drop(df[df.labels == 0].index[-20000:], inplace=True)
Answered By: Himanshuman

Usually I do split then concat

df1 = df.iloc[:20000]
df2 = df.drop(df1.index)
new = pd.concat([df1[df1['labels'] != 0], df2])
Answered By: BENY
mydict = {
  "file_path" : ["a", "b", "c", "d", "e" , "f", "g"],
  "label" : [0, 1, 0, 1, 1, 1, 0]
}
df = pd.DataFrame(mydict)
file_path label
0 a 0
1 b 1
2 c 0
3 d 1
4 e 1
5 f 1
6 g 0

if your labels are 1 or 0 and you want get only "1" label, you can group your dataset by "label" column and then use get_group() :

get_1 = df.groupby("label").get_group(1)
get_1
file_path label
1 b 1
3 d 1
4 e 1
5 f 1
Answered By: Galaxy
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.