How to remove some rows from a Pandas dataframe to balance it
Question:
I have a csv file and after reading it with pandas it has the following structure:
file_path, label
- -
The labels are only zeros and ones, and the frequency count is as follows:
data["labels"].value_counts()
0 197664
1 78444
I would like to remove an amount of rows which has the value 0, lets say 20k for example so that the frequency counts will have these values.
data["labels"].value_counts()
0 195664
1 78444
Answers:
You can drop the last 20K rows on some condition using pandas drop
.
df.drop(df[df.labels == 0].index[-20000:], inplace=True)
Usually I do split then concat
df1 = df.iloc[:20000]
df2 = df.drop(df1.index)
new = pd.concat([df1[df1['labels'] != 0], df2])
mydict = {
"file_path" : ["a", "b", "c", "d", "e" , "f", "g"],
"label" : [0, 1, 0, 1, 1, 1, 0]
}
df = pd.DataFrame(mydict)
file_path
label
0
a
0
1
b
1
2
c
0
3
d
1
4
e
1
5
f
1
6
g
0
if your labels are 1 or 0 and you want get only "1" label, you can group your dataset by "label" column and then use get_group() :
get_1 = df.groupby("label").get_group(1)
get_1
file_path
label
1
b
1
3
d
1
4
e
1
5
f
1
I have a csv file and after reading it with pandas it has the following structure:
file_path, label
- -
The labels are only zeros and ones, and the frequency count is as follows:
data["labels"].value_counts()
0 197664
1 78444
I would like to remove an amount of rows which has the value 0, lets say 20k for example so that the frequency counts will have these values.
data["labels"].value_counts()
0 195664
1 78444
You can drop the last 20K rows on some condition using pandas drop
.
df.drop(df[df.labels == 0].index[-20000:], inplace=True)
Usually I do split then concat
df1 = df.iloc[:20000]
df2 = df.drop(df1.index)
new = pd.concat([df1[df1['labels'] != 0], df2])
mydict = {
"file_path" : ["a", "b", "c", "d", "e" , "f", "g"],
"label" : [0, 1, 0, 1, 1, 1, 0]
}
df = pd.DataFrame(mydict)
file_path | label | |
---|---|---|
0 | a | 0 |
1 | b | 1 |
2 | c | 0 |
3 | d | 1 |
4 | e | 1 |
5 | f | 1 |
6 | g | 0 |
if your labels are 1 or 0 and you want get only "1" label, you can group your dataset by "label" column and then use get_group() :
get_1 = df.groupby("label").get_group(1)
get_1
file_path | label | |
---|---|---|
1 | b | 1 |
3 | d | 1 |
4 | e | 1 |
5 | f | 1 |