Pandas : balancing data
Question:
Note: This question is not the same as an answer here: “Pandas: sample each group after groupby”
Trying to figure out how to use pandas.DataFrame.sample
or any other function to balance this data:
df[class].value_counts()
c1 9170
c2 5266
c3 4523
c4 2193
c5 1956
c6 1896
c7 1580
c8 1407
c9 1324
I need to get a random sample of each class (c1, c2, .. c9) where sample size is equal to the size of a class with min number of instances. In this example sample size should be the size of class c9 = 1324.
Any simple way to do this with Pandas?
Update
To clarify my question, in the table above :
c1 9170
c2 5266
c3 4523
...
Numbers are counts of instances of c1,c2,c3,… classes, so actual data looks like this:
c1 'foo'
c2 'bar'
c1 'foo-2'
c1 'foo-145'
c1 'xxx-07'
c2 'zzz'
...
etc.
Update 2
To clarify more:
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
'val': [1,2,1,1,2,1,1,2,3,3]
}
df = pd.DataFrame(d)
class val
0 c1 1
1 c2 2
2 c1 1
3 c1 1
4 c2 2
5 c1 1
6 c1 1
7 c2 2
8 c3 3
9 c3 3
df['class'].value_counts()
c1 5
c2 3
c3 2
Name: class, dtype: int64
g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()))
class val
class
c1 6 c1 1
5 c1 1
c2 4 c2 2
1 c2 2
c3 9 c3 3
8 c3 3
Looks like this works. Main questions:
How g.apply(lambda x: x.sample(g.size().min()))
works? I know what ‘lambda` is, but:
- What is passed to
lambda
in x
in this case?
- What is
g
in g.size()
?
- Why output contains 6,5,4, 1,8,9 numbers? What do they
mean?
Answers:
g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
class val
0 c1 1
1 c1 1
2 c2 2
3 c2 2
4 c3 3
5 c3 3
Answers to your follow-up questions
- The
x
in the lambda
ends up being a dataframe that is the subset of df
represented by the group. Each of these dataframes, one for each group, gets passed through this lambda
.
g
is the groupby
object. I placed it in a named variable because I planned on using it twice. df.groupby('class').size()
is an alternative way to do df['class'].value_counts()
but since I was going to groupby
anyway, I might as well reuse the same groupby
, use a size
to get the value counts… saves time.
- Those numbers are the the index values from
df
that go with the sampling. I added reset_index(drop=True)
to get rid of it.
The above answer is correct but I would love to specify that the g above is not a Pandas DataFrame
object which the user most likely wants. It is a pandas.core.groupby.groupby.DataFrameGroupBy
object. Pandas apply does not modify the dataframe inplace but returns a dataframe. To see this, try calling head
on g and the result will be as shown below.
import pandas as pd
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
'val': [1,2,1,1,2,1,1,2,3,3]
}
d = pd.DataFrame(d)
g = d.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
g.head()
>>> class val
0 c1 1
1 c2 2
2 c1 1
3 c1 1
4 c2 2
5 c1 1
6 c1 1
7 c2 2
8 c3 3
9 c3 3
To fix this, you can either create a new variable or
assign g to the result of the apply as shown below so that you get a Pandas DataFrame
:
g = d.groupby('class')
g = pd.DataFrame(g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)))
Calling the head now yields:
g.head()
>>>class val
0 c1 1
1 c2 2
2 c1 1
3 c1 1
4 c2 2
Which is most likely what the user wants.
This method get randomly k elements of each class.
def sampling_k_elements(group, k=3):
if len(group) < k:
return group
return group.sample(k)
balanced = df.groupby('class').apply(sampling_k_elements).reset_index(drop=True)
"The following code works for undersampling of unbalanced classes but it’s too much sorry for that.Try it! And also it works the same for upsampling problems! Good Luck!"
Import required sampling libraries
from sklearn.utils import resample
Define the majority and minority class
df_minority9 = df[df['class']=='c9']
df_majority1 = df[df['class']=='c1']
df_majority2 = df[df['class']=='c2']
df_majority3 = df[df['class']=='c3']
df_majority4 = df[df['class']=='c4']
df_majority5 = df[df['class']=='c5']
df_majority6 = df[df['class']=='c6']
df_majority7 = df[df['class']=='c7']
df_majority8 = df[df['class']=='c8']
Unndersample majority class
maj_class1 = resample(df_majority1,
replace=True,
n_samples=1324,
random_state=123)
maj_class2 = resample(df_majority2,
replace=True,
n_samples=1324,
random_state=123)
maj_class3 = resample(df_majority3,
replace=True,
n_samples=1324,
random_state=123)
maj_class4 = resample(df_majority4,
replace=True,
n_samples=1324,
random_state=123)
maj_class5 = resample(df_majority5,
replace=True,
n_samples=1324,
random_state=123)
maj_class6 = resample(df_majority6,
replace=True,
n_samples=1324,
random_state=123)
maj_class7 = resample(df_majority7,
replace=True,
n_samples=1324,
random_state=123)
maj_class8 = resample(df_majority8,
replace=True,
n_samples=1324,
random_state=123)
Combine minority class with undersampled majority class
df=pd.concat([df_minority9,maj_class1,maj_class2,maj_class3,maj_class4, maj_class5,dmaj_class6,maj_class7,maj_class8])
Display new balanced class counts
df['class'].value_counts()
I know this question is old but I stumbled across it and wasn’t really happy with the solutions here and in other threads. I made a quick solution using list comprehension that works for me. Maybe it is useful to someone else:
df_for_training_grouped = df_for_training.groupby("sentiment")
df_for_training_grouped.groups.values()
frames_of_groups = [x.sample(df_for_training_grouped.size().min()) for y, x in df_for_training_grouped]
new_df = pd.concat(frames_of_groups)
The result is a dataframe which contains the same amount of entries for each group. The amount of entries is set to the size of the smallest group.
Note: This question is not the same as an answer here: “Pandas: sample each group after groupby”
Trying to figure out how to use pandas.DataFrame.sample
or any other function to balance this data:
df[class].value_counts()
c1 9170
c2 5266
c3 4523
c4 2193
c5 1956
c6 1896
c7 1580
c8 1407
c9 1324
I need to get a random sample of each class (c1, c2, .. c9) where sample size is equal to the size of a class with min number of instances. In this example sample size should be the size of class c9 = 1324.
Any simple way to do this with Pandas?
Update
To clarify my question, in the table above :
c1 9170
c2 5266
c3 4523
...
Numbers are counts of instances of c1,c2,c3,… classes, so actual data looks like this:
c1 'foo'
c2 'bar'
c1 'foo-2'
c1 'foo-145'
c1 'xxx-07'
c2 'zzz'
...
etc.
Update 2
To clarify more:
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
'val': [1,2,1,1,2,1,1,2,3,3]
}
df = pd.DataFrame(d)
class val
0 c1 1
1 c2 2
2 c1 1
3 c1 1
4 c2 2
5 c1 1
6 c1 1
7 c2 2
8 c3 3
9 c3 3
df['class'].value_counts()
c1 5
c2 3
c3 2
Name: class, dtype: int64
g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()))
class val
class
c1 6 c1 1
5 c1 1
c2 4 c2 2
1 c2 2
c3 9 c3 3
8 c3 3
Looks like this works. Main questions:
How g.apply(lambda x: x.sample(g.size().min()))
works? I know what ‘lambda` is, but:
- What is passed to
lambda
inx
in this case? - What is
g
ing.size()
? - Why output contains 6,5,4, 1,8,9 numbers? What do they
mean?
g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
class val
0 c1 1
1 c1 1
2 c2 2
3 c2 2
4 c3 3
5 c3 3
Answers to your follow-up questions
- The
x
in thelambda
ends up being a dataframe that is the subset ofdf
represented by the group. Each of these dataframes, one for each group, gets passed through thislambda
. g
is thegroupby
object. I placed it in a named variable because I planned on using it twice.df.groupby('class').size()
is an alternative way to dodf['class'].value_counts()
but since I was going togroupby
anyway, I might as well reuse the samegroupby
, use asize
to get the value counts… saves time.- Those numbers are the the index values from
df
that go with the sampling. I addedreset_index(drop=True)
to get rid of it.
The above answer is correct but I would love to specify that the g above is not a Pandas DataFrame
object which the user most likely wants. It is a pandas.core.groupby.groupby.DataFrameGroupBy
object. Pandas apply does not modify the dataframe inplace but returns a dataframe. To see this, try calling head
on g and the result will be as shown below.
import pandas as pd
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
'val': [1,2,1,1,2,1,1,2,3,3]
}
d = pd.DataFrame(d)
g = d.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
g.head()
>>> class val
0 c1 1
1 c2 2
2 c1 1
3 c1 1
4 c2 2
5 c1 1
6 c1 1
7 c2 2
8 c3 3
9 c3 3
To fix this, you can either create a new variable or
assign g to the result of the apply as shown below so that you get a Pandas DataFrame
:
g = d.groupby('class')
g = pd.DataFrame(g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)))
Calling the head now yields:
g.head()
>>>class val
0 c1 1
1 c2 2
2 c1 1
3 c1 1
4 c2 2
Which is most likely what the user wants.
This method get randomly k elements of each class.
def sampling_k_elements(group, k=3):
if len(group) < k:
return group
return group.sample(k)
balanced = df.groupby('class').apply(sampling_k_elements).reset_index(drop=True)
"The following code works for undersampling of unbalanced classes but it’s too much sorry for that.Try it! And also it works the same for upsampling problems! Good Luck!"
Import required sampling libraries
from sklearn.utils import resample
Define the majority and minority class
df_minority9 = df[df['class']=='c9']
df_majority1 = df[df['class']=='c1']
df_majority2 = df[df['class']=='c2']
df_majority3 = df[df['class']=='c3']
df_majority4 = df[df['class']=='c4']
df_majority5 = df[df['class']=='c5']
df_majority6 = df[df['class']=='c6']
df_majority7 = df[df['class']=='c7']
df_majority8 = df[df['class']=='c8']
Unndersample majority class
maj_class1 = resample(df_majority1,
replace=True,
n_samples=1324,
random_state=123)
maj_class2 = resample(df_majority2,
replace=True,
n_samples=1324,
random_state=123)
maj_class3 = resample(df_majority3,
replace=True,
n_samples=1324,
random_state=123)
maj_class4 = resample(df_majority4,
replace=True,
n_samples=1324,
random_state=123)
maj_class5 = resample(df_majority5,
replace=True,
n_samples=1324,
random_state=123)
maj_class6 = resample(df_majority6,
replace=True,
n_samples=1324,
random_state=123)
maj_class7 = resample(df_majority7,
replace=True,
n_samples=1324,
random_state=123)
maj_class8 = resample(df_majority8,
replace=True,
n_samples=1324,
random_state=123)
Combine minority class with undersampled majority class
df=pd.concat([df_minority9,maj_class1,maj_class2,maj_class3,maj_class4, maj_class5,dmaj_class6,maj_class7,maj_class8])
Display new balanced class counts
df['class'].value_counts()
I know this question is old but I stumbled across it and wasn’t really happy with the solutions here and in other threads. I made a quick solution using list comprehension that works for me. Maybe it is useful to someone else:
df_for_training_grouped = df_for_training.groupby("sentiment")
df_for_training_grouped.groups.values()
frames_of_groups = [x.sample(df_for_training_grouped.size().min()) for y, x in df_for_training_grouped]
new_df = pd.concat(frames_of_groups)
The result is a dataframe which contains the same amount of entries for each group. The amount of entries is set to the size of the smallest group.