sklearn stratified sampling based on a column
Question:
I have a fairly large CSV file containing amazon review data which I read into a pandas data frame. I want to split the data 80-20(train-test) but while doing so I want to ensure that the split data is proportionally representing the values of one column (Categories), i.e all the different category of reviews are present both in train and test data proportionally.
The data looks like this:
**ReviewerID** **ReviewText** **Categories** **ProductId**
1212 good product Mobile 14444425
1233 will buy again drugs 324532
5432 not recomended dvd 789654123
Im using the following code to do so:
import pandas as pd
Meta = pd.read_csv('C:\Users\xyz\Desktop\WM Project\Joined.csv')
import numpy as np
from sklearn.cross_validation import train_test_split
train, test = train_test_split(Meta.categories, test_size = 0.2, stratify=y)
it gives the following error
NameError: name 'y' is not defined
As I’m relatively new to python I cant figure out what I’m doing wrong or whether this code will stratify based on column categories. It seems to work fine when i remove the stratify option as well as the categories column from train-test split.
Any help will be appreciated.
Answers:
sklearn.model_selection.train_test_split
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the class labels.
Along the API docs, I think you have to try like X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y)
.
Meta_X
, Meta_Y
should be assigned properly by you(I think Meta_Y
should be Meta.categories
based on your code).
>>> import pandas as pd
>>> Meta = pd.read_csv('C:\Users\*****\Downloads\so\Book1.csv')
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> y = Meta.pop('Categories')
>>> Meta
ReviewerID ReviewText ProductId
0 1212 good product 14444425
1 1233 will buy again 324532
2 5432 not recomended 789654123
>>> y
0 Mobile
1 drugs
2 dvd
Name: Categories, dtype: object
>>> X = Meta
>>> X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify=y)
>>> X_test
ReviewerID ReviewText ProductId
0 1212 good product 14444425
I am not sure why StratifiedShuffleSplit isn’t mentioned by anyone
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df, df['Categories']):
strat_train_set = df.loc[train_index]
strat_test_set = df.loc[test_index]
For documentation refer StratifiedShuffleSplit
You don’t need to use sklearn
– use DataFrame.groupby
with DataFrame.sample
instead:
df.groupby([cols]).apply(lambda f: f.sample(frac=ratio))
Note: you might also need to reset_index(drop=True)
afterwards
I have a fairly large CSV file containing amazon review data which I read into a pandas data frame. I want to split the data 80-20(train-test) but while doing so I want to ensure that the split data is proportionally representing the values of one column (Categories), i.e all the different category of reviews are present both in train and test data proportionally.
The data looks like this:
**ReviewerID** **ReviewText** **Categories** **ProductId**
1212 good product Mobile 14444425
1233 will buy again drugs 324532
5432 not recomended dvd 789654123
Im using the following code to do so:
import pandas as pd
Meta = pd.read_csv('C:\Users\xyz\Desktop\WM Project\Joined.csv')
import numpy as np
from sklearn.cross_validation import train_test_split
train, test = train_test_split(Meta.categories, test_size = 0.2, stratify=y)
it gives the following error
NameError: name 'y' is not defined
As I’m relatively new to python I cant figure out what I’m doing wrong or whether this code will stratify based on column categories. It seems to work fine when i remove the stratify option as well as the categories column from train-test split.
Any help will be appreciated.
sklearn.model_selection.train_test_split
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the class labels.
Along the API docs, I think you have to try like X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y)
.
Meta_X
, Meta_Y
should be assigned properly by you(I think Meta_Y
should be Meta.categories
based on your code).
>>> import pandas as pd
>>> Meta = pd.read_csv('C:\Users\*****\Downloads\so\Book1.csv')
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> y = Meta.pop('Categories')
>>> Meta
ReviewerID ReviewText ProductId
0 1212 good product 14444425
1 1233 will buy again 324532
2 5432 not recomended 789654123
>>> y
0 Mobile
1 drugs
2 dvd
Name: Categories, dtype: object
>>> X = Meta
>>> X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify=y)
>>> X_test
ReviewerID ReviewText ProductId
0 1212 good product 14444425
I am not sure why StratifiedShuffleSplit isn’t mentioned by anyone
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df, df['Categories']):
strat_train_set = df.loc[train_index]
strat_test_set = df.loc[test_index]
For documentation refer StratifiedShuffleSplit
You don’t need to use sklearn
– use DataFrame.groupby
with DataFrame.sample
instead:
df.groupby([cols]).apply(lambda f: f.sample(frac=ratio))
Note: you might also need to reset_index(drop=True)
afterwards