How can I partitioning data set (csv file) with systematic sampling method?(python)
Question:
Here are the requirements:
- Partitioning data set into train data set and test data set.
- Systematic sampling should be used when partitioning data.
- The train data set should be about 80% of all data points and the test data set should be 20% of them.
I have tried some codes:
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
and
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2)
The codes either do systematic sampling or data partition but I’m not sure how to satisfy two conditions at the same time
Answers:
Systematic sampling:
It is a sampling technique in which the first element is selected at random and others get selected based on a fixed sampling interval. For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20)
Suppose we want the element with the number 3 and a sample size of 5. The next selection will be made at an interval of 20/5 i.e. 4 so 3 + 4 = 7 so 3,7,11 and so on.
So, you want to have this and also partition your data into two separated data of one %80 and the other %20 of the size of original data.
You may use the following:
import pandas as pd
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
trainSize = 0.8 * len(df)
step = int(len(df)/trainSize)
train_df = systematic_sampling(df, step)
# First, concat both data frames, so the output will have some duplicates!
remaining_df = pd.concat([df, train_df])
# Then, drop those which are duplicate, it is like "df - train_df"
remaining_df = remaining_df.drop_duplicates(keep=False)
Now, in the train_df
, you have %80 of the original data and in the remaining_df
you have the test data.
For others reading this, it was a good reference to read about this question: Read Me!
Here are the requirements:
- Partitioning data set into train data set and test data set.
- Systematic sampling should be used when partitioning data.
- The train data set should be about 80% of all data points and the test data set should be 20% of them.
I have tried some codes:
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
and
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2)
The codes either do systematic sampling or data partition but I’m not sure how to satisfy two conditions at the same time
Systematic sampling:
It is a sampling technique in which the first element is selected at random and others get selected based on a fixed sampling interval. For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20)
Suppose we want the element with the number 3 and a sample size of 5. The next selection will be made at an interval of 20/5 i.e. 4 so 3 + 4 = 7 so 3,7,11 and so on.
So, you want to have this and also partition your data into two separated data of one %80 and the other %20 of the size of original data.
You may use the following:
import pandas as pd
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
trainSize = 0.8 * len(df)
step = int(len(df)/trainSize)
train_df = systematic_sampling(df, step)
# First, concat both data frames, so the output will have some duplicates!
remaining_df = pd.concat([df, train_df])
# Then, drop those which are duplicate, it is like "df - train_df"
remaining_df = remaining_df.drop_duplicates(keep=False)
Now, in the train_df
, you have %80 of the original data and in the remaining_df
you have the test data.
For others reading this, it was a good reference to read about this question: Read Me!