How can I partitioning data set (csv file) with systematic sampling method?(python)

Question:

Here are the requirements:

  • Partitioning data set into train data set and test data set.
  • Systematic sampling should be used when partitioning data.
  • The train data set should be about 80% of all data points and the test data set should be 20% of them.

I have tried some codes:

def systematic_sampling(df, step):

    indexes = np.arange(0, len(df), step=step)
    systematic_sample = df.iloc[indexes]
    return systematic_sample

and

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2)

The codes either do systematic sampling or data partition but I’m not sure how to satisfy two conditions at the same time

Asked By: Caspar

||

Answers:

Systematic sampling:

It is a sampling technique in which the first element is selected at random and others get selected based on a fixed sampling interval. For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20)

Suppose we want the element with the number 3 and a sample size of 5. The next selection will be made at an interval of 20/5 i.e. 4 so 3 + 4 = 7 so 3,7,11 and so on.

So, you want to have this and also partition your data into two separated data of one %80 and the other %20 of the size of original data.

You may use the following:

import pandas as pd

def systematic_sampling(df, step):
    indexes = np.arange(0, len(df), step=step)
    systematic_sample = df.iloc[indexes]
    return systematic_sample


trainSize = 0.8 * len(df)
step = int(len(df)/trainSize) 
train_df = systematic_sampling(df, step)
# First, concat both data frames, so the output will have some duplicates!
remaining_df = pd.concat([df, train_df])
# Then, drop those which are duplicate, it is like "df - train_df"
remaining_df = remaining_df.drop_duplicates(keep=False)

Now, in the train_df, you have %80 of the original data and in the remaining_df you have the test data.

For others reading this, it was a good reference to read about this question: Read Me!

Answered By: Amirhossein Sefati
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.