How to split data into trainset and testset randomly?
Question:
I have a large dataset and want to split it into training(50%) and testing set(50%).
Say I have 100 examples stored the input file, each line contains one example. I need to choose 50 lines as training set and 50 lines testing set.
My idea is first generate a random list with length 100 (values range from 1 to 100), then use the first 50 elements as the line number for the 50 training examples. The same with testing set.
This could be achieved easily in Matlab
fid=fopen(datafile);
C = textscan(fid, '%s','delimiter', 'n');
plist=randperm(100);
for i=1:50
trainstring = C{plist(i)};
fprintf(train_file,trainstring);
end
for i=51:100
teststring = C{plist(i)};
fprintf(test_file,teststring);
end
But how could I accomplish this function in Python? I’m new to Python, and don’t know whether I could read the whole file into an array, and choose certain lines.
Answers:
Well first of all there’s no such thing as “arrays” in Python, Python uses lists and that does make a difference, I suggest you use NumPy which is a pretty good library for Python and it adds a lot of Matlab-like functionality.You can get started here Numpy for Matlab users
This can be done similarly in Python using lists, (note that the whole list is shuffled in place).
import random
with open("datafile.txt", "rb") as f:
data = f.read().split('n')
random.shuffle(data)
train_data = data[:50]
test_data = data[50:]
The following produces more general k-fold cross-validation splits. Your 50-50 partitioning would be achieved by making k=2
below, all you would have to to is to pick one of the two partitions produced. Note: I haven’t tested the code, but I’m pretty sure it should work.
import random, math
def k_fold(myfile, myseed=11109, k=3):
# Load data
data = open(myfile).readlines()
# Shuffle input
random.seed=myseed
random.shuffle(data)
# Compute partition size given input k
len_part=int(math.ceil(len(data)/float(k)))
# Create one partition per fold
train={}
test={}
for ii in range(k):
test[ii] = data[ii*len_part:ii*len_part+len_part]
train[ii] = [jj for jj in data if jj not in test[ii]]
return train, test
You could also use numpy. When your data is stored in a numpy.ndarray:
import numpy as np
from random import sample
l = 100 #length of data
f = 50 #number of elements you need
indices = sample(range(l),f)
train_data = data[indices]
test_data = np.delete(data,indices)
You can try this approach
import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.cross_validation.train_test_split(csv, train_size = 0.5)
UPDATE: train_test_split
was moved to model_selection
so the current way (scikit-learn 0.22.2) to do it is this:
import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.model_selection.train_test_split(csv, train_size = 0.5)
from sklearn.model_selection import train_test_split
import numpy
with open("datafile.txt", "rb") as f:
data = f.read().split('n')
data = numpy.array(data) #convert array to numpy type array
x_train ,x_test = train_test_split(data,test_size=0.5) #test_size=0.5(whole_data)
To answer @desmond.carros question, I modified the best answer as follows,
import random
file=open("datafile.txt","r")
data=list()
for line in file:
data.append(line.split(#your preferred delimiter))
file.close()
random.shuffle(data)
train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
test_data = data[int((len(data)+1)*.80):] #Splits 20% data to test set
The code splits the entire dataset to 80% train and 20% test data
sklearn.cross_validation
is deprecated since version 0.18, instead you should use sklearn.model_selection
as show below
from sklearn.model_selection import train_test_split
import numpy
with open("datafile.txt", "rb") as f:
data = f.read().split('n')
data = numpy.array(data) #convert array to numpy type array
x_train ,x_test = train_test_split(data,test_size=0.5) #test_size=0.5(whole_data)
A quick note for the answer from @subin sahayam
import random
file=open("datafile.txt","r")
data=list()
for line in file:
data.append(line.split(#your preferred delimiter))
file.close()
random.shuffle(data)
train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
test_data = data[int(len(data)*.80+1):] #Splits 20% data to test set
If your list size is a even number, you should not add the 1 in the code below. Instead, you need to check the size of the list first and then determine if you need to add the 1.
test_data = data[int(len(data)*.80+1):]
I have a large dataset and want to split it into training(50%) and testing set(50%).
Say I have 100 examples stored the input file, each line contains one example. I need to choose 50 lines as training set and 50 lines testing set.
My idea is first generate a random list with length 100 (values range from 1 to 100), then use the first 50 elements as the line number for the 50 training examples. The same with testing set.
This could be achieved easily in Matlab
fid=fopen(datafile);
C = textscan(fid, '%s','delimiter', 'n');
plist=randperm(100);
for i=1:50
trainstring = C{plist(i)};
fprintf(train_file,trainstring);
end
for i=51:100
teststring = C{plist(i)};
fprintf(test_file,teststring);
end
But how could I accomplish this function in Python? I’m new to Python, and don’t know whether I could read the whole file into an array, and choose certain lines.
Well first of all there’s no such thing as “arrays” in Python, Python uses lists and that does make a difference, I suggest you use NumPy which is a pretty good library for Python and it adds a lot of Matlab-like functionality.You can get started here Numpy for Matlab users
This can be done similarly in Python using lists, (note that the whole list is shuffled in place).
import random
with open("datafile.txt", "rb") as f:
data = f.read().split('n')
random.shuffle(data)
train_data = data[:50]
test_data = data[50:]
The following produces more general k-fold cross-validation splits. Your 50-50 partitioning would be achieved by making k=2
below, all you would have to to is to pick one of the two partitions produced. Note: I haven’t tested the code, but I’m pretty sure it should work.
import random, math
def k_fold(myfile, myseed=11109, k=3):
# Load data
data = open(myfile).readlines()
# Shuffle input
random.seed=myseed
random.shuffle(data)
# Compute partition size given input k
len_part=int(math.ceil(len(data)/float(k)))
# Create one partition per fold
train={}
test={}
for ii in range(k):
test[ii] = data[ii*len_part:ii*len_part+len_part]
train[ii] = [jj for jj in data if jj not in test[ii]]
return train, test
You could also use numpy. When your data is stored in a numpy.ndarray:
import numpy as np
from random import sample
l = 100 #length of data
f = 50 #number of elements you need
indices = sample(range(l),f)
train_data = data[indices]
test_data = np.delete(data,indices)
You can try this approach
import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.cross_validation.train_test_split(csv, train_size = 0.5)
UPDATE: train_test_split
was moved to model_selection
so the current way (scikit-learn 0.22.2) to do it is this:
import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.model_selection.train_test_split(csv, train_size = 0.5)
from sklearn.model_selection import train_test_split
import numpy
with open("datafile.txt", "rb") as f:
data = f.read().split('n')
data = numpy.array(data) #convert array to numpy type array
x_train ,x_test = train_test_split(data,test_size=0.5) #test_size=0.5(whole_data)
To answer @desmond.carros question, I modified the best answer as follows,
import random
file=open("datafile.txt","r")
data=list()
for line in file:
data.append(line.split(#your preferred delimiter))
file.close()
random.shuffle(data)
train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
test_data = data[int((len(data)+1)*.80):] #Splits 20% data to test set
The code splits the entire dataset to 80% train and 20% test data
sklearn.cross_validation
is deprecated since version 0.18, instead you should use sklearn.model_selection
as show below
from sklearn.model_selection import train_test_split
import numpy
with open("datafile.txt", "rb") as f:
data = f.read().split('n')
data = numpy.array(data) #convert array to numpy type array
x_train ,x_test = train_test_split(data,test_size=0.5) #test_size=0.5(whole_data)
A quick note for the answer from @subin sahayam
import random
file=open("datafile.txt","r")
data=list()
for line in file:
data.append(line.split(#your preferred delimiter))
file.close()
random.shuffle(data)
train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
test_data = data[int(len(data)*.80+1):] #Splits 20% data to test set
If your list size is a even number, you should not add the 1 in the code below. Instead, you need to check the size of the list first and then determine if you need to add the 1.
test_data = data[int(len(data)*.80+1):]