How to split a csv_file into two file: one containing 40% of the original data, the other 60%. The data should be shuffled first
Question:
I have a csv file. The columns are [‘A’ ‘B’ ‘C’], and there are 1000 rows of original data.
A B C
1 0 1
-1 2 0
.
.
.
1 0 0.
So I need 40% of these data in one csv_file, 60 % in the other. But first, the rows must be shuffled randomly. Hopefully using the pandas module in python.
I tried
Import pandas as pd
df=pd.read_csv('filename.csv')
np.random.permutation(df)
df[0:400].to_csv('filename1.csv')
df[401:].to_csv('filename2.csv')
but np.random.permutation(df) returns only arrays.
Answers:
Try this way
with shuffling before saving & complete snippet
import numpy as np
import pandas as pd
per = 40
mask =int(len(df))
perdf=df.head(int((mask*(per/100))))
perdf =perdf.iloc[np.random.permutation(len(perdf))]
perdf.to_csv('40perdf.csv')
perdf60=df[:mask]
perdf60 =perdf60.iloc[np.random.permutation(len(perdf60))]
perdf60.to_csv('60perdf.csv')
Note: Not tested…Pls test it & let me know
Use pandas.DataFrame.sample
to get shuffled 40% without replacement then drop from main table to get the 60%.
df_40 = df.sample(frac=0.4)
df_60 = df.drop(df_40.index)
Problem was, that You don’t return result of permutation
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:temptest1.csv", sep=',')
# source file like this
# A,B,C
# 0,1,1
# 0,0,0
# 1,1,0
# 0,0,0
# 0,0,1
# 2,0,0
df = pd.DataFrame( np.random.permutation(df))
df = df.rename(columns={0: 'A',1:'B',2:'C'})
split_place = int(df.shape[0]*0.4)
df[0:split_place].to_csv(r'c:tempfilename1.csv', index=False, columns=None, sep=',')
# in file get somthing like
# A,B,C
# 0,0,1
# 0,0,0
df[split_place:].to_csv(r'c:tempfilename2.csv',index=False, sep=',')
# if don't need header, can use header=False,
more info bout saving to CSV in pandas documentation
I have a csv file. The columns are [‘A’ ‘B’ ‘C’], and there are 1000 rows of original data.
A B C
1 0 1
-1 2 0
.
.
.
1 0 0.
So I need 40% of these data in one csv_file, 60 % in the other. But first, the rows must be shuffled randomly. Hopefully using the pandas module in python.
I tried
Import pandas as pd
df=pd.read_csv('filename.csv')
np.random.permutation(df)
df[0:400].to_csv('filename1.csv')
df[401:].to_csv('filename2.csv')
but np.random.permutation(df) returns only arrays.
Try this way
with shuffling before saving & complete snippet
import numpy as np
import pandas as pd
per = 40
mask =int(len(df))
perdf=df.head(int((mask*(per/100))))
perdf =perdf.iloc[np.random.permutation(len(perdf))]
perdf.to_csv('40perdf.csv')
perdf60=df[:mask]
perdf60 =perdf60.iloc[np.random.permutation(len(perdf60))]
perdf60.to_csv('60perdf.csv')
Note: Not tested…Pls test it & let me know
Use pandas.DataFrame.sample
to get shuffled 40% without replacement then drop from main table to get the 60%.
df_40 = df.sample(frac=0.4)
df_60 = df.drop(df_40.index)
Problem was, that You don’t return result of permutation
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:temptest1.csv", sep=',')
# source file like this
# A,B,C
# 0,1,1
# 0,0,0
# 1,1,0
# 0,0,0
# 0,0,1
# 2,0,0
df = pd.DataFrame( np.random.permutation(df))
df = df.rename(columns={0: 'A',1:'B',2:'C'})
split_place = int(df.shape[0]*0.4)
df[0:split_place].to_csv(r'c:tempfilename1.csv', index=False, columns=None, sep=',')
# in file get somthing like
# A,B,C
# 0,0,1
# 0,0,0
df[split_place:].to_csv(r'c:tempfilename2.csv',index=False, sep=',')
# if don't need header, can use header=False,
more info bout saving to CSV in pandas documentation