How to split a csv_file into two file: one containing 40% of the original data, the other 60%. The data should be shuffled first

Question

I have a csv file. The columns are [‘A’ ‘B’ ‘C’], and there are 1000 rows of original data.
A B C
1 0 1
-1 2 0
.
.
.
1 0 0.
So I need 40% of these data in one csv_file, 60 % in the other. But first, the rows must be shuffled randomly. Hopefully using the pandas module in python.

I tried

Import pandas as pd
df=pd.read_csv('filename.csv')
np.random.permutation(df)
df[0:400].to_csv('filename1.csv')
df[401:].to_csv('filename2.csv')

but np.random.permutation(df) returns only arrays.

Asked By: Apy_dum

||

Source

Answer 1

Try this way

with shuffling before saving & complete snippet

import numpy as np
import pandas as pd


per = 40
mask =int(len(df))

perdf=df.head(int((mask*(per/100))))

perdf =perdf.iloc[np.random.permutation(len(perdf))]
perdf.to_csv('40perdf.csv')


perdf60=df[:mask]
perdf60 =perdf60.iloc[np.random.permutation(len(perdf60))]
perdf60.to_csv('60perdf.csv')

Note: Not tested…Pls test it & let me know

Answered By: Bhargav

Answer 2

Use pandas.DataFrame.sample to get shuffled 40% without replacement then drop from main table to get the 60%.

df_40 = df.sample(frac=0.4)
df_60 = df.drop(df_40.index)

Answered By: EM77

Answer 3

Problem was, that You don’t return result of permutation

import pandas as pd
import numpy as np

df = pd.read_csv(r"C:temptest1.csv", sep=',')
# source file like this
# A,B,C
# 0,1,1
# 0,0,0
# 1,1,0
# 0,0,0
# 0,0,1
# 2,0,0

df = pd.DataFrame( np.random.permutation(df))
df = df.rename(columns={0: 'A',1:'B',2:'C'})

split_place = int(df.shape[0]*0.4)
df[0:split_place].to_csv(r'c:tempfilename1.csv', index=False, columns=None, sep=',')
# in file get somthing like
# A,B,C
# 0,0,1
# 0,0,0

df[split_place:].to_csv(r'c:tempfilename2.csv',index=False,  sep=',')
# if don't need header, can use header=False,

more info bout saving to CSV in pandas documentation

Answered By: Deiv_vieD

How to split a csv_file into two file: one containing 40% of the original data, the other 60%. The data should be shuffled first

Question:

Answers: