Python + Pandas split CSV file by rows with headers/columns in every file

Question

Im pretty new to python but already having some success.

There is just a small detail missing i cannot figure out to get working.

As the title says, Im splitting huge CSV files with weather data (almost millions rows).
The splitting works well, but the columns/header is in the first file only.

The data looks like that:

year;month;day;date;Id;Po;P;;T;st;sn;Tx;sn;Tn;e;R1;Rd;nr;S1;ps;mp;mT;mTx;mTn;me;mR;mS
2003;3;1;01.03.2003;1001;10047;10059;1;27;46;0;1;1;52;45;56;3;13;;;0;0;0;0;0;0;
2003;3;1;01.03.2003;1008;9995;10031;1;173;45;1;142;1;211;13;18;3;7;;;0;0;0;0;0;0;
2003;3;1;01.03.2003;1025;10058;10068;1;2;27;0;22;1;25;50;182;6;21;;;0;0;0;0;0;0;
2003;3;1;01.03.2003;1026;9924;10067;1;6;26;0;18;1;28;49;183;6;22;53;47;0;0;0;0;0;0;0
2003;3;1;01.03.2003;1028;9991;10011;1;84;57;1;36;1;128;33;47;5;15;;;0;0;0;0;0;0;
2003;3;1;01.03.2003;1098;10006;10024;1;18;29;0;10;1;46;43;58;5;15;;;0;0;0;0;0;0;
2003;3;1;01.03.2003;1152;10092;10108;0;18;26;0;42;1;2;57;110;5;21;60;53;0;0;0;0;0;0;0
2003;3;1;01.03.2003;1212;10148;10166;0;53;13;0;69;0;38;71;;;;;;0;0;0;0;0;;
2003;3;1;01.03.2003;1238;9030;10192;1;29;42;0;6;1;58;37;5;1;2;;;0;0;0;0;0;0;
2003;3;1;01.03.2003;1241;10148;10159;0;44;24;0;68;0;23;65;55;3;12;;;0;0;0;0;0;0;
2003;3;1;01.03.2003;1271;10143;10167;0;33;29;0;65;0;2;59;39;3;9;;;0;0;0;0;0;0;
2003;3;1;01.03.2003;1317;10152;10197;0;48;13;0;80;0;21;72;95;2;12;;;0;0;0;0;0;0;
2003;3;1;01.03.2003;1384;9955;10208;0;3;37;0;52;1;35;50;21;2;4;;;0;0;0;0;0;0;
2003;3;1;01.03.2003;1389;;;1;6;39;0;57;1;55;50;18;2;3;;;;0;0;0;0;0;
.
.
.
.(the dots are just implying that its more data below ;) )

Id like to keep the columns in every CSV file written. Not just in the first file created.

The code so far (with some tkinter field):

def splitFiles():
    file_path = str(CVS_file_source.get())
    new_filename = str(new_filename_entry.get())
    path_destination = str(folder_path_destination.get())
    file_destination = os.path.join(path_destination, new_filename)
    dattype = field_dattype.get()

    #csv file name to be read in
    in_csv = file_path

    #get the number of lines of the csv file to be read
    number_lines = sum(1 for row in (open(in_csv)))

    #size of rows of data to write to the csv,
    #you can change the row size according to your need
    rowsize = rows.get()


    #start looping through data writing it to a new file for each set
    for i in range(0,number_lines,rowsize):

        df = pd.read_csv(in_csv,
              nrows = rowsize,  #number of rows to read at each loop
              skiprows = i)     #skip rows that have been read



        #csv to write data to a new file with indexed name. input_1.csv etc.
        out_csv = file_destination + '_' + str(i) + dattype

        df.to_csv(out_csv,
              index=False,
              header=True,
              mode='a',             #append data to csv file
              chunksize=rowsize)    #size of data to append for each loop

(I used the code form that question:
How to split csv file keeping its header in each smaller files in Python? )

What is the code missing? It does not work for me as suggested in the question.

Any help will be great!

Asked By: lilaaffe

||

Source

Answer 1

If you want to save the file in a split format, then you don’t really need a huge function. Let’s say you’d like to save every 100k rows:

for i in range(round(len(df)/10**5)+1):
   df.iloc[i*10**5:(i+1)*10**5,:].to_csv('path_to_save_file_'+str(i*10**5)+"_"+str(i*10**5)+'.csv')
print("Saving file with rows from: ",i*10**5,"to",(i+1)*10**5)

Or you can do it in a single with list comprehensions:

[df.iloc[i*10**5:(i+1)*10**5,:].to_csv('path_to_save_file_'+str(i*10**5)+"_"+str(i*10**5)+'.csv') for i in range(round(len(df)/10**5)+1)]

This will essentially write a csv with rows 0 to 100000, from 100000 to 200000 and so on, while the numbers in the name so you can easily identify them. Returning:

Saving file with rows from:  0 100000
Saving file with rows from:  100000 200000
Saving file with rows from:  200000 300000
Saving file with rows from:  300000 400000
Saving file with rows from:  400000 500000
Saving file with rows from:  500000 600000
Saving file with rows from:  600000 700000
Saving file with rows from:  700000 800000
Saving file with rows from:  800000 900000
Saving file with rows from:  900000 1000000
Saving file with rows from:  1000000 1100000

Answered By: Celius Stingher

Answer 2

read_csv function is very handy, it can do many tricks including getting chanks, so you can do something like this:

i = 1

for df in pd.read_csv('file.csv', chunksize=1000):
                                  #^^^^^^^^^^^^^^ number of rows to read
    df.to_csv(f'file (chank {i}).csv', index=False)
    i+=1

Answered By: SergFSM

Python + Pandas split CSV file by rows with headers/columns in every file

Question:

Answers: