Split a csv file into multiple files based on a pattern

Question:

I have a csv file with the following structure:

time,magnitude
0,13517
292.5669,370
620.8469,528
0,377
832.3269,50187
5633.9419,3088
20795.0950,2922
21395.6879,2498
21768.2139,647
21881.2049,194
0,3566
292.5669,370
504.1510,712
1639.4800,287
46709.1749,365
46803.4400,500

I’d like to split this csv file into separate csv files, like the following:

File 1:

time,magnitude
0,13517
292.5669,370
620.8469,528

File 2:

time,magnitude
0,377
832.3269,50187
5633.9419,3088
20795.0950,2922
21395.6879,2498

and so on..

I’ve read several similar posts (e.g., this, this, or this one), but they all search for specific values in a column and save each groups of values into a separate file. However, in my case, the values of time column are not the same. I’d like to split base on a condition: If time = 0, save that row and all subsequent rows in a new file until the next time =0.

Can someone please let me know how to do this?

Asked By: mOna

||

Answers:

I took the liberty to create a few more data similar to the ones you provided to test my solution. Moreover, I didn’t use an input csv file but a dataframe instead. Here is my solution:

import pandas as pd
import numpy as np

# Create a random DataFrame

data = {
   'time': [0, 292.5669, 620.8469, 0, 832.3269, 5633.9419, 20795.0950, 21395.6879, 0, 230.5678, 456.8468, 0, 784.3265, 5445.9452, 20345.0980, 21095.6898],
   'magnitude': [13517, 370, 528, 377, 50187, 3088, 2922, 2498, 13000, 369, 527, 376, 50100, 3087, 2921, 2497]
}

df = pd.DataFrame(data)

# Function to split a DataFrame based on a pattern

def split_dataframe_by_pattern(df, output_prefix):
    file_count = 1
    current_group = pd.DataFrame(columns=df.columns)  # Initialize the current group

    for index, row in df.iterrows():
        if row['time'] == 0 and not current_group.empty:  # If time = 0 and the current group is not empty, create a new file
            output_file = f'{output_prefix}_{file_count}.csv'

            # Save the current group to the new file

            current_group.to_csv(output_file, index=False)
            current_group = pd.DataFrame(columns=df.columns)  # Reset the current group
            file_count += 1

        # Use pandas.concat to append the row to the current group
        current_group = pd.concat([current_group, row.to_frame().T], ignore_index=True)

    # Save the last group to a file

    current_group.to_csv(f'{output_prefix}_{file_count}.csv', index=False)

# Example usage:
output_prefix = 'output_file'
split_dataframe_by_pattern(df, output_prefix)

My output is four csv files:

output_file_1.csv

time,magnitude
0.0,13517.0
292.5669,370.0
620.8469,528.0

output_file_2.csv

time,magnitude
0.0,377.0
832.3269,50187.0
5633.9419,3088.0
20795.095,2922.0
21395.6879,2498.0

output_file_3.csv

time,magnitude
0.0,13000.0
230.5678,369.0
456.8468,527.0

output_file_4.csv

time,magnitude
0.0,376.0
784.3265,50100.0
5445.9452,3087.0
20345.098,2921.0
21095.6898,2497.0
Answered By: cconsta1

You can do that with panda very easily like this:

import pandas as pd
df = pd.read_csv("mydata.csv")
last_idx = 0
file_idx = 0
for i,time in enumerate(df.time):
    if time == 0 and i != 0:
        df.iloc[last_idx:i].to_csv(f"mydata_{file_idx}.csv", index=None)
        file_idx += 1
        last_idx = i
df.iloc[last_idx:].to_csv(f"mydata_{file_idx}.csv", index=None)

With , you can use groupby and boolean indexing :

#pip install pandas
import pandas as pd

df = pd.read_csv("input_file.csv", sep=",") # <- change the sep if needed

for n, g in df.groupby(df["time"].eq(0).cumsum()):
    g.to_csv(f"file_{n}.csv", index=False, sep=",")

Output :

    time  magnitude   # <- file_1.csv
  0.0000      13517
292.5669        370
620.8469        528

      time  magnitude # <- file_2.csv
    0.0000        377
  832.3269      50187
 5633.9419       3088
20795.0950       2922
21395.6879       2498
Answered By: Timeless

datasplit.awk

#!/usr/bin/awk -f

BEGIN
{
    filename = "output_file_"
    fileext = ".csv"
    FS = ","

    c = 0
    file = filename c fileext
    getline
    header = $0
}
{
    if ($1 == 0){
        c = c + 1
        file = filename c fileext
        print header > file
        print $0 >> file
    } else {
        print >> file
    }
}

Make the file executable:

chmod +x datasplit.awk

Start in the folder where the data shall be written:

datasplit.awk datafile
Answered By: dodrg
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.