Split CSV into multiple files based on column value

Question

I have a poorly-structured CSV file named file.csv, and I want to split it up into multiple CSV using Python.

|A|B|C|
|Continent||1|
|Family|44950|file1|
|Species|44950|12|
|Habitat||4|
|Species|44950|22|
|Condition|Tue Jan 24 00:00:00 UTC 2023|4|
|Family|Fish|file2|
|Species|Bass|8|
|Species|Trout|2|
|Habitat|River|3|

The new files need to be separated based on everything between the Family rows, so for example:

file1.csv

|A|B|C|
|Continent||1|
|Family|44950|file1|
|Species|44950|12|
|Habitat||4|
|Species|44950|22|
|Condition|Tue Jan 24 00:00:00 UTC 2023|4|

file2.csv

|A|B|C|
|Continent||1|
|Family|Fish|file2|
|Species|Bass|8|
|Species|Trout|2|
|Habitat|River|3|

What’s the best way of achieving this when the number of rows between appearances of Species is not consistent?

Asked By: MSD

||

Source

Answer 1

import pandas as pd
pd.read_csv('file.csv',delimiter='|')
groups = df.groupby('Family')
for name, group in groups:
    group.to_csv(name + '.csv', index=False)

Answered By: Amir194

Answer 2

Here is a pure python working method:

# Read file
with open('file.csv', 'r') as file:
    text = file.read()

# Split using |Family|
splitted_text = text.split("|Family|")

# Remove unwanted content before first |Family|
splitted_text = splitted_text[1:]

# Add |Family| back to each part
splitted_text = ['|Family|' + item for item in splitted_text]

# Write files
for i, content in enumerate(splitted_text ):
    with open('file{}.csv'.format(i), 'w') as file:
        file.write(content)

Answered By: farshad

Answer 3

If your file really looks like that 😉 then you could use groupby from the standard library module itertools:

from itertools import groupby

def key(line): return line.startswith("|Family|")

family_line, file_no = None, 0
with open("file.csv", "r") as fin:
    for is_family_line, lines in groupby(fin, key=key):
        if is_family_line:
            family_line = list(lines).pop()
        elif family_line is None:
            header = "".join(lines)
        else:
            file_no += 1
            with open(f"file{file_no}.csv", "w") as fout:
                fout.write(header + family_line)
                for line in lines:
                    fout.write(line)

A Pandas solution would be:

import pandas as pd

df = pd.read_csv("file.csv", header=None, delimiter="|").fillna("")
blocks = df.iloc[:, 1].eq("Family").cumsum()
header_df = df[blocks.eq(0)]
for no, sdf in df.groupby(blocks):
    if no > 0:
        sdf = pd.concat([header_df, sdf])
        sdf.to_csv(f"file{no}.csv", index=False, header=False, sep="|")

Answered By: Timus

Split CSV into multiple files based on column value

Question:

Answers: