Concatenating multiple csv files into a single csv with the same header

Question:

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).

#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None,)
    list_.append(df)
    stockstats_data = pd.concat(list_)
    print(file_ + " has been imported.")

This code works fine, but it is slow. It can take up to 2 days to process.

I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.

 for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.

Thanks.

Asked By: mattblack

||

Answers:

Are you required to do this in Python? If you are open to doing this entirely in shell, all you’d need to do is first cat the header row from a randomly selected input .csv file into merged.csv before running your one-liner:

cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 
Answered By: Peter Leimbigler

You don’t need pandas for this, just the simple csv module would work fine.

import csv

df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
    writer = csv.writer(fout)
    for filename in allFiles:
        with open(filename) as fin:
            reader = csv.reader(fin)
            headers = reader.next()
            if write_headers:
                write_headers = False  # Only write headers once.
                writer.writerow(headers)
            writer.writerows(reader)  # Write all remaining rows.
Answered By: Alexander

If you don’t need the CSV in memory, just copying from input to output, it’ll be a lot cheaper to avoid parsing at all, and copy without building up in memory:

import shutil
import glob


#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort()  # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
    for i, fname in enumerate(allFiles):
        with open(fname, 'rb') as infile:
            if i != 0:
                infile.readline()  # Throw away header on all but first file
            # Block copy rest of file from input to output without parsing
            shutil.copyfileobj(infile, outfile)
            print(fname + " has been imported.")

That’s it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize. Don’t omit the allFiles.sort()!†

This assumes all the CSV files have the same format, encoding, line endings, etc., the encoding encodes such that newlines appear as a single byte equivalent to ASCII n and it’s the last byte in the character (so ASCII and all ASCII superset encodings work, as does UTF-16-BE and UTF-32-BE, but not UTF-16-LE and UTF-32-LE) and the header doesn’t contain embedded newlines, but if that’s the case, it’s a lot faster than the alternatives.

For the cases where the encoding’s version of a newline doesn’t look enough like an ASCII newline, or where the input files are in one encoding, and the output file should be in a different encoding, you can add the work of encoding and decoding without adding CSV parsing/serializing work, with (adding a from io import open if on Python 2, to get Python 3-like efficient encoding-aware file objects, and defining known_input_encoding to some string representing the known encoding for input files, e.g. known_input_encoding = 'utf-16-le', and optionally a different encoding for output files):

# Other imports and setup code prior to first with unchanged from before

# Perform encoding to chosen output encoding, disabling line-ending 
# translation to avoid conflicting with CSV dialect, matching raw binary behavior
with open('someoutputfile.csv', 'w', encoding=output_encoding, newline='') as outfile:
    for i, fname in enumerate(allFiles):
        # Decode with known encoding, disabling line-ending translation
        # for same reasons as above
        with open(fname, encoding=known_input_encoding, newline='') as infile:
            if i != 0:
                infile.readline()  # Throw away header on all but first file
            # Block copy rest of file from input to output without parsing
            # just letting the file object decode from input and encode to output
            shutil.copyfileobj(infile, outfile)
            print(fname + " has been imported.")

This is still much faster than involving the csv module, especially in modern Python (where the io module has undergone greater and greater optimization, to the point where the cost of decoding and reencoding is pretty minor, especially next to the cost of performing I/O in the first place). It’s also a good validity check for self-checking encodings (e.g. the UTF family) even if the encoding is not supposed to change; if the data doesn’t match the assumed self-checking encoding, it’s highly unlikely to decode validly, so you’ll get an exception rather than silent misbehavior.


Because some of the duplicates linked here are looking for an even faster solution than copyfileobj, some options:

  1. The only succinct, reasonably portable option is to continue using copyfileobj and explicitly pass a non-default length parameter, e.g. shutil.copyfileobj(infile, outfile, 1 << 20) (1 << 20 is 1 MiB, a number which shutil has switched to for plain shutil.copyfile calls on Windows due to superior performance).

  2. Still portable, but only works for binary files and not succinct, would be to copy the underlying code copyfile uses on Windows, which uses a reusable bytearray buffer with a larger size than copyfileobj‘s default (1 MiB, rather than 64 KiB), removing some allocation overhead that copyfileobj can’t fully avoid for large buffers. You’d replace shutil.copyfileobj(infile, outfile) with (3.8+’s walrus operator, :=, used for brevity) the following code adapted from CPython 3.10’s implementation of shutil._copyfileobj_readinto (which you could always use directly if you don’t mind using non-public APIs):

    buf_length = 1 << 20  # 1 MiB buffer; tweak to preference
    # Using a memoryview gets zero copy performance when short reads occur
    with memoryview(bytearray(buf_length)) as mv:  
        while n := infile.readinto(mv):
            if n < buf_length:
                with mv[:n] as smv:
                    outfile.write(smv)
            else:
                outfile.write(mv)
    
  3. Non-portably, if you can (in any way you feel like) determine the precise length of the header, and you know it will not change by even a byte in any other file, you can write the header directly, then use OS-specific calls similar to what shutil.copyfile uses under the hood to copy the non-header portion of each file, using OS-specific APIs that can do the work with a single system call (regardless of file size) and avoid extra data copies (by pushing all the work to in-kernel or even within file-system operations, removing copies to and from user space) e.g.:

    a. On Linux kernel 2.6.33 and higher (and any other OS that allows the sendfile(2) system call to work between open files), you can replace the .readline() and copyfileobj calls with:

    filesize = os.fstat(infile.fileno()).st_size  # Get underlying file's size
    os.sendfile(outfile.fileno(), infile.fileno(), header_len_bytes, filesize - header_len_bytes)
    

    To make it signal resilient, it may be necessary to check the return value from sendfile, and track the number of bytes sent + skipped and the number remaining, looping until you’ve copied them all (these are low level system calls, they can be interrupted).

    b. On any system Python 3.8+ built with glibc >= 2.27 (or on Linux kernel 4.5+), where the files are all on the same filesystem, you can replace sendfile with copy_file_range:

    filesize = os.fstat(infile.fileno()).st_size  # Get underlying file's size
    os.copy_file_range(infile.fileno(), outfile.fileno(), filesize - header_len_bytes, header_len_bytes)
    

    With similar caveats about checking for copying fewer bytes than expected and retrying.

    c. On OSX/macOS, you can use the completely undocumented, and therefore even less portable/stable API shutil.copyfile uses, posix._fcopyfile for a similar purpose, with something like (completely untested, and really, don’t do this; it’s likely to break across even minor Python releases):

    infile.seek(header_len_bytes)  # Skip past header
    posix._fcopyfile(infile.fileno(), outfile.fileno(), posix._COPYFILE_DATA)
    

    which assumes fcopyfile pays attention to the seek position (docs aren’t 100% on this) and, as noted, is not only macOS-specific, but uses undocumented CPython internals that could change in any release.


† An aside on sorting the results of glob: That allFiles.sort() call should not be omitted; glob imposes no ordering on the results, and for reproducible results, you’ll want to impose some ordering (it wouldn’t be great if the same files, with the same names and data, produced an output file in a different order simply because in-between runs, a file got moved out of the directory, then back in, and changed the native iteration order). Without the sort call, this code (and all other Python+glob module answers) will not reliably read from a directory containing a.csv and b.csv in alphabetical (or any other useful) order; it’ll vary by OS, file system, and often the entire history of file creation/deletion in the directory in question. This has broken stuff before in the real world, see details at A Code Glitch May Have Caused Errors In More Than 100 Published Studies.

Answered By: ShadowRanger

Here’s a simpler approach – you can use pandas (though I am not sure how it will help with RAM usage)-

import pandas as pd
import glob

path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_)
    stockstats_data = pd.concat((df, stockstats_data), axis=0)
Answered By: markroxor