I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).
#import csv files from folder path =r'data/US/market/merged_data' allFiles = glob.glob(path + "/*.csv") stockstats_data = pd.DataFrame() list_ =  for file_ in allFiles: df = pd.read_csv(file_,index_col=None,) list_.append(df) stockstats_data = pd.concat(list_) print(file_ + " has been imported.")
This code works fine, but it is slow. It can take up to 2 days to process.
I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.
Are you required to do this in Python? If you are open to doing this entirely in shell, all you’d need to do is first
cat the header row from a randomly selected input .csv file into
merged.csv before running your one-liner:
cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
You don’t need pandas for this, just the simple
csv module would work fine.
import csv df_out_filename = 'df_out.csv' write_headers = True with open(df_out_filename, 'wb') as fout: writer = csv.writer(fout) for filename in allFiles: with open(filename) as fin: reader = csv.reader(fin) headers = reader.next() if write_headers: write_headers = False # Only write headers once. writer.writerow(headers) writer.writerows(reader) # Write all remaining rows.
If you don’t need the CSV in memory, just copying from input to output, it’ll be a lot cheaper to avoid parsing at all, and copy without building up in memory:
import shutil import glob #import csv files from folder path = r'data/US/market/merged_data' allFiles = glob.glob(path + "/*.csv") allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters with open('someoutputfile.csv', 'wb') as outfile: for i, fname in enumerate(allFiles): with open(fname, 'rb') as infile: if i != 0: infile.readline() # Throw away header on all but first file # Block copy rest of file from input to output without parsing shutil.copyfileobj(infile, outfile) print(fname + " has been imported.")
shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize. Don’t omit the
This assumes all the CSV files have the same format, encoding, line endings, etc., the encoding encodes such that newlines appear as a single byte equivalent to ASCII
n and it’s the last byte in the character (so ASCII and all ASCII superset encodings work, as does UTF-16-BE and UTF-32-BE, but not UTF-16-LE and UTF-32-LE) and the header doesn’t contain embedded newlines, but if that’s the case, it’s a lot faster than the alternatives.
For the cases where the encoding’s version of a newline doesn’t look enough like an ASCII newline, or where the input files are in one encoding, and the output file should be in a different encoding, you can add the work of encoding and decoding without adding CSV parsing/serializing work, with (adding a
from io import open if on Python 2, to get Python 3-like efficient encoding-aware file objects, and defining
known_input_encoding to some string representing the known encoding for input files, e.g.
known_input_encoding = 'utf-16-le', and optionally a different encoding for output files):
# Other imports and setup code prior to first with unchanged from before # Perform encoding to chosen output encoding, disabling line-ending # translation to avoid conflicting with CSV dialect, matching raw binary behavior with open('someoutputfile.csv', 'w', encoding=output_encoding, newline='') as outfile: for i, fname in enumerate(allFiles): # Decode with known encoding, disabling line-ending translation # for same reasons as above with open(fname, encoding=known_input_encoding, newline='') as infile: if i != 0: infile.readline() # Throw away header on all but first file # Block copy rest of file from input to output without parsing # just letting the file object decode from input and encode to output shutil.copyfileobj(infile, outfile) print(fname + " has been imported.")
This is still much faster than involving the
csv module, especially in modern Python (where the
io module has undergone greater and greater optimization, to the point where the cost of decoding and reencoding is pretty minor, especially next to the cost of performing I/O in the first place). It’s also a good validity check for self-checking encodings (e.g. the UTF family) even if the encoding is not supposed to change; if the data doesn’t match the assumed self-checking encoding, it’s highly unlikely to decode validly, so you’ll get an exception rather than silent misbehavior.
Because some of the duplicates linked here are looking for an even faster solution than
copyfileobj, some options:
The only succinct, reasonably portable option is to continue using
copyfileobj and explicitly pass a non-default
length parameter, e.g.
shutil.copyfileobj(infile, outfile, 1 << 20) (
1 << 20 is 1 MiB, a number which
shutil has switched to for plain
shutil.copyfile calls on Windows due to superior performance).
Still portable, but only works for binary files and not succinct, would be to copy the underlying code
copyfile uses on Windows, which uses a reusable
bytearray buffer with a larger size than
copyfileobj‘s default (1 MiB, rather than 64 KiB), removing some allocation overhead that
copyfileobj can’t fully avoid for large buffers. You’d replace
shutil.copyfileobj(infile, outfile) with (3.8+’s walrus operator,
:=, used for brevity) the following code adapted from CPython 3.10’s implementation of
shutil._copyfileobj_readinto (which you could always use directly if you don’t mind using non-public APIs):
buf_length = 1 << 20 # 1 MiB buffer; tweak to preference # Using a memoryview gets zero copy performance when short reads occur with memoryview(bytearray(buf_length)) as mv: while n := infile.readinto(mv): if n < buf_length: with mv[:n] as smv: outfile.write(smv) else: outfile.write(mv)
Non-portably, if you can (in any way you feel like) determine the precise length of the header, and you know it will not change by even a byte in any other file, you can write the header directly, then use OS-specific calls similar to what
shutil.copyfile uses under the hood to copy the non-header portion of each file, using OS-specific APIs that can do the work with a single system call (regardless of file size) and avoid extra data copies (by pushing all the work to in-kernel or even within file-system operations, removing copies to and from user space) e.g.:
a. On Linux kernel 2.6.33 and higher (and any other OS that allows the
sendfile(2) system call to work between open files), you can replace the
copyfileobj calls with:
filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size os.sendfile(outfile.fileno(), infile.fileno(), header_len_bytes, filesize - header_len_bytes)
To make it signal resilient, it may be necessary to check the return value from
sendfile, and track the number of bytes sent + skipped and the number remaining, looping until you’ve copied them all (these are low level system calls, they can be interrupted).
b. On any system Python 3.8+ built with glibc >= 2.27 (or on Linux kernel 4.5+), where the files are all on the same filesystem, you can replace
filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size os.copy_file_range(infile.fileno(), outfile.fileno(), filesize - header_len_bytes, header_len_bytes)
With similar caveats about checking for copying fewer bytes than expected and retrying.
c. On OSX/macOS, you can use the completely undocumented, and therefore even less portable/stable API
posix._fcopyfile for a similar purpose, with something like (completely untested, and really, don’t do this; it’s likely to break across even minor Python releases):
infile.seek(header_len_bytes) # Skip past header posix._fcopyfile(infile.fileno(), outfile.fileno(), posix._COPYFILE_DATA)
fcopyfile pays attention to the seek position (docs aren’t 100% on this) and, as noted, is not only macOS-specific, but uses undocumented CPython internals that could change in any release.
† An aside on sorting the results of
allFiles.sort() call should not be omitted;
glob imposes no ordering on the results, and for reproducible results, you’ll want to impose some ordering (it wouldn’t be great if the same files, with the same names and data, produced an output file in a different order simply because in-between runs, a file got moved out of the directory, then back in, and changed the native iteration order). Without the
sort call, this code (and all other Python+glob module answers) will not reliably read from a directory containing
b.csv in alphabetical (or any other useful) order; it’ll vary by OS, file system, and often the entire history of file creation/deletion in the directory in question. This has broken stuff before in the real world, see details at A Code Glitch May Have Caused Errors In More Than 100 Published Studies.
Here’s a simpler approach – you can use pandas (though I am not sure how it will help with RAM usage)-
import pandas as pd import glob path =r'data/US/market/merged_data' allFiles = glob.glob(path + "/*.csv") stockstats_data = pd.DataFrame() list_ =  for file_ in allFiles: df = pd.read_csv(file_) stockstats_data = pd.concat((df, stockstats_data), axis=0)