What is the fastest way to combine 100 CSV files with headers into one?

Question

What is the fastest way to combine 100 CSV files with headers into one with the following setup:

The total size of files is 200 MB. (The size is reduced to make the
computation time visible)
The files are located on an SSD with a maximum speed of 240 MB/s.
The CPU has 4 cores so multi-threading and multiple processes are
allowed.
There exists only one node (important for Spark)
The available memory is 15 GB. So the files easily fit into memory.
The OS is Linux (Debian Jessie)
The computer is actually a n1-standard-4 instance in Google Cloud.

(The detailed setup was included to make the scope of the question more specific. The changes were made according to the feedback here)

File 1.csv:

a,b
1,2

File 2.csv:

a,b
3,4

Final out.csv:

a,b
1,2
3,4

According to my benchmarks the fastest from all the proposed methods is pure python. Is there any faster method?

Benchmarks (Updated with the methods from comments and posts):

Method                      Time
pure python                  0.298s
sed                          1.9s
awk                          2.5s
R data.table                 4.4s
R data.table with colClasses 4.4s
Spark 2                     40.2s
python pandas          1min 11.0s

Versions of tools:

sed 4.2.2
awk: mawk 1.3.3 Nov 1996
Python 3.6.1
Pandas 0.20.1
R 3.4.0
data.table 1.10.4
Spark 2.1.1

Code in Jupyter notebooks:

sed:

%%time
!head temp/in/1.csv > temp/merged_sed.csv
!sed 1d temp/in/*.csv >> temp/merged_sed.csv

Pure Python all binary read-write with undocumented behavior of “next”:

%%time
with open("temp/merged_pure_python2.csv","wb") as fout:
    # first file:
    with open("temp/in/1.csv", "rb") as f:
        fout.write(f.read())
    # now the rest:    
    for num in range(2,101):
        with open("temp/in/"+str(num)+".csv", "rb") as f:
            next(f) # skip the header
            fout.write(f.read())

awk:

%%time
!awk 'NR==1; FNR==1{{next}} 1' temp/in/*.csv > temp/merged_awk.csv

R data.table:

%%time
%%R
filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv"))
files <- lapply(filenames, fread)
merged_data <- rbindlist(files, use.names=F)
fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)

R data.table with colClasses:

%%time
%%R
filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv"))
files <- lapply(filenames, fread,colClasses=c(
    V1="integer",
    V2="integer",
    V3="integer",
    V4="integer",
    V5="integer",
    V6="integer",
    V7="integer",
    V8="integer",
    V9="integer",
    V10="integer"))
merged_data <- rbindlist(files, use.names=F)
fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)

Spark (pyspark):

%%time
df = spark.read.format("csv").option("header", "true").load("temp/in/*.csv")
df.coalesce(1).write.option("header", "true").csv("temp/merged_pyspark.csv")

Python pandas:

%%time
import pandas as pd

interesting_files = glob.glob("temp/in/*.csv")
df_list = []
for filename in sorted(interesting_files):
    df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)

full_df.to_csv("temp/merged_pandas.csv", index=False)

Data was generated by:

%%R
df=data.table(replicate(10,sample(0:9,100000,rep=TRUE)))
for (i in 1:100){
    write.csv(df,paste0("temp/in/",i,".csv"), row.names=FALSE)
}

Asked By: keiv.fly

||

Source

Answer 1

sed is probably the fastest. I would also propose an awk alternative

awk 'NR==1; FNR==1{next} 1' file* > output

prints the first line from the first file, then skips all other first lines from the rest of the files.

Timings:
I tried 10,000 lines long 100 files each around ~~200MB~~ (not sure). Here is a worst timing on my server.

real    0m0.429s                                              
user    0m0.360s                                      
sys     0m0.068s

server specs (little monster)

$ lscpu                                                                                                         
Architecture:          x86_64                                                                                                             
CPU op-mode(s):        32-bit, 64-bit                                                                                                     
Byte Order:            Little Endian                                                                                                      
CPU(s):                12                                                                                                                 
On-line CPU(s) list:   0-11                                                                                                               
Thread(s) per core:    1                                                                                                                  
Core(s) per socket:    6                                                                                                                  
Socket(s):             2                                                                                                                  
NUMA node(s):          1                                                                                                                  
Vendor ID:             GenuineIntel                                                                                                       
CPU family:            6                                                                                                                  
Model:                 63                                                                                                                 
Model name:            Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz                                                                          
Stepping:              2                                                                                                                  
CPU MHz:               2394.345                                                                                                           
BogoMIPS:              4789.86                                                                                                            
Virtualization:        VT-x                                                                                                               
L1d cache:             32K                                                                                                                
L1i cache:             32K                                                                                                                
L2 cache:              256K                                                                                                               
L3 cache:              15360K                                                                                                             
NUMA node0 CPU(s):     0-11

Answered By: karakfa

Answer 2

According to the benchmarks in the question the fastest method is pure Python with undocumented “next()” function behavior with binary files. The method was proposed by Stefan Pochmann

Benchmarks:

Benchmarks (Updated with the methods from comments and posts):

Method                      Time
pure python                  0.298s
sed                          1.9s
awk                          2.5s
R data.table                 4.4s
R data.table with colClasses 4.4s
Spark 2                     40.2s
python pandas          1min 11.0s

Versions of tools:

sed 4.2.2
awk: mawk 1.3.3 Nov 1996
Python 3.6.1
Pandas 0.20.1
R 3.4.0
data.table 1.10.4
Spark 2.1.1

Pure Python code:

with open("temp/merged_pure_python2.csv","wb") as fout:
    # first file:
    with open("temp/in/1.csv", "rb") as f:
        fout.write(f.read())
    # now the rest:    
    for num in range(2,101):
        with open("temp/in/"+str(num)+".csv", "rb") as f:
            next(f) # skip the header
            fout.write(f.read())

Answered By: keiv.fly

What is the fastest way to combine 100 CSV files with headers into one?

Question:

Answers: