What is the fastest way to combine 100 CSV files with headers into one?
Question:
What is the fastest way to combine 100 CSV files with headers into one with the following setup:
- The total size of files is 200 MB. (The size is reduced to make the
computation time visible)
- The files are located on an SSD with a maximum speed of 240 MB/s.
- The CPU has 4 cores so multi-threading and multiple processes are
allowed.
- There exists only one node (important for Spark)
- The available memory is 15 GB. So the files easily fit into memory.
- The OS is Linux (Debian Jessie)
- The computer is actually a n1-standard-4 instance in Google Cloud.
(The detailed setup was included to make the scope of the question more specific. The changes were made according to the feedback here)
File 1.csv:
a,b
1,2
File 2.csv:
a,b
3,4
Final out.csv:
a,b
1,2
3,4
According to my benchmarks the fastest from all the proposed methods is pure python. Is there any faster method?
Benchmarks (Updated with the methods from comments and posts):
Method Time
pure python 0.298s
sed 1.9s
awk 2.5s
R data.table 4.4s
R data.table with colClasses 4.4s
Spark 2 40.2s
python pandas 1min 11.0s
Versions of tools:
sed 4.2.2
awk: mawk 1.3.3 Nov 1996
Python 3.6.1
Pandas 0.20.1
R 3.4.0
data.table 1.10.4
Spark 2.1.1
Code in Jupyter notebooks:
sed:
%%time
!head temp/in/1.csv > temp/merged_sed.csv
!sed 1d temp/in/*.csv >> temp/merged_sed.csv
Pure Python all binary read-write with undocumented behavior of “next”:
%%time
with open("temp/merged_pure_python2.csv","wb") as fout:
# first file:
with open("temp/in/1.csv", "rb") as f:
fout.write(f.read())
# now the rest:
for num in range(2,101):
with open("temp/in/"+str(num)+".csv", "rb") as f:
next(f) # skip the header
fout.write(f.read())
awk:
%%time
!awk 'NR==1; FNR==1{{next}} 1' temp/in/*.csv > temp/merged_awk.csv
R data.table:
%%time
%%R
filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv"))
files <- lapply(filenames, fread)
merged_data <- rbindlist(files, use.names=F)
fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)
R data.table with colClasses:
%%time
%%R
filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv"))
files <- lapply(filenames, fread,colClasses=c(
V1="integer",
V2="integer",
V3="integer",
V4="integer",
V5="integer",
V6="integer",
V7="integer",
V8="integer",
V9="integer",
V10="integer"))
merged_data <- rbindlist(files, use.names=F)
fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)
Spark (pyspark):
%%time
df = spark.read.format("csv").option("header", "true").load("temp/in/*.csv")
df.coalesce(1).write.option("header", "true").csv("temp/merged_pyspark.csv")
Python pandas:
%%time
import pandas as pd
interesting_files = glob.glob("temp/in/*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv("temp/merged_pandas.csv", index=False)
Data was generated by:
%%R
df=data.table(replicate(10,sample(0:9,100000,rep=TRUE)))
for (i in 1:100){
write.csv(df,paste0("temp/in/",i,".csv"), row.names=FALSE)
}
Answers:
sed
is probably the fastest. I would also propose an awk
alternative
awk 'NR==1; FNR==1{next} 1' file* > output
prints the first line from the first file, then skips all other first lines from the rest of the files.
Timings:
I tried 10,000 lines long 100 files each around 200MB (not sure). Here is a worst timing on my server.
real 0m0.429s
user 0m0.360s
sys 0m0.068s
server specs (little monster)
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Stepping: 2
CPU MHz: 2394.345
BogoMIPS: 4789.86
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-11
According to the benchmarks in the question the fastest method is pure Python with undocumented “next()” function behavior with binary files. The method was proposed by Stefan Pochmann
Benchmarks:
Benchmarks (Updated with the methods from comments and posts):
Method Time
pure python 0.298s
sed 1.9s
awk 2.5s
R data.table 4.4s
R data.table with colClasses 4.4s
Spark 2 40.2s
python pandas 1min 11.0s
Versions of tools:
sed 4.2.2
awk: mawk 1.3.3 Nov 1996
Python 3.6.1
Pandas 0.20.1
R 3.4.0
data.table 1.10.4
Spark 2.1.1
Pure Python code:
with open("temp/merged_pure_python2.csv","wb") as fout:
# first file:
with open("temp/in/1.csv", "rb") as f:
fout.write(f.read())
# now the rest:
for num in range(2,101):
with open("temp/in/"+str(num)+".csv", "rb") as f:
next(f) # skip the header
fout.write(f.read())
What is the fastest way to combine 100 CSV files with headers into one with the following setup:
- The total size of files is 200 MB. (The size is reduced to make the
computation time visible) - The files are located on an SSD with a maximum speed of 240 MB/s.
- The CPU has 4 cores so multi-threading and multiple processes are
allowed. - There exists only one node (important for Spark)
- The available memory is 15 GB. So the files easily fit into memory.
- The OS is Linux (Debian Jessie)
- The computer is actually a n1-standard-4 instance in Google Cloud.
(The detailed setup was included to make the scope of the question more specific. The changes were made according to the feedback here)
File 1.csv:
a,b
1,2
File 2.csv:
a,b
3,4
Final out.csv:
a,b
1,2
3,4
According to my benchmarks the fastest from all the proposed methods is pure python. Is there any faster method?
Benchmarks (Updated with the methods from comments and posts):
Method Time
pure python 0.298s
sed 1.9s
awk 2.5s
R data.table 4.4s
R data.table with colClasses 4.4s
Spark 2 40.2s
python pandas 1min 11.0s
Versions of tools:
sed 4.2.2
awk: mawk 1.3.3 Nov 1996
Python 3.6.1
Pandas 0.20.1
R 3.4.0
data.table 1.10.4
Spark 2.1.1
Code in Jupyter notebooks:
sed:
%%time
!head temp/in/1.csv > temp/merged_sed.csv
!sed 1d temp/in/*.csv >> temp/merged_sed.csv
Pure Python all binary read-write with undocumented behavior of “next”:
%%time
with open("temp/merged_pure_python2.csv","wb") as fout:
# first file:
with open("temp/in/1.csv", "rb") as f:
fout.write(f.read())
# now the rest:
for num in range(2,101):
with open("temp/in/"+str(num)+".csv", "rb") as f:
next(f) # skip the header
fout.write(f.read())
awk:
%%time
!awk 'NR==1; FNR==1{{next}} 1' temp/in/*.csv > temp/merged_awk.csv
R data.table:
%%time
%%R
filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv"))
files <- lapply(filenames, fread)
merged_data <- rbindlist(files, use.names=F)
fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)
R data.table with colClasses:
%%time
%%R
filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv"))
files <- lapply(filenames, fread,colClasses=c(
V1="integer",
V2="integer",
V3="integer",
V4="integer",
V5="integer",
V6="integer",
V7="integer",
V8="integer",
V9="integer",
V10="integer"))
merged_data <- rbindlist(files, use.names=F)
fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)
Spark (pyspark):
%%time
df = spark.read.format("csv").option("header", "true").load("temp/in/*.csv")
df.coalesce(1).write.option("header", "true").csv("temp/merged_pyspark.csv")
Python pandas:
%%time
import pandas as pd
interesting_files = glob.glob("temp/in/*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv("temp/merged_pandas.csv", index=False)
Data was generated by:
%%R
df=data.table(replicate(10,sample(0:9,100000,rep=TRUE)))
for (i in 1:100){
write.csv(df,paste0("temp/in/",i,".csv"), row.names=FALSE)
}
sed
is probably the fastest. I would also propose an awk
alternative
awk 'NR==1; FNR==1{next} 1' file* > output
prints the first line from the first file, then skips all other first lines from the rest of the files.
Timings:
I tried 10,000 lines long 100 files each around 200MB (not sure). Here is a worst timing on my server.
real 0m0.429s
user 0m0.360s
sys 0m0.068s
server specs (little monster)
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Stepping: 2
CPU MHz: 2394.345
BogoMIPS: 4789.86
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-11
According to the benchmarks in the question the fastest method is pure Python with undocumented “next()” function behavior with binary files. The method was proposed by Stefan Pochmann
Benchmarks:
Benchmarks (Updated with the methods from comments and posts):
Method Time
pure python 0.298s
sed 1.9s
awk 2.5s
R data.table 4.4s
R data.table with colClasses 4.4s
Spark 2 40.2s
python pandas 1min 11.0s
Versions of tools:
sed 4.2.2
awk: mawk 1.3.3 Nov 1996
Python 3.6.1
Pandas 0.20.1
R 3.4.0
data.table 1.10.4
Spark 2.1.1
Pure Python code:
with open("temp/merged_pure_python2.csv","wb") as fout:
# first file:
with open("temp/in/1.csv", "rb") as f:
fout.write(f.read())
# now the rest:
for num in range(2,101):
with open("temp/in/"+str(num)+".csv", "rb") as f:
next(f) # skip the header
fout.write(f.read())