Merge multiple csv files into one
Question:
I have roughly 20 csv files (all with headers) that I would like to merge all of them into 1 csv file.
Looking online, one way I found was to use the terminal command:
cat *.csv > file.csv
This worked just fine, but the problem is, as all the csv file comes with the headers, those also get placed into the csv file.
Is there a terminal command or python script on which I can merge all those csv files into one and keep only one header?
Thank you so much
Answers:
You can do this with awk
:
awk '(NR == 1) || (FNR > 1)' *.csv > file.csv
FNR
refers to the record number (typically the line number) in the current file and NR
refers to the total record number. So the first line of the first file is accepted and the first lines of the subsequent files are ignored.
This does assume that all your csv files have the same number of columns in the same order.
This command should work for you:
tail -qn +2 *.csv > file.csv
Although, do note, you need to have an extra empty line at the end of each file, otherwise the files will concat in the same line 1, 12, 2
instead of 1, 1
in row 1 and 2, 2
in row 2.
My vote goes to the Awk solution, but since this question explicitly asks about Python, here is a solution for that.
import csv
import sys
writer = csv.writer(sys.stdout)
firstfile = True
for file in sys.argv[1:]:
with open(file, 'r') as rawfile:
reader = csv.reader(rawfile)
for idx, row in enumerate(reader):
# enumerate() is zero-based by default; 0 is first line
if idx == 0 and not firstfile:
continue
writer.writerow(row)
firstfile = False
Usage: python script.py first.csv second.csv etc.csv >final.csv
This simple script doesn’t really benefit from any Python features, but if you need to count the number of fields in non-trivial CSV files (i.e. with quoted fields which might contain a comma which isn’t a separator) that’s hard in Awk, and trivial in Python (because the csv
library already knows exactly how to handle that).
The below code was what worked for me.
import csv
from datetime import datetime
import glob
Time = datetime.now()
Time = Time.strftime("%Y%B%d""_""%H%M")
inputFiles = [] #[i for i in glob.glob('*.{}'.format(extension))]
for file in glob.glob("*.csv"):
inputFiles.append(file)
print(inputFiles)
with open("combined" + Time + '.csv', 'xb') as csvfile:
filewriter = csv.writer(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
outputFile = "combined" + Time + '.csv'
for file in inputFiles:
f = open(file, "r") # set f as opening the given csv in the same file location
reader = csv.reader(f) # set reader as a readable copy of the csv
rows = [] # set rows as an empty list
for (
row
) in (
reader
): # for every row in reader, try to append a new row in our rows list, and if now, pass
try:
with open(outputFile, "a", newline="") as g:
# create a csv writer
writer = csv.writer(g)
# write the account number and the docket to the csv file
writer.writerow(row)
except:
pass
I have roughly 20 csv files (all with headers) that I would like to merge all of them into 1 csv file.
Looking online, one way I found was to use the terminal command:
cat *.csv > file.csv
This worked just fine, but the problem is, as all the csv file comes with the headers, those also get placed into the csv file.
Is there a terminal command or python script on which I can merge all those csv files into one and keep only one header?
Thank you so much
You can do this with awk
:
awk '(NR == 1) || (FNR > 1)' *.csv > file.csv
FNR
refers to the record number (typically the line number) in the current file and NR
refers to the total record number. So the first line of the first file is accepted and the first lines of the subsequent files are ignored.
This does assume that all your csv files have the same number of columns in the same order.
This command should work for you:
tail -qn +2 *.csv > file.csv
Although, do note, you need to have an extra empty line at the end of each file, otherwise the files will concat in the same line 1, 12, 2
instead of 1, 1
in row 1 and 2, 2
in row 2.
My vote goes to the Awk solution, but since this question explicitly asks about Python, here is a solution for that.
import csv
import sys
writer = csv.writer(sys.stdout)
firstfile = True
for file in sys.argv[1:]:
with open(file, 'r') as rawfile:
reader = csv.reader(rawfile)
for idx, row in enumerate(reader):
# enumerate() is zero-based by default; 0 is first line
if idx == 0 and not firstfile:
continue
writer.writerow(row)
firstfile = False
Usage: python script.py first.csv second.csv etc.csv >final.csv
This simple script doesn’t really benefit from any Python features, but if you need to count the number of fields in non-trivial CSV files (i.e. with quoted fields which might contain a comma which isn’t a separator) that’s hard in Awk, and trivial in Python (because the csv
library already knows exactly how to handle that).
The below code was what worked for me.
import csv
from datetime import datetime
import glob
Time = datetime.now()
Time = Time.strftime("%Y%B%d""_""%H%M")
inputFiles = [] #[i for i in glob.glob('*.{}'.format(extension))]
for file in glob.glob("*.csv"):
inputFiles.append(file)
print(inputFiles)
with open("combined" + Time + '.csv', 'xb') as csvfile:
filewriter = csv.writer(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
outputFile = "combined" + Time + '.csv'
for file in inputFiles:
f = open(file, "r") # set f as opening the given csv in the same file location
reader = csv.reader(f) # set reader as a readable copy of the csv
rows = [] # set rows as an empty list
for (
row
) in (
reader
): # for every row in reader, try to append a new row in our rows list, and if now, pass
try:
with open(outputFile, "a", newline="") as g:
# create a csv writer
writer = csv.writer(g)
# write the account number and the docket to the csv file
writer.writerow(row)
except:
pass