Python CSVkit compare CSV files
Question:
I have two CSV files that look like this..
CSV 1
reference | name | house
----------------------------
2348A | john | 37
5648R | bill | 3
RT48 | kate | 88
76A | harry | 433
CSV2
reference
---------
2348A
76A
Using Python and CSVkit I am trying to create an output CSV of the rows in CSV1 by comparing it to CSV2. Does anybody have an example they can point me in the direction of?
Answers:
I would recommended to use pandas
to achieve what you are looking for:
And here is how simple it would be using pandas, consider your two csv files are like this:
CSV1
reference,name,house
2348A,john,37
5648R,bill,3
RT48,kate,88
76A,harry ,433
CSV2
reference
2348A
76A
Code
import pandas as pd
df1 = pd.read_csv(r'd:tempdata1.csv')
df2 = pd.read_csv(r'd:tempdata2.csv')
df3 = pd.merge(df1,df2, on= 'reference', how='inner')
df3.to_csv('outpt.csv')
output.csv
,reference,name,house
0,2348A,john,37
1,76A,harry ,433
I’d recommend using a tool like;
- csvjoin from csvkit or
- Pandas
Solution 1:
pip install csvkit
$ csvjoin --help
usage: csvjoin [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
[-p ESCAPECHAR] [-z MAXFIELDSIZE] [-e ENCODING] [-S] [-v] [-l]
[--zero] [-c COLUMNS] [--outer] [--left] [--right]
[FILE [FILE ...]]
Example:
Left Join on column of [reference]
csvjoin --columns "reference" --left CSV1.csv CSV2.csv
Solution 2:
You can also use the pandas library which is easier to manipulate Dataframe and filter by column.
import pandas as pd
df1 = pd.read_csv('CSV1.csv')
df2 = pd.read_csv('CSV2.csv')
df = df1[df1['reference'].isin(df2['reference'])]
df.to_csv('output.csv', index=False)
I have two CSV files that look like this..
CSV 1
reference | name | house
----------------------------
2348A | john | 37
5648R | bill | 3
RT48 | kate | 88
76A | harry | 433
CSV2
reference
---------
2348A
76A
Using Python and CSVkit I am trying to create an output CSV of the rows in CSV1 by comparing it to CSV2. Does anybody have an example they can point me in the direction of?
I would recommended to use pandas
to achieve what you are looking for:
And here is how simple it would be using pandas, consider your two csv files are like this:
CSV1
reference,name,house
2348A,john,37
5648R,bill,3
RT48,kate,88
76A,harry ,433
CSV2
reference
2348A
76A
Code
import pandas as pd
df1 = pd.read_csv(r'd:tempdata1.csv')
df2 = pd.read_csv(r'd:tempdata2.csv')
df3 = pd.merge(df1,df2, on= 'reference', how='inner')
df3.to_csv('outpt.csv')
output.csv
,reference,name,house
0,2348A,john,37
1,76A,harry ,433
I’d recommend using a tool like;
- csvjoin from csvkit or
- Pandas
Solution 1:
pip install csvkit
$ csvjoin --help
usage: csvjoin [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
[-p ESCAPECHAR] [-z MAXFIELDSIZE] [-e ENCODING] [-S] [-v] [-l]
[--zero] [-c COLUMNS] [--outer] [--left] [--right]
[FILE [FILE ...]]
Example:
Left Join on column of [reference]
csvjoin --columns "reference" --left CSV1.csv CSV2.csv
Solution 2:
You can also use the pandas library which is easier to manipulate Dataframe and filter by column.
import pandas as pd
df1 = pd.read_csv('CSV1.csv')
df2 = pd.read_csv('CSV2.csv')
df = df1[df1['reference'].isin(df2['reference'])]
df.to_csv('output.csv', index=False)