Removing duplicate records from CSV file using Python Pandas
Question:
I would like to remove duplicate records from a csv file using Python Pandas
The CSV contains records with three attributes scale, minzoom, maxzoom. I want to have a resulting dataframe with minzoom and maxzoom and the records left being unique
i.e
Input CSV file (lookup_scales.csv)
Scale, minzoom, maxzoom
2000, 0, 15
3000, 0, 15
10000, 8, 15
20000, 8, 15
200000, 15, 18
250000, 15, 18
Required distinct_lookup_scales.csv (Without scale column)
minzoom, maxzoom
0,5
8,15
15,18
My code so far is
lookup_scales_df = pd.read_csv('C:/Marine/lookup/lookup_scales.csv', names = ['minzoom','maxzoom'])
lookup_scales_df = lookup_scales_df.set_index([2, 3])
file_name = "C:/Marine/lookup/distinct_lookup_scales.csv"
lookup_scales_df.groupby('minzoom', 'maxzoom').to_csv(file_name, sep=',')
Very grateful for any help. I am new to pandas and working with dataframe
Answers:
You can use pd.read_csv()
, pd.to_csv()
and drop_duplicates()
:
import pandas as pd
df = pd.read_csv('test.csv', sep=', ', engine='python')
new_df = df[['minzoom','maxzoom']].drop_duplicates()
new_df.to_csv('out.csv', index=False)
Outputs to out.csv
:
minzoom,maxzoom
0,15
8,15
15,18
Note sep=', '
when reading test.csv
, otherwise your column names with contain a leading space if left with default sep=','
.
You don’t need numpy or anything you can just do the unique-ify in one line, while importing the csv using pandas:
import pandas as pd
df = pd.read_csv('lookup_scales.csv', usecols=['minzoom', 'maxzoom']).drop_duplicates(keep='first').reset_index()
output:
minzoom maxzoom
0 0 15
1 8 15
2 15 18
Then to write it out to csv:
df.to_csv(file_name, index=False) # you don't need to set sep in this because to_csv makes it comma delimited.
So the whole code:
import pandas as pd
df = pd.read_csv('lookup_scales.csv', usecols=['minzoom', 'maxzoom']).drop_duplicates(keep='first').reset_index()
file_name = "C:/Marine/lookup/distinct_lookup_scales.csv"
df.to_csv(file_name, index=False) # you don't need to set sep in this because to_csv makes it comma delimited.
The answer provided by d_kennetz is completely wrong. The correct way of doing this while keeping other columns intact is by replacing h
:
df = pd.read_csv('yourcsvfilehere.csv').drop_duplicates('columnnamehere',keep='first')
Here’s a simple Python script to do that. You’ll use Pandas which is a powerful data manipulation library.
import pandas as pd
# read CSV file
data = pd.read_csv('input.csv')
# remove duplicates based on 'email' column
cleaned_data = data.drop_duplicates(subset='email')
# save the cleaned data into a new CSV file
cleaned_data.to_csv('cleaned.csv', index=False)
I would like to remove duplicate records from a csv file using Python Pandas
The CSV contains records with three attributes scale, minzoom, maxzoom. I want to have a resulting dataframe with minzoom and maxzoom and the records left being unique
i.e
Input CSV file (lookup_scales.csv)
Scale, minzoom, maxzoom
2000, 0, 15
3000, 0, 15
10000, 8, 15
20000, 8, 15
200000, 15, 18
250000, 15, 18
Required distinct_lookup_scales.csv (Without scale column)
minzoom, maxzoom
0,5
8,15
15,18
My code so far is
lookup_scales_df = pd.read_csv('C:/Marine/lookup/lookup_scales.csv', names = ['minzoom','maxzoom'])
lookup_scales_df = lookup_scales_df.set_index([2, 3])
file_name = "C:/Marine/lookup/distinct_lookup_scales.csv"
lookup_scales_df.groupby('minzoom', 'maxzoom').to_csv(file_name, sep=',')
Very grateful for any help. I am new to pandas and working with dataframe
You can use pd.read_csv()
, pd.to_csv()
and drop_duplicates()
:
import pandas as pd
df = pd.read_csv('test.csv', sep=', ', engine='python')
new_df = df[['minzoom','maxzoom']].drop_duplicates()
new_df.to_csv('out.csv', index=False)
Outputs to out.csv
:
minzoom,maxzoom
0,15
8,15
15,18
Note sep=', '
when reading test.csv
, otherwise your column names with contain a leading space if left with default sep=','
.
You don’t need numpy or anything you can just do the unique-ify in one line, while importing the csv using pandas:
import pandas as pd
df = pd.read_csv('lookup_scales.csv', usecols=['minzoom', 'maxzoom']).drop_duplicates(keep='first').reset_index()
output:
minzoom maxzoom
0 0 15
1 8 15
2 15 18
Then to write it out to csv:
df.to_csv(file_name, index=False) # you don't need to set sep in this because to_csv makes it comma delimited.
So the whole code:
import pandas as pd
df = pd.read_csv('lookup_scales.csv', usecols=['minzoom', 'maxzoom']).drop_duplicates(keep='first').reset_index()
file_name = "C:/Marine/lookup/distinct_lookup_scales.csv"
df.to_csv(file_name, index=False) # you don't need to set sep in this because to_csv makes it comma delimited.
The answer provided by d_kennetz is completely wrong. The correct way of doing this while keeping other columns intact is by replacing h
:
df = pd.read_csv('yourcsvfilehere.csv').drop_duplicates('columnnamehere',keep='first')
Here’s a simple Python script to do that. You’ll use Pandas which is a powerful data manipulation library.
import pandas as pd
# read CSV file
data = pd.read_csv('input.csv')
# remove duplicates based on 'email' column
cleaned_data = data.drop_duplicates(subset='email')
# save the cleaned data into a new CSV file
cleaned_data.to_csv('cleaned.csv', index=False)