how do I remove rows with duplicate values of columns in pandas data frame?
Question:
I have a pandas data frame which looks like this.
Column1 Column2 Column3
0 cat 1 C
1 dog 1 A
2 cat 1 B
I want to identify that cat and bat are same values which have been repeated and hence want to remove one record and preserve only the first record. The resulting data frame should only have.
Column1 Column2 Column3
0 cat 1 C
1 dog 1 A
Answers:
import pandas as pd
df = pd.DataFrame({"Column1":["cat", "dog", "cat"],
"Column2":[1,1,1],
"Column3":["C","A","B"]})
df = df.drop_duplicates(subset=['Column1'], keep='first')
print(df)
Using drop_duplicates
with subset
with list of columns to check for duplicates on and keep='first'
to keep first of duplicates.
If dataframe
is:
df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Then:
result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
Inside the drop_duplicates()
method of Dataframe
you can provide a series of column names to eliminate duplicate records from your data.
The following “Tested” code does the same :
import pandas as pd
df = pd.DataFrame()
df.insert(loc=0,column='Column1',value=['cat', 'toy', 'cat'])
df.insert(loc=1,column='Column2',value=['bat', 'flower', 'bat'])
df.insert(loc=2,column='Column3',value=['xyz', 'abc', 'lmn'])
df = df.drop_duplicates(subset=['Column1','Column2'],keep='first')
print(df)
Inside of the subset parameter, you can insert other column names as well and by default it will consider all the columns of your data and you can provide keep value as :-
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
Use drop_duplicates() by using column name
import pandas as pd
data = pd.read_excel('your_excel_path_goes_here.xlsx')
#print(data)
data.drop_duplicates(subset=["Column1"], keep="first")
keep=first to instruct Python to keep the first value and remove other columns duplicate values.
keep=last to instruct Python to keep the last value and remove other columns duplicate values.
Suppose we want to remove all duplicate values in the excel sheet. We can specify keep=False
I have a pandas data frame which looks like this.
Column1 Column2 Column3
0 cat 1 C
1 dog 1 A
2 cat 1 B
I want to identify that cat and bat are same values which have been repeated and hence want to remove one record and preserve only the first record. The resulting data frame should only have.
Column1 Column2 Column3
0 cat 1 C
1 dog 1 A
import pandas as pd
df = pd.DataFrame({"Column1":["cat", "dog", "cat"],
"Column2":[1,1,1],
"Column3":["C","A","B"]})
df = df.drop_duplicates(subset=['Column1'], keep='first')
print(df)
Using drop_duplicates
with subset
with list of columns to check for duplicates on and keep='first'
to keep first of duplicates.
If dataframe
is:
df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Then:
result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
Inside the drop_duplicates()
method of Dataframe
you can provide a series of column names to eliminate duplicate records from your data.
The following “Tested” code does the same :
import pandas as pd
df = pd.DataFrame()
df.insert(loc=0,column='Column1',value=['cat', 'toy', 'cat'])
df.insert(loc=1,column='Column2',value=['bat', 'flower', 'bat'])
df.insert(loc=2,column='Column3',value=['xyz', 'abc', 'lmn'])
df = df.drop_duplicates(subset=['Column1','Column2'],keep='first')
print(df)
Inside of the subset parameter, you can insert other column names as well and by default it will consider all the columns of your data and you can provide keep value as :-
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
Use drop_duplicates() by using column name
import pandas as pd
data = pd.read_excel('your_excel_path_goes_here.xlsx')
#print(data)
data.drop_duplicates(subset=["Column1"], keep="first")
keep=first to instruct Python to keep the first value and remove other columns duplicate values.
keep=last to instruct Python to keep the last value and remove other columns duplicate values.
Suppose we want to remove all duplicate values in the excel sheet. We can specify keep=False