Why duplicates aren't being removed in Pandas

Question

Very new to Python and Pandas…but the issue is that my final output file isn’t excluding any duplicates on the ‘Customer Number’. Any suggestions on why this would be happening would be appreciated!

import pandas as pd
import numpy as np #numpy is the module which can replace errors from huge datasets 
from openpyxl import load_workbook
from openpyxl.styles import Font

df_1 = pd.read_excel('PRT Tracings 2020.xlsx', sheet_name='Export') #this is reading the Excel document shifts and looks at sheet
df_2 = pd.read_excel('PRT Tracings 2021.xlsx', sheet_name='Export') #this reads the same Excel document but looks at a different sheet
df_3 = pd.read_excel('PRT Tracings YTD 2022.xlsx', sheet_name='Export') #this reads a different Excel file, and only has one sheet so no need to have it read a sheet

df_all = pd.concat([df_1, df_2, df_3], sort=False) #this combines the sheets from 1,2,3 and the sort function as false so our columns stay in the same order

to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this Excel file combines all three sheets into one spreadsheet

df_all = df_all.replace(np.nan, 'N/A', regex=True) #replaces errors with N/A

remove = ['ORDERNUMBER', 'ORDER_TYPE', 'ORDERDATE', 'Major Code Description', 'Product_Number_And_Desc', 'Qty', 'Order_$', 'Order_List_$'] #this will remove all unwanted columns
df_all.drop(columns=remove, inplace=True)

df_all.drop_duplicates(subset=['Customer Number'], keep=False) #this will remove all duplicates from the tracing number syntax with pandas module

to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this Excel file combines all three sheets into one spreadsheet

wb = load_workbook('Combined_PRT_Tracings.xlsx') #we are using this to have openpyxl read the data, from the spreadsheet already created
ws = wb.active #this workbook is active

wb.save('Combined_PRT_Tracings.xlsx')

Asked By: qtsweeney3535

||

Source

Answer 1

You should assign the return value of df_all.drop_duplicates to a variable or set inplace=True to have the DataFrame contents overwritten. This is to prevent undesired changes to the original data.

Try:

df_all = df_all.drop_duplicates(subset='Customer Number', keep=False)

Or the equivalent:

df_all.drop_duplicates(subset='Customer Number', keep=False, inplace=True)

That will remove all duplicate rows from the DataFrame. If you want to keep the first or last row which contains a duplicate, change keep to first or last.

Answered By: Jose Vega

Why duplicates aren't being removed in Pandas

Question:

Answers: