Pandas DataFrame: .replace() and .strip() methods returning NaN values

Question:

I read a pdf file into a DataFrame using tabula and used .concat() to combine it all into one DataFrame by doing the following:

import pandas as pd
import tabula

df = tabula.read_pdf('card_details.pdf', pages='all')
df = pd.concat(df, ignore_index=True)

I want to clean some of this data as a column which contains card numbers also has some non-numeric characters (question marks) in it. I’ve tried using .replace() and .strip() to remove these in a DataFrame that I made myself and it worked.

df['card_number'] = df['card_number'].str.strip('?')

or

df['card_number'] = df['card_number'].str.replace(r'D+', '')

However, when I use it on this specific DataFrame that I read from the pdf, it returns NaN values for most of the data. Here’s some screenshots of the DataFrame before and after.

DataFrame before cleaning

DataFrame after cleaning

Out of 15309 rows, only 2400 are not NaN – there are only around 50 rows that contain non-numeric values in them. So I really don’t understand what’s happening here as the card numbers without non-numeric characters are becoming null. Any ideas on what I may be doing wrong?

Asked By: Adam Idris

||

Answers:

This happens when there is actual numerical data in the table. You can cast all the data to string first:


df['card_number'] = df['card_number'].astype(str).str.replace(r'D+', '')
Answered By: Klops
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.