Pandas DataFrame: .replace() and .strip() methods returning NaN values

Question

I read a pdf file into a DataFrame using tabula and used .concat() to combine it all into one DataFrame by doing the following:

import pandas as pd
import tabula

df = tabula.read_pdf('card_details.pdf', pages='all')
df = pd.concat(df, ignore_index=True)

I want to clean some of this data as a column which contains card numbers also has some non-numeric characters (question marks) in it. I’ve tried using .replace() and .strip() to remove these in a DataFrame that I made myself and it worked.

df['card_number'] = df['card_number'].str.strip('?')

or

df['card_number'] = df['card_number'].str.replace(r'D+', '')

However, when I use it on this specific DataFrame that I read from the pdf, it returns NaN values for most of the data. Here’s some screenshots of the DataFrame before and after.

Out of 15309 rows, only 2400 are not NaN – there are only around 50 rows that contain non-numeric values in them. So I really don’t understand what’s happening here as the card numbers without non-numeric characters are becoming null. Any ideas on what I may be doing wrong?

Asked By: Adam Idris

||

Source

Answer 1

This happens when there is actual numerical data in the table. You can cast all the data to string first:


df['card_number'] = df['card_number'].astype(str).str.replace(r'D+', '')

Answered By: Klops

Pandas DataFrame: .replace() and .strip() methods returning NaN values

Question:

Answers: