Trying to remove commas and dollars signs with Pandas in Python

Question:

Tring to remove the commas and dollars signs from the columns. But when I do, the table prints them out and still has them in there. Is there a different way to remove the commans and dollars signs using a pandas function. I was unuable to find anything in the API Docs or maybe i was looking in the wrong place

 import pandas as pd
    import pandas_datareader.data as web

players = pd.read_html('http://www.usatoday.com/sports/mlb/salaries/2013/player/p/')


df1 = pd.DataFrame(players[0])


df1.drop(df1.columns[[0,3,4, 5, 6]], axis=1, inplace=True)
df1.columns = ['Player', 'Team', 'Avg_Annual']
df1['Avg_Annual'] = df1['Avg_Annual'].replace(',', '')

print (df1.head(10))
Asked By: Mark

||

Answers:

You have to access the str attribute per http://pandas.pydata.org/pandas-docs/stable/text.html

df1['Avg_Annual'] = df1['Avg_Annual'].str.replace(',', '')
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace('$', '')
df1['Avg_Annual'] = df1['Avg_Annual'].astype(int)

alternately;

df1['Avg_Annual'] = df1['Avg_Annual'].str.replace(',', '').str.replace('$', '').astype(int)

if you want to prioritize time spent typing over readability.

Answered By: mechanical_meat

Shamelessly stolen from this answerbut, that answer is only about changing one character and doesn’t complete the coolness: since it takes a dictionary, you can replace any number of characters at once, as well as in any number of columns.

# if you want to operate on multiple columns, put them in a list like so:
cols = ['col1', 'col2', ..., 'colN']

# pass them to df.replace(), specifying each char and it's replacement:
df[cols] = df[cols].replace({'$': '', ',': ''}, regex=True)

@shivsn caught that you need to use regex=True; you already knew about replace (but also didn’t show trying to use it on multiple columns or both the dollar sign and comma simultaneously).

This answer is simply spelling out the details I found from others in one place for those like me (e.g. noobs to python an pandas). Hope it’s helpful.

Answered By: Hendy

@bernie’s answer is spot on for your problem. Here’s my take on the general problem of loading numerical data in pandas.

Often the source of the data is reports generated for direct consumption. Hence the presence of extra formatting like %, thousand’s separator, currency symbols etc. All of these are useful for reading but causes problems for the default parser. My solution is to typecast the column to string, replace these symbols one by one then cast it back to appropriate numerical formats. Having a boilerplate function which retains only [0-9.] is tempting but causes problems where the thousand’s separator and decimal gets swapped, also in case of scientific notation. Here’s my code which I wrap into a function and apply as needed.

df[col] = df[col].astype(str)  # cast to string

# all the string surgery goes in here
df[col] = df[col].replace('$', '')
df[col] = df[col].replace(',', '')  # assuming ',' is the thousand's separator in your locale
df[col] = df[col].replace('%', '')

df[col] = df[col].astype(float)  # cast back to appropriate type
Answered By: BiGYaN

I used this logic

df.col = df.col.apply(lambda x:x.replace('$','').replace(',',''))
Answered By: Tushar Gupta

When I got to this problem, this was how I got out of it.

df['Salary'] = df['Salary'].str.replace("$",'').astype(float)
Answered By: kojo justine

This worked for me. Adding "|" means or :

df['Salary'].str.replace('$|,','', regex=True)
Answered By: Mnl
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.