When to apply(pd.to_numeric) and when to astype(np.float64) in python?

Question:

I have a pandas DataFrame object named xiv which has a column of int64 Volume measurements.

In[]: xiv['Volume'].head(5)
Out[]: 

0    252000
1    484000
2     62000
3    168000
4    232000
Name: Volume, dtype: int64

I have read other posts (like this and this) that suggest the following solutions. But when I use either approach, it doesn’t appear to change the dtype of the underlying data:

In[]: xiv['Volume'] = pd.to_numeric(xiv['Volume'])

In[]: xiv['Volume'].dtypes
Out[]: 
dtype('int64')

Or…

In[]: xiv['Volume'] = pd.to_numeric(xiv['Volume'])
Out[]: ###omitted for brevity###

In[]: xiv['Volume'].dtypes
Out[]: 
dtype('int64')

In[]: xiv['Volume'] = xiv['Volume'].apply(pd.to_numeric)

In[]: xiv['Volume'].dtypes
Out[]: 
dtype('int64')

I’ve also tried making a separate pandas Series and using the methods listed above on that Series and reassigning to the x['Volume'] obect, which is a pandas.core.series.Series object.

I have, however, found a solution to this problem using the numpy package’s float64 type – this works but I don’t know why it’s different.

In[]: xiv['Volume'] = xiv['Volume'].astype(np.float64)

In[]: xiv['Volume'].dtypes
Out[]: 
dtype('float64') 

Can someone explain how to accomplish with the pandas library what the numpy library seems to do easily with its float64 class; that is, convert the column in the xiv DataFrame to a float64 in place.

Asked By: d8aninja

||

Answers:

If you already have numeric dtypes (int8|16|32|64,float64,boolean) you can convert it to another “numeric” dtype using Pandas .astype() method.

Demo:

In [90]: df = pd.DataFrame(np.random.randint(10**5,10**7,(5,3)),columns=list('abc'), dtype=np.int64)

In [91]: df
Out[91]:
         a        b        c
0  9059440  9590567  2076918
1  5861102  4566089  1947323
2  6636568   162770  2487991
3  6794572  5236903  5628779
4   470121  4044395  4546794

In [92]: df.dtypes
Out[92]:
a    int64
b    int64
c    int64
dtype: object

In [93]: df['a'] = df['a'].astype(float)

In [94]: df.dtypes
Out[94]:
a    float64
b      int64
c      int64
dtype: object

It won’t work for object (string) dtypes, that can’t be converted to numbers:

In [95]: df.loc[1, 'b'] = 'XXXXXX'

In [96]: df
Out[96]:
           a        b        c
0  9059440.0  9590567  2076918
1  5861102.0   XXXXXX  1947323
2  6636568.0   162770  2487991
3  6794572.0  5236903  5628779
4   470121.0  4044395  4546794

In [97]: df.dtypes
Out[97]:
a    float64
b     object
c      int64
dtype: object

In [98]: df['b'].astype(float)
...
skipped
...
ValueError: could not convert string to float: 'XXXXXX'

So here we want to use pd.to_numeric() method:

In [99]: df['b'] = pd.to_numeric(df['b'], errors='coerce')

In [100]: df
Out[100]:
           a          b        c
0  9059440.0  9590567.0  2076918
1  5861102.0        NaN  1947323
2  6636568.0   162770.0  2487991
3  6794572.0  5236903.0  5628779
4   470121.0  4044395.0  4546794

In [101]: df.dtypes
Out[101]:
a    float64
b    float64
c      int64
dtype: object

I don’t have a technical explanation for this but, I have noticed that pd.to_numeric() raises the following error when converting the string ‘nan’:

In [10]: df = pd.DataFrame({'value': 'nan'}, index=[0])

In [11]: pd.to_numeric(df.value)

Traceback (most recent call last):

  File "<ipython-input-11-98729d13e45c>", line 1, in <module>
    pd.to_numeric(df.value)

  File "C:Usersjoshua.leeAppDataLocalContinuumanaconda3libsite-packagespandascoretoolsnumeric.py", line 133, in to_numeric
    coerce_numeric=coerce_numeric)

  File "pandas/_libs/srcinference.pyx", line 1185, in pandas._libs.lib.maybe_convert_numeric

ValueError: Unable to parse string "nan" at position 0

whereas astype(float) does not:

df.value.astype(float)
Out[12]: 
0   NaN
Name: value, dtype: float64
Answered By: reevesnmortimer

You can use this:

pd.to_numeric(df.value, errors='coerce').fillna(0, downcast='infer')  

It will use zero in place of nan.

Answered By: Mohd Waseem

I observed that I was able to convert object(str) to float first and then float to Int64.

df = pd.DataFrame(np.random.randint(10**5,10**7,(5,3)),columns=list('abc'), 
dtype=np.int64)
df['a'] = df['a'].astype('str')
df.dtypes

df['a'] = df['a'].astype('float')
df['a'] = df['a'].astype('int64')

Worked fine.

Answered By: Shobhit Sharma

I think I have an explanation that buttresses what the others gave. In summary and as I will show below, pd.to_numeric(arg, errors='coerce') can handle numbers that cannot be converted to numeric, such as '50a' by converting them to NaN. You can then drop null values. Whereas, DataFrame.astype() does not have that ability.

In practice, I use pd.to_numeric(arg, errors='coerce') first especially when the DataFrame column or series has the possibility of holding numbers that cannot be converted to Numeric, as it converts those numbers to NaN, I then drop the NaN if desired, then use DataFrame.astype() to convert the datatype to the exact numeric data type I desire, such as float64, int32, int64 etc.

See examples below:

bio = {'Age': [56, 57, '50a'], 'Name': ['YOU', 'ME', 'HIM']}
df = pd.DataFrame(bio)
>>> df  
Age Name
0   56  YOU
1   57   ME
2  50a  HIM
>>> df['Age'] = df['Age'].astype(int)
.......
.......
ValueError: invalid literal for int() with base 10: '50a'

# Even when the error is forced to be ignore, the change is not made
>>> df['Age'] = df['Age'].astype(int, errors='ignore')
>>> df
   Age Name
0   56  YOU
1   57   ME
2  50a  HIM

Observe what will happen when I use pd.to_numeric(arg, errors='coerce')

>>> df['Age'] = pd.to_numeric(df['Age']) #Used without the coerce
........
........
ValueError: Unable to parse string "50a" at position 2

# When used with parameter: error = coerce, it changes invalid values to Nan. 
# You can then use astype(int) or astype(float) to convert the NaN to 0
>>> df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
>>> df
    Age Name
0  56.0  YOU
1  57.0   ME
2   NaN  HIM

# You can then drop nulls if you desire

In summary, both work hand in hand for specific purposes especially when handling nulls

Answered By: Jeff Erhabor