convert entire pandas dataframe to integers in pandas (0.17.0)
Question:
My question is very similar to this one, but I need to convert my entire dataframe instead of just a series. The to_numeric
function only works on one series at a time and is not a good replacement for the deprecated convert_objects
command. Is there a way to get similar results to the convert_objects(convert_numeric=True)
command in the new pandas release?
Thank you Mike Müller for your example. df.apply(pd.to_numeric)
works very well if the values can all be converted to integers. What if in my dataframe I had strings that could not be converted into integers?
Example:
df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
df.dtypes
Out[59]:
Words object
ints object
dtype: object
Then I could run the deprecated function and get:
df = df.convert_objects(convert_numeric=True)
df.dtypes
Out[60]:
Words object
ints int64
dtype: object
Running the apply
command gives me errors, even with try and except handling.
Answers:
All columns convertible
You can apply the function to all columns:
df.apply(pd.to_numeric)
Example:
>>> df = pd.DataFrame({'a': ['1', '2'],
'b': ['45.8', '73.9'],
'c': [10.5, 3.7]})
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
a 2 non-null object
b 2 non-null object
c 2 non-null float64
dtypes: float64(1), object(2)
memory usage: 64.0+ bytes
>>> df.apply(pd.to_numeric).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
a 2 non-null int64
b 2 non-null float64
c 2 non-null float64
dtypes: float64(2), int64(1)
memory usage: 64.0 bytes
Not all columns convertible
pd.to_numeric
has the keyword argument errors
:
Signature: pd.to_numeric(arg, errors='raise')
Docstring:
Convert argument to a numeric type.
Parameters
----------
arg : list, tuple or array of objects, or Series
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
- If 'raise', then invalid parsing will raise an exception
- If 'coerce', then invalid parsing will be set as NaN
- If 'ignore', then invalid parsing will return the input
Setting it to ignore
will return the column unchanged if it cannot be converted into a numeric type.
As pointed out by Anton Protopopov, the most elegant way is to supply ignore
as keyword argument to apply()
:
>>> df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
>>> df.apply(pd.to_numeric, errors='ignore').info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
Words 2 non-null object
ints 2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes
My previously suggested way, using partial from the module functools
, is more verbose:
>>> from functools import partial
>>> df = pd.DataFrame({'ints': ['3', '5'],
'Words': ['Kobe', 'Bryant']})
>>> df.apply(partial(pd.to_numeric, errors='ignore')).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
Words 2 non-null object
ints 2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes
apply()
the pd.to_numeric
with errors='ignore'
and assign it back to the DataFrame:
df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
print ("Orig: n",df.dtypes)
df.apply(pd.to_numeric, errors='ignore')
print ("nto_numeric: n",df.dtypes)
df = df.apply(pd.to_numeric, errors='ignore')
print ("nto_numeric with assign: n",df.dtypes)
Output:
Orig:
ints object
Words object
dtype: object
to_numeric:
ints object
Words object
dtype: object
to_numeric with assign:
ints int64
Words object
dtype: object
you can use df.astype() to convert the series to desired datatype.
For example:
my_str_df = [[’20’,’30’,’40’]]
then:
my_int_df = my_str_df[‘column_name’].astype(int) # this will be the int type
The accepted answer with pd.to_numeric() converts to float, as soon as it is needed. Reading the question in detail, it is about converting any numeric column to integer.
That is why the accepted answer needs a loop over all columns to convert the numbers to int in the end.
Just for completeness, this is even possible without pd.to_numeric(); of course, this is not recommended:
df = pd.DataFrame({'a': ['1', '2'],
'b': ['45.8', '73.9'],
'c': [10.5, 3.7]})
for i in df.columns:
try:
df[[i]] = df[[i]].astype(float).astype(int)
except:
pass
print(df.dtypes)
Out:
a int32
b int32
c int32
dtype: object
EDITED:
Mind that this not recommended solution is unnecessarily complicated; pd.to_numeric()
can simply use the keyword argument downcast='integer'
to force integer as output, thank you for the comment. This is then still missing in the accepted answer, though.
My question is very similar to this one, but I need to convert my entire dataframe instead of just a series. The to_numeric
function only works on one series at a time and is not a good replacement for the deprecated convert_objects
command. Is there a way to get similar results to the convert_objects(convert_numeric=True)
command in the new pandas release?
Thank you Mike Müller for your example. df.apply(pd.to_numeric)
works very well if the values can all be converted to integers. What if in my dataframe I had strings that could not be converted into integers?
Example:
df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
df.dtypes
Out[59]:
Words object
ints object
dtype: object
Then I could run the deprecated function and get:
df = df.convert_objects(convert_numeric=True)
df.dtypes
Out[60]:
Words object
ints int64
dtype: object
Running the apply
command gives me errors, even with try and except handling.
All columns convertible
You can apply the function to all columns:
df.apply(pd.to_numeric)
Example:
>>> df = pd.DataFrame({'a': ['1', '2'],
'b': ['45.8', '73.9'],
'c': [10.5, 3.7]})
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
a 2 non-null object
b 2 non-null object
c 2 non-null float64
dtypes: float64(1), object(2)
memory usage: 64.0+ bytes
>>> df.apply(pd.to_numeric).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
a 2 non-null int64
b 2 non-null float64
c 2 non-null float64
dtypes: float64(2), int64(1)
memory usage: 64.0 bytes
Not all columns convertible
pd.to_numeric
has the keyword argument errors
:
Signature: pd.to_numeric(arg, errors='raise') Docstring: Convert argument to a numeric type. Parameters ---------- arg : list, tuple or array of objects, or Series errors : {'ignore', 'raise', 'coerce'}, default 'raise' - If 'raise', then invalid parsing will raise an exception - If 'coerce', then invalid parsing will be set as NaN - If 'ignore', then invalid parsing will return the input
Setting it to ignore
will return the column unchanged if it cannot be converted into a numeric type.
As pointed out by Anton Protopopov, the most elegant way is to supply ignore
as keyword argument to apply()
:
>>> df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
>>> df.apply(pd.to_numeric, errors='ignore').info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
Words 2 non-null object
ints 2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes
My previously suggested way, using partial from the module functools
, is more verbose:
>>> from functools import partial
>>> df = pd.DataFrame({'ints': ['3', '5'],
'Words': ['Kobe', 'Bryant']})
>>> df.apply(partial(pd.to_numeric, errors='ignore')).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
Words 2 non-null object
ints 2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes
apply()
the pd.to_numeric
with errors='ignore'
and assign it back to the DataFrame:
df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
print ("Orig: n",df.dtypes)
df.apply(pd.to_numeric, errors='ignore')
print ("nto_numeric: n",df.dtypes)
df = df.apply(pd.to_numeric, errors='ignore')
print ("nto_numeric with assign: n",df.dtypes)
Output:
Orig:
ints object
Words object
dtype: object
to_numeric:
ints object
Words object
dtype: object
to_numeric with assign:
ints int64
Words object
dtype: object
you can use df.astype() to convert the series to desired datatype.
For example:
my_str_df = [[’20’,’30’,’40’]]
then:
my_int_df = my_str_df[‘column_name’].astype(int) # this will be the int type
The accepted answer with pd.to_numeric() converts to float, as soon as it is needed. Reading the question in detail, it is about converting any numeric column to integer.
That is why the accepted answer needs a loop over all columns to convert the numbers to int in the end.
Just for completeness, this is even possible without pd.to_numeric(); of course, this is not recommended:
df = pd.DataFrame({'a': ['1', '2'],
'b': ['45.8', '73.9'],
'c': [10.5, 3.7]})
for i in df.columns:
try:
df[[i]] = df[[i]].astype(float).astype(int)
except:
pass
print(df.dtypes)
Out:
a int32
b int32
c int32
dtype: object
EDITED:
Mind that this not recommended solution is unnecessarily complicated; pd.to_numeric()
can simply use the keyword argument downcast='integer'
to force integer as output, thank you for the comment. This is then still missing in the accepted answer, though.