what are all the dtypes that pandas recognizes?
Question:
For pandas, would anyone know, if any datatype apart from
(i) float64
, int64
(and other variants of np.number
like float32
, int8
etc.)
(ii) bool
(iii) datetime64
, timedelta64
such as string columns, always have a dtype
of object
?
Alternatively, I want to know, if there are any datatype apart from (i), (ii) and (iii) in the list above that pandas
does not make it’s dtype
an object
?
Answers:
EDIT Feb 2020 following pandas 1.0.0 release
Pandas mostly uses NumPy arrays and dtypes for each Series (a dataframe is a collection of Series, each which can have its own dtype). NumPy’s documentation further explains dtype, data types, and data type objects. In addition, the answer provided by @lcameron05 provides an excellent description of the numpy dtypes. Furthermore, the pandas docs on dtypes have a lot of additional information.
The main types stored in pandas objects are float, int, bool,
datetime64[ns], timedelta[ns], and object. In addition these dtypes
have item sizes, e.g. int64 and int32.
By default integer types are int64 and float types are float64,
REGARDLESS of platform (32-bit or 64-bit). The following will all
result in int64 dtypes.
Numpy, however will choose platform-dependent types when creating
arrays. The following WILL result in int32 on 32-bit platform.
One of the major changes to version 1.0.0 of pandas is the introduction of pd.NA
to represent scalar missing values (rather than the previous values of np.nan
, pd.NaT
or None
, depending on usage).
Pandas extends NumPy’s type system and also allows users to write their on extension types. The following lists all of pandas extension types.
Kind of data: tz-aware datetime (note that NumPy does not support timezone-aware datetimes).
Data type: DatetimeTZDtype
Scalar: Timestamp
Array: arrays.DatetimeArray
String Aliases: ‘datetime64[ns, ]’
Kind of data: Categorical
Data type: CategoricalDtype
Scalar: (none)
Array: Categorical
String Aliases: ‘category’
Kind of data: period (time spans)
Data type: PeriodDtype
Scalar: Period
Array: arrays.PeriodArray
String Aliases: ‘period[]’, ‘Period[]’
Kind of data: sparse
Data type: SparseDtype
Scalar: (none)
Array: arrays.SparseArray
String Aliases: ‘Sparse’, ‘Sparse[int]’, ‘Sparse[float]’
Kind of data: intervals
Data type: IntervalDtype
Scalar: Interval
Array: arrays.IntervalArray
String Aliases: ‘interval’, ‘Interval’, ‘Interval[<numpy_dtype>]’, ‘Interval[datetime64[ns, ]]’, ‘Interval[timedelta64[]]’
Kind of data: nullable integer
Data type: Int64Dtype, …
Scalar: (none)
Array: arrays.IntegerArray
String Aliases: ‘Int8’, ‘Int16’, ‘Int32’, ‘Int64’, ‘UInt8’, ‘UInt16’, ‘UInt32’, ‘UInt64’
Kind of data: Strings
Data type: StringDtype
Scalar: str
Array: arrays.StringArray
String Aliases: ‘string’
8) Boolean data with missing values
Kind of data: Boolean (with NA)
Data type: BooleanDtype
Scalar: bool
Array: arrays.BooleanArray
String Aliases: ‘boolean’
pandas
borrows its dtypes from numpy
. For demonstration of this see the following:
import pandas as pd
df = pd.DataFrame({'A': [1,'C',2.]})
df['A'].dtype
>>> dtype('O')
type(df['A'].dtype)
>>> numpy.dtype
You can find the list of valid numpy.dtypes
in the documentation:
‘?’ boolean
‘b’ (signed) byte
‘B’ unsigned byte
‘i’ (signed) integer
‘u’ unsigned integer
‘f’ floating-point
‘c’ complex-floating point
‘m’ timedelta
‘M’ datetime
‘O’ (Python) objects
‘S’, ‘a’ zero-terminated bytes (not recommended)
‘U’ Unicode string
‘V’ raw data (void)
pandas
should support these types. Using the astype
method of a pandas.Series
object with any of the above options as the input argument will result in pandas
trying to convert the Series
to that type (or at the very least falling back to object
type); 'u'
is the only one that I see pandas
not understanding at all:
df['A'].astype('u')
>>> TypeError: data type "u" not understood
This is a numpy
error that results because the 'u'
needs to be followed by a number specifying the number of bytes per item in (which needs to be valid):
import numpy as np
np.dtype('u')
>>> TypeError: data type "u" not understood
np.dtype('u1')
>>> dtype('uint8')
np.dtype('u2')
>>> dtype('uint16')
np.dtype('u4')
>>> dtype('uint32')
np.dtype('u8')
>>> dtype('uint64')
# testing another invalid argument
np.dtype('u3')
>>> TypeError: data type "u3" not understood
To summarise, the astype
methods of pandas
objects will try and do something sensible with any argument that is valid for numpy.dtype
. Note that numpy.dtype('f')
is the same as numpy.dtype('float32')
and numpy.dtype('f8')
is the same as numpy.dtype('float64')
etc. Same goes for passing the arguments to pandas
astype
methods.
To locate the respective data type classes in NumPy, the Pandas docs recommends this:
def subdtypes(dtype):
subs = dtype.__subclasses__()
if not subs:
return dtype
return [dtype, [subdtypes(dt) for dt in subs]]
subdtypes(np.generic)
Output:
[numpy.generic,
[[numpy.number,
[[numpy.integer,
[[numpy.signedinteger,
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.int64,
numpy.timedelta64]],
[numpy.unsignedinteger,
[numpy.uint8,
numpy.uint16,
numpy.uint32,
numpy.uint64,
numpy.uint64]]]],
[numpy.inexact,
[[numpy.floating,
[numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
[numpy.complexfloating,
[numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
[numpy.flexible,
[[numpy.character, [numpy.bytes_, numpy.str_]],
[numpy.void, [numpy.record]]]],
numpy.bool_,
numpy.datetime64,
numpy.object_]]
Pandas accepts these classes as valid types. For example, dtype={'A': np.float}
.
NumPy docs contain more details and a chart:
Building on other answers, pandas also includes a number of its own dtypes.
Pandas and third-party libraries extend NumPy’s type system in a few
places. This section describes the extensions pandas has made
internally. See Extension types for how to write your own extension
that works with pandas. See Extension data types for a list of
third-party libraries that have implemented an extension.
The following table lists all of pandas extension types. See the
respective document
https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes
–Updated link–
Also, as of pandas 1.0 it has its own string dtype and nullable dtypes.
For pandas, would anyone know, if any datatype apart from
(i) float64
, int64
(and other variants of np.number
like float32
, int8
etc.)
(ii) bool
(iii) datetime64
, timedelta64
such as string columns, always have a dtype
of object
?
Alternatively, I want to know, if there are any datatype apart from (i), (ii) and (iii) in the list above that pandas
does not make it’s dtype
an object
?
EDIT Feb 2020 following pandas 1.0.0 release
Pandas mostly uses NumPy arrays and dtypes for each Series (a dataframe is a collection of Series, each which can have its own dtype). NumPy’s documentation further explains dtype, data types, and data type objects. In addition, the answer provided by @lcameron05 provides an excellent description of the numpy dtypes. Furthermore, the pandas docs on dtypes have a lot of additional information.
The main types stored in pandas objects are float, int, bool,
datetime64[ns], timedelta[ns], and object. In addition these dtypes
have item sizes, e.g. int64 and int32.
By default integer types are int64 and float types are float64,
REGARDLESS of platform (32-bit or 64-bit). The following will all
result in int64 dtypes.Numpy, however will choose platform-dependent types when creating
arrays. The following WILL result in int32 on 32-bit platform.
One of the major changes to version 1.0.0 of pandas is the introduction ofpd.NA
to represent scalar missing values (rather than the previous values ofnp.nan
,pd.NaT
orNone
, depending on usage).
Pandas extends NumPy’s type system and also allows users to write their on extension types. The following lists all of pandas extension types.
Kind of data: tz-aware datetime (note that NumPy does not support timezone-aware datetimes).
Data type: DatetimeTZDtype
Scalar: Timestamp
Array: arrays.DatetimeArray
String Aliases: ‘datetime64[ns, ]’
Kind of data: Categorical
Data type: CategoricalDtype
Scalar: (none)
Array: Categorical
String Aliases: ‘category’
Kind of data: period (time spans)
Data type: PeriodDtype
Scalar: Period
Array: arrays.PeriodArray
String Aliases: ‘period[]’, ‘Period[]’
Kind of data: sparse
Data type: SparseDtype
Scalar: (none)
Array: arrays.SparseArray
String Aliases: ‘Sparse’, ‘Sparse[int]’, ‘Sparse[float]’
Kind of data: intervals
Data type: IntervalDtype
Scalar: Interval
Array: arrays.IntervalArray
String Aliases: ‘interval’, ‘Interval’, ‘Interval[<numpy_dtype>]’, ‘Interval[datetime64[ns, ]]’, ‘Interval[timedelta64[]]’
Kind of data: nullable integer
Data type: Int64Dtype, …
Scalar: (none)
Array: arrays.IntegerArray
String Aliases: ‘Int8’, ‘Int16’, ‘Int32’, ‘Int64’, ‘UInt8’, ‘UInt16’, ‘UInt32’, ‘UInt64’
Kind of data: Strings
Data type: StringDtype
Scalar: str
Array: arrays.StringArray
String Aliases: ‘string’
8) Boolean data with missing values
Kind of data: Boolean (with NA)
Data type: BooleanDtype
Scalar: bool
Array: arrays.BooleanArray
String Aliases: ‘boolean’
pandas
borrows its dtypes from numpy
. For demonstration of this see the following:
import pandas as pd
df = pd.DataFrame({'A': [1,'C',2.]})
df['A'].dtype
>>> dtype('O')
type(df['A'].dtype)
>>> numpy.dtype
You can find the list of valid numpy.dtypes
in the documentation:
‘?’ boolean
‘b’ (signed) byte
‘B’ unsigned byte
‘i’ (signed) integer
‘u’ unsigned integer
‘f’ floating-point
‘c’ complex-floating point
‘m’ timedelta
‘M’ datetime
‘O’ (Python) objects
‘S’, ‘a’ zero-terminated bytes (not recommended)
‘U’ Unicode string
‘V’ raw data (void)
pandas
should support these types. Using the astype
method of a pandas.Series
object with any of the above options as the input argument will result in pandas
trying to convert the Series
to that type (or at the very least falling back to object
type); 'u'
is the only one that I see pandas
not understanding at all:
df['A'].astype('u')
>>> TypeError: data type "u" not understood
This is a numpy
error that results because the 'u'
needs to be followed by a number specifying the number of bytes per item in (which needs to be valid):
import numpy as np
np.dtype('u')
>>> TypeError: data type "u" not understood
np.dtype('u1')
>>> dtype('uint8')
np.dtype('u2')
>>> dtype('uint16')
np.dtype('u4')
>>> dtype('uint32')
np.dtype('u8')
>>> dtype('uint64')
# testing another invalid argument
np.dtype('u3')
>>> TypeError: data type "u3" not understood
To summarise, the astype
methods of pandas
objects will try and do something sensible with any argument that is valid for numpy.dtype
. Note that numpy.dtype('f')
is the same as numpy.dtype('float32')
and numpy.dtype('f8')
is the same as numpy.dtype('float64')
etc. Same goes for passing the arguments to pandas
astype
methods.
To locate the respective data type classes in NumPy, the Pandas docs recommends this:
def subdtypes(dtype):
subs = dtype.__subclasses__()
if not subs:
return dtype
return [dtype, [subdtypes(dt) for dt in subs]]
subdtypes(np.generic)
Output:
[numpy.generic,
[[numpy.number,
[[numpy.integer,
[[numpy.signedinteger,
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.int64,
numpy.timedelta64]],
[numpy.unsignedinteger,
[numpy.uint8,
numpy.uint16,
numpy.uint32,
numpy.uint64,
numpy.uint64]]]],
[numpy.inexact,
[[numpy.floating,
[numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
[numpy.complexfloating,
[numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
[numpy.flexible,
[[numpy.character, [numpy.bytes_, numpy.str_]],
[numpy.void, [numpy.record]]]],
numpy.bool_,
numpy.datetime64,
numpy.object_]]
Pandas accepts these classes as valid types. For example, dtype={'A': np.float}
.
NumPy docs contain more details and a chart:
Building on other answers, pandas also includes a number of its own dtypes.
Pandas and third-party libraries extend NumPy’s type system in a few
places. This section describes the extensions pandas has made
internally. See Extension types for how to write your own extension
that works with pandas. See Extension data types for a list of
third-party libraries that have implemented an extension.The following table lists all of pandas extension types. See the
respective document
https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes
–Updated link–
Also, as of pandas 1.0 it has its own string dtype and nullable dtypes.