pandas distinction between str and object types
Question:
Numpy seems to make a distinction between str
and object
types. For instance I can do ::
>>> import pandas as pd
>>> import numpy as np
>>> np.dtype(str)
dtype('S')
>>> np.dtype(object)
dtype('O')
Where dtype(‘S’) and dtype(‘O’) corresponds to str
and object
respectively.
However pandas seem to lack that distinction and coerce str
to object
. ::
>>> df = pd.DataFrame({'a': np.arange(5)})
>>> df.a.dtype
dtype('int64')
>>> df.a.astype(str).dtype
dtype('O')
>>> df.a.astype(object).dtype
dtype('O')
Forcing the type to dtype('S')
does not help either. ::
>>> df.a.astype(np.dtype(str)).dtype
dtype('O')
>>> df.a.astype(np.dtype('S')).dtype
dtype('O')
Is there any explanation for this behavior?
Answers:
Numpy’s string dtypes aren’t python strings.
Therefore, pandas
deliberately uses native python strings, which require an object dtype.
First off, let me demonstrate a bit of what I mean by numpy’s strings being different:
In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)
Now, ‘x’ is a numpy
string dtype (fixed-width, c-like string) and y
is an array of native python strings.
If we try to go beyond 7 characters, we’ll see an immediate difference. The string dtype versions will be truncated:
In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
dtype='|S7')
While the object dtype versions can be arbitrary length:
In [6]: y[1] = 'a really really really long'
In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)
Next, the |S
dtype strings can’t hold unicode properly, though there is a unicode fixed-length string dtype, as well. I’ll skip an example, for the moment.
Finally, numpy’s strings are actually mutable, while Python strings are not. For example:
In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjohx01'],
dtype='|S7')
For all of these reasons, pandas
chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won’t work in pandas
. Instead, it always uses native python strings, which behave in a more intuitive way for most users.
Numpy seems to make a distinction between str
and object
types. For instance I can do ::
>>> import pandas as pd
>>> import numpy as np
>>> np.dtype(str)
dtype('S')
>>> np.dtype(object)
dtype('O')
Where dtype(‘S’) and dtype(‘O’) corresponds to str
and object
respectively.
However pandas seem to lack that distinction and coerce str
to object
. ::
>>> df = pd.DataFrame({'a': np.arange(5)})
>>> df.a.dtype
dtype('int64')
>>> df.a.astype(str).dtype
dtype('O')
>>> df.a.astype(object).dtype
dtype('O')
Forcing the type to dtype('S')
does not help either. ::
>>> df.a.astype(np.dtype(str)).dtype
dtype('O')
>>> df.a.astype(np.dtype('S')).dtype
dtype('O')
Is there any explanation for this behavior?
Numpy’s string dtypes aren’t python strings.
Therefore, pandas
deliberately uses native python strings, which require an object dtype.
First off, let me demonstrate a bit of what I mean by numpy’s strings being different:
In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)
Now, ‘x’ is a numpy
string dtype (fixed-width, c-like string) and y
is an array of native python strings.
If we try to go beyond 7 characters, we’ll see an immediate difference. The string dtype versions will be truncated:
In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
dtype='|S7')
While the object dtype versions can be arbitrary length:
In [6]: y[1] = 'a really really really long'
In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)
Next, the |S
dtype strings can’t hold unicode properly, though there is a unicode fixed-length string dtype, as well. I’ll skip an example, for the moment.
Finally, numpy’s strings are actually mutable, while Python strings are not. For example:
In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjohx01'],
dtype='|S7')
For all of these reasons, pandas
chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won’t work in pandas
. Instead, it always uses native python strings, which behave in a more intuitive way for most users.