pandas DF from numpy structured array: can't get unicode or string type for column (only object)
Question:
I pull data out of a software system, which gives me a numpy structured array. I convert this to a pandas DataFrame to do work and then need to convert it back to a structured array so I can push it back into the original system. String/text data shows up in the array as a unicode column and gets described as an object in the DF. Been trying to figure out how to get it back to either unicode or string, either in the DF or even the ending array, but having trouble. In the interest of asking a single question, how do I get a DF column dtype to be unicode/string?
Here’s what I’m trying, the column ‘region’ is the one I’m focusing on:
import pandas as pd
import numpy as np
arr = array([(1, u'01', 7733855, 0), (2, u'01', 7733919, 1244),
(3, u'01', 7732571, 1236), (4, u'01', 7732387, 1234),
(5, u'01', 7733327, 1239), (6, u'01', 7733755, 1241),
(7, u'01', 7732571, 1236), (8, u'01', 7733923, 0),
(9, u'01', 7733327, 1239), (10, u'01', 7733755, 1241)],
dtype=[('hru_id_nat', '<i4'), ('region', '<U255'), ('POI_ID', '<i4'), ('hru_segment', '<i4')])
Then I can make into a DF:
df = pd.DataFrame(arr)
df.dtypes
shows that ‘region’ has an object
dtype:
hru_id_nat int32
region object
POI_ID int32
hru_segment int32
dtype: object
I try to specify the dtypes when converting to DF, but not quite getting it:
n = list(arr.dtype.names)
t = [i[0].name for i in arr.dtype.fields.values()]
dt = [(i, j) for i, j in zip(n, t)]
dt
gets:
[('hru_id_nat', 'int32'),
('region', 'unicode8160'),
('POI_ID', 'int32'),
('hru_segment', 'int32')]
This throws an error when I try to use the dt specification to create the DF
df = pd.DataFrame(arr, dt)
doesn’t help if I try these:
dt[1] = ('region', 'unicode')
dt[1] = ('region', 'str')
dt[1] = ('region', np.str)
I’ve also tried convert_type()
(based on this post) and df['region'] = df['region'].astype(np.str)
(based on this post), but neither seem to change the dtype reported by the DF.
Thanks much for any input.
Answers:
check out the documentation here
here is the code I used to test it:
import pandas as pd
import numpy as np
arr = pd.DataFrame(data=[(1, u'01', 7733855, 0), (2, u'01', 7733919, 1244),
(3, u'01', 7732571, 1236), (4, u'01', 7732387, 1234),
(5, u'01', 7733327, 1239), (6, u'01', 7733755, 1241),
(7, u'01', 7732571, 1236), (8, u'01', 7733923, 0),
(9, u'01', 7733327, 1239), (10, u'01', 7733755, 1241)],)
print arr, 'n', arr.dtypes
arr = arr.astype('string')
arr = arr.astype('int')
print arr.values, 'n', arr.dtypes
the output was
astype worked for me. my versions are python 2.7.6 pandas 0.13.1 and numpy 1.8.2
Unless I misunderstand (which is entirely possible), I think you have an XY problem here …. the pandas DataFrame will never tell you that it has anything with a dtype of ‘unicode’. But your unicode data are perfectly safe stored as ‘object’. All string data are stored as an ‘object’ dtype1.
The problem of getting back the unicode dtype after converting from the DataFrame shouldn’t be hard. When I take your DataFrame and convert it using the to_records
method, I get your string data (‘region’) as type 'O'
, which is what you probably did:
>>> a = df.to_records()
>>> a
rec.array([(0L, 1, u'01', 7733855, 0), (1L, 2, u'01', 7733919, 1244),
(2L, 3, u'01', 7732571, 1236), (3L, 4, u'01', 7732387, 1234),
(4L, 5, u'01', 7733327, 1239), (5L, 6, u'01', 7733755, 1241),
(6L, 7, u'01', 7732571, 1236), (7L, 8, u'01', 7733923, 0),
(8L, 9, u'01', 7733327, 1239), (9L, 10, u'01', 7733755, 1241)],
dtype=[('index', '<i8'), ('hru_id_nat', '<i4'), ('region', 'O'), ('POI_ID', '<i4'), ('hru_segment', '<i4')])
But getting it back to unicode was as simple as re-using your original datatype object.
>>> dt = {'names':('hru_id_nat', 'region', 'POI_ID', 'hru_segment'),
'formats':('<i4', '<U255', '<i4', '<i4')}
>>> b = a.astype(dt)
>>> b
rec.array([(1, u'01', 7733855, 0), (2, u'01', 7733919, 1244),
(3, u'01', 7732571, 1236), (4, u'01', 7732387, 1234),
(5, u'01', 7733327, 1239), (6, u'01', 7733755, 1241),
(7, u'01', 7732571, 1236), (8, u'01', 7733923, 0),
(9, u'01', 7733327, 1239), (10, u'01', 7733755, 1241)],
dtype=[(u'hru_id_nat', '<i4'), (u'region', '<U255'), (u'POI_ID', '<i4'), (u'hru_segment', '<i4')])
You might need to be wary of the index, so include an index=False
keyword in the call to to_records
if you don’t want it.
1 Prior to version 1.0.0, in which a StringDType was introduced. Use of the explicit type in modern versions of Pandas is encouraged – see Text data types.
You can use StringDType, which was introduced in pandas 1.0.0 in January 2020:
import pandas as pd
arr = ([(1, '01', 7733855, 0), (2, '01', 7733919, 1244),
(3, '01', 7732571, 1236), (4, '01', 7732387, 1234),
(5, '01', 7733327, 1239), (6, '01', 7733755, 1241),
(7, '01', 7732571, 1236), (8, '01', 7733923, 0),
(9, '01', 7733327, 1239), (10, '01', 7733755, 1241)])
df = pd.DataFrame(arr, columns=["hru_id_nat", "region", "POI_ID", "hru_segment"])
df["region"] = df["region"].astype(pd.StringDtype())
Now we can use the .str
accessor to do string operations:
In [11]: df["region"].str[1]
Out[11]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
Name: region, dtype: string
Note that as of pandas 1.5.2, the API for StringDType()
is still marked as experimental and subject to change, so use at your own risk in production code.
I pull data out of a software system, which gives me a numpy structured array. I convert this to a pandas DataFrame to do work and then need to convert it back to a structured array so I can push it back into the original system. String/text data shows up in the array as a unicode column and gets described as an object in the DF. Been trying to figure out how to get it back to either unicode or string, either in the DF or even the ending array, but having trouble. In the interest of asking a single question, how do I get a DF column dtype to be unicode/string?
Here’s what I’m trying, the column ‘region’ is the one I’m focusing on:
import pandas as pd
import numpy as np
arr = array([(1, u'01', 7733855, 0), (2, u'01', 7733919, 1244),
(3, u'01', 7732571, 1236), (4, u'01', 7732387, 1234),
(5, u'01', 7733327, 1239), (6, u'01', 7733755, 1241),
(7, u'01', 7732571, 1236), (8, u'01', 7733923, 0),
(9, u'01', 7733327, 1239), (10, u'01', 7733755, 1241)],
dtype=[('hru_id_nat', '<i4'), ('region', '<U255'), ('POI_ID', '<i4'), ('hru_segment', '<i4')])
Then I can make into a DF:
df = pd.DataFrame(arr)
df.dtypes
shows that ‘region’ has an object
dtype:
hru_id_nat int32
region object
POI_ID int32
hru_segment int32
dtype: object
I try to specify the dtypes when converting to DF, but not quite getting it:
n = list(arr.dtype.names)
t = [i[0].name for i in arr.dtype.fields.values()]
dt = [(i, j) for i, j in zip(n, t)]
dt
gets:
[('hru_id_nat', 'int32'),
('region', 'unicode8160'),
('POI_ID', 'int32'),
('hru_segment', 'int32')]
This throws an error when I try to use the dt specification to create the DF
df = pd.DataFrame(arr, dt)
doesn’t help if I try these:
dt[1] = ('region', 'unicode')
dt[1] = ('region', 'str')
dt[1] = ('region', np.str)
I’ve also tried convert_type()
(based on this post) and df['region'] = df['region'].astype(np.str)
(based on this post), but neither seem to change the dtype reported by the DF.
Thanks much for any input.
check out the documentation here
here is the code I used to test it:
import pandas as pd
import numpy as np
arr = pd.DataFrame(data=[(1, u'01', 7733855, 0), (2, u'01', 7733919, 1244),
(3, u'01', 7732571, 1236), (4, u'01', 7732387, 1234),
(5, u'01', 7733327, 1239), (6, u'01', 7733755, 1241),
(7, u'01', 7732571, 1236), (8, u'01', 7733923, 0),
(9, u'01', 7733327, 1239), (10, u'01', 7733755, 1241)],)
print arr, 'n', arr.dtypes
arr = arr.astype('string')
arr = arr.astype('int')
print arr.values, 'n', arr.dtypes
the output was
astype worked for me. my versions are python 2.7.6 pandas 0.13.1 and numpy 1.8.2
Unless I misunderstand (which is entirely possible), I think you have an XY problem here …. the pandas DataFrame will never tell you that it has anything with a dtype of ‘unicode’. But your unicode data are perfectly safe stored as ‘object’. All string data are stored as an ‘object’ dtype1.
The problem of getting back the unicode dtype after converting from the DataFrame shouldn’t be hard. When I take your DataFrame and convert it using the to_records
method, I get your string data (‘region’) as type 'O'
, which is what you probably did:
>>> a = df.to_records()
>>> a
rec.array([(0L, 1, u'01', 7733855, 0), (1L, 2, u'01', 7733919, 1244),
(2L, 3, u'01', 7732571, 1236), (3L, 4, u'01', 7732387, 1234),
(4L, 5, u'01', 7733327, 1239), (5L, 6, u'01', 7733755, 1241),
(6L, 7, u'01', 7732571, 1236), (7L, 8, u'01', 7733923, 0),
(8L, 9, u'01', 7733327, 1239), (9L, 10, u'01', 7733755, 1241)],
dtype=[('index', '<i8'), ('hru_id_nat', '<i4'), ('region', 'O'), ('POI_ID', '<i4'), ('hru_segment', '<i4')])
But getting it back to unicode was as simple as re-using your original datatype object.
>>> dt = {'names':('hru_id_nat', 'region', 'POI_ID', 'hru_segment'),
'formats':('<i4', '<U255', '<i4', '<i4')}
>>> b = a.astype(dt)
>>> b
rec.array([(1, u'01', 7733855, 0), (2, u'01', 7733919, 1244),
(3, u'01', 7732571, 1236), (4, u'01', 7732387, 1234),
(5, u'01', 7733327, 1239), (6, u'01', 7733755, 1241),
(7, u'01', 7732571, 1236), (8, u'01', 7733923, 0),
(9, u'01', 7733327, 1239), (10, u'01', 7733755, 1241)],
dtype=[(u'hru_id_nat', '<i4'), (u'region', '<U255'), (u'POI_ID', '<i4'), (u'hru_segment', '<i4')])
You might need to be wary of the index, so include an index=False
keyword in the call to to_records
if you don’t want it.
1 Prior to version 1.0.0, in which a StringDType was introduced. Use of the explicit type in modern versions of Pandas is encouraged – see Text data types.
You can use StringDType, which was introduced in pandas 1.0.0 in January 2020:
import pandas as pd
arr = ([(1, '01', 7733855, 0), (2, '01', 7733919, 1244),
(3, '01', 7732571, 1236), (4, '01', 7732387, 1234),
(5, '01', 7733327, 1239), (6, '01', 7733755, 1241),
(7, '01', 7732571, 1236), (8, '01', 7733923, 0),
(9, '01', 7733327, 1239), (10, '01', 7733755, 1241)])
df = pd.DataFrame(arr, columns=["hru_id_nat", "region", "POI_ID", "hru_segment"])
df["region"] = df["region"].astype(pd.StringDtype())
Now we can use the .str
accessor to do string operations:
In [11]: df["region"].str[1]
Out[11]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
Name: region, dtype: string
Note that as of pandas 1.5.2, the API for StringDType()
is still marked as experimental and subject to change, so use at your own risk in production code.