pandas DF from numpy structured array: can't get unicode or string type for column (only object)

Question:

I pull data out of a software system, which gives me a numpy structured array. I convert this to a pandas DataFrame to do work and then need to convert it back to a structured array so I can push it back into the original system. String/text data shows up in the array as a unicode column and gets described as an object in the DF. Been trying to figure out how to get it back to either unicode or string, either in the DF or even the ending array, but having trouble. In the interest of asking a single question, how do I get a DF column dtype to be unicode/string?

Here’s what I’m trying, the column ‘region’ is the one I’m focusing on:

import pandas as pd
import numpy as np
arr = array([(1, u'01', 7733855, 0), (2, u'01', 7733919, 1244),
       (3, u'01', 7732571, 1236), (4, u'01', 7732387, 1234),
       (5, u'01', 7733327, 1239), (6, u'01', 7733755, 1241),
       (7, u'01', 7732571, 1236), (8, u'01', 7733923, 0),
       (9, u'01', 7733327, 1239), (10, u'01', 7733755, 1241)], 
      dtype=[('hru_id_nat', '<i4'), ('region', '<U255'), ('POI_ID', '<i4'), ('hru_segment', '<i4')])

Then I can make into a DF:

df = pd.DataFrame(arr)
df.dtypes

shows that ‘region’ has an object dtype:

hru_id_nat      int32
region         object
POI_ID          int32
hru_segment     int32
dtype: object

I try to specify the dtypes when converting to DF, but not quite getting it:

n = list(arr.dtype.names)
t = [i[0].name for i in arr.dtype.fields.values()]
dt = [(i, j) for i, j in zip(n, t)]
dt

gets:

[('hru_id_nat', 'int32'),
 ('region', 'unicode8160'),
 ('POI_ID', 'int32'),
 ('hru_segment', 'int32')]

This throws an error when I try to use the dt specification to create the DF

df = pd.DataFrame(arr, dt)

doesn’t help if I try these:

dt[1] = ('region', 'unicode')
dt[1] = ('region', 'str')
dt[1] = ('region', np.str)

I’ve also tried convert_type() (based on this post) and df['region'] = df['region'].astype(np.str) (based on this post), but neither seem to change the dtype reported by the DF.

Thanks much for any input.

Asked By: Roland

||

Answers:

check out the documentation here

here is the code I used to test it:

import pandas as pd
import numpy as np
arr = pd.DataFrame(data=[(1, u'01', 7733855, 0), (2, u'01', 7733919, 1244),
       (3, u'01', 7732571, 1236), (4, u'01', 7732387, 1234),
       (5, u'01', 7733327, 1239), (6, u'01', 7733755, 1241),
       (7, u'01', 7732571, 1236), (8, u'01', 7733923, 0),
       (9, u'01', 7733327, 1239), (10, u'01', 7733755, 1241)],) 

print arr, 'n', arr.dtypes
arr = arr.astype('string')
arr = arr.astype('int')
print arr.values, 'n', arr.dtypes

the output was

enter image description here

astype worked for me. my versions are python 2.7.6 pandas 0.13.1 and numpy 1.8.2

Answered By: Yojimbo

Unless I misunderstand (which is entirely possible), I think you have an XY problem here …. the pandas DataFrame will never tell you that it has anything with a dtype of ‘unicode’. But your unicode data are perfectly safe stored as ‘object’. All string data are stored as an ‘object’ dtype1.

The problem of getting back the unicode dtype after converting from the DataFrame shouldn’t be hard. When I take your DataFrame and convert it using the to_records method, I get your string data (‘region’) as type 'O', which is what you probably did:

>>> a = df.to_records()
>>> a
rec.array([(0L, 1, u'01', 7733855, 0), (1L, 2, u'01', 7733919, 1244),
       (2L, 3, u'01', 7732571, 1236), (3L, 4, u'01', 7732387, 1234),
       (4L, 5, u'01', 7733327, 1239), (5L, 6, u'01', 7733755, 1241),
       (6L, 7, u'01', 7732571, 1236), (7L, 8, u'01', 7733923, 0),
       (8L, 9, u'01', 7733327, 1239), (9L, 10, u'01', 7733755, 1241)], 
      dtype=[('index', '<i8'), ('hru_id_nat', '<i4'), ('region', 'O'), ('POI_ID', '<i4'), ('hru_segment', '<i4')])

But getting it back to unicode was as simple as re-using your original datatype object.

>>> dt = {'names':('hru_id_nat', 'region', 'POI_ID', 'hru_segment'),
      'formats':('<i4', '<U255', '<i4', '<i4')}
>>> b = a.astype(dt)
>>> b
rec.array([(1, u'01', 7733855, 0), (2, u'01', 7733919, 1244),
       (3, u'01', 7732571, 1236), (4, u'01', 7732387, 1234),
       (5, u'01', 7733327, 1239), (6, u'01', 7733755, 1241),
       (7, u'01', 7732571, 1236), (8, u'01', 7733923, 0),
       (9, u'01', 7733327, 1239), (10, u'01', 7733755, 1241)], 
      dtype=[(u'hru_id_nat', '<i4'), (u'region', '<U255'), (u'POI_ID', '<i4'), (u'hru_segment', '<i4')])

You might need to be wary of the index, so include an index=False keyword in the call to to_records if you don’t want it.


1 Prior to version 1.0.0, in which a StringDType was introduced. Use of the explicit type in modern versions of Pandas is encouraged – see Text data types.

Answered By: Ajean

You can use StringDType, which was introduced in pandas 1.0.0 in January 2020:

import pandas as pd
arr = ([(1, '01', 7733855, 0), (2, '01', 7733919, 1244),
       (3, '01', 7732571, 1236), (4, '01', 7732387, 1234),
       (5, '01', 7733327, 1239), (6, '01', 7733755, 1241),
       (7, '01', 7732571, 1236), (8, '01', 7733923, 0),
       (9, '01', 7733327, 1239), (10, '01', 7733755, 1241)])
df = pd.DataFrame(arr, columns=["hru_id_nat", "region", "POI_ID", "hru_segment"])
df["region"] = df["region"].astype(pd.StringDtype())

Now we can use the .str accessor to do string operations:

In [11]: df["region"].str[1]
Out[11]:
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: region, dtype: string

Note that as of pandas 1.5.2, the API for StringDType() is still marked as experimental and subject to change, so use at your own risk in production code.

Answered By: gerrit
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.