How to set dtypes by column in pandas DataFrame

Question:

I want to bring some data into a pandas DataFrame and I want to assign dtypes for each column on import. I want to be able to do this for larger datasets with many different columns, but, as an example:

myarray = np.random.randint(0,5,size=(2,2))
mydf = pd.DataFrame(myarray,columns=['a','b'], dtype=[float,int])
mydf.dtypes

results in:

TypeError: data type not understood

I tried a few other methods such as:

mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': int})

TypeError: object of type ‘type’ has no len()

If I put dtype=(float,int) it applies a float format to both columns.

In the end I would like to just be able to pass it a list of datatypes the same way I can pass it a list of column names.

Asked By: Chris

||

Answers:

I just ran into this, and the pandas issue is still open, so I’m posting my workaround. Assuming df is my DataFrame and dtype is a dict mapping column names to types:

for k, v in dtype.items():
    df[k] = df[k].astype(v)

(note: use dtype.iteritems() in python 2)

For the reference:

Answered By: mattexx

You may want to try passing in a dictionary of Series objects to the DataFrame constructor – it will give you much more specific control over the creation, and should hopefully be clearer what’s going on. A template version (data1 can be an array etc.):

df = pd.DataFrame({'column1':pd.Series(data1, dtype='type1'),
                   'column2':pd.Series(data2, dtype='type2')})

And example with data:

df = pd.DataFrame({'A':pd.Series([1,2,3], dtype='int'),
                   'B':pd.Series([7,8,9], dtype='float')})

print (df)
   A  B
0  1  7.0
1  2  8.0
2  3  9.0

print (df.dtypes)
A     int32
B    float64
dtype: object
Answered By: DBCerigo

while working with data types, they should be passed as strings.

For example the latter method you followed should be modified as

mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': 'int'})

instead of

mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': int}).

The dtype (int, float etc.) should be given as strings.

Or else as an Alternative method (iff you don’t want to pass as strings)
import numpy as np and use
mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': np.int})

Answered By: user10983117

As of pandas version 0.24.2 (the current stable release) it is not possible to pass an explicit list of datatypes to the DataFrame constructor as the docs state:

dtype : dtype, default None

    Data type to force. Only a single dtype is allowed. If None, infer

However, the dataframe class does have a static method allowing you to convert a numpy structured array to a dataframe so you can do:

>>> myarray = np.random.randint(0,5,size=(2,2))
>>> record = np.array(map(tuple,myarray),dtype=[('a',np.float),('b',np.int)])
>>> mydf = pd.DataFrame.from_records(record)
>>> mydf.dtypes
a    float64
b      int64
dtype: object
Answered By: user545424
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.