Python pandas: how to specify data types when reading an Excel file?

Question:

I am importing an excel file into a pandas dataframe with the pandas.read_excel() function.

One of the columns is the primary key of the table: it’s all numbers, but it’s stored as text (the little green triangle in the top left of the Excel cells confirms this).

However, when I import the file into a pandas dataframe, the column gets imported as a float. This means that, for example, ‘0614’ becomes 614.

Is there a way to specify the datatype when importing a column? I understand this is possible when importing CSV files but couldn’t find anything in the syntax of read_excel().

The only solution I can think of is to add an arbitrary letter at the beginning of the text (converting ‘0614’ into ‘A0614’) in Excel, to make sure the column is imported as text, and then chopping off the ‘A’ in python, so I can match it to other tables I am importing from SQL.

Answers:

You just specify converters. I created an excel spreadsheet of the following structure:

names   ages
bob     05
tom     4
suzy    3

Where the “ages” column is formatted as strings. To load:

import pandas as pd

df = pd.read_excel('Book1.xlsx',sheetname='Sheet1',header=0,converters={'names':str,'ages':str})
>>> df
       names ages
   0   bob   05
   1   tom   4
   2   suzy  3
Answered By: tnknepp

The read_excel() function has a converters argument, where you can apply functions to input in certain columns. You can use this to keep them as strings.
Documentation:

Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.

Example code:

pandas.read_excel(my_file, converters = {my_str_column: str})
Answered By: Nix G-D

Starting with v0.20.0, the dtype keyword argument in read_excel() function could be used to specify the data types that needs to be applied to the columns just like it exists for read_csv() case.

Using converters and dtype arguments together on the same column name would lead to the latter getting shadowed and the former gaining preferance.


1) Inorder for it to not interpret the dtypes but rather pass all the contents of it’s columns as they were originally in the file before, we could set this arg to str or object so that we don’t mess up our data. (one such case would be leading zeros in numbers which would be lost otherwise)

pd.read_excel('file_name.xlsx', dtype=str)            # (or) dtype=object

2) It even supports a dict mapping wherein the keys constitute the column names and values it’s respective data type to be set especially when you want to alter the dtype for a subset of all the columns.

# Assuming data types for `a` and `b` columns to be altered
pd.read_excel('file_name.xlsx', dtype={'a': np.float64, 'b': np.int32})
Answered By: Nickil Maveli

In case if you are not aware of the number and name of columns in dataframe then this method can be handy:

column_list = []
df_column = pd.read_excel(file_name, 'Sheet1').columns
for i in df_column:
    column_list.append(i)
converter = {col: str for col in column_list} 
df_actual = pd.read_excel(file_name, converters=converter)

where column_list is the list of your column names.

Answered By: Tango

If your key has a fixed number of digits, you should probably store as text rather than as numeric data. You can use the converters argument or read_excel for this.

Or, if this does not work, just manipulate your data once it’s read into your dataframe:

df['key_zfill'] = df['key'].astype(str).str.zfill(4)

  names   key key_zfill
0   abc     5      0005
1   def  4962      4962
2   ghi   300      0300
3   jkl    14      0014
4   mno    20      0020
Answered By: jpp

If you don’t know the column names and you want to specify str data type to all columns:

table = pd.read_excel("path_to_filename")
cols = table.columns
conv = dict(zip(cols ,[str] * len(cols)))
table = pd.read_excel("path_to_filename", converters=conv)
Answered By: HazimoRa3d

If you are able to read the excel file correctly and only the integer values are not showing up. you can specify like this.

df = pd.read_excel('my.xlsx',sheetname='Sheet1', engine="openpyxl", dtype=str)

this should change your integer values into a string and show in dataframe

Answered By: Rajat Tyagi

converters or dtype won’t always help. Especially for date/time and duration (ideally a mix of both…), post-processing is necessary. In such cases, reading the Excel file’s content to a built-in type and create the DataFrame from that can be an option.

Here’s an example file. The "duration" column contains duration values in HH:MM:SS and invalid values "-".

excel_sample

If the hour is less than 24, Excel formats the entry to a time, otherwise a duration. We want dtype timedelta for the whole column in the DataFrame. But pandas messes up the import:

import pandas as pd
df = pd.read_excel("path-to-file")

df.duration
# 0               12:30:00
# 1    1900-01-01 00:30:00
# 2                      -
# 3    1900-01-01 23:55:01
# Name: duration, dtype: object

[type(i) for i in df.duration]
# [datetime.time, datetime.datetime, str, datetime.datetime]

Now we have datetime.datetime and datetime.time objects, and it’s difficult to get back duration (timedelta)! You could do it directly with a converter, but that does not make it less difficult.

Here, I found it to be actually easier to use the excel loader engine directly:

from openpyxl import load_workbook
wb = load_workbook('path-to-file')

sheet = wb['Tests'] # adjust sheet name, this is for the demo file
data = list(sheet.values) # a list of tuples, one tuple for each row
df = pd.DataFrame(data[1:], columns=data[0]) # first tuple is column names

df['duration']
# 0           12:30:00
# 1     1 day, 0:30:00
# 2                  -
# 3    1 day, 23:55:01
# Name: duration, dtype: object

[type(i) for i in df['duration']]
# [datetime.time, datetime.timedelta, str, datetime.timedelta]

So now we already have some timedelta objects! The conversion of the others to timedelta can be done as simple as

df['duration'] = pd.to_timedelta(df.duration.astype(str), errors='coerce')
df['duration']
# 0   0 days 12:30:00
# 1   1 days 00:30:00
# 2               NaT
# 3   1 days 23:55:01
# Name: duration, dtype: timedelta64[ns]
Answered By: FObersteiner
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.