How to read UTF-8 files with Pandas?

Question:

I have a UTF-8 file with twitter data and I am trying to read it into a Python data frame but I can only get an ‘object’ type instead of unicode strings:

# file 1459966468_324.csv
#1459966468_324.csv: UTF-8 Unicode English text
df = pd.read_csv('1459966468_324.csv', dtype={'text': unicode})
df.dtypes
text               object
Airline            object
name               object
retweet_count     float64
sentiment          object
tweet_location     object
dtype: object

What is the right way of reading and coercing UTF-8 data into unicode with Pandas?

This does not solve the problem:

df = pd.read_csv('1459966468_324.csv', encoding = 'utf8')
df.apply(lambda x: pd.lib.infer_dtype(x.values))

Text file is here:
https://raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv

Asked By: Istvan

||

Answers:

Use the encoding keyword with the appropriate parameter:

df = pd.read_csv('1459966468_324.csv', encoding='utf8')
Answered By: Stefan

As the other poster mentioned, you might try:

df = pd.read_csv('1459966468_324.csv', encoding='utf8')

However this could still leave you looking at ‘object’ when you print the dtypes. To confirm they are utf8, try this line after reading the CSV:

df.apply(lambda x: pd.lib.infer_dtype(x.values))

Example output:

args            unicode
date         datetime64
host            unicode
kwargs          unicode
operation       unicode
Answered By: Sam

Pandas stores strings in objects. In python 3, all string are in unicode by default. So if you use python 3, your data is already in unicode (don’t be mislead by type object).

If you have python 2, then use df = pd.read_csv('your_file', encoding = 'utf8'). Then try for example pd.lib.infer_dtype(df.iloc[0,0]) (I guess the first col consists of strings.)

Answered By: ptrj

Looks like the location of this function has moved. This worked for me on 1.0.1:

df.apply(lambda x: pd.api.types.infer_dtype(x.values))
Answered By: cefect

Perhaps the appropriate parameter for the encoding keyword is:

df = pd.read_csv('1459966468_324.csv', encoding='latin1')
Answered By: Colibri

I have same problem ,so I downgrade my Pandas to Version 1.2.4 and now its working .

Answered By: user21591625
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.