Unicode error when reading a csv file with pandas

Question:

Why pandas is not able to read this csv file and returns ‘UnicodeEncodeError’. I tried lot of solutions from stackoverflow (local download, different encoding, change the engine…), but still not working…How to fix it?

import pandas as pd
url = 'http://é.com'

pd.read_csv(url,encoding='utf-8')
Asked By: SciPy

||

Answers:

TL;DR

Your URL contains non ASCII character as the error complains.

Just change:

url = 'http://é.com'

For:

url = 'http://%C3%A9.com'

And the problem is fixed.

Solutions

Automatic URL escaping

Reading the error in depth shows that after executing the request to get resource behind the URL, the read_csv function expects the URL of resource to be ASCII encoded which seems not the be the case for this specific resource.

This call that is made by read_csv fails miserably:

import urllib.request
urllib.request.urlopen(url)

The problem is due to the accent in é that must be escaped to prevent urlopen to fail. Below a clean way to enforce this requirement:

import urllib.parse

result = urllib.parse.urlparse(url)
replaced = result._replace(path=urllib.parse.quote(result.path))
url = urllib.parse.urlunparse(replaced)

pd.read_csv(url)

Handling dataflow by yourself

Alternatively you can by pass this limitation by handling the complete flow by yourself. Following snippet does the trick:

import io
import gzip
import pandas as pd
import requests

url = 'http://é.com'
response = requests.get(url)
file = io.BytesIO(response.content)
with gzip.open(file, 'rb') as handler:
     df = pd.read_csv(handler)

The key is to get the HTTP resource and deflate it then fake the content as a file-like object because read_csv does read directly CSV strings.

Answered By: jlandercy
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.