Decoding UTF-8 strings in Python

Question:

I’m writing a web crawler in python, and it involves taking headlines from websites.

One of the headlines should’ve read : And the Hip’s coming, too

But instead it said: And the Hip’s coming, too

What’s going wrong here?

Asked By: user1624005

||

Answers:

You need to properly decode the source text. Most likely the source text is in UTF-8 format, not ASCII.

Because you do not provide any context or code for your question it is not possible to give a direct answer.

I suggest you study how unicode and character encoding is done in Python:

http://docs.python.org/2/howto/unicode.html

Answered By: Mikko Ohtamaa

It’s an encoding error – so if it’s a unicode string, this ought to fix it:

text.encode("windows-1252").decode("utf-8")

If it’s a plain string, you’ll need an extra step:

text.decode("utf-8").encode("windows-1252").decode("utf-8")

Both of these will give you a unicode string.

By the way – to discover how a piece of text like this has been mangled due to encoding issues, you can use chardet:

>>> import chardet
>>> chardet.detect(u"And the Hip’s coming, too")
{'confidence': 0.5, 'encoding': 'windows-1252'}
Answered By: Zero Piraeus
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.