Convert Python escaped unicode sequences to UTF-8
Question:
So I’m using BeautifulSoup. It gets me the text of some HTML nodes, but those nodes have some Unicode characters, which get converted to escaped sequences in the string
For example, An HTML element that has this:
50 €
is retrieved by BeautifulSoup like:
soup.find("h2").text
as this string: 50u20ac
, Which is only readable in the Python console.
But then it becomes unreadable when written to a JSON file.
Note: I save to json using this code:
with open('file.json', 'w') as fp:
json.dump(fileToSave, fp)
How can I convert those Unicode characters back to UTF-8 or whatever makes them readable again?
Answers:
Please try with below :
utf8string = <unicodestring>.encode("utf-8")
For Python 2.7, I think you can use codecs
and json.dump(obj, fp, ensure_ascii=False)
. Example:
import codecs
import json
with codecs.open(filename, 'w', encoding='utf-8') as fp:
# obj is a 'unicode' which contains "50 €"
json.dump(obj, fp, ensure_ascii=False)
Small demo using Python 3. If you don’t dump to JSON using ensure_ascii=False
, non-ASCII will be written to JSON with Unicode escape codes. That doesn’t affect the ability to load the JSON, but it is less readable in the .json file itself.
Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> html = '<element>50u20ac</element'
>>> html
'<element>50€</element'
>>> soup = BeautifulSoup(html,'html')
>>> soup.find('element').text
'50€'
>>> import json
>>> with open('out.json','w',encoding='utf8') as f:
... json.dump(soup.find('element').text,f,ensure_ascii=False)
...
>>> ^Z
Content of out.json (UTF-8-encoded):
"50€"
So I’m using BeautifulSoup. It gets me the text of some HTML nodes, but those nodes have some Unicode characters, which get converted to escaped sequences in the string
For example, An HTML element that has this:
50 €
is retrieved by BeautifulSoup like:
soup.find("h2").text
as this string: 50u20ac
, Which is only readable in the Python console.
But then it becomes unreadable when written to a JSON file.
Note: I save to json using this code:
with open('file.json', 'w') as fp:
json.dump(fileToSave, fp)
How can I convert those Unicode characters back to UTF-8 or whatever makes them readable again?
Please try with below :
utf8string = <unicodestring>.encode("utf-8")
For Python 2.7, I think you can use codecs
and json.dump(obj, fp, ensure_ascii=False)
. Example:
import codecs
import json
with codecs.open(filename, 'w', encoding='utf-8') as fp:
# obj is a 'unicode' which contains "50 €"
json.dump(obj, fp, ensure_ascii=False)
Small demo using Python 3. If you don’t dump to JSON using ensure_ascii=False
, non-ASCII will be written to JSON with Unicode escape codes. That doesn’t affect the ability to load the JSON, but it is less readable in the .json file itself.
Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> html = '<element>50u20ac</element'
>>> html
'<element>50€</element'
>>> soup = BeautifulSoup(html,'html')
>>> soup.find('element').text
'50€'
>>> import json
>>> with open('out.json','w',encoding='utf8') as f:
... json.dump(soup.find('element').text,f,ensure_ascii=False)
...
>>> ^Z
Content of out.json (UTF-8-encoded):
"50€"