Unicode Decoding error when trying to generate pdf with non-ascii characters
Question:
I am working with some software that is generating an error when trying to create a pdf from html that contains non-ascii characters. I have created a much simpler program to reproduce the problem and help me understand what is going on.
#!/usr/bin/python
#coding=utf8
from __future__ import unicode_literals
import pdfkit
from pyPdf import PdfFileWriter, PdfFileReader
f = open('test.html','r')
html = f.read()
print html
pdfkit.from_string(html, 'gen.pdf')
f.close()
Running this program results in:
<html>
<body>
<h1>ر</h1>
</body>
</html>
Traceback (most recent call last):
File "./testerror.py", line 10, in <module>
pdfkit.from_string(html, 'gen.pdf')
File "/usr/local/lib/python2.7/dist-packages/pdfkit/api.py", line 72, in from_string
return r.to_pdf(output_path)
File "/usr/local/lib/python2.7/dist-packages/pdfkit/pdfkit.py", line 136, in to_pdf
input = self.source.to_s().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 18: ordinal not in range(128)
I tried adding a replace statement to strip the problem character, but that also resulted in an error:
Traceback (most recent call last):
File "./testerror.py", line 9, in <module>
html = html.replace('ر','-')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 18: ordinal not in range(128)
I am afraid I don’t understand ascii / utf-8 encoding very well. If anyone could help me understand what is going on here, that would be great! I am not sure if this is a problem in the pdf library, or if this is a result of my ignorance of encodings 🙂
Answers:
Reading pdfkit
source code, it appears that pdfkit.from_string
expects its first argument to be unicode
not str
, so it’s up to you to properly decode html
. To do so you must know what encoding your test.html
file is. Once you know that you just have to proceed:
with open('test.html') as f:
html = f.read().decode('<your-encoding-name-here>)
pdfkit.from_string(html, 'gen.pdf')
Note that str.decode(<encoding>)
will return a unicode
string and unicode.encode(<encoding>)
will return a byte string, IOW you decode
from byte string to unicode and you encode
from unicode to byte string.
In your case can also use codecs.open(path, mode, encoding)
instead of file.open()
+ explicit decoding, ie:
import codecs
with codecs.open('test.html', encoding=<your-encoding-name-here>) as f:
html = f.read() # `codecs` while do the decoding behind the scene
As a side note:
-
read (read binary for codecs
but that’s an implementation detail) is the default mode when opening a file so no need to specify it all
-
using files as context managers (with open(path) as f: ...
) makes sure the file will be properly closed. While CPython will usually close opened filed when the file
objects get collected, this is an implementation detail and is not garanteed by the language, so do not rely on it.
Also HTML should include charset
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
The question seems to be Python 2 specific. However, I had a similar issue with Python 3 in a Flask + Apache/mod_wsgi environment on Ubuntu 22.04. when passing a non-ASCII-string to the header or footer via the from_string
options (e.g. document = pdfkit.from_string(html, False, options={"header-left": "é"}
). I then got the error UnicodeEncodeError: 'ascii' codec can't encode character 'xe9' in position 0: ordinal not in range(128)
. The problem was the missing locale
setting for WSGIDaemonProcess
in the Apache/VirtualHost configuration. I solved it by passing locake=C.UTF-8
: WSGIDaemonProcess myapp user=myuser group=mygroup threads=5 locale=C.UTF-8 python-home=/path/to/myapp/venv
.
I am working with some software that is generating an error when trying to create a pdf from html that contains non-ascii characters. I have created a much simpler program to reproduce the problem and help me understand what is going on.
#!/usr/bin/python
#coding=utf8
from __future__ import unicode_literals
import pdfkit
from pyPdf import PdfFileWriter, PdfFileReader
f = open('test.html','r')
html = f.read()
print html
pdfkit.from_string(html, 'gen.pdf')
f.close()
Running this program results in:
<html>
<body>
<h1>ر</h1>
</body>
</html>
Traceback (most recent call last):
File "./testerror.py", line 10, in <module>
pdfkit.from_string(html, 'gen.pdf')
File "/usr/local/lib/python2.7/dist-packages/pdfkit/api.py", line 72, in from_string
return r.to_pdf(output_path)
File "/usr/local/lib/python2.7/dist-packages/pdfkit/pdfkit.py", line 136, in to_pdf
input = self.source.to_s().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 18: ordinal not in range(128)
I tried adding a replace statement to strip the problem character, but that also resulted in an error:
Traceback (most recent call last):
File "./testerror.py", line 9, in <module>
html = html.replace('ر','-')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 18: ordinal not in range(128)
I am afraid I don’t understand ascii / utf-8 encoding very well. If anyone could help me understand what is going on here, that would be great! I am not sure if this is a problem in the pdf library, or if this is a result of my ignorance of encodings 🙂
Reading pdfkit
source code, it appears that pdfkit.from_string
expects its first argument to be unicode
not str
, so it’s up to you to properly decode html
. To do so you must know what encoding your test.html
file is. Once you know that you just have to proceed:
with open('test.html') as f:
html = f.read().decode('<your-encoding-name-here>)
pdfkit.from_string(html, 'gen.pdf')
Note that str.decode(<encoding>)
will return a unicode
string and unicode.encode(<encoding>)
will return a byte string, IOW you decode
from byte string to unicode and you encode
from unicode to byte string.
In your case can also use codecs.open(path, mode, encoding)
instead of file.open()
+ explicit decoding, ie:
import codecs
with codecs.open('test.html', encoding=<your-encoding-name-here>) as f:
html = f.read() # `codecs` while do the decoding behind the scene
As a side note:
-
read (read binary for
codecs
but that’s an implementation detail) is the default mode when opening a file so no need to specify it all -
using files as context managers (
with open(path) as f: ...
) makes sure the file will be properly closed. While CPython will usually close opened filed when thefile
objects get collected, this is an implementation detail and is not garanteed by the language, so do not rely on it.
Also HTML should include charset
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
The question seems to be Python 2 specific. However, I had a similar issue with Python 3 in a Flask + Apache/mod_wsgi environment on Ubuntu 22.04. when passing a non-ASCII-string to the header or footer via the from_string
options (e.g. document = pdfkit.from_string(html, False, options={"header-left": "é"}
). I then got the error UnicodeEncodeError: 'ascii' codec can't encode character 'xe9' in position 0: ordinal not in range(128)
. The problem was the missing locale
setting for WSGIDaemonProcess
in the Apache/VirtualHost configuration. I solved it by passing locake=C.UTF-8
: WSGIDaemonProcess myapp user=myuser group=mygroup threads=5 locale=C.UTF-8 python-home=/path/to/myapp/venv
.