u'ufeff' in Python string

Question:

I got an error with the following exception message:

UnicodeEncodeError: 'ascii' codec can't encode character u'ufeff' in
position 155: ordinal not in range(128)

Not sure what u'ufeff' is, it shows up when I’m web scraping. How can I remedy the situation? The .replace() string method doesn’t work on it.

Asked By: James Hallen

||

Answers:

That character is the BOM or “Byte Order Mark”. It is usually received as the first few bytes of a file, telling you how to interpret the encoding of the rest of the data. You can simply remove the character to continue. Although, since the error says you were trying to convert to ‘ascii’, you should probably pick another encoding for whatever you were trying to do.

Answered By: swstephe

The content you’re scraping is encoded in unicode rather than ascii text, and you’re getting a character that doesn’t convert to ascii. The right ‘translation’ depends on what the original web page thought it was. Python’s unicode page gives the background on how it works.

Are you trying to print the result or stick it in a file? The error suggests it’s writing the data that’s causing the problem, not reading it. This question is a good place to look for the fixes.

Answered By: theodox

The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).

Output:

utf-8     'ABC'
utf-8-sig 'xefxbbxbfABC'
utf-16    'xffxfeAx00Bx00Cx00'    # Adds BOM and encodes using native processor endian-ness.
utf-16le  'Ax00Bx00Cx00'
utf-16be  'x00Ax00Bx00C'

utf-8  w/ BOM decoded with utf-8     u'ufeffABC'    # doesn't remove BOM if present.
utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.
utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le  u'ufeffABC'    # doesn't remove BOM if present.

Note that the utf-16 codec requires BOM to be present, or Python won’t know if the data is big- or little-endian.

Answered By: Mark Tolonen

This problem arise basically when you save your python code in a UTF-8 or UTF-16 encoding because python add some special character at the beginning of the code automatically (which is not shown by the text editors) to identify the encoding format. But, when you try to execute the code it gives you the syntax error in line 1 i.e, start of code because python compiler understands ASCII encoding.
when you view the code of file using read() function you can see at the begin of the returned code ‘ufeff’ is shown.
The one simplest solution to this problem is just by changing the encoding back to ASCII encoding(for this you can copy your code to a notepad and save it Remember! choose the ASCII encoding…
Hope this will help.

Answered By: Jagdish Chauhan

I ran into this on Python 3 and found this question (and solution).
When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding.

Without it, the BOM is included in the read result:

>>> f = open('file', mode='r')
>>> f.read()
'ufefftest'

Giving the correct encoding, the BOM is omitted in the result:

>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()
'test'

Just my 2 cents.

Answered By: siebz0r

Here is based on the answer from Mark Tolonen. The string included different languages of the word ‘test’ that’s separated by ‘|’, so you can see the difference.

u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print('utf-8     %r' % e8)
print('utf-8-sig %r' % e8s)
print('utf-16    %r' % e16)
print('utf-16le  %r' % e16le)
print('utf-16be  %r' % e16be)
print()
print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))

Here is a test run:

>>> u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> e8 = u.encode('utf-8')        # encode without BOM
>>> e8s = u.encode('utf-8-sig')   # encode with BOM
>>> e16 = u.encode('utf-16')      # encode with BOM
>>> e16le = u.encode('utf-16le')  # encode without BOM
>>> e16be = u.encode('utf-16be')  # encode without BOM
>>> print('utf-8     %r' % e8)
utf-8     b'ABCtestxcexb2xe8xb2x9dxe5xa1x94xecx9cx84mxc3xa1sbxc3xaata|test|xd8xa7xd8xaexd8xaaxd8xa8xd8xa7xd8xb1|xe6xb5x8bxe8xafx95|xe6xb8xacxe8xa9xa6|xe3x83x86xe3x82xb9xe3x83x88|xe0xa4xaaxe0xa4xb0xe0xa5x80xe0xa4x95xe0xa5x8dxe0xa4xb7xe0xa4xbe|xe0xb4xaaxe0xb4xb0xe0xb4xbfxe0xb4xb6xe0xb5x8bxe0xb4xa7xe0xb4xa8|xd7xa4xd6xbcxd7xa8xd7x95xd7x91xd7x99xd7xa8xd7x9f|kixe1xbbx83m tra|xc3x96lxc3xa7ek|'
>>> print('utf-8-sig %r' % e8s)
utf-8-sig b'xefxbbxbfABCtestxcexb2xe8xb2x9dxe5xa1x94xecx9cx84mxc3xa1sbxc3xaata|test|xd8xa7xd8xaexd8xaaxd8xa8xd8xa7xd8xb1|xe6xb5x8bxe8xafx95|xe6xb8xacxe8xa9xa6|xe3x83x86xe3x82xb9xe3x83x88|xe0xa4xaaxe0xa4xb0xe0xa5x80xe0xa4x95xe0xa5x8dxe0xa4xb7xe0xa4xbe|xe0xb4xaaxe0xb4xb0xe0xb4xbfxe0xb4xb6xe0xb5x8bxe0xb4xa7xe0xb4xa8|xd7xa4xd6xbcxd7xa8xd7x95xd7x91xd7x99xd7xa8xd7x9f|kixe1xbbx83m tra|xc3x96lxc3xa7ek|'
>>> print('utf-16    %r' % e16)
utf-16    b"xffxfeAx00Bx00Cx00tx00ex00sx00tx00xb2x03x9dx8cTXx04xc7mx00xe1x00sx00bx00xeax00tx00ax00|x00tx00ex00sx00tx00|x00'x06.x06*x06(x06'x061x06|x00Kmxd5x8b|x00,nfx8a|x00xc60xb90xc80|x00*t0t@tx15tMt7t>t|x00*r0r?r6rKr'r(r|x00xe4x05xbcx05xe8x05xd5x05xd1x05xd9x05xe8x05xdfx05|x00kx00ix00xc3x1emx00 x00tx00rx00ax00|x00xd6x00lx00xe7x00ex00kx00|x00"
>>> print('utf-16le  %r' % e16le)
utf-16le  b"Ax00Bx00Cx00tx00ex00sx00tx00xb2x03x9dx8cTXx04xc7mx00xe1x00sx00bx00xeax00tx00ax00|x00tx00ex00sx00tx00|x00'x06.x06*x06(x06'x061x06|x00Kmxd5x8b|x00,nfx8a|x00xc60xb90xc80|x00*t0t@tx15tMt7t>t|x00*r0r?r6rKr'r(r|x00xe4x05xbcx05xe8x05xd5x05xd1x05xd9x05xe8x05xdfx05|x00kx00ix00xc3x1emx00 x00tx00rx00ax00|x00xd6x00lx00xe7x00ex00kx00|x00"
>>> print('utf-16be  %r' % e16be)
utf-16be  b"x00Ax00Bx00Cx00tx00ex00sx00tx03xb2x8cx9dXTxc7x04x00mx00xe1x00sx00bx00xeax00tx00ax00|x00tx00ex00sx00tx00|x06'x06.x06*x06(x06'x061x00|mKx8bxd5x00|n,x8afx00|0xc60xb90xc8x00|t*t0t@tx15tMt7t>x00|r*r0r?r6rKr'r(x00|x05xe4x05xbcx05xe8x05xd5x05xd1x05xd9x05xe8x05xdfx00|x00kx00ix1exc3x00mx00 x00tx00rx00ax00|x00xd6x00lx00xe7x00ex00kx00|"
>>> print()

>>> print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
utf-8  w/ BOM decoded with utf-8     'ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
utf-8  w/ BOM decoded with utf-8-sig 'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
utf-16 w/ BOM decoded with utf-16    'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))
utf-16 w/ BOM decoded with utf-16le  'ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'

It’s worth to know that only both utf-8-sig and utf-16 get back the original string after both encode and decode.

Answered By: caot
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.