UnicodeDecodeError for invalid sequence beyond requested read

Question:

Suppose I write a UTF-8 encoded string to a file, followed by a byte sequence that is invalid in UTF-8:

with open('/tmp/foo.txt', 'wb') as f:
    f.write('αβγ'.encode('utf-8'))
    f.write(b'x80')  # will cause unicode parse error if read

If I read exactly 3 UTF-8-encoded characters from this file, it should be fine, right?

with open('/tmp/foo.txt', 'r', encoding='utf-8') as f:
    print(f.read(3))

But actually it raises an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 6: invalid start byte

Why is it trying to decode more than it needs to?

Asked By: rgov

||

Answers:

TL;DR: This is because python decodes a chunk of the file in advance, even if you never end up reading that far into it.

Explanation

I tried adding more valid characters before the invalid character to see how python would handle it, and sure enough, adding enough valid characters before the invalid character silenced that error. I then looked for the magic number of padding characters you needed, and it ended up being 8192, or 2^13. Here’s a script that might give you a better idea of what’s going on:

chunk_size = 2 ** 13
print(f'chunk size: {chunk_size}')

padding_sizes = [chunk_size-1, chunk_size, chunk_size+1]

for padding_size in padding_sizes:
    with open(f'/tmp/{padding_size}.txt', 'wb') as f:
        f.write(('a'*padding_size).encode('utf-8'))
        f.write(b'x80')  # will cause unicode parse error if read

for padding_size in padding_sizes:
    print()
    print(f'number of padding bytes: {padding_size}')
    for num_bytes in [1, padding_size, padding_size+1]:
        print(f'reading {num_bytes:4} bytes: ', end='')
        try:
            with open(f'/tmp/{padding_size}.txt', 'r', encoding='utf-8') as f:
                f.read(num_bytes)
                print('PASS')
        except UnicodeDecodeError as e:
            print('FAIL')

And the output:

chunk size: 8192

number of padding bytes: 8191
reading    1 bytes: FAIL
reading 8191 bytes: FAIL
reading 8192 bytes: FAIL

number of padding bytes: 8192
reading    1 bytes: PASS
reading 8192 bytes: PASS
reading 8193 bytes: FAIL

number of padding bytes: 8193
reading    1 bytes: PASS
reading 8193 bytes: PASS
reading 8194 bytes: FAIL

Even reading a single byte of a file with an invalid character in the first chunk results in an error, whereas reading an entire chunk of a file in which the invalid character is the very next byte raises no error. And, as expected, attempting to read the invalid byte always raises an error.

But why? Probably because most times you read from a file, you’ll read thousands of bytes. And while decoding bytes is probably a pretty cheap operation, the overhead of constantly going back to the file, decoding a small number of bytes, and storing the results could become pretty significant if done thousands of times.

Bonus

Interestingly, when using chunk_size + 1 padding bytes, reading chunk_size + 1 bytes raises no error. Out of curiosity, I added an extra test case:

print(f'number of padding bytes: {chunk_size+1}')
print(f'reading {chunk_size} bytes, then 1 byte: ', end='')
try:
    with open(f'/tmp/{chunk_size+1}.txt', 'r', encoding='utf-8') as f:
        f.read(chunk_size)
        f.read(1)
        print('PASS')
except UnicodeDecodeError as e:
    print('FAIL')

And the output:

number of padding bytes: 8193
reading 8192 bytes, then 1 byte: FAIL

This seems to suggest that each call to f.read() will decode at least a chunk of bytes from the file (if those bytes haven’t been decoded already by a previous call to f.read()), but it can also decode more, and not just in multiples of chunks.

Answered By: rpm
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.