python seek() and read() count file positions differently

Question:

I’m making a script that displays the text in a specific position of a file. However there is a discrepancy in how seek() and read() are counting. It goes like this.

My text file is:

1
%
2
%
―
%
4
%
5
%
6

The ‘―’ in the 5th row is a horizontal bar (unicode 0x2015) not a dash. The ‘%’ works as a divider.

The following data works as the file index

0 2 ['1n']
4 2 ['2n']
8 4 ['―n']
14 2 ['4n']
18 2 ['5n']
22 2 ['6n']

the 1st column is the position of the string in the file(the numbers), the 2nd the length, and the 3td is the text to display (numbers in rows 1,3,5,7,9,11 of the text file).

I’m trying to read the file at a specific positions as follows:

f = open('myfile.txt', 'r', encoding='utf-8')
f.seek(start)
text = f.read(length)
f.close()

where ‘start’ and ‘length’ are the first and second columns of the index file, and ‘text’ is the text to display. This works great for displaying the content of all rows in the index file except for the 5th one (the one with the horizontal bar) because seek() interpretes the length of the horizontal bar as 3, thus having total length of 4 in the index file(3 for the horizontal bar and 1 por the ‘n’), while read() interpretes the length of the horizontal line as only one thus creating the following ouput:

―
%
(blank space)

That is, it includes the horizontal bar, its ‘n’, the divider and its ‘n’ (four characters). This effect is accumulative, the more horizontal bars or any other unicode character not in utf-8 will increase the number of lines wrongly displayed.

Any idea on how to fix this?

Asked By: Daniel

||

Answers:

seek is always in terms of bytes,* not characters, even for files opened in text mode.

There’s no way it could work remotely efficiently otherwise—the millionth character in a UTF-8 text file could be at byte 1,000,000 or at byte 2,739,184, and the only way to find out is to go back to the start and encode 999,999 characters.**

But read is only reading bytes if you’re in binary mode; in text mode, those bytes are decoded to Unicode strings on the fly. (Since you’re reading the file sequentially, this isn’t usually a performance issue—but when it is, you’ve always got binary mode.)

If you have a known position you want to be able to return to, you can “mark” it by calling tell and then seek back to it later, but otherwise, seeking isn’t very useful in text files, except to the start or end of the file of course.


* In fact, it’s not even documented to be bytes for text files; anything other than 0 or “an opaque number” as returned by tell produces “undefined behavior”. I believe it always will seek to the exact specified byte position—but because of the way the decoder pipeline works, this can cause mojibake even if you don’t seek into the middle of a character, particularly with encodings that use shift codes. To handle these cases, tell makes special snapshots that can be restored on a later seek, but of course there is no snapshot for some random point in the file.

** That’s not quite true—you could build up a table of offsets as you read along, or whenever you try to seek, and maybe even by reading ahead. But this definitely isn’t something you’d want Python doing on every file just for the rare cases where you want to seek by character index; it’s something you’d want to tune specifically to the uncommon case you care about. The linecache module—which is in the standard library because the debugger needs it—does roughly equivalent work, and comes with pretty readable source as long as you ignore the bits about the tokenizer, so if you want to build a character indexer yourself, it may be good sample code to get started.

Answered By: abarnert

In python 3, when you open a file in text mode, e.g., "r", there is a decoder between you and the raw file. In this case, its the UTF-8 decoder. "file position" doesn’t really make sense because a character index at the text level is different than a byte index in the file. In addition, python does caching in the background to aid with decode.

The solution is to read in binary and do the decode later

f = open('myfile.txt', 'rb')
f.seek(start)
text = f.read(length).decode(encoding='utf-8')
f.close()
Answered By: tdelaney
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.