Learning Python – len() returns 2n+2

Question:

I’m sorry if this is a duplicate post but search seemed to yield no useful results…or maybe I’m such a noob that I’m not understanding what is being said in the answers.

I wrote this small code for practice (following "learning Python the hard way"). I tried to make a shorter version of a code which was already given to me.

from sys import argv

script, from_file, to_file = argv

# here is the part where I tried to simplify the commands and see if I still get the same result,
# Turns out it's the same 2n+2
trial = open(from_file)
trial_data = trial.read()
print(len(trial_data))
trial.close()

# actual code after defining the argumentative variables
in_file = open(from_file).read()

input(f"Transfering {len(in_file)} characters from {from_file} to {to_file}, hit RETURN to continue, CRTL-C to abort.")
#'in_data = in_file.read()

out_file = open(to_file, 'w').write(in_file)

When using len() it always seems to return 2n+2 value instead of n, where n is the actual number of characters in the text file. I also made sure there are no extra lines in the text file.

Can someone kindly explain?

TIA

I was expecting the exact number of characters found in the txt file to be returned. Turns out it’s too much to ask.

Edit: since so many are asking for a practical example….here it goes:

The poem 
dedicated to Puxijn
The Chonk one

What i get is

ÿþT h e   p o e m

 d e d i c a t e d   t o   P u x i j n

 T h e   C h o n k   o n e

I think it is an encoding problem. I’m using the latest python if that is of any help.

Asked By: Mister Mace

||

Answers:

Possibly the extra characters are the new line character or some other invisible to-your-text-editor character?

Try to make a simple test file with only one character.
eg run

echo "a" > test_file

Also there is a dedicated bash command to count such stuff

wc -m
Answered By: partizanos

The observed behaviour is consistent with opening the file in binary mode and the file being encoded in utf-16 with a BOM.

If you then call len on the contents of that file it will count the bytes in that file.
The amount of bytes will depend on the specific encoding.

That would explain both the 2n cause every utf-16 char has 2 bytes as well as the + 2 the BOM newline.

Answered By: cafce25

Based on your updated question, you’re definitely reading from UTF-16 encoded text files using the locale default encoding (probably latin-1 or cp1252, both of which would decode the UTF-16 BOM to ÿþ; Windows often uses cp1252 as the default, and latin-1, while largely eclipsed by UTF-8 in the present day, was a popular locale on older UNIX-likes for a long time). Those encodings will read any old bytes without error, even if the encoding is wrong (they map one to one from all 256 bytes to a matching 256 characters), producing gibberish (for bytes outside the ASCII range), and weird gaps (for the null bytes before each ASCII character in UTF-16).

Change all your open calls to add an extra argument, encoding='utf-16', e.g.:

trial = open(from_file, encoding='utf-16')

and Python will use the correct text encoding to decode the raw bytes to a str, and all your lengths will match up.

Alternatively, when saving the files in a reasonable editor, make sure to tweak the encoding to make it an encoding Python will use by default (in modern Python, you can force UTF-8 mode regardless of locale settings, and UTF-8 is probably the most popular portable encoding, in part because for pure ASCII text, it’s identical to ASCII, wasting no disk space).

Answered By: ShadowRanger
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.