Python str vs unicode on Windows, Python 2.7, why does 'á' become 'xa0'

Question:

Background

I’m using a Windows machine. I know Python 2.* is not supported anymore, but I’m still learning Python 2.7.16. I also have Python 3.7.1. I know in Python 3.* "unicode was renamed to str"

I use Git Bash as my main shell.

I read this question. I feel like I understand the difference between Unicode (code points) and encodings (different encoding systems; bytes).

Question

But I don’t get expected results.
When running git bash C:Python27python.exe…:

Python 2.7.16 (v2.7.16:413a49145e, Mar  4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32

>>> 'á'
'xa0'
#'xc3xa1' expected

>>> len('á') 
1
#2 expected

# one more for reference:
>>> 'à'
'x85'
#'xc3xa0' expected

Can you help me understand why I get the output shown above?

Specifically why does 'á' become 'xa0'?

What I tried

I can use unicode object to get the results I expect:

>>> u'á'.encode('utf-8')
'xc3xa1'
>>> len(u'á'.encode('utf-8'))
2

I can open IDLE and I get different results — not expected results, but at least I understand these results.

Python 2.7.16 (v2.7.16:413a49145e, Mar  4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
>>> 'á'
'xe1'
>>> len('á')
1
>>> 'à'
'xe0'

The IDLE results are unexpected but I still understand the results; Martijn Peters explains why 'á' become 'xe1' in the Latin 1 encoding.

So why does IDLE give different results from running my Git Bash Python 2.7.1 executable directly? In other words, if IDLE is using Latin 1 to encoding for my input, what encoding is used by my Git Bash Python 2.7.1. executable, such that 'á' becomes 'xa0'

What I’m wondering

Is my default encoding the problem? I’m too scared to change the default encoding.

>>> import sys; sys.getdefaultencoding()
'ascii'

I feel like it’s my terminal’s encoding that’s the problem? (I use git bash) Should I try to change the PYTHONIOENCODING environment variable?

I try to check the git bash locale, the result is:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

Also I’m using interactive Python , should I try a file instead, using this?

# -*- coding: utf-8 -*- sets the source file's encoding, not the output encoding.

I know upgrading to Python 3 is a solution., but I’m still curious about why my Python 2.7.16 behaves differently.

Asked By: Nate Anderson

||

Answers:

Thanks @dan04, @MarkTolonen and @ (see the comments to the question above). As @MarkTolonen says:

command prompt uses the default OEM code page (cp437 for US Windows ….)"

This seems clear from checking code page 437 for the values I’m trying to encode:

>>> 'á' #-> 'xa0' expected in code page 437
>>> 'à' #-> 'x85' expected in code page 437

I highlight those values in the screenshot below.
screenshot of code page 437 from https://en.wikipedia.org/wiki/Code_page_437 highlighting the characters à (mapping to byte x85) and á (mapping to byte xa0)

I used @MarkTolonen’s suggestion of running the chcp command to get or set the encoding used by my shell/terminal. chcp is short for "change code page". If you’re using Git Bash, use chcp.com instead. Sure enough, when I run chcp, the output is Active code page: 437:

a screenshot of two terminals/shells. on the left, git bash, with the command chcp, which returns "bash chcp: command not found". Then the command chcp.com, which returns "Active code page: 437". On the right, cmd, (Windows co mmand line), with the command chcp, which returns "Active code page: 437". Then the command where chcp, which returns "C:WindowsSystem32chcp.com"

Then I tried @juanpa.arrivillaga’s suggestion of using a file. First I tried a file that explicitly used the 437 code page.

  1. I added the "magic comment" to specify encoding 437: # -*- coding: cp437 -*-, but that’s not enough to encode the file. The magic comment explains to Python how to decode the file.
  2. I also had to change the encoding of the file (tell my editor, VS Code, how to encode in CP437).

Once I do both those things with a Python file (encode and decode with CP437), I get the same "unexpected" results as my OP, which confirms that CP437 is indeed the encoding used by my terminal/shell.

screenshot of a text file on the left, edited in Visual Studio Code, saved using "CP437" encoding, and prints the value of 'á' and length of 'á'. On the right the output of running the file in Python 2, which shows the "unexpected" results in the OP, confirming that encoding 437 is the reason for those results.

In general you must both encode and include the "decode magic comment", and make sure your shell uses the same encoding!

  • If I include the cp437 "magic comment" without encoding in CP437 (VS Code default encoding is UTF-8), the length of 'á' is 2; as in UTF-8! (Note the results are printed in my CP437 shell so they look strange; I see character , which is xc3 in CP437!)
  • If I encode in CP437 but I don’t include the magic comment, I get an error: (SyntaxError: Non-ASCII character 'xa0' in file 437_encoding.py on line 4)

screenshot showing the results of encoding in cp437 without the magic comment, and encoding in utf-8 with a cp437 magic comment

If I encode in utf-8, and I include the "magic comment" for utf-8, and I change my shell to use utf-8 (chcp.com 65001), then I get the results I expect!

screenshot showing a utf8 encoded file with utf8 magic comment in a shell that is changed to use utf8 encoding (code page 65001), then I get the original results I expected in OP

Finally, if I try @MarkTolonen’s suggestion to use sys.stdout.encoding, it will tell me the results ‘cp437’!

  • Please note sys.stdout.encoding (which for me had the value cp437)…
  • is not the same as sys.getdefaultencoding() (which for me had the value ascii

screenshot showing a shell, which chcp.com 437 to change the shell's encoding to cp437, then running Python27/python.exe interpreter and printing sys.stdout.encoding (which had cp437 as expected) and sys.getdefaultencoding() (which had ascii, because sys.getdefaultencoding() is not the same as sys.stdout.encoding)

And if I try to check sys.stdout.encoding when I used chcp.com to change the code page to UTF-8 (value 65001), I get an error LookupError: unknown encoding: cp65001 which is described in more detail here

A screenshot of a git bash terminal, use chcp.com 65001 to change the shell encoding to UTF-8, then run C:/Python27/python.exe, import sys, and get an error "LookupError: unknown encoding: cp65001"

Answered By: Nate Anderson
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.