Python str vs unicode on Windows, Python 2.7, why does 'á' become 'xa0'
Question:
Background
I’m using a Windows machine. I know Python 2.* is not supported anymore, but I’m still learning Python 2.7.16. I also have Python 3.7.1. I know in Python 3.* "unicode
was renamed to str
"
I use Git Bash as my main shell.
I read this question. I feel like I understand the difference between Unicode (code points) and encodings (different encoding systems; bytes).
Question
- When I evaluate
'á'
, I expect to get 'xc3xa1'
as shown in this answer
- When I evaluate
len('á')
, I expect to get 2
, as shown in this answer
But I don’t get expected results.
When running git bash C:Python27python.exe…:
Python 2.7.16 (v2.7.16:413a49145e, Mar 4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
>>> 'á'
'xa0'
#'xc3xa1' expected
>>> len('á')
1
#2 expected
# one more for reference:
>>> 'à'
'x85'
#'xc3xa0' expected
Can you help me understand why I get the output shown above?
Specifically why does 'á'
become 'xa0'
?
What I tried
I can use unicode
object to get the results I expect:
>>> u'á'.encode('utf-8')
'xc3xa1'
>>> len(u'á'.encode('utf-8'))
2
I can open IDLE and I get different results — not expected results, but at least I understand these results.
Python 2.7.16 (v2.7.16:413a49145e, Mar 4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
>>> 'á'
'xe1'
>>> len('á')
1
>>> 'à'
'xe0'
The IDLE results are unexpected but I still understand the results; Martijn Peters explains why 'á'
become 'xe1'
in the Latin 1 encoding.
So why does IDLE give different results from running my Git Bash Python 2.7.1 executable directly? In other words, if IDLE is using Latin 1 to encoding for my input, what encoding is used by my Git Bash Python 2.7.1. executable, such that 'á'
becomes 'xa0'
What I’m wondering
Is my default encoding the problem? I’m too scared to change the default encoding.
>>> import sys; sys.getdefaultencoding()
'ascii'
I feel like it’s my terminal’s encoding that’s the problem? (I use git bash) Should I try to change the PYTHONIOENCODING
environment variable?
I try to check the git bash locale
, the result is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
Also I’m using interactive Python , should I try a file instead, using this?
# -*- coding: utf-8 -*- sets the source file's encoding, not the output encoding.
I know upgrading to Python 3 is a solution., but I’m still curious about why my Python 2.7.16 behaves differently.
Answers:
Thanks @dan04, @MarkTolonen and @ (see the comments to the question above). As @MarkTolonen says:
command prompt uses the default OEM code page (cp437 for US Windows ….)"
This seems clear from checking code page 437 for the values I’m trying to encode:
>>> 'á' #-> 'xa0' expected in code page 437
>>> 'à' #-> 'x85' expected in code page 437
I highlight those values in the screenshot below.
I used @MarkTolonen’s suggestion of running the chcp
command to get or set the encoding used by my shell/terminal. chcp
is short for "change code page". If you’re using Git Bash, use chcp.com
instead. Sure enough, when I run chcp
, the output is Active code page: 437
:
Then I tried @juanpa.arrivillaga’s suggestion of using a file. First I tried a file that explicitly used the 437 code page.
- I added the "magic comment" to specify encoding 437:
# -*- coding: cp437 -*-
, but that’s not enough to encode the file. The magic comment explains to Python how to decode the file.
- I also had to change the encoding of the file (tell my editor, VS Code, how to encode in CP437).
Once I do both those things with a Python file (encode and decode with CP437), I get the same "unexpected" results as my OP, which confirms that CP437 is indeed the encoding used by my terminal/shell.
In general you must both encode and include the "decode magic comment", and make sure your shell uses the same encoding!
- If I include the cp437 "magic comment" without encoding in CP437 (VS Code default encoding is UTF-8), the length of
'á'
is 2; as in UTF-8! (Note the results are printed in my CP437 shell so they look strange; I see character ├
, which is xc3
in CP437!)
- If I encode in CP437 but I don’t include the magic comment, I get an error:
(SyntaxError: Non-ASCII character 'xa0' in file 437_encoding.py on line 4)
If I encode in utf-8, and I include the "magic comment" for utf-8, and I change my shell to use utf-8 (chcp.com 65001
), then I get the results I expect!
Finally, if I try @MarkTolonen’s suggestion to use sys.stdout.encoding
, it will tell me the results ‘cp437’!
- Please note
sys.stdout.encoding
(which for me had the value cp437
)…
- is not the same as
sys.getdefaultencoding()
(which for me had the value ascii
…
And if I try to check sys.stdout.encoding
when I used chcp.com
to change the code page to UTF-8 (value 65001), I get an error LookupError: unknown encoding: cp65001
which is described in more detail here
Background
I’m using a Windows machine. I know Python 2.* is not supported anymore, but I’m still learning Python 2.7.16. I also have Python 3.7.1. I know in Python 3.* "unicode
was renamed to str
"
I use Git Bash as my main shell.
I read this question. I feel like I understand the difference between Unicode (code points) and encodings (different encoding systems; bytes).
Question
- When I evaluate
'á'
, I expect to get'xc3xa1'
as shown in this answer - When I evaluate
len('á')
, I expect to get2
, as shown in this answer
But I don’t get expected results.
When running git bash C:Python27python.exe…:
Python 2.7.16 (v2.7.16:413a49145e, Mar 4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
>>> 'á'
'xa0'
#'xc3xa1' expected
>>> len('á')
1
#2 expected
# one more for reference:
>>> 'à'
'x85'
#'xc3xa0' expected
Can you help me understand why I get the output shown above?
Specifically why does 'á'
become 'xa0'
?
What I tried
I can use unicode
object to get the results I expect:
>>> u'á'.encode('utf-8')
'xc3xa1'
>>> len(u'á'.encode('utf-8'))
2
I can open IDLE and I get different results — not expected results, but at least I understand these results.
Python 2.7.16 (v2.7.16:413a49145e, Mar 4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
>>> 'á'
'xe1'
>>> len('á')
1
>>> 'à'
'xe0'
The IDLE results are unexpected but I still understand the results; Martijn Peters explains why 'á'
become 'xe1'
in the Latin 1 encoding.
So why does IDLE give different results from running my Git Bash Python 2.7.1 executable directly? In other words, if IDLE is using Latin 1 to encoding for my input, what encoding is used by my Git Bash Python 2.7.1. executable, such that 'á'
becomes 'xa0'
What I’m wondering
Is my default encoding the problem? I’m too scared to change the default encoding.
>>> import sys; sys.getdefaultencoding()
'ascii'
I feel like it’s my terminal’s encoding that’s the problem? (I use git bash) Should I try to change the PYTHONIOENCODING
environment variable?
I try to check the git bash locale
, the result is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
Also I’m using interactive Python , should I try a file instead, using this?
# -*- coding: utf-8 -*- sets the source file's encoding, not the output encoding.
I know upgrading to Python 3 is a solution., but I’m still curious about why my Python 2.7.16 behaves differently.
Thanks @dan04, @MarkTolonen and @ (see the comments to the question above). As @MarkTolonen says:
command prompt uses the default OEM code page (cp437 for US Windows ….)"
This seems clear from checking code page 437 for the values I’m trying to encode:
>>> 'á' #-> 'xa0' expected in code page 437
>>> 'à' #-> 'x85' expected in code page 437
I highlight those values in the screenshot below.
I used @MarkTolonen’s suggestion of running the chcp
command to get or set the encoding used by my shell/terminal. chcp
is short for "change code page". If you’re using Git Bash, use chcp.com
instead. Sure enough, when I run chcp
, the output is Active code page: 437
:
Then I tried @juanpa.arrivillaga’s suggestion of using a file. First I tried a file that explicitly used the 437 code page.
- I added the "magic comment" to specify encoding 437:
# -*- coding: cp437 -*-
, but that’s not enough to encode the file. The magic comment explains to Python how to decode the file. - I also had to change the encoding of the file (tell my editor, VS Code, how to encode in CP437).
Once I do both those things with a Python file (encode and decode with CP437), I get the same "unexpected" results as my OP, which confirms that CP437 is indeed the encoding used by my terminal/shell.
In general you must both encode and include the "decode magic comment", and make sure your shell uses the same encoding!
- If I include the cp437 "magic comment" without encoding in CP437 (VS Code default encoding is UTF-8), the length of
'á'
is 2; as in UTF-8! (Note the results are printed in my CP437 shell so they look strange; I see character├
, which isxc3
in CP437!) - If I encode in CP437 but I don’t include the magic comment, I get an error:
(SyntaxError: Non-ASCII character 'xa0' in file 437_encoding.py on line 4)
If I encode in utf-8, and I include the "magic comment" for utf-8, and I change my shell to use utf-8 (chcp.com 65001
), then I get the results I expect!
Finally, if I try @MarkTolonen’s suggestion to use sys.stdout.encoding
, it will tell me the results ‘cp437’!
- Please note
sys.stdout.encoding
(which for me had the valuecp437
)… - is not the same as
sys.getdefaultencoding()
(which for me had the valueascii
…
And if I try to check sys.stdout.encoding
when I used chcp.com
to change the code page to UTF-8 (value 65001), I get an error LookupError: unknown encoding: cp65001
which is described in more detail here