Python write bytes to file using redirect of print
Question:
using perl,
$ perl -e 'print "xca"' > out
now $ xxd out
we have
00000000: ca
But with Python, I tried
$ python3 -c 'print("xca", end="")' > out
$ xxd out
what I got is
00000000: c38a
I’m not sure what is going on.
Answers:
In python xca is interpreted as a two-byte string in the UTF-8
encoding and that’s why when a value is written inside a file it
automatically stored two bytes in the file as c3 8a
But in perl xca is interpreted as a single byte with the hexadecimal
value 0xca and for that when the value is stored inside the file it will save
without encoding.
So in Python, a str
object is a series of unicode code points. How this is printed to the screen depends on the encoding of your sys.stdout
. This is picked based on your locale (or possibly various environment variables can affect this, but by default, it is your locale). So yours must be set to UTF-8. That’s my default too:
(py311) Juans-MBP:~ juan$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
(py311) Juans-MBP:~ juan$ python -c "print('xca', end='')" | xxd
00000000: c38a
However, if I override my locale and tell it to use en_US.ISO8859-1
(latin-1), a single-byte encoding, we get what you expect:
(py311) Juans-MBP:~ juan$ LC_ALL="en_US.ISO8859-1" python -c "print('xca', end='')" | xxd
00000000: ca
The solution is to work with raw bytes if you want raw bytes. The way to do that in Python source code is to use a bytes literal (or a string literal and then .encode
it). We can use the raw buffer at sys.stdout.buffer
:
(py311) Juans-MBP:~ juan$ python -c "import sys; sys.stdout.buffer.write(b'xca')" | xxd
00000000: ca
Or by encoding a string to a bytes object:
(py311) Juans-MBP:~ juan$ python -c "import sys; sys.stdout.buffer.write('xca'.encode('latin'))" | xxd
00000000: ca
using perl,
$ perl -e 'print "xca"' > out
now $ xxd out
we have
00000000: ca
But with Python, I tried
$ python3 -c 'print("xca", end="")' > out
$ xxd out
what I got is
00000000: c38a
I’m not sure what is going on.
In python xca is interpreted as a two-byte string in the UTF-8
encoding and that’s why when a value is written inside a file it
automatically stored two bytes in the file asc3 8a
But in perl xca is interpreted as a single byte with the hexadecimal
value 0xca and for that when the value is stored inside the file it will save
without encoding.
So in Python, a str
object is a series of unicode code points. How this is printed to the screen depends on the encoding of your sys.stdout
. This is picked based on your locale (or possibly various environment variables can affect this, but by default, it is your locale). So yours must be set to UTF-8. That’s my default too:
(py311) Juans-MBP:~ juan$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
(py311) Juans-MBP:~ juan$ python -c "print('xca', end='')" | xxd
00000000: c38a
However, if I override my locale and tell it to use en_US.ISO8859-1
(latin-1), a single-byte encoding, we get what you expect:
(py311) Juans-MBP:~ juan$ LC_ALL="en_US.ISO8859-1" python -c "print('xca', end='')" | xxd
00000000: ca
The solution is to work with raw bytes if you want raw bytes. The way to do that in Python source code is to use a bytes literal (or a string literal and then .encode
it). We can use the raw buffer at sys.stdout.buffer
:
(py311) Juans-MBP:~ juan$ python -c "import sys; sys.stdout.buffer.write(b'xca')" | xxd
00000000: ca
Or by encoding a string to a bytes object:
(py311) Juans-MBP:~ juan$ python -c "import sys; sys.stdout.buffer.write('xca'.encode('latin'))" | xxd
00000000: ca