Reading and decoding bytes after modification by external process raises UnicodeDecodeError

Question:

I’m trying to encrypt a string with ansible-vault. To do this, I open two temporary files, one to hold the value to be encrypted and one to hold the password to encrypt with.

Reproducible example:

import os
import subprocess
import tempfile

value = "foo"
password = "bar"

with tempfile.NamedTemporaryFile() as value_file:
    value_file.write(value.encode('utf-8'))
    value_file.flush()
    with tempfile.NamedTemporaryFile() as pass_file:
        pass_file.write(password.encode('utf-8'))
        pass_file.flush()
        subprocess.run(["ansible-vault", "encrypt", value_file.name, "--vault-password-file", pass_file.name], capture_output=True)
    os.system("cat " + value_file.name)
    os.system("xxd " + value_file.name)
    print(value_file.read().decode("utf-8"))

This results in the following output:

$ python src/test.py 
$ANSIBLE_VAULT;1.1;AES256
62653133373465363632343862623335363563666364366465396361633733643238623463343539
6231376131343538346662666133653932306137323131350a383531383561356261366639336461
37306461663030383164633638346566353662356461333163356633613664313762653933313363
3837356531646330660a363033363063396563326562653339633731656666656531353831623065
3539
00000000: 2441 4e53 4942 4c45 5f56 4155 4c54 3b31  $ANSIBLE_VAULT;1
00000010: 2e31 3b41 4553 3235 360a 3632 3635 3331  .1;AES256.626531
00000020: 3333 3337 3334 3635 3336 3336 3332 3334  3337346536363234
00000030: 3338 3632 3632 3333 3335 3336 3335 3633  3862623335363563
00000040: 3636 3633 3634 3336 3634 3635 3339 3633  6663643664653963
00000050: 3631 3633 3337 3333 3634 3332 3338 3632  6163373364323862
00000060: 3334 3633 3334 3335 3339 0a36 3233 3133  3463343539.62313
00000070: 3736 3133 3133 3433 3533 3833 3436 3636  7613134353834666
00000080: 3236 3636 3133 3336 3533 3933 3233 3036  2666133653932306
00000090: 3133 3733 3233 3133 3133 3530 6133 3833  137323131350a383
000000a0: 3533 3133 3833 3536 3133 3536 3236 3133  5313835613562613
000000b0: 3636 3633 3933 3336 3436 310a 3337 3330  66639336461.3730
000000c0: 3634 3631 3636 3330 3330 3338 3331 3634  6461663030383164
000000d0: 3633 3336 3338 3334 3635 3636 3335 3336  6336383465663536
000000e0: 3632 3335 3634 3631 3333 3331 3633 3335  6235646133316335
000000f0: 3636 3333 3631 3336 3634 3331 3337 3632  6633613664313762
00000100: 3635 3339 3333 3331 3333 3633 0a33 3833  653933313363.383
00000110: 3733 3536 3533 3136 3436 3333 3036 3630  7356531646330660
00000120: 6133 3633 3033 3333 3633 3036 3333 3936  a363033363063396
00000130: 3536 3333 3236 3536 3236 3533 3333 3936  5633265626533396
00000140: 3333 3733 3136 3536 3636 3636 3536 3533  3373165666665653
00000150: 3133 3533 3833 3136 3233 3036 350a 3335  1353831623065.35
00000160: 3339 0a                                  39.
Traceback (most recent call last):
  File "src/test.py", line 18, in <module>
    print(value_file.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 3: invalid start byte

I run os.system("cat " + value_file.name) to verify the file gets correctly encrypted and os.system("xxd " + value_file.name) to see the individual bytes I’m expecting to read back. The raised exception doesn’t seem to line up with this as byte in position 3 isn’t 0x9a. I’m wondering why that would be?

I assumed it might have had something to do with the file remaining open while ansible-vault modifies it, but replacing the subprocess.run(...) call with os.system("echo baz>" + value_file.name) seemed to work just fine.

I also tried different encodings and the only one which didn’t raise an exception was iso-8859-1 which dumped out complete gibberish.

Seeking to the start of the file using value_file.seek(0) before reading also made no difference.

Asked By: Tadej Gašparovič

||

Answers:

Having an external utility modify a file that you have open is simply not a good idea. ansible-vault is almost certainly going to create a new file of the same name when it writes its output; your script (on Unix-ish systems at least) will be left with a file handle to the original, now-deleted, file.

So normally, you wouldn’t expect to be able to read anything at all from value_file – not even the "foo" that you wrote to it, because the current file position will be just after that. But ansible-vault actually calls the shred utility on the file before deleting it, in the (possibly incorrect) belief that this would make it harder to recover the unencrypted data. shred works by overwriting the file contents with random bytes – and the default behavior is to round up the size of the random data it’s writing to a multiple of the disk block size. So that mysterious 0x9A is just one of these random bytes, the first one that happened to not be meaningful in UTF-8.

Answered By: jasonharper
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.