What is the encoding of CompletedProcess.stdout coming from Powershell/Windows in Python?
Question:
I am getting this output from a ping request started from Python with subprocess.run()
:
>>> process.stdout
b"rnEnvoi d'une requx88te 'ping' sur www.google.fr [142.250.179.195] avec 32 octets de donnx82esxff:rnRx82ponse de 142.250.179.195xff: octets=32 temps=39 ms TTL=110rnRx82ponse de 142.250.179.195xff: octets=32 temps=46 ms TTL=110rnRx82ponse de 142.250.179.195xff: octets=32 temps=37 ms TTL=110rnrnStatistiques Ping pour 142.250.179.195:rn Paquetsxff: envoyx82s = 3, rex87us = 3, perdus = 0 (perte 0%),rnDurx82e approximative des boucles en millisecondes :rn Minimum = 37ms, Maximum = 46ms, Moyenne = 40msrn"
I run this script from Pycharm that runs Powershell on a Windows 10 21H2 in French language. So I expect encoding Windows-1252. Which is also the guess of chardet:
>>> chardet.detect(process.stdout)
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
However decoding this with Windows-1252 does not look very right:
>>> process.stdout.decode("windows-1252")
"rnEnvoi d'une requˆte 'ping' sur www.google.fr [142.250.179.195] avec 32 octets de donn‚esÿ:rnR‚ponse de 142.250.179.195ÿ: octets=32 temps=39 ms TTL=110rnR‚ponse de 142.250.179.195ÿ: octets=32 temps=46 ms TTL=110rnR‚ponse de 142.250.179.195ÿ: octets=32 temps=37 ms TTL=110rnrnStatistiques Ping pour 142.250.179.195:rn Paquetsÿ: envoy‚s = 3, re‡us = 3, perdus = 0 (perte 0%),rnDur‚e approximative des boucles en millisecondes :rn Minimum = 37ms, Maximum = 46ms, Moyenne = 40msrn"
0x88 should be ê
and 0x82 should be è
Answers:
Console applications on Windows typically use the console’s active code page to encode their output, which by default is the system’s legacy OEM code page (e.g., CP437
on US-English systems), not the legacy ANSI code page used by GUI applications (e.g, Windows-1252
):
You can use the following code to determine the console’s active code page and decode based on it:
import ctypes
import subprocess
# Get the console's active code page, as an integer.
oemCP = ctypes.windll.kernel32.GetConsoleOutputCP()
process = subprocess.run('ping.exe', capture_output=True)
# Decode based on the console's active code page.
print(process.stdout.decode("cp" + str(oemCP)))
A note re detecting the coding:
-
The prevalent single-byte code pages that are used as the OEM and ANSI code pages do not use BOMs, and any byte value is also a valid character.
-
This ultimately makes any attempt to detect what an unknown encoding is guesswork – though the probability of guessing right can be improved with sophisticated linguistic analysis.
-
I don’t know what approach chardet.detect()
uses, but in this case it guessed incorrectly; that it guessed can be inferred from the presence of a confidence
value.
0x88
should be ê
and 0x82
should be è
This actually applies to CP437, not to Windows-1252, as the following PowerShell code demonstrates:
PS> [System.Text.Encoding]::GetEncoding(437).GetString([byte[]] (0x88, 0x82))
êé
If any one lands here, small working snipped based on mklement0 answer.
import locale
import ctypes
import subprocess
oemCP = ctypes.windll.kernel32.GetConsoleOutputCP()
print(f'{locale.getpreferredencoding()=}')
encoding = "cp" + str(oemCP)
print(f"{encoding=}")
p = subprocess.run('dir c:\',
shell=True,
stdout=subprocess.PIPE,
universal_newlines=True,
encoding=encoding,
stderr=subprocess.STDOUT)
print(p.stdout)
I am getting this output from a ping request started from Python with subprocess.run()
:
>>> process.stdout
b"rnEnvoi d'une requx88te 'ping' sur www.google.fr [142.250.179.195] avec 32 octets de donnx82esxff:rnRx82ponse de 142.250.179.195xff: octets=32 temps=39 ms TTL=110rnRx82ponse de 142.250.179.195xff: octets=32 temps=46 ms TTL=110rnRx82ponse de 142.250.179.195xff: octets=32 temps=37 ms TTL=110rnrnStatistiques Ping pour 142.250.179.195:rn Paquetsxff: envoyx82s = 3, rex87us = 3, perdus = 0 (perte 0%),rnDurx82e approximative des boucles en millisecondes :rn Minimum = 37ms, Maximum = 46ms, Moyenne = 40msrn"
I run this script from Pycharm that runs Powershell on a Windows 10 21H2 in French language. So I expect encoding Windows-1252. Which is also the guess of chardet:
>>> chardet.detect(process.stdout)
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
However decoding this with Windows-1252 does not look very right:
>>> process.stdout.decode("windows-1252")
"rnEnvoi d'une requˆte 'ping' sur www.google.fr [142.250.179.195] avec 32 octets de donn‚esÿ:rnR‚ponse de 142.250.179.195ÿ: octets=32 temps=39 ms TTL=110rnR‚ponse de 142.250.179.195ÿ: octets=32 temps=46 ms TTL=110rnR‚ponse de 142.250.179.195ÿ: octets=32 temps=37 ms TTL=110rnrnStatistiques Ping pour 142.250.179.195:rn Paquetsÿ: envoy‚s = 3, re‡us = 3, perdus = 0 (perte 0%),rnDur‚e approximative des boucles en millisecondes :rn Minimum = 37ms, Maximum = 46ms, Moyenne = 40msrn"
0x88 should be ê
and 0x82 should be è
Console applications on Windows typically use the console’s active code page to encode their output, which by default is the system’s legacy OEM code page (e.g., CP437
on US-English systems), not the legacy ANSI code page used by GUI applications (e.g, Windows-1252
):
You can use the following code to determine the console’s active code page and decode based on it:
import ctypes
import subprocess
# Get the console's active code page, as an integer.
oemCP = ctypes.windll.kernel32.GetConsoleOutputCP()
process = subprocess.run('ping.exe', capture_output=True)
# Decode based on the console's active code page.
print(process.stdout.decode("cp" + str(oemCP)))
A note re detecting the coding:
-
The prevalent single-byte code pages that are used as the OEM and ANSI code pages do not use BOMs, and any byte value is also a valid character.
-
This ultimately makes any attempt to detect what an unknown encoding is guesswork – though the probability of guessing right can be improved with sophisticated linguistic analysis.
-
I don’t know what approach
chardet.detect()
uses, but in this case it guessed incorrectly; that it guessed can be inferred from the presence of aconfidence
value.
0x88
should beê
and0x82
should beè
This actually applies to CP437, not to Windows-1252, as the following PowerShell code demonstrates:
PS> [System.Text.Encoding]::GetEncoding(437).GetString([byte[]] (0x88, 0x82))
êé
If any one lands here, small working snipped based on mklement0 answer.
import locale
import ctypes
import subprocess
oemCP = ctypes.windll.kernel32.GetConsoleOutputCP()
print(f'{locale.getpreferredencoding()=}')
encoding = "cp" + str(oemCP)
print(f"{encoding=}")
p = subprocess.run('dir c:\',
shell=True,
stdout=subprocess.PIPE,
universal_newlines=True,
encoding=encoding,
stderr=subprocess.STDOUT)
print(p.stdout)