How can I make sense of a badly encoded message?

Question

---------------------------
ƒGƒ‰[
---------------------------
ƒfƒBƒXƒvƒŒƒCƒ‚[ƒh‚ªÝ’è‚Å‚«‚Ü‚¹‚ñ.
---------------------------
OK   
---------------------------

I get this clear error message out of Shooter’s Solitude system 4, after I feed it this version of d3drm.dll (sigh.)

Here’s an hexdump for your convenience:

00000000  c6 92 66 c6 92 42 c6 92  58 c6 92 76 c6 92 c5 92  |..f..B..X..v....|
00000010  c6 92 43 c6 92 e2 80 9a  c2 81 5b c6 92 68 e2 80  |..C.......[..h..|
00000020  9a c2 aa c2 90 c3 9d e2  80 99 c3 a8 e2 80 9a c3  |................|
00000030  85 e2 80 9a c2 ab e2 80  9a c3 9c e2 80 9a c2 b9  |................|
00000040  e2 80 9a c3 b1 2e 0a                              |.......|
00000047

How would you turn this into a coherent error message — that is, how would you go about to find the correct encoding/deconding couple for this error message?

Here’s what I tried.

I guess the issue is the developer used the wrong encoding settings for this message (given the age of the game, developed for WinXP, this is unsurprising). By looking at it, one’d guess the message was encoded in some sort of multibyte encoding ( ƒf ƒB ƒX ƒv ƒŒ.)

However, each group seems to be made by three bytes (variable?). This rules out the usual suspects:

>>> wat = "ƒfƒBƒXƒvƒŒƒCƒ‚[ƒh‚ªÝ’è‚Å‚«‚Ü‚¹‚ñ. "
>>> wat.encode("UTF-8").decode("UTF-32")
UnicodeDecodeError: 'utf32' codec cannot decode bytes in position 0-3:
codepoint not in range(0x110000)
>>> wat.encode("UTF-8").decode("UTF-16")
UnicodeDecodeError: 'utf16' codec cannot decode bytes in position 70-70:
truncated data
>>> wat.encode("UTF-8")[:-1].decode("UTF-16")
'鋆왦䊒鋆왘皒鋆鋅鋆왃ue292骀臂왛梒胢슚슪쎐ue29d馀ꣃ胢쎚ue285骀ꯂ胢쎚ue29c骀맂胢쎚⺱'
#meaningless according to Google Translate.

I chose UTF-8 as the starting encoding because ASCII didn’t work (UnicodeEncodeError: 'ascii' codec can't encode character 'u0192' in position 0: ordinal not in range(128)) and UTF-8 should be the default encoding for Windows 7 anyway (the OS I tried to use.)

Not quite there.

Kabie may be on something but that’s not the full story. First off, I can’t reproduce his encoding:

>>> print (wat.encode("UTF-8").decode("Shift-JIS"))
UnicodeDecodeError: 'shift_jis' codec cannot decode bytes in position 22-23: illegal multibyte sequence
>>> print (wat.encode("UTF-8")[:22].decode("Shift-JIS"))
ﾆ断ﾆ達ﾆ湛ﾆ致ﾆ椎槌辰ﾆ停

Wikipedia says there’s a very similar encoding out there: cp932.

>>> print(wat.encode("UTF-8").decode("932"))
UnicodeDecodeError: 'cp932' codec cannot decode bytes in position 44-45: illegal multibyte sequence
>>> print(wat.encode("UTF-8")[:44].decode("932"))
ﾆ断ﾆ達ﾆ湛ﾆ致ﾆ椎槌辰ﾆ停喙ﾆ檀窶堋ｪﾃ昶凖ｨ窶堙

Again, very different from what he pasted. Let’s see it, however:

>>> print("ディスプレイモx81[ドがx90ﾝ定できません.n")
ディスプレイモ[ドがﾝ定できません.

This is garbage for Google Translate, however. I then tried to remove some bits and pieces. Given that ディスプレイ means "display", if I removed "garbage" around the bits that can’t be decoded I get:

  ディスプレイモx81[ドがx90ﾝ定できません.
→ ディスプレイ      ドが    ﾝ定できません.
→ The display mode is not specified.

However, since I asked on SO, this is not the full story. What is with those bytes that couldn’t be decoded? How would you get these bytes to begin with.

Asked By: badp

||

Source

Answer 1

Obviously.

Since it is a Japanese game

‘ディスプレイモx81[ドがx90ﾝ定できません.n’

‘Disupureimo x81 [the de x90 applications can not be fixed. N’

Because I pasted the string, there are some missing.

The coding named Shift-JIS. I use my Opera to show the characters actually.

EDIT:
Sadly all my browsers can’t add comments on SO. I guess it’s about the network. So I have to update here.

You probably should set your display mode to 256 colors. That’s many Japanese game needed.

EDIT2:
Interesting story.

About how I got the string, which is the most funny thing, is I DIDN’T directly encode the original bytes into it, as you may tried, only got this:

ﾆ断ﾆ達ﾆ湛ﾆ致ﾆ椎槌辰ﾆ停�堋ーﾆ檀窶堋ｪﾂ静昶�凖ｨ窶堙��堋ｫ窶堙懌�堋ｹ窶堙ｱ.

But pasting the string into another web page as source, then using Opera changed the coding to Shift-JIS.

Opera has this feature that let you modify source code of web page and show it. So I wrote a page like:

<!DOCTYPE html>
<head>
<title>test</title>
</head>
<body>
'ƒfƒBƒXƒvƒŒƒCƒ‚ƒh‚ªÝ’è‚Å‚«‚Ü‚¹‚ñ.
</body>
</html>

and that’s what I got:

‘ディスプレイモドがﾝ定できません.

Which is even more meaningless. And have you tried changing color mode to 256 colors?

Answered By: Kabie

Answer 2

Maybe this will help:

from binascii import unhexlify

data = '''
c6 92 66 c6 92 42 c6 92 58 c6 92 76 c6 92 c5 92
c6 92 43 c6 92 e2 80 9a c2 81 5b c6 92 68 e2 80
9a c2 aa c2 90 c3 9d e2 80 99 c3 a8 e2 80 9a c3
85 e2 80 9a c2 ab e2 80 9a c3 9c e2 80 9a c2 b9
e2 80 9a c3 b1 2e 0a
'''

data = unhexlify(data.replace(' ','').replace('n',''))
print data.decode('utf8').encode('windows-1252','xmlcharrefreplace').decode('shift-jis')

Output

ディスプレイモ&#129;[ドが&#144;ﾝ定できません.

The hex data you provided was Shift_JIS decoded as windows-1252 and then re-encoded as UTF-8.

Edit

Building on John Machin’s answer:

from binascii import unhexlify
import re

data = '''
c6 92 66 c6 92 42 c6 92 58 c6 92 76 c6 92 c5 92
c6 92 43 c6 92 e2 80 9a c2 81 5b c6 92 68 e2 80
9a c2 aa c2 90 c3 9d e2 80 99 c3 a8 e2 80 9a c3
85 e2 80 9a c2 ab e2 80 9a c3 9c e2 80 9a c2 b9
e2 80 9a c3 b1 2e 0a
'''

data = unhexlify(data.replace(' ','').replace('n',''))
data = data.decode('utf8').encode('windows-1252','xmlcharrefreplace')
# convert the XML entities that windows-1252 couldn't encode back into bytes
data = re.sub(r'&#(d+);',lambda x: chr(int(x.group(1))),data)
print data.decode('shift-jis')

Output

ディスプレイモードが設定できません.

Answered By: Mark Tolonen

Answer 3

=== file disupure.py ===

# start with the OP's hex dump:
hexbytes = """
c6 92 66 c6 92 42 c6 92  58 c6 92 76 c6 92 c5 92
c6 92 43 c6 92 e2 80 9a  c2 81 5b c6 92 68 e2 80
9a c2 aa c2 90 c3 9d e2  80 99 c3 a8 e2 80 9a c3
85 e2 80 9a c2 ab e2 80  9a c3 9c e2 80 9a c2 b9
e2 80 9a c3 b1 2e 0a
"""
strg = ''.join(
    chr(int(hexbyte, 16))
    for hexbyte in hexbytes.split()
    )
uc = strg.decode('utf8') # decodes OK but result is gibberish
uc_hex = ' '.join("%04X" % ord(x) for x in uc)
print uc_hex
# but it's stuffed ... U+0192??? oh yeah, 0x83
badenc = 'cp1252' # sort of, things like 0x81 have to be allowed for
fix_bad = {}
for i in xrange(256):
    b = chr(i)
    try:
        fix_bad[ord(b.decode(badenc))] = i
    except UnicodeDecodeError:
        fix_bad[i] = i

recoded = uc.translate(fix_bad).encode('latin1')
better_uc = recoded.decode('cp932')
# It's on Windows; cp932 what would have been used
# but 'sjis' gives the same answer
better_uc_hex = ' '.join("%04X" % ord(x) for x in better_uc)
print better_uc_hex
print repr(better_uc)
print better_uc

Result of running this in IDLE (blank lines added for clarity):

0192 0066 0192 0042 0192 0058 0192 0076 0192 0152 0192 0043 0192 201A 0081 005B 0192 0068 201A 00AA 0090 00DD 2019 00E8 201A 00C5 201A 00AB 201A 00DC 201A 00B9 201A 00F1 002E 000A

30C7 30A3 30B9 30D7 30EC 30A4 30E2 30FC 30C9 304C 8A2D 5B9A 3067 304D 307E 305B 3093 002E 000A

u'u30c7u30a3u30b9u30d7u30ecu30a4u30e2u30fcu30c9u304cu8a2du5b9au3067u304du307eu305bu3093.n'

ディスプレイモードが設定できません.

Google Translate: You can set the display mode.

Microsoft (Bing) Translate: Display mode is not set.

Update A bit more explanation on why the translation table is needed, and why it maps x81 etc to U+0081, from the Wikipedia article on cp1252:

According to the information on
Microsoft’s and the Unicode
Consortium’s websites, positions 81,
8D, 8F, 90, and 9D are unused. However
the Windows API call for converting
from code pages to Unicode maps these
to the corresponding C1 control codes.

Answered By: John Machin

How can I make sense of a badly encoded message?

Question:

Here’s what I tried.

Not quite there.

Answers:

Output

Edit

Output