How to tell python that a string is actually bytes-object? Not converting

Question:

I have a txt file which contains a line:

 '        6: "\351\231\220\346\227\266\345\205\215\350\264\271"'

The contents in the double quotes is actually octal encoding, but with two escape characters.

After the line has been read in, I used regex to extract the contents in the double quotes.

c = re.search(r': "(.+)"', line).group(1)

After that, I have two problem:

First, I need to replace the two escape characters with one.

Second, Tell python that the str object c is actually a byte object.

None of them has been done.

I have tried:

re.sub('\', '', line)
re.sub(r'\', '', line)
re.sub(r'\', r'', line)

All failed.

A bytes object can be easily define with ‘b’.

c = b'351231220346227266345205215350264271'

How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.

I googled a lot, but with no answers. Maybe I use the wrong key word.

Does anyone know how to do these? Or other way to get what I want?

Asked By: Cloud

||

Answers:

This is always a little confusing. I assume your bytes object should represent a string like:

b = b'351231220346227266345205215350264271'
b.decode()
# '限时免费'

To get that with your escaped string, you could use the codecs library and try:

import re
import codecs

line =  '        6: "\351\231\220\346\227\266\345\205\215\350\264\271"'
c = re.search(r': "(.+)"', line).group(1)

codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
# '限时免费'

giving the same result.

Answered By: Mark

In Python, a string is a sequence of Unicode characters. If you have a string that represents a sequence of bytes, you can indicate this by marking the string as a bytes object by prefixing it with b.

For example:

python
code
byte_string = b’This is a byte string.’
print(type(byte_string)) # Output: <class ‘bytes’>
In this example, the b prefix indicates that the string is a sequence of bytes, and the type of byte_string is bytes. Note that you cannot use special characters like n in a byte string, as they are interpreted as part of the string, not as special characters.

It’s important to note that a bytes object is different from a str (string) object in terms of encoding, representation and processing. So, marking a string as a bytes object is important when you are dealing with binary data, such as reading from a binary file or sending data over a network connection.

Answered By: Ashfaque Ahamed

The string contains literal text for escape codes. You cannot just replace the literal backslashes with a single backslash as escape codes are used in source code to indicate a single character. Decoding is needed to change literal escape codes to the actual character, but only byte strings can be decoded.

Encoding a Unicode string to a byte string with the Latin-1 codec translates Unicode code points 1:1 to the corresponding byte, so it is the common way to directly convert a "byte-string-like" Unicode string to an actual byte string.

Step-by-Step:

>>> s = "\351\231\220\346\227\266\345\205\215\350\264\271"
>>> print(s)  # Actual text of the string
351231220346227266345205215350264271
>>> s.encode('latin1')  # Convert to byte string
b'\351\231\220\346\227\266\345\205\215\350\264\271'
>>> # decode the escape codes...result is latin-1 characters in Unicode
>>> s.encode('latin1').decode('unicode-escape')
'éx99x90æx97¶åx85x8dè´¹'  # convert back to byte string
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'xe9x99x90xe6x97xb6xe5x85x8dxe8xb4xb9'
>>> # data is UTF-8-encoded text so decode it correctly now
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'限时免费'

Your text example looks like part of a Python dictionary. You may be able to save some steps by using the ast module’s literal_eval function to turn the dictionary directly into a Python object, and then just fix this line of code:

>>> # Python dictionary-like text
 d='{6: "\351\231\220\346\227\266\345\205\215\350\264\271"}'
>>> import ast
>>> ast.literal_eval(d)  # returns Python dictionary with value already decoded
{6: 'éx99x90æx97¶åx85x8dè´¹'}
>>> ast.literal_eval(d)[6] # but decoded incorrectly as Latin-1 text.
'éx99x90æx97¶åx85x8dè´¹'
>>> ast.literal_eval(d)[6].encode('latin1').decode('utf8') # undo Latin1, decode as UTF-8
'限时免费'
Answered By: Mark Tolonen
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.