How to tell python that a string is actually bytes-object? Not converting
Question:
I have a txt file which contains a line:
' 6: "\351\231\220\346\227\266\345\205\215\350\264\271"'
The contents in the double quotes is actually octal encoding, but with two escape characters.
After the line has been read in, I used regex to extract the contents in the double quotes.
c = re.search(r': "(.+)"', line).group(1)
After that, I have two problem:
First, I need to replace the two escape characters with one.
Second, Tell python that the str object c
is actually a byte object.
None of them has been done.
I have tried:
re.sub('\', '', line)
re.sub(r'\', '', line)
re.sub(r'\', r'', line)
All failed.
A bytes object can be easily define with ‘b’.
c = b'351231220346227266345205215350264271'
How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.
I googled a lot, but with no answers. Maybe I use the wrong key word.
Does anyone know how to do these? Or other way to get what I want?
Answers:
This is always a little confusing. I assume your bytes object should represent a string like:
b = b'351231220346227266345205215350264271'
b.decode()
# '限时免费'
To get that with your escaped string, you could use the codecs library and try:
import re
import codecs
line = ' 6: "\351\231\220\346\227\266\345\205\215\350\264\271"'
c = re.search(r': "(.+)"', line).group(1)
codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
# '限时免费'
giving the same result.
In Python, a string is a sequence of Unicode characters. If you have a string that represents a sequence of bytes, you can indicate this by marking the string as a bytes object by prefixing it with b.
For example:
python
code
byte_string = b’This is a byte string.’
print(type(byte_string)) # Output: <class ‘bytes’>
In this example, the b prefix indicates that the string is a sequence of bytes, and the type of byte_string is bytes. Note that you cannot use special characters like n in a byte string, as they are interpreted as part of the string, not as special characters.
It’s important to note that a bytes object is different from a str (string) object in terms of encoding, representation and processing. So, marking a string as a bytes object is important when you are dealing with binary data, such as reading from a binary file or sending data over a network connection.
The string contains literal text for escape codes. You cannot just replace the literal backslashes with a single backslash as escape codes are used in source code to indicate a single character. Decoding is needed to change literal escape codes to the actual character, but only byte strings can be decoded.
Encoding a Unicode string to a byte string with the Latin-1 codec translates Unicode code points 1:1 to the corresponding byte, so it is the common way to directly convert a "byte-string-like" Unicode string to an actual byte string.
Step-by-Step:
>>> s = "\351\231\220\346\227\266\345\205\215\350\264\271"
>>> print(s) # Actual text of the string
351231220346227266345205215350264271
>>> s.encode('latin1') # Convert to byte string
b'\351\231\220\346\227\266\345\205\215\350\264\271'
>>> # decode the escape codes...result is latin-1 characters in Unicode
>>> s.encode('latin1').decode('unicode-escape')
'éx99x90æx97¶åx85x8dè´¹' # convert back to byte string
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'xe9x99x90xe6x97xb6xe5x85x8dxe8xb4xb9'
>>> # data is UTF-8-encoded text so decode it correctly now
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'限时免费'
Your text example looks like part of a Python dictionary. You may be able to save some steps by using the ast
module’s literal_eval
function to turn the dictionary directly into a Python object, and then just fix this line of code:
>>> # Python dictionary-like text
d='{6: "\351\231\220\346\227\266\345\205\215\350\264\271"}'
>>> import ast
>>> ast.literal_eval(d) # returns Python dictionary with value already decoded
{6: 'éx99x90æx97¶åx85x8dè´¹'}
>>> ast.literal_eval(d)[6] # but decoded incorrectly as Latin-1 text.
'éx99x90æx97¶åx85x8dè´¹'
>>> ast.literal_eval(d)[6].encode('latin1').decode('utf8') # undo Latin1, decode as UTF-8
'限时免费'
I have a txt file which contains a line:
' 6: "\351\231\220\346\227\266\345\205\215\350\264\271"'
The contents in the double quotes is actually octal encoding, but with two escape characters.
After the line has been read in, I used regex to extract the contents in the double quotes.
c = re.search(r': "(.+)"', line).group(1)
After that, I have two problem:
First, I need to replace the two escape characters with one.
Second, Tell python that the str object c
is actually a byte object.
None of them has been done.
I have tried:
re.sub('\', '', line)
re.sub(r'\', '', line)
re.sub(r'\', r'', line)
All failed.
A bytes object can be easily define with ‘b’.
c = b'351231220346227266345205215350264271'
How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.
I googled a lot, but with no answers. Maybe I use the wrong key word.
Does anyone know how to do these? Or other way to get what I want?
This is always a little confusing. I assume your bytes object should represent a string like:
b = b'351231220346227266345205215350264271'
b.decode()
# '限时免费'
To get that with your escaped string, you could use the codecs library and try:
import re
import codecs
line = ' 6: "\351\231\220\346\227\266\345\205\215\350\264\271"'
c = re.search(r': "(.+)"', line).group(1)
codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
# '限时免费'
giving the same result.
In Python, a string is a sequence of Unicode characters. If you have a string that represents a sequence of bytes, you can indicate this by marking the string as a bytes object by prefixing it with b.
For example:
python
code
byte_string = b’This is a byte string.’
print(type(byte_string)) # Output: <class ‘bytes’>
In this example, the b prefix indicates that the string is a sequence of bytes, and the type of byte_string is bytes. Note that you cannot use special characters like n in a byte string, as they are interpreted as part of the string, not as special characters.
It’s important to note that a bytes object is different from a str (string) object in terms of encoding, representation and processing. So, marking a string as a bytes object is important when you are dealing with binary data, such as reading from a binary file or sending data over a network connection.
The string contains literal text for escape codes. You cannot just replace the literal backslashes with a single backslash as escape codes are used in source code to indicate a single character. Decoding is needed to change literal escape codes to the actual character, but only byte strings can be decoded.
Encoding a Unicode string to a byte string with the Latin-1 codec translates Unicode code points 1:1 to the corresponding byte, so it is the common way to directly convert a "byte-string-like" Unicode string to an actual byte string.
Step-by-Step:
>>> s = "\351\231\220\346\227\266\345\205\215\350\264\271"
>>> print(s) # Actual text of the string
351231220346227266345205215350264271
>>> s.encode('latin1') # Convert to byte string
b'\351\231\220\346\227\266\345\205\215\350\264\271'
>>> # decode the escape codes...result is latin-1 characters in Unicode
>>> s.encode('latin1').decode('unicode-escape')
'éx99x90æx97¶åx85x8dè´¹' # convert back to byte string
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'xe9x99x90xe6x97xb6xe5x85x8dxe8xb4xb9'
>>> # data is UTF-8-encoded text so decode it correctly now
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'限时免费'
Your text example looks like part of a Python dictionary. You may be able to save some steps by using the ast
module’s literal_eval
function to turn the dictionary directly into a Python object, and then just fix this line of code:
>>> # Python dictionary-like text
d='{6: "\351\231\220\346\227\266\345\205\215\350\264\271"}'
>>> import ast
>>> ast.literal_eval(d) # returns Python dictionary with value already decoded
{6: 'éx99x90æx97¶åx85x8dè´¹'}
>>> ast.literal_eval(d)[6] # but decoded incorrectly as Latin-1 text.
'éx99x90æx97¶åx85x8dè´¹'
>>> ast.literal_eval(d)[6].encode('latin1').decode('utf8') # undo Latin1, decode as UTF-8
'限时免费'