Regular expression parsing a binary file?
Question:
I have a file which mixes binary data and text data. I want to parse it through a regular expression, but I get this error:
TypeError: can't use a string pattern on a bytes-like object
I’m guessing that message means that Python doesn’t want to parse binary files.
I’m opening the file with the "rb"
flags.
How can I parse binary files with regular expressions in Python?
EDIT: I’m using Python 3.2.0
Answers:
This is working for me for python 2.6
>>> import re
>>> r = re.compile(".*(ELF).*")
>>> f = open("/bin/ls")
>>> x = f.readline()
>>> r.match(x).groups()
('ELF',)
In your re.compile
you need to use a bytes
object, signified by an initial b
:
r = re.compile(b"(This)")
This is Python 3 being picky about the difference between strings and bytes.
I think you use Python 3 .
1.Opening a file in binary mode is simple but subtle. The only difference
from opening it in text mode is that
the mode parameter contains a ‘b’
character.
……..
4.Here’s one difference, though: a binary stream object has no encoding
attribute. That makes sense, right?
You’re reading (or writing) bytes, not
strings, so there’s no conversion for
Python to do.
Then, in Python 3, since a binary stream from a file is a stream of bytes, a regex to analyse a stream from a file must be defined with a sequence of bytes, not a sequence of characters.
In Python 2, a string was an array of
bytes whose character encoding was
tracked separately. If you wanted
Python 2 to keep track of the
character encoding, you had to use a
Unicode string (u”) instead. But in
Python 3, a string is always what
Python 2 called a Unicode string —
that is, an array of Unicode
characters (of possibly varying byte
lengths).
http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html
and
In Python 3, all strings are sequences
of Unicode characters. There is no
such thing as a Python string encoded
in UTF-8, or a Python string encoded
as CP-1252. “Is this string UTF-8?” is
an invalid question. UTF-8 is a way of
encoding characters as a sequence of
bytes. If you want to take a string
and turn it into a sequence of bytes
in a particular character encoding,
Python 3 can help you with that.
and
4.6. Strings vs. Bytes# Bytes are bytes; characters are an abstraction.
An immutable sequence of Unicode
characters is called a string. An
immutable sequence of
numbers-between-0-and-255 is called a
bytes object.
….
1.To define a bytes object, use the b’ ‘ “byte literal” syntax. Each byte
within the byte literal can be an
ASCII character or an encoded
hexadecimal number from x00 to xff
(0–255).
So you will define your regex as follows
pat = re.compile(b'[a-f]+d+')
and not as
pat = re.compile('[a-f]+d+')
More explanations here:
I have a file which mixes binary data and text data. I want to parse it through a regular expression, but I get this error:
TypeError: can't use a string pattern on a bytes-like object
I’m guessing that message means that Python doesn’t want to parse binary files.
I’m opening the file with the "rb"
flags.
How can I parse binary files with regular expressions in Python?
EDIT: I’m using Python 3.2.0
This is working for me for python 2.6
>>> import re
>>> r = re.compile(".*(ELF).*")
>>> f = open("/bin/ls")
>>> x = f.readline()
>>> r.match(x).groups()
('ELF',)
In your re.compile
you need to use a bytes
object, signified by an initial b
:
r = re.compile(b"(This)")
This is Python 3 being picky about the difference between strings and bytes.
I think you use Python 3 .
1.Opening a file in binary mode is simple but subtle. The only difference
from opening it in text mode is that
the mode parameter contains a ‘b’
character.……..
4.Here’s one difference, though: a binary stream object has no encoding
attribute. That makes sense, right?
You’re reading (or writing) bytes, not
strings, so there’s no conversion for
Python to do.
Then, in Python 3, since a binary stream from a file is a stream of bytes, a regex to analyse a stream from a file must be defined with a sequence of bytes, not a sequence of characters.
In Python 2, a string was an array of
bytes whose character encoding was
tracked separately. If you wanted
Python 2 to keep track of the
character encoding, you had to use a
Unicode string (u”) instead. But in
Python 3, a string is always what
Python 2 called a Unicode string —
that is, an array of Unicode
characters (of possibly varying byte
lengths).http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html
and
In Python 3, all strings are sequences
of Unicode characters. There is no
such thing as a Python string encoded
in UTF-8, or a Python string encoded
as CP-1252. “Is this string UTF-8?” is
an invalid question. UTF-8 is a way of
encoding characters as a sequence of
bytes. If you want to take a string
and turn it into a sequence of bytes
in a particular character encoding,
Python 3 can help you with that.
and
4.6. Strings vs. Bytes# Bytes are bytes; characters are an abstraction.
An immutable sequence of Unicode
characters is called a string. An
immutable sequence of
numbers-between-0-and-255 is called a
bytes object.….
1.To define a bytes object, use the b’ ‘ “byte literal” syntax. Each byte
within the byte literal can be an
ASCII character or an encoded
hexadecimal number from x00 to xff
(0–255).
So you will define your regex as follows
pat = re.compile(b'[a-f]+d+')
and not as
pat = re.compile('[a-f]+d+')
More explanations here: