Regular expression parsing a binary file?

Question:

I have a file which mixes binary data and text data. I want to parse it through a regular expression, but I get this error:

TypeError: can't use a string pattern on a bytes-like object

I’m guessing that message means that Python doesn’t want to parse binary files.
I’m opening the file with the "rb" flags.

How can I parse binary files with regular expressions in Python?

EDIT: I’m using Python 3.2.0

Asked By: DonkeyMaster

||

Answers:

This is working for me for python 2.6

>>> import re
>>> r = re.compile(".*(ELF).*")
>>> f = open("/bin/ls")
>>> x = f.readline()
>>> r.match(x).groups()
('ELF',)
Answered By: Rumple Stiltskin

In your re.compile you need to use a bytes object, signified by an initial b:

r = re.compile(b"(This)")

This is Python 3 being picky about the difference between strings and bytes.

Answered By: Scott Griffiths

I think you use Python 3 .

1.Opening a file in binary mode is simple but subtle. The only difference
from opening it in text mode is that
the mode parameter contains a ‘b’
character.

……..

4.Here’s one difference, though: a binary stream object has no encoding
attribute. That makes sense, right?
You’re reading (or writing) bytes, not
strings, so there’s no conversion for
Python to do.

http://www.diveintopython3.net/files.html#read

Then, in Python 3, since a binary stream from a file is a stream of bytes, a regex to analyse a stream from a file must be defined with a sequence of bytes, not a sequence of characters.

In Python 2, a string was an array of
bytes whose character encoding was
tracked separately. If you wanted
Python 2 to keep track of the
character encoding, you had to use a
Unicode string (u”) instead. But in
Python 3, a string is always what
Python 2 called a Unicode string —
that is, an array of Unicode
characters (of possibly varying byte
lengths).

http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html

and

In Python 3, all strings are sequences
of Unicode characters. There is no
such thing as a Python string encoded
in UTF-8, or a Python string encoded
as CP-1252. “Is this string UTF-8?” is
an invalid question. UTF-8 is a way of
encoding characters as a sequence of
bytes. If you want to take a string
and turn it into a sequence of bytes
in a particular character encoding,
Python 3 can help you with that.

http://www.diveintopython3.net/strings.html#boring-stuff

and

4.6. Strings vs. Bytes# Bytes are bytes; characters are an abstraction.
An immutable sequence of Unicode
characters is called a string. An
immutable sequence of
numbers-between-0-and-255 is called a
bytes object.

….

1.To define a bytes object, use the b’ ‘ “byte literal” syntax. Each byte
within the byte literal can be an
ASCII character or an encoded
hexadecimal number from x00 to xff
(0–255).

http://www.diveintopython3.net/strings.html#boring-stuff

So you will define your regex as follows

pat = re.compile(b'[a-f]+d+')

and not as

pat = re.compile('[a-f]+d+')

More explanations here:

15.6.4. Can’t use a string pattern on a bytes-like object

Answered By: eyquem