How can I find list comprehensions in Python code?

Question:

I am trying to refactor some Python modules which contain complex list comprehensions than can be single or multiline. An example of such a list comprehension is:

some_list = [y(x) for x in some_complex_expression if x != 2]

I attempted to use the following regex pattern in PyCharm but this matches simple lists as well:

[.+]

Is there a way to not match simple lists and perhaps also match list comprehensions that are multiline? I am okay with solutions other than regex as well.

Asked By: Riya

||

Answers:

To match the above example without matching simple lists you can use:

[.+ for .+ in .+]

Thanks, JvdV! (this answer is based on his tips)

Answered By: Riya

Regex is not designed to handle a structured syntax. You are almost certain to always be able to find corner cases that your deliberately written regex is unable to handle, as suggested by the comments above.

A proper Python parser should be used instead to identify list comprehensions per the language specifications. Fortunately, Python has included a comprehensive set of modules that help parse and navigate through Python code in various ways.

In your case, you can use the ast module to parse the code into an abstract syntax tree, walk through the AST with ast.walk, identify list comprehensions by the ListComp nodes, and output the lines of those nodes along with their line numbers.

Since list comprehensions can be nested, you’d want to avoid outputting the inner list comprehensions when the outer ones are already printed. This can be done by keeping track of the last line number sent to the output and only printing line numbers greater than the last line number.

For example, with the following code:

import ast

with open('file.py') as file:
    lines = file.readlines()

last_lineno = 0
for node in ast.walk(ast.parse(''.join(lines))):
    if isinstance(node, ast.ListComp):
        for lineno in range(node.lineno, node.end_lineno + 1):
            if lineno > last_lineno:
                print(lineno, lines[lineno - 1], sep='t', end='')
                last_lineno = lineno
        print()

and the following content of file.py:

a = [(i + 1) * 2 for i in range(3)]
b = '[(i + 1) * 2 for i in range(3)]'
c = [
    i * 2
    for i in range(3)
    if i
]
# d = [(i + 1) * 2 for i in range(3)]
e = [
    [(i + 1) * 2 for i in range(j)]
    for j in range(3)
]

the code would output:

1   a = [(i + 1) * 2 for i in range(3)]

3   c = [
4       i * 2
5       for i in range(3)
6       if i
7   ]

9   e = [
10      [(i + 1) * 2 for i in range(j)]
11      for j in range(3)
12  ]

because b is assigned a string, and the assignment of d is commented out.

Demo: https://replit.com/@blhsing/StimulatingCrimsonProgramminglanguage#main.py

Answered By: blhsing