Python AST with preserved comments

Question:

I can get AST without comments using

import ast
module = ast.parse(open('/path/to/module.py').read())

Could you show an example of getting AST with preserved comments (and whitespace)?

Asked By: Andrei

||

Answers:

The ast module doesn’t include comments. The tokenize module can give you comments, but doesn’t provide other program structure.

Answered By: Ned Batchelder

Other experts seem to think the Python AST module strips comments, so that means that route simply won’t work for you.

Our DMS Software Reengineering Toolkit with its Python front end will parse Python and build ASTs that capture all the comments (see this SO example). The Python front end includes a prettyprinter that can regenerate Python code (with the comments!) directly from the AST. DMS itself provides the low-level parsing machinery, and a source-to-source transformation capability that operate on patterns written using the target language (e.g., Python) surface syntax.

Answered By: Ira Baxter

An AST that keeps information about formating, comments etc. is called a Full Syntax Tree.

redbaron is able to do this. Install with pip install redbaron and try the following code.

import redbaron

with open("/path/to/module.py", "r") as source_code:
    red = redbaron.RedBaron(source_code.read())

print (red.fst())
Answered By: azmeuk

This question naturally arises when writing any kind of Python code beautifier, pep-8 checker, etc. In such cases, you are doing a source-to-source transformations, you do expect the input to be written by human and not only want the output to be human-readable, but in addition expect it to:

  1. include all comments, exactly where they appear in the original.
  2. output the exact spelling of strings, including docstrings as in the original.

This is far from easy to do with the ast module. You could call it a hole in the api, but there seems to be no easy way to extend the api to do 1 and 2 easily.

Andrei’s suggestion to use both ast and tokenize together is a brilliant workaround. The idea came to me also when writing a Python to Coffeescript converter, but the code is far from trivial.

The TokenSync (ts) class starting at line 1305 in py2cs.py coordinates communication between the token-based data and the ast traversal. Given the source string s, the TokenSync class tokenizes s and inits internal data structures that support several interface methods:

ts.leading_lines(node): Returns a list of the preceding comment and blank lines.

ts.trailing_comment(node): Return a string containing the trailing comment for the node, if any.

ts.sync_string(node): Return the spelling of the string at the given node.

It is straightforward, but just a bit clumsy, for the ast visitors to use these methods. Here are some examples from the CoffeeScriptTraverser (cst) class in py2cs.py:

def do_Str(self, node):
    '''A string constant, including docstrings.'''
    if hasattr(node, 'lineno'):
        return self.sync_string(node)

This works provided that ast.Str nodes are visited in the order they appear in the sources. This happens naturally in most traversals.

Here is the ast.If visitor. It shows how to use ts.leading_lines and ts.trailing_comment:

def do_If(self, node):

    result = self.leading_lines(node)
    tail = self.trailing_comment(node)
    s = 'if %s:%s' % (self.visit(node.test), tail)
    result.append(self.indent(s))
    for z in node.body:
        self.level += 1
        result.append(self.visit(z))
        self.level -= 1
    if node.orelse:
        tail = self.tail_after_body(node.body, node.orelse, result)
        result.append(self.indent('else:' + tail))
        for z in node.orelse:
            self.level += 1
            result.append(self.visit(z))
            self.level -= 1
    return ''.join(result)

The ts.tail_after_body method compensates for the fact that there are no ast nodes representing ‘else’ clauses. It’s not rocket science, but it isn’t pretty:

def tail_after_body(self, body, aList, result):
    '''
    Return the tail of the 'else' or 'finally' statement following the given body.
    aList is the node.orelse or node.finalbody list.
    '''
    node = self.last_node(body)
    if node:
        max_n = node.lineno
        leading = self.leading_lines(aList[0])
        if leading:
            result.extend(leading)
            max_n += len(leading)
        tail = self.trailing_comment_at_lineno(max_n + 1)
    else:
        tail = 'n'
    return tail

Note that cst.tail_after_body just calls ts.tail_after_body.

Summary

The TokenSync class encapsulates most of the complexities involved in making token-oriented data available to ast traversal code. Using the TokenSync class is straightforward, but the ast visitors for all Python statements (and ast.Str) must include calls to ts.leading_lines, ts.trailing_comment and ts.sync_string. Furthermore, the ts.tail_after_body hack is needed to handle “missing” ast nodes.

In short, the code works well, but is just a bit clumsy.

@Andrei: your short answer might suggest that you know of a more elegant way. If so, I would love to see it.

Edward K. Ream

Answered By: Edward K. Ream

A few people have already mentioned lib2to3 but I wanted to create a more complete answer, because this tool is an under-appreciated gem. Don’t bother with redbaron.

lib2to3 is comprised of a few parts:

  • the parser: tokens, grammar, etc
  • fixers: library of transformations
  • refactor tools: applies fixers to a parsed ast
  • the command line: choose fixes to apply and run them in parallel using multiprocessing

Below is a brief introduction to using lib2to3 for transformations and scraping data (i.e. extraction).

Transformations

If you’d like to transform python files (i.e. complex find/replace), the CLI provided by lib2to3 is fully featured, and can transform files in parallel.

To use it, create a python package where each sub-module within it contains a single sub-class of lib2to3.fixer_base.BaseFix. See lib2to3.fixes for lots of examples.

Then create your executable script (replacing “myfixes” with the name of your package):

import sys
import lib2to3.main

def main(args=None):
    sys.exit(lib2to3.main.main("myfixes", args=args))

if __name__ == '__main__':
    main()

Run yourscript -h to see the options.

Scraping

If your goal is to gather data, but not transform it, then you need to do a little more work. Here’s a recipe I whipped up to use lib2to3 for data scraping:

# file: basescraper.py
from __future__ import absolute_import, print_function

from lib2to3.pgen2 import token
from lib2to3.pgen2.parse import ParseError
from lib2to3.pygram import python_grammar
from lib2to3.refactor import RefactoringTool
from lib2to3 import fixer_base


def symbol_name(number):
    """
    Get a human-friendly name from a token or symbol

    Very handy for debugging.
    """
    try:
        return token.tok_name[number]
    except KeyError:
        return python_grammar.number2symbol[number]


class SimpleRefactoringTool(RefactoringTool):
    def __init__(self, scraper_classes, options=None, explicit=None):
        self.fixers = None
        self.scraper_classes = scraper_classes
        # first argument is a list of fixer paths, as strings. we override
        # get_fixers, so we don't need it.
        super(SimpleRefactoringTool, self).__init__(None, options, explicit)

    def get_fixers(self):
        """
        Override base method to get fixers from passed fixers classes instead
        of via dotted-module-paths.
        """
        self.fixers = [cls(self.options, self.fixer_log)
                       for cls in self.scraper_classes]
        return (self.fixers, [])

    def get_results(self):
        """
        Get the scraped results returned from `scraper_classes`
        """
        return {type(fixer): fixer.results for fixer in self.fixers}


class BaseScraper(fixer_base.BaseFix):
    """
    Base class for a fixer that stores results.

    lib2to3 was designed with transformation in mind, but if you just want
    to scrape results, you need a way to pass data back to the caller.
    """
    BM_compatible = True

    def __init__(self, options, log):
        self.results = []
        super(BaseScraper, self).__init__(options, log)

    def scrape(self, node, match):
        raise NotImplementedError

    def transform(self, node, match):
        result = self.scrape(node, match)
        if result is not None:
            self.results.append(result)


def scrape(code, scraper):
    """
    Simple interface when you have a single scraper class.
    """
    tool = SimpleRefactoringTool([scraper])
    tool.refactor_string(code, '<test.py>')
    return tool.get_results()[scraper]

And here’s a simple scraper that finds the first comment after a function def:

# file: commentscraper.py
from basescraper import scrape, BaseScraper, ParseError

class FindComments(BaseScraper):

    PATTERN = """ 
    funcdef< 'def' name=any parameters< '(' [any] ')' >
           ['->' any] ':' suite=any+ >
    """

    def scrape(self, node, results):
        suite = results["suite"]
        name = results["name"]

        if suite[0].children[1].type == token.INDENT:
            indent_node = suite[0].children[1]
            return (str(name), indent_node.prefix.strip())
        else:
            # e.g. "def foo(...): x = 5; y = 7"
            # nothing to save
            return

# example usage:

code = '''

@decorator
def foobar():
    # type: comment goes here
    """
    docstring
    """
    pass

'''
comments = scrape(code, FindTypeComments)
assert comments == [('foobar', '# type: comment goes here')]
Answered By: chadrik

If you’re using python 3, you can use bowler, which is based on lib2to3, but provides a much nicer API and CLI for creating transformation scripts.

https://pybowler.io/

Answered By: chadrik

LibCST provides a Concrete Syntax Tree for Python that looks and feels like an AST. Most of node types are the same as AST while formatting information (comment, space, comma, etc) are available.
https://github.com/Instagram/LibCST/

In [1]: import libcst as cst

In [2]: cst.parse_statement("fn(1, 2)  # a comment")                                                                                                                
Out[2]:
SimpleStatementLine(
    body=[
        Expr(
            value=Call(
                func=Name(
                    value='fn',
                    lpar=[],
                    rpar=[],
                ),
                args=[
                    Arg(
                        value=Integer(
                            value='1',
                            lpar=[],
                            rpar=[],
                        ),
                        keyword=None,
                        equal=MaybeSentinel.DEFAULT,
                        comma=Comma(        # <--- a comma
                            whitespace_before=SimpleWhitespace(
                                value='',
                            ),
                            whitespace_after=SimpleWhitespace(
                                value=' ',  # <--- a white space
                            ),
                        ),
                        star='',
                        whitespace_after_star=SimpleWhitespace(
                            value='',
                        ),
                        whitespace_after_arg=SimpleWhitespace(
                            value='',
                        ),
                    ),
                    Arg(
                        value=Integer(
                            value='2',
                            lpar=[],
                            rpar=[],
                        ),
                        keyword=None,
                        equal=MaybeSentinel.DEFAULT,
                        comma=MaybeSentinel.DEFAULT,
                        star='',
                        whitespace_after_star=SimpleWhitespace(
                            value='',
                        ),
                        whitespace_after_arg=SimpleWhitespace(
                            value='',
                        ),
                    ),
                ],
                lpar=[],
                rpar=[],
                whitespace_after_func=SimpleWhitespace(
                    value='',
                ),
                whitespace_before_args=SimpleWhitespace(
                    value='',
                ),
            ),
            semicolon=MaybeSentinel.DEFAULT,
        ),
    ],
    leading_lines=[],
    trailing_whitespace=TrailingWhitespace(
        whitespace=SimpleWhitespace(
            value='  ',
        ),
        comment=Comment(
            value='# a comment',  # <--- comment
        ),
        newline=Newline(
            value=None,
        ),
    ),
)
Answered By: Lai Jimmy