Using regex to remove comments from source files

Question:

I’m making a program to automate the writing of some C code, (I’m writing to parse strings into enumerations with the same name)
C’s handling of strings is not that great.
So some people have been nagging me to try python.

I made a function that is supposed to remove C-style /* COMMENT */ and //COMMENT
from a string:
Here is the code:

def removeComments(string):
    re.sub(re.compile("/*.*?*/",re.DOTALL ) ,"" ,string) # remove all occurance streamed comments (/*COMMENT */) from string
    re.sub(re.compile("//.*?n" ) ,"" ,string) # remove all occurance singleline comments (//COMMENTn ) from string

So I tried this code out.

str="/* spam * spam */ eggs"
removeComments(str)
print str

And it apparently did nothing.

Any suggestions as to what I’ve done wrong?

There’s a saying I’ve heard a couple of times:

If you have a problem and you try to solve it with Regex you end up with two problems.


EDIT:
Looking back at this years later. (after a fair bit more parsing experience)

I think regex may have been the right solution.
And the simple regex used here "good enough".
I may not have emphasized this enough in the question.
This was for a single specific file. That had no tricky situations.
I think it would be a lot less maintenance to keep the file being parsed simple enough for the regex, than to complicate the regex, into an unreadable symbol soup. (e.g. require that the file only use // single line comments.)

Answers:

You are doing it wrong.

Regex is for Regular Languages, which C isn’t.

Answered By: Otto Allmendinger

I would suggest using a REAL parser like SimpleParse or PyParsing. SimpleParse requires that you actually know EBNF, but is very fast. PyParsing has its own EBNF-like syntax but that is adapted for Python and makes it a breeze to build powerfully accurate parsers.

Edit:

Here is an example of how easy it is to use PyParsing in this context:

>>> test = '/* spam * spam */ eggs'
>>> import pyparsing
>>> comment = pyparsing.nestedExpr("/*", "*/").suppress()
>>> print comment.transformString(test)         
' eggs'

Here is a more complex example using single and multi-line comments.

Before:

/*
 * multiline comments
 * abc 2323jklj
 * this is the worst C code ever!!
*/
void
do_stuff ( int shoe, short foot ) {
    /* this is a comment
     * multiline again! 
     */
    exciting_function(whee);
} /* extraneous comment */

After:

>>> print comment.transformString(code)   

void
do_stuff ( int shoe, short foot ) {

     exciting_function(whee);
} 

It leaves an extra newline wherever it stripped comments, but that could be addressed.

Answered By: jathanism

re.sub returns a string, so changing your code to the following will give results:

def removeComments(string):
    string = re.sub(re.compile("/*.*?*/",re.DOTALL ) ,"" ,string) # remove all occurrences streamed comments (/*COMMENT */) from string
    string = re.sub(re.compile("//.*?n" ) ,"" ,string) # remove all occurrence single-line comments (//COMMENTn ) from string
    return string
Answered By: msanders

I would recommend you read this page that has a quite detailed analyzis of the problem and gives a good understanding on why your approach doesn’t work: http://ostermiller.org/findcomment.html

Short version: The regex you are looking for is this:

(/*([^*]|[rn]|(*+([^*/]|[rn])))**+/)|(//.*)

This should match both types of comment blocks. If you are having troubles following it read the page i linked.

Answered By: MatsT

I see several things you might want to revise.

First, Python passes objects by value, but some object types are immutable. Strings and integers are among these immutable types. So if you pass a string to a function, any changes to the string you make within the function won’t affect the string you passed in. You should try returning a string instead. Furthermore, within the removeComments() function, you need to assign the value returned by re.sub() to a new variable — like any function that takes a string as an argument, re.sub() will not modify the string.

Second, I would echo what others have said about parsing C code. Regular expressions are not the best way to go here.

Answered By: jhoon

As noted in one of my other comments, comment nesting isn’t really the problem (in C, comments don’t nest, though a few compilers to support nested comments anyway). The problem is with things like string literals, that can contain the exact same character sequence as a comment delimiter without actually being one.

As Mike Graham said, the right tool for the job is a lexer. A parser is unnecessary and would be overkill, but a lexer is exactly the right thing. As it happens, I posted a (partial) lexer for C (and C++) earlier this morning. It doesn’t attempt to correctly identify all lexical elements (i.e. all keywords and operators) but it’s entirely sufficient for stripping comments. It won’t do any good on the “using Python” front though, as it’s written entirely in C (it predates my using C++ for much more than experimental code).

Answered By: Jerry Coffin
mystring="""
blah1 /* comments with
multiline */

blah2
blah3
// double slashes comments
blah4 // some junk comments

"""
for s in mystring.split("*/"):
    s=s[:s.find("/*")]
    print s[:s.find("//")]

output

$ ./python.py

blah1


blah2
blah3
Answered By: ghostdog74

This program removes comments with // and /* */ from the given file:

#! /usr/bin/python3
import sys
import re
if len(sys.argv)!=2:
     exit("Syntax:python3 exe18.py inputfile.cc ")
else:
     print ('The following files are given by you:',sys.argv[0],sys.argv[1])
with open(sys.argv[1],'r') as ifile:
    newstring=re.sub(r'/*.*?*/',' ',ifile.read(),flags=re.S)
with open(sys.argv[1],'w') as ifile:
    ifile.write(newstring)
print('/* */ have been removed from the inputfile')
with open(sys.argv[1],'r') as ifile:
      newstring1=re.sub(r'//.*',' ',ifile.read())
with open(sys.argv[1],'w') as ifile:
      ifile.write(newstring1)
print('// have been removed from the inputfile')
Answered By: harishli2020

What about "//comment-like strings inside quotes"?

OP is asking how to do do it using regular expressions; so:

def remove_comments(string):
    pattern = r"(".*?"|'.*?')|(/*.*?*/|//[^rn]*$)"
    # first group captures quoted strings (double or single)
    # second group captures comments (//single-line or /* multi-line */)
    regex = re.compile(pattern, re.MULTILINE|re.DOTALL)
    def _replacer(match):
        # if the 2nd group (capturing comments) is not None,
        # it means we have captured a non-quoted (real) comment string.
        if match.group(2) is not None:
            return "" # so we will return empty to remove the comment
        else: # otherwise, we will return the 1st group
            return match.group(1) # captured quoted-string
    return regex.sub(_replacer, string)

This WILL remove:

  • /* multi-line comments */
  • // single-line comments

Will NOT remove:

  • String var1 = "this is /* not a comment. */";
  • char *var2 = "this is // not a comment, either.";
  • url = 'http://not.comment.com';

Note: This will also work for Javascript source.

Answered By: Onur Yıldırım

Just want add another regex where we have to remove anything between * and ; in python

data = re.sub(re.compile(“*.*?;”,re.DOTALL),’ ‘,data)

there is backslash before * to escape the meta character.

Answered By: Saurabh Kukreti

Found another solution with pyparsing following Jathanism.

import pyparsing

test = """
/* Code my code
xx to remove comments in C++
or C or python */

include <iostream> // Some comment

int main (){
    cout << "hello world" << std::endl; // comment
}
"""
commentFilter = pyparsing.cppStyleComment.suppress()
# To filter python style comment, use
# commentFilter = pyparsing.pythonStyleComment.suppress()
# To filter C style comment, use
# commentFilter = pyparsing.cStyleComment.suppress()

newtest = commentFilter.transformString(test)
print(newest)

Produces the following output:

include <iostream> 

int main (){
    cout << "hello world" << std::endl; 
}

Can also use pythonStyleComment, javaStyleComment, cppStyleComment. Found it pretty useful.

Answered By: Gokul
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.