Remove C and C++ comments using Python? Remove C and C++ comments using Python? python python

Remove C and C++ comments using Python?


This handles C++-style comments, C-style comments, strings and simple nesting thereof.

def comment_remover(text):    def replacer(match):        s = match.group(0)        if s.startswith('/'):            return " " # note: a space and not an empty string        else:            return s    pattern = re.compile(        r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',        re.DOTALL | re.MULTILINE    )    return re.sub(pattern, replacer, text)

Strings needs to be included, because comment-markers inside them does not start a comment.

Edit: re.sub didn't take any flags, so had to compile the pattern first.

Edit2: Added character literals, since they could contain quotes that would otherwise be recognized as string delimiters.

Edit3: Fixed the case where a legal expression int/**/x=5; would become intx=5; which would not compile, by replacing the comment with a space rather then an empty string.


C (and C++) comments cannot be nested. Regular expressions work well:

//.*?\n|/\*.*?\*/

This requires the “Single line” flag (Re.S) because a C comment can span multiple lines.

def stripcomments(text):    return re.sub('//.*?\n|/\*.*?\*/', '', text, flags=re.S)

This code should work.

/EDIT: Notice that my above code actually makes an assumption about line endings! This code won't work on a Mac text file. However, this can be amended relatively easily:

//.*?(\r\n?|\n)|/\*.*?\*/

This regular expression should work on all text files, regardless of their line endings (covers Windows, Unix and Mac line endings).

/EDIT: MizardX and Brian (in the comments) made a valid remark about the handling of strings. I completely forgot about that because the above regex is plucked from a parsing module that has additional handling for strings. MizardX's solution should work very well but it only handles double-quoted strings.


Don't forget that in C, backslash-newline is eliminated before comments are processed, and trigraphs are processed before that (because ??/ is the trigraph for backslash). I have a C program called SCC (strip C/C++ comments), and here is part of the test code...

" */ /* SCC has been trained to know about strings /* */ */"!"\"Double quotes embedded in strings, \\\" too\'!""And \newlines in them""And escaped double quotes at the end of a string\""aa '\\n' OKaa "\""aa "\\n"This is followed by C++/C99 comment number 1.// C++/C99 comment with \continuation character \on three source lines (this should not be seen with the -C flaThe C++/C99 comment number 1 has finished.This is followed by C++/C99 comment number 2./\/\C++/C99 comment (this should not be seen with the -C flag)The C++/C99 comment number 2 has finished.This is followed by regular C comment number 1./\*\Regularcomment*\/The regular C comment number 1 has finished./\\/ This is not a C++/C99 comment!This is followed by C++/C99 comment number 3./\\\/ But this is a C++/C99 comment!The C++/C99 comment number 3 has finished./\\* This is not a C or C++  comment!This is followed by regular C comment number 2./\*/ This is a regular C comment *\but this is just a routine continuation *\and that was not the end either - but this is *\\/The regular C comment number 2 has finished.This is followed by regular C comment number 3./\\\\* C comment */

This does not illustrate trigraphs. Note that you can have multiple backslashes at the end of a line, but the line splicing doesn't care about how many there are, but the subsequent processing might. Etc. Writing a single regex to handle all these cases will be non-trivial (but that is different from impossible).