Remove C and C++ comments using Python?
This handles C++-style comments, C-style comments, strings and simple nesting thereof.
def comment_remover(text): def replacer(match): s = match.group(0) if s.startswith('/'): return " " # note: a space and not an empty string else: return s pattern = re.compile( r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"', re.DOTALL | re.MULTILINE ) return re.sub(pattern, replacer, text)
Strings needs to be included, because comment-markers inside them does not start a comment.
Edit: re.sub didn't take any flags, so had to compile the pattern first.
Edit2: Added character literals, since they could contain quotes that would otherwise be recognized as string delimiters.
Edit3: Fixed the case where a legal expression int/**/x=5;
would become intx=5;
which would not compile, by replacing the comment with a space rather then an empty string.
C (and C++) comments cannot be nested. Regular expressions work well:
//.*?\n|/\*.*?\*/
This requires the “Single line” flag (Re.S
) because a C comment can span multiple lines.
def stripcomments(text): return re.sub('//.*?\n|/\*.*?\*/', '', text, flags=re.S)
This code should work.
/EDIT: Notice that my above code actually makes an assumption about line endings! This code won't work on a Mac text file. However, this can be amended relatively easily:
//.*?(\r\n?|\n)|/\*.*?\*/
This regular expression should work on all text files, regardless of their line endings (covers Windows, Unix and Mac line endings).
/EDIT: MizardX and Brian (in the comments) made a valid remark about the handling of strings. I completely forgot about that because the above regex is plucked from a parsing module that has additional handling for strings. MizardX's solution should work very well but it only handles double-quoted strings.
Don't forget that in C, backslash-newline is eliminated before comments are processed, and trigraphs are processed before that (because ??/ is the trigraph for backslash). I have a C program called SCC (strip C/C++ comments), and here is part of the test code...
" */ /* SCC has been trained to know about strings /* */ */"!"\"Double quotes embedded in strings, \\\" too\'!""And \newlines in them""And escaped double quotes at the end of a string\""aa '\\n' OKaa "\""aa "\\n"This is followed by C++/C99 comment number 1.// C++/C99 comment with \continuation character \on three source lines (this should not be seen with the -C flaThe C++/C99 comment number 1 has finished.This is followed by C++/C99 comment number 2./\/\C++/C99 comment (this should not be seen with the -C flag)The C++/C99 comment number 2 has finished.This is followed by regular C comment number 1./\*\Regularcomment*\/The regular C comment number 1 has finished./\\/ This is not a C++/C99 comment!This is followed by C++/C99 comment number 3./\\\/ But this is a C++/C99 comment!The C++/C99 comment number 3 has finished./\\* This is not a C or C++ comment!This is followed by regular C comment number 2./\*/ This is a regular C comment *\but this is just a routine continuation *\and that was not the end either - but this is *\\/The regular C comment number 2 has finished.This is followed by regular C comment number 3./\\\\* C comment */
This does not illustrate trigraphs. Note that you can have multiple backslashes at the end of a line, but the line splicing doesn't care about how many there are, but the subsequent processing might. Etc. Writing a single regex to handle all these cases will be non-trivial (but that is different from impossible).