Example Regexes to Match Common Programming Language Constructs

Regular expressions are very useful to manipulate source code in a text editor or in a regex-based text processing tool. Most programming languages use similar constructs like keywords, comments and strings. But often there are subtle differences that make it tricky to use the correct regex. When picking a regex from the list of examples below, be sure to read the description with each regex to make sure you are picking the correct one.

Unless otherwise indicated, all examples below assume that the dot does not match newlines and that the caret and dollar do match at embedded line breaks. In many programming languages, this means that single-line mode must be off, and multi-line mode must be on.

When used by themselves, these regular expressions may not have the intended result. If a comment appears inside a string, the comment regex will consider the text inside the string as a comment. The string regex will also match strings inside comments. The solution is to use more than one regular expression and to combine those into a simple parser, like in this pseudo-code:

GlobalStartPosition := 0;
while GlobalStartPosition < LengthOfText do
 GlobalMatchPosition := LengthOfText;
 MatchedRegEx := NULL;
 foreach RegEx in RegExList do
 RegEx.StartPosition := GlobalStartPosition;
 if RegEx.Match and RegEx.MatchPosition < GlobalMatchPosition then
 MatchedRegEx := RegEx;
 GlobalMatchPosition := RegEx.MatchPosition;
 endif
 endforeach
 if MatchedRegEx <> NULL then
 // At this point, MatchedRegEx indicates which regex matched
 // and you can do whatever processing you want depending on
 // which regex actually matched.
 endif
 GlobalStartPosition := GlobalMatchPosition;
endwhile

If you put a regex matching a comment and a regex matching a string in RegExList, then you can be sure that the comment regex will not match comments inside strings, and vice versa. Inside the loop you can then process the match according to whether it is a comment or a string.

An alternative solution is to combine regexes: (comment)|(string). The alternation has the same effect as the code snipped above. Iterate over all the matches of this regex. Inside the loop, check which capturing group found the regex match. If group 1 matched, you have a comment. If group two matched, you have a string. Then process the match according to that.

You can use this technique to build a full parser. Add regular expressions for all lexical elements in the language or file format you want to parse. Inside the loop, keep track of what was matched so that the following matches can be processed according to their context. For example, if curly braces need to be balanced, increment a counter when an opening brace is matched, and decrement it when a closing brace is matched. Raise an error if the counter goes negative at any point or if it is nonzero when the end of the file is reached.

Comments

#.*$ matches a single-line comment starting with a # and continuing until the end of the line. Similarly, //.*$ matches a single-line comment starting with //.

If the comment must appear at the start of the line, use ^#.*$. If only whitespace is allowed between the start of the line and the comment, use ^\s*#.*$. Compiler directives or pragmas in C can be matched this way. Note that in this last example, any leading whitespace will be part of the regex match. Use capturing parentheses to separate the whitespace and the comment.

/\*.*?\*/ matches a C-style multi-line comment if you turn on the option for the dot to match newlines. The general syntax is begin.*?end. C-style comments do not allow nesting. If the "begin" part appears inside the comment, it is ignored. As soon as the "end" part if found, the comment is closed.

If your programming language allows nested comments then you can only match them properly using a single regular expression if your regular expression flavor supports recursion. The general syntax is begin(?:middle|(?R))*end. It is important that middle cannot match begin or end to avoid matching incorrectly nested comments. Swift, for example, uses the same syntax as C for comments but allows block comments to be nested. /\*(?:[^*/]|/(?!\*)|\*(?!/)|(?R))*?\*/ matches a Swift block comment, including any and all nested comments. The middle portion [^*/]|/(?!\*)|\*(?!/) is a bit complicated because we need it to match * and / but not /* or */.

If you are building a parser then it is much more efficient to let the parser handle comments. The regex with the lexical elements would have just /\* to match the start of a comment. When this is matched the parser can call a subroutine that tracks the comment nesting level, starting at 1. The routine would iterate over the matches of the regex (/\*)|(\*/), incrementing the nesting level when the first group matches and decrementing it when the second group matches. When the nesting level reaches 0 the subroutine exits, having found a properly nested comment. If the subroutine runs out of text before the nesting level reaches zero then the comment is unclosed.

Strings

"[^"\r\n]*" matches a single-line string that does not allow the quote character to appear inside the string. Using the negated character class is more efficient than using a lazy dot. "[^"]*" allows the string to span across multiple lines.

"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" matches a single-line string in which the quote character can appear if it is escaped by a backslash. Though this regular expression may seem more complicated than it needs to be, it is much faster than simpler solutions which can cause a whole lot of backtracking in case a double quote appears somewhere all by itself rather than part of a string. "[^"\\]*(?:\\.[^"\\]*)*" allows the string to span multiple lines.

You can adapt the above regexes to match any sequence delimited by two (possibly different) characters. If we use b for the starting character, e and the end, and x as the escape character, the version without escape becomes b[^e\r\n]*e, and the version with escape becomes b[^ex\r\n]*(?:x.[^ex\r\n]*)*e.

You can use \v instead of \r\n if your regex flavor supports \v to match any vertical whitespace and your programming language also disallows any vertical whitespace in strings. Make sure to test this because some regex flavors match only the vertical tab with \v. Or, omit the \r\n from the negated character classes to allow strings to span across lines.

Triple-quoted strings typically allow any characters, including quotes, as part of their content. Then '''.*?''' with a lazy dot is an appropriate solution. Turn on "dot matches line breaks" or "single line" mode as in (?s)'''.*?''' to allow the string to span across lines. These regexes allow one or two consecutive quotes as part of their content. But match ends at the first sequence of thee quotes.

'''[^'\\]*+(?:(?s:\\.|'{1,2}(?!'))[^'\\]*+)*+''' allows the string to contain three consecutive quotes as long as the first one is escaped with a backslash.

Numbers

\b[0-9]+\b matches a positive integer number. Use [0-9] instead of \d to match only ASCII digits. Do not forget the word boundaries! [-+]?\b[0-9]+\b allows for a sign. \b[0-9][0-9_]*+\b allows the digits to be grouped arbitrarily using underscores, requiring only that the number starts with a digit. \b[0-9]+(?:_+[0-9]+)*\b additionally requires the number to end with a digit.

\b0[xX][0-9a-fA-F]+\b matches a C-style hexadecimal number. \b0[xX][_0-9a-fA-F]+\b allows grouping with underscores after the 0x.

((\b[0-9]+)?\.)?[0-9]+\b matches an integer number as well as a floating point number with optional integer part. (\b[0-9]+\.([0-9]+\b)?|\.[0-9]+\b) matches a floating point number with optional integer as well as optional fractional part, but does not match an integer number.

((\b[0-9]+)?\.)?\b[0-9]+([eE][-+]?[0-9]+)?\b matches a number in scientific notation. The mantissa can be an integer or floating point number with optional integer part. The exponent is optional.

\b[0-9]+(\.[0-9]+)?(e[+-]?[0-9]+)?\b also matches a number in scientific notation. The difference with the previous example is that if the mantissa is a floating point number then the integer part is mandatory.

If you read through the floating point number example then you will notice that the above regexes are different from what is used there. The above regexes are more stringent. They use word boundaries to exclude numbers that are part of other things like identifiers. You can prepend [-+]? to all of the above regexes to include an optional sign in the regex. I did not do so above because in programming languages, the + and - are usually considered operators rather than signs.

Reserved Words or Keywords

Matching reserved words is easy. Simply use alternation to string them together: \b(first|second|third|etc)\b Again, do not forget the word boundaries.

AltStyle によって変換されたページ (->オリジナル) /