Specifying Modes Inside The Regular Expression

Normally, matching modes are specified outside the regular expression. In a programming language, you pass them as a flag to the regex constructor or append them to the regex literal. In an application, you’d toggle the appropriate buttons or checkboxes. You can find the specifics in the tools and languages section of this website.

Sometimes, the tool or language does not provide the ability to specify matching options. The handy String.matches() method in Java does not take a parameter for matching options like Pattern.compile() does. Or, the regex flavor may support matching modes that aren’t exposed as external flags. The regex functions in R have ignore.case as their only option, even though the underlying PCRE library has more matching modes than any other discussed in this tutorial.

In those situations, you can add the following mode modifiers to the start of the regex. To specify multiple modes, simply put them together as in (?ismx). All regex flavors discussed in this tutorial support mode modifiers at the start of the regex, except for JavaScript, RE2, std::regex, Oracle, XML Schema, and XPath POSIX do not support mode modifiers at all. Boost supports mode modifiers when you select its default ECMAScript grammar but not with its other grammars.

The following modifier letters are supported by the regex flavors discussed in this tutorial. Some like (?i) are universal. Others are only supported by specific flavors. A few letters are listed more than once because they have completely different functions in different regex flavors. The list organizes the letters by related functions, with increasing specialty as you go down the list.

All these modes can be turned off by preceding them with a hyphen. (?ix-sm) turns on case insensitivity and free-spacing and turns off the other two options. All flavors that support mode modifiers at the start of the regex support turning off options, except for Tcl and Python.

  • (?i) makes the regex case insensitive.
  • (?c) makes the regex case sensitive. Only supported by Tcl.
  • (?x) turn on free-spacing mode.
  • (?t) turn off free-spacing mode. Only supported by Tcl.
  • (?xx) turn on free-spacing mode, also in character classes. Supported by Perl 5.26, PCRE2 10.30, PHP 7.3.0, and R 4.0.0 and later versions.
  • (?s) for "single line mode" makes the dot match all characters, including line breaks. Not supported by Ruby. In Tcl, (?s) also makes the caret and dollar match at the start and end of the string only.
  • (?m) for "multi-line mode" makes the caret and dollar match at the start and end of each line in the subject string. In Ruby, (?m) makes the dot match all characters, without affecting the caret and dollar which always match at the start and end of each line in Ruby. In Tcl, (?m) also prevents the dot from matching line breaks.
  • (?n) in Tcl is the same as (?m).
  • (?p) in Tcl makes the caret and dollar match at the start and the end of each line, and makes the dot match line breaks.
  • (?w) in Tcl makes the caret and dollar match only at the start and the end of the subject string, and prevents the dot from matching line breaks.
  • (?d) corresponds with UNIX_LINES in Java, which makes the dot, caret, and dollar treat only the newline character \n as a line break, instead of recognizing all line break characters from the Unicode standard. Whether they match or don’t match (at) line breaks depends on (?s) and (?m).
  • (?n) turns all unnamed groups into non-capturing groups. Supported by .NET, the JGsoft flavor, Perl 5.22, PCRE2 10.30, PHP 7.3.0, R 4.0.0, and later versions of these.
  • (?w) in ICU makes \b match UAX #29 word boundaries instead of traditional word boundaries.
  • (?J) allows duplicate group names. Only supported by PCRE and PCRE2 and languages that use those such as Delphi, PHP, and R.
  • (?U) turns on "ungreedy mode", which switches the syntax for greedy and lazy quantifiers. So (?U)a* is lazy and (?U)a*? is greedy. Only supported by PCRE and PCRE2 and applications that use them. Its use is strongly discouraged because it confuses the meaning of the standard quantifier syntax.
  • (?X) makes escaping letters with a backslash an error if that combination is not a valid regex token. Only supported by PCRE and applications that use it. PCRE2 does not support this mode modifier as it treats needlessly escaped letters as an error by default.
  • (?b) makes Tcl interpret the regex as a POSIX BRE.
  • (?e) makes Tcl interpret the regex as a POSIX ERE.
  • (?q) makes Tcl interpret the regex as a literal string (minus the (?q) characters).

The following mode modifiers can only be turned on unless a variant with a hyphen is explicitly mentioned in the list. Python only allows them at the very start of the regex. You cannot use both (?a) and (?u) in the same modifier. (?ua) and (?au) are errors in Perl and Python. In Ruby only the last letter takes effect.

  • (?a) makes shorthand and POSIX classes match only ASCII characters in Python 3.0, Perl, PCRE2 10.43, and Ruby 2.0, and later versions of these. In Python it additionally changes case insensitivity to affect only ASCII characters.
  • (?aa) stops Perl from allowing ASCII characters in the regex to match certain non-ASCII characters in the subject or vice versa when the regex is case insensitive. This affects the ASCII letter K and the Kelvin sign ℡ (U+212A), for example. Implies (?a).
  • (?r) does in PCRE2 10.43 and R 4.4.0 and later what (?aa) does in Perl, without implying (?a).
  • (?u) makes shorthand and POSIX classes match Unicode characters in Python, Perl, and Ruby 2.0 and later. (?u) makes Python and Java use Unicode case folding for case sensitivity. This modifier indeed has two functions in Python, but only one function in the other flavors that recognize it. In Java 5 (this version only) it also turns on case insensitivity. In all other flavors that is toggled separately with (?i).
  • (?-u) makes case insensitivity affect only ASCII characters in Java. Does not turn off case insensitivity in Java 5.
  • (?U) makes shorthand and POSIX classes match Unicode characters in Java 7 and later. (?U) implies (?u).
  • (?-U) makes shorthand and POSIX classes match only ASCII characters in Java 9 and later. (?-U) does not imply (?-u).

Turning Modes On and Off for Only Part of The Regular Expression

If you insert a mode modifier such as (?ism) in the middle of the regex then the modifier only applies to the part of the regex to the right of the modifier. If you use alternation after the modifier then it applies to all following alternatives as well. If you place the modifier inside a group then it only affects the remainder of that group, including any nested groups that follow the modifier, and any following alternatives within the same group.

Tcl and Python are the only flavors discussed in this tutorial that do not support mode modifiers in the middle of the regex. They are an error in Tcl and in Python 3.7 and later. Python 3.6 and prior allowed modifiers in the middle of the regex but still applied the modifier to the whole regex as if you had placed it at the very start of the regex.

Modifiers in the middle of the regex break alternation that follows the modifier in Ruby. Use a modifier span if you need an option to apply to part of a regex that uses alternation in Ruby.

Modifier Spans

Instead of using two modifiers, one to turn an option on, and one to turn it off, you use a modifier span. (?i)caseless(?-i)cased(?i)caseless is equivalent to (?i)caseless(?-i:cased)caseless. This syntax resembles that of the non-capturing group (?:group). You could think of a non-capturing group as a modifier span that does not change any modifiers. Like a non-capturing group, the modifier span does not create a backreference.

All flavors that support mode modifiers in the middle of the regex also support modifier spans. Python and Tcl do not support modifier spans even though they support non-capturing groups.

JavaScript makes a surprise appearance here. Originally, JavaScript did not support mode modifiers at all. But ECMAScript 2018 introduced the /u flag. In addition to enabling Unicode features, this flag enables mode modifier spans with the options (?i:case insensitive), (?s:single-line), and (?m:multi-line). As with other flavors, you can combine the letters in a single modifier span and use the hyphen to turn off certain options.

AltStyle によって変換されたページ (->オリジナル) /