In the regexp2 module, the functions are named by symbols in the excl
package, and are named compile-re, match-re, split-re, and replace-re. (There is also an older regexp module which is maintained for backward compatibility but no longer discussed in this document.)
The regexp2 module provides a fast, mostly-Perl-compatible regular expression matcher. It handles Unicode character set (UCS-2). Symbols in the regexp2 module are in the excl package.
You load the regexp2 module with the form
(require :regexp2)
Like Perl, there are four types of 'mode switches' that affect the meaning of regular expressions. The switch can be specified by the keyword arguments passed to regexp APIs, or by embedding 'flags' in the regular expression itself. The following table lists the valid mode switches.
:case-fold
Case-insensitive match (when true). Currently, case folding is only
effective for ASCII characters.
:multiple-lines
Treats the input string as multiple lines. If turned on, "^" and "$"
matches the start and end of any line instead of the beginning and end
of the whole string.
:single-line
Treats the input string as a single line. If turned on, "." matches
any character, even a newline. Normally "." matches any character
except a newline.
:ignore-whitespace
Extend syntax. Whitespace in the regular expression is ignored, and
comments can be inserted, to increase legibility of the regular
expression.
Within a regular expression, a mode switch can be turned on/off locally by (?:...)
construct. For example, (?i:foo)
makes foo
match case insensitively. (?-i:foo)
makes foo
match case sensitively. You can combine multiple flags, like (?im-sx:foo)
. A construct like (?i)
changes the mode switch until the end of the current grouping.
The backslash character is treated by the Lisp reader as an escape character, telling the reader to treat the next character as a literal rather than some sort of a control charcater. Thus, suppose you want to make a string of
ab"cd
You want the double quote character to be part of the string, but the reader will misunderstand
"ab"cd"
so you specify it as
"ab\"cd"
When those 8 characters are read, a 5 character string will be stored, with no backslash and the double quote as the third character:
cl-user(4): (setq str "ab\"cd")
"ab\"cd"
cl-user(5): (length str)
5
cl-user(6): (char str 2)
#\" ;; note " not \
cl-user(7):
Backslashes are used extensively in regular expressions. In order to specify a backslash in a string, you enter two backslashes (the first is read as an escape and the second as the character (a backslash) which you want to include. So
cl-user(7): "\|"
"|"
cl-user(8): "\\|"
"\\|"
cl-user(9): (length *)
2
cl-user(10): (split-re "\\|" "this|is|a|string")
("this" "is" "a" "string")
cl-user(11):
;; "\|" matches the empty string so the result is the letters as strings:
cl-user(11): (split-re "\|" "this|is|a|string")
("t" "h" "i" "s" "|" "i" "s" "|" "a" "|" "s" "t" "r" "i" "n" "g")
;; More examples with tabs (\t):
cl-user(12): (split-re "\\t" "this is a string with tabs")
("this" "is" "a" "string" "with" "tabs")
cl-user(13): (split-re "\\x09" "this is a string with tabs")
("this" "is" "a" "string" "with" "tabs")
cl-user(14):
See Capturing and back reference below for the operation of capturing and reference.
Capturing submatches (X)
and (?<name>X)
are numbered in the order of its opening parenthesis from left to right. Named submatch is counted the same as unnamed submatch, and can be back-referenced by both name and number.
When the input string matches X, the portion of the input string is saved. It can be referenced within a regular expression, by the back reference construct, or can be returned to the user program as a submatch.
If the capturing construct matches more than once, it saves the last match.
Most functions that accept a regexp string such as match-re and compile-re also accept a regexp tree. A regexp tree is an s-expression with the syntax described below. The syntax was defined by CL-PPCRE and is intended to be compatible with it. This documentation was taken from the CL-PPCRE documentation, available as http://www.weitz.de/cl-ppcre/ .
Programmers are usually most familiar with regexp string syntax, and it suffices for many normal regexp applications. However, string syntax does not scale very well -- complex regexps are hard to write and even harder to parse. The frequent need for backslach escapes is a further complication. In such cases, programmers may find tree syntax easiler to code in Lisp source code editors with their autmatic indentation and parentheses matching.
Further, in any application that generates regular expressions on the fly will undoubtedly find it easier to generate s-expr tress than trying to perform and extra level of encoding into string syntax, only to force the regexp system immediately to parse the string back.
Tree syntax is as follows:
Every string and character is a parse tree and is treated literally as a part of the regular expression, i.e. parentheses, brackets, asterisks and such aren't special.
The symbol :void
is equivalent to the empty string.
The symbol :everything
is equivalent to Perl's dot, i.e it matches everything (except maybe a newline character depending on the mode).
The symbols :word-boundary
and :non-word-boundary
are equivalent to Perl's "\b" and "\B".
The symbols :digit-class
, :non-digit-class
, :word-char-class
, :non-word-char-class
, :whitespace-char-class
, and :non-whitespace-char-class are equivalent to Perl's special character classes "\d", "\D", "\w", "\W", "\s", and "\S" respectively.
The symbols :start-anchor
, :end-anchor
, :modeless-start-anchor
, :modeless-end-anchor
, and :modeless-end-anchor-no-newline
are equivalent to Perl's "^", "$", "\A", "\Z", and "\z" respectively.
The symbols :case-insensitive-p
, :case-sensitive-p
, :multi-line-mode-p
, :not-multi-line-mode-p
, :single-line-mode-p
, and :not-single-line-mode-p
are equivalent to Perl's embedded modifiers "(?i)", "(?-i)", "(?m)", "(?-m)", "(?s)", and "(?-s)". As usual, changes applied to modes are kept local to the innermost enclosing grouping or clustering construct.
(:flags {<modifier>}*)
where
(:sequence {<parse-tree>}*)
means a sequence of parse trees, i.e. the parse trees must match one after another. Example: (:sequence #\f #\o #\o)
is equivalent to the parse tree "foo".
(:group {<parse-tree>}*)
is like :sequence
but changes applied to modifier flags (see above) are kept local to the parse trees enclosed by this construct. Think of it as the S-expression variant of Perl's "(?:
(:alternation {<parse-tree>}*)
means an alternation of parse trees, i.e. one of the parse trees must match. Example: (:alternation #\b #\a #\z)
is equivalent to the Perl regex string "b|a|z".
(:branch <test> <parse-tree>)
is for conditional regular expressions with the syntax
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
(:positive-lookahead|:negative-lookahead|:positive-lookbehind|:negative-lookbehind <parse-tree>)
-- the meaning should be pretty obvious.
(:greedy-repetition|:non-greedy-repetition <min> <max> <parse-tree>)
where nil
will result in a regular expression which tries to match nil
). So, e.g., (:non-greedy-repetition 0 1 "ab")
is equivalent to the Perl regex string "(?:ab)??".
(:standalone <parse-tree>)
is an "independent" subexpression, i.e. (:standalone "bar")
is equivalent to the Perl regex string "(?>bar)".
(:register <parse-tree>)
is a capturing register group. As usual, registers are counted from left to right beginning with 1.
(:back-reference <number>)
where (:char-class|:inverted-char-class {<item>}*)
where (:range <char1> <char2>)
where (char<= <char1> <char2>)
is true. Example: (:inverted-char-class #\a (:range #\d #\g) :digit-class)
is equivalent to the Perl regex string "[^aD-G\d]".
There is a small region of ambiguity between string syntax and tree syntax. Although a single string is a valid parse tree, it will be interpreted as a Perl regexp strings when given to compile-re and friends. To circumvent this you can use the equivalent parse tree (:GROUP
If you want to find out how parse trees are related to Perl regex strings you should play around with parse-re - a function which converts a Perl regexp strings to a parse tree.
There are seven functions (listed first) and three macros in the API:
Traditionally, '{' and '}' characters that do not consist of a valid repetition syntax are taken literally. That is, a regular expression "x{1,3,4}" matches the string "x{1,3,4}", and a regular expression "a{" matches the string "a{". Currently, these regular expressions raise a syntax error in our regexp library.
Embedded Perl expressions like (?{ $a = 3+$b }) are not supported because Lisp cannot execute Perl code.
The following escape sequences are not supported. Precisely speaking, these are actually a feature of Perl's literal string syntax, and not a part of regular expression.
\N{name} named char
\lx lowercase x
\ux uppercase x
\Lx..\E lowercase x..
\Ux..\E uppercase x..
\Qx..\E quote non-alphanumeric chars in x.
The character properties (\p{property} and \P{property}), and extended unicode combining sequence \X, aren't supported.
POSIX character class syntax [:class:] within character classes is not supported.
Inconsistent capturing and alternation order. This is due to Perl's bug. Only appears in the mixture of very tricky situation and optimization. See http://www.weitz.de/cl-ppcre/ for the details.
Theoretically ACL's regexp library uses the same mechanism that Perl and CL-PPCRE are using: Nondeterministic finite automaton (NFA). When there are multiple possibilies of matching, it "remembers" the current state and tries each possibility one at a time. If a trial fails, it backs up the last saved state and tries the next possibility; that is called a "backtrack". It is possible to compose a very short regular expression that does a huge number of backtracks; if you have nested repetitions and alternations, the number of required backtracks grows exponentially.
The regexp optimizer tries to reduce the number of backtracks, but it is not always possible. Here are some tips to improve the performance of the matcher.
If applicable, use of standalone group (?>X)
in the inner loop. It effectively turns the nested repetition into unnested repetition from the viewpoint of NFA. Lots of optimizations can be done for unnested repetition, but not much can be done for a nested one.
If the input string is very long, it is a good idea to split it up by non-backtracking regexp first, then apply more compilicated regexp to the small chunks. Non-backtracking NFA runs in constant space and linear time to the input length. Single-nested backtracking NFA runs in linear space and quadratic time to the input length.
Note that an unnested greedy repetition followed by a character or a character set that are exclusive to the beginning of the repetition becomes a non-backtracking regexp; for example, the regular expression (\w+)\s+
runs without backtracking.
Copyright (c) Franz Inc. Lafayette, CA., USA. All rights reserved.