New Regular Expression Features in Tcl 8.1
Tcl 8.1 now handles advanced regular expressions (REs). Previous regular
expression handling is almost unchanged except that clumsy handling
of escapes like \n
has been much improved, and a few escapes
that were previously legal (but useless) now won't work.
Note that a few advanced features aren't useful yet but are ready for future Tcl releases. That's because Tcl 8.1 (apart from the regular expression engine) implements only the Unicode locale (where all characters sort in Unicode order, there are no multi-character collating elements and no equivalence classes).
This document has an overview of the new regular expression features. For exact semantics and more details, see the new re_syntax (n) reference page. (The re_syntax(n) page was split from the 8.1 regexp (n) reference page, which used to cover RE syntax for all Tcl commands.) This howto document covers:
1. Regular Expression Overview
2. Regular Expressions in Tcl 8.1
\xxx
)
{}
)
[: :]
)
[. .]
)
[= =]
)
(?:re)
)
(?=re)
and (?!re)
)
(?xyz)
), Directors (***
)
3. Summary: Regular Expression changes in Tcl 8.1
This Part describes regular expressions (REs), explains REs from Tcl 8.0 and before, and describes the Tcl regexp and regsub commands. Part Two describes the new Tcl 8.1 REs.
A regular expression uses metacharacters
(characters that assume special meaning for matching other characters)
such as *
, []
, $
and .
.
For example, the RE [Hh]ello!*
would match Hello and
hello and Hello! (and hello!!!!!).
The RE [Hh](ello|i)!*
would match Hello and Hi
and Hi! (and so on).
A backslash (\
) disables the special meaning of the following
character, so you could match the string [Hello] with the RE
\[Hello\]
.
Regular expressions in Tcl 8.0 and before had the following metacharacters:
m.d
matches mad, mod, m3d, etc.)[a-z0-9_]
matches a lowercase ASCII letter, a digit, or an underscore)^hi
matches hi and his but not this)hi$
matches hi and chi but not this)M.*D
matches MD, MAD, MooD, M.D, etc.)hi!?
matches hi or hi!)hi!+
matches hi! or hi!! or hi!!! or ...)this|that
matches this or that)([0-9A-F][0-9A-F])+
matches groups of two hexadecimal digits: A9 or AB03 or 8A6E00, but not A or A2C)."Eat (this|that)!"
matches "Eat this!" or "Eat that!").a\.*
matches a or a. or a.. or etc.). Note that \
also has special meaning to the Tcl interpreter (and to applications, such as C compilers).The syntax above is supported in Tcl 8.1. Tcl 8.1 also supports advanced regular expressions (AREs). These powerful expressions are introduced in more detail in Part Two. Briefly, though, AREs support backreferences, lookahead, non-greedy matching, many escapes, features that are useful for internationalization (handling collation elements, equivalence classes and character classes), and much more.
The Tcl 8.1 regular expression engine almost always interprets 8.0-style REs correctly. In the few cases that it doesn't, and when the problem is too difficult to fix, the 8.1 engine has an option to select 8.0 ("ERE") interpretation.
$line
against the RE [Hh]ello!*
, you would write:
regexp {[Hh]ello!*} $line match
If part or all of the line
variable matches the RE, regexp
stores the matching part in the match variable and returns a
value of 1.
$in_line
to replace all space or tab characters with a single space character;
the edited line is stored in the out_line variable:
regsub -all {[ \t]+} $in_line { } out_line
Please also read the following section about backslash processing.
\t
in the
previous example as a character-entry escape that stands
for a tab character.
We actually used the 8.1 syntax above; the example wouldn't have
worked under 8.0!
In Tcl 8.0 and before, you had to surround the regular expression
with double quotes so the Tcl backslash processor could convert the
\t
to a literal tab character. The square brackets had to
be hidden from the backslash processor by adding backslashes before
them, which made code harder to read and possibly more error-prone.
Here's the previous example rewritten for Tcl 8.0 and before:
regsub -all "\[ \t\]+" $in_line { } out_line
For more about the simplified 8.1 syntax, see the section Backslash Escapes.
Tcl 8.1 regular expressions are basically a superset of 8.0 REs. This howto document has an overview of the new features. Please see the re_syntax (n) reference page for exact semantics and more details.
*
in the RE z*
matches zero or more zs. By default, regular
expression quantifiers are greedy:
they match as much text as they can. Tcl 8.1 REs also have non-greedy
quantifiers, which match the least text they can.
To make a non-greedy quantifier, add a question mark (?
) at the end.
Let's start by storing some HTML text in a variable, then using two regexp commands to match it. The first RE is greedy, and the second is non-greedy:
% set x {<EM>He</EM> sits, but <EM>she</EM> stands.}
<EM>He</EM> sits, but <EM>she</EM> stands.
% regexp {<EM>.*</EM>} $x match; set match
<EM>He</EM> sits, but <EM>she</EM>
% regexp {<EM>.*?</EM>} $x match; set match
<EM>He</EM>
The first RE <EM>.*</EM>
is "greedy."
It matches from the first <EM>
to the last </EM>
.
The second RE <EM>.*?</EM>
, with
a question mark (?
) after the *
quantifier, is
non-greedy: it matches as little text as possible after the first
<EM>
.
Could you write a greedy RE that works like the non-greedy version?
It isn't easy!
A greedy RE like <EM>[^<]*</EM>
would do it in this case -- but it wouldn't work if there were other
HTML tags (with a <
character) between the pair of
<EM>
tags in the $x
string.
Here are a new string and another pair of REs to match it:
% set y {123zzz456}
123zzz456
% regexp {3z*} $y match; set match
3zzz
% regexp {3z*?} $y match; set match
3
The greedy RE 3z*
matches all the zs it can
(three) under its "zero or more" rule.
The non-greedy RE 3z*?
matches just 3 because it matches
the fewest zs it can under its "zero or more" rule.
To review, the greedy quantifiers from Tcl 8.0 are: *
,
+
, and ?
.
So the non-greedy quantifiers (added in Tcl 8.1) are: *?
,
+?
, and ??
.
Tcl 8.1 also has the new quantifiers {m}
, {m,}
, and
{m,n}
, as well as the non-greedy versions
{m}?
, {m,}?
, and {m,n}?
.
The section on bounds explains -- and has more
examples of non-greedy matching.
\
) disables the metacharacter after it. For example,
a\*
matches the character a followed by a literal asterisk
(*
) character. In Tcl 8.0 and before, it was legal to put
a backslash before a non-metacharacter -- for instance,
regexp {\p}
matched the character p. (Note that
regexp {\n}
matched the character n, which was a source of confusion. To
get a newline character into an RE before version 8.1, you had to write
regexp "\n"
so Tcl processing inside double quotes would convert the \n
to a newline.)
The Tcl 8.1 regular expression engine interprets backslash escapes
itself. So now regexp {\n}
matches a newline, not
the character n. REs are simpler to write in 8.1 because of
this. (You can still write regexp "\n"
-- and let Tcl conversion
happen inside the double quotes -- so most old code will still work.)
One of the most important changes in 8.1 is that a backslash inside a
bracket expression is treated as the start of an escape.
In 8.0 and before, a backslash inside brackets was treated as a literal
backslash character.
For example, in 8.0 and before, regexp {[a\n]}
would match
the characters a, \, or n.
But in 8.1, regexp {[a\n]}
would match the
characters a or newline (because \n
is the
backslash escape for "newline").
Tcl 8.1 has also added many new backslash escapes. For instance,
\d
matches a digit. Some of these are listed below, and
the re_syntax (n) reference page has the whole list.
In Tcl 8.1 regular expressions (but not in other parts of the
language), it's illegal
to use a backslash before a non-metacharacter unless it makes a valid
escape. So regexp {\p}
is now
an error. If you have code that (for some bizarre reason) has regular
expressions with a backslash before a non-metacharacter, like regexp
{\p}
, you'll need to fix it.
As explained above, the Tcl 8.1 regular expression engine now interprets
backslash sequences like \n
to mean "newline". It also has
four new kinds of escapes: character entry escapes, class shorthand
escapes, constraint escapes, and back references. Here's an introduction.
(The re_syntax (n) page has full details.)
\n
represents a newline character. \uwxyz
(where
wxyz is hexadecimal) represents the
Unicode character U+wxyz
.
\d
stands for [[:digit:]]
, which means "any
single digit."
\m
matches
only at the start of a word -- so the RE \mhi
will match the
third word in the string he said hi but won't match he said
thigh.
(X.*Y)1円
matches any doubled
string that starts with X and ends with Y, such as
XYXY, XabcYXabcY, X--YX--Y, etc.
Finally, remember that (as in Tcl 8.0 and before) some applications, such as C compilers, interpret these backslash sequences themselves before the regular expression engine sees them. You may need to double (or quadruple, etc.) the number of backslashes for these applications. Still, in straight Tcl 8.1 code, writing backslash escapes is now both simpler and more powerful than in 8.0 and before.
*
, +
, and ?
.
They specify "how many" (respectively, zero or more, one or more, and
zero or one). Tcl 8.1 added new quantifiers that let you choose exactly
how many matches: the bounds operators, {}
.
These operators
come in three greedy forms: {m}
, {m,}
,
and {m,n}
. The corresponding non-greedy forms are {m}?
,
{m,}?
, and {m,n}?
.
{m}
quantifier matches exactly m occurrences.
So does {m}?
.
For example, either #{70}
or #{70}?
match a string of
exactly 70 #
characters.
{m,}
quantifier matches at least m occurrences.
Here's a demo of the greedy and non-greedy versions:
% set x {a##b#######c}
a##b#######c
% regexp {#{4,}} $x match; set match
#######
% regexp {#{4,}?} $x match; set match
####
Notice that the first two number signs (##
) in the string are
never matched because there aren't at least four of them.
{m,n}
quantifier matches at least m but no
more than n occurrences.
For example,
the RE http://([^/]+/?){1,3}
would match Web URLs that have
3 components (like http://xyz.fr/euro/billets.htm), or with 2
components (like http://xyz.fr/euro/, or with just
1 component (like http://xyz.fr).
The RE matches a final slash (/
) if there is one.
As always, a greedy match will match as
long a string as possible: it would try for 3 matches.
A non-greedy quantifier would try to match the least (1 match).
But be careful: http://([^/]+/?){1,3}?
won't match all
the way to a possible slash because it matches the fewest characters
possible!
(With input http://xyz.fr/, that RE would match just
http://x.)
This brings up one of the many subtleties in these advanced regular
expressions: that the outer non-greedy quantifier overrides the inner
greedy quantifiers and makes all quantifiers non-greedy!
There's an explanation in re_syntax (n) reference page section
named Matching.
punct
stands for the "punctuation" characters.
A character class is always written as part of a bracket
expression, which is a list of characters enclosed in []
.
For instance, the character class named digit
stands for any of
the digits 0-9 (zero through nine). The character class is written
with the class name inside a set of brackets and colons, like this:
[[:digit:]]
. The old familiar expression for digits is written
as a range: [0-9]
. When you compare the new character class
to the old range version, you can see that the outer square brackets are the
same in both. So a character class is written
[:classname:]
.
The table below describes the Tcl 8.1 character classes.
alnum
.)
You can use more than one character class in a bracket expression.
You can also mix character classes with ranges and single characters.
For instance, [[:digit:]a-cx-z]
would match a digit (0-9),
a, b, c, x, y, or z
-- and [^[:digit:]a-cx-z]
would match
any character except those. This syntax can take some
time to get familiar with! The key is to look for the character
class (here, [:digit:]
) inside the bracket expression.
The advantage of character classes (like [:alpha:]
) over
explicit ranges in brackets (like [a-z]
) is that character
classes include characters that aren't easy to type on ASCII keyboards.
For example, the Spanish language includes the character ñ.
It doesn't fall into the range [a-z]
, but it is in the Tcl 8.1
character class [:alpha:]
.
In the same way, the Spanish punctuation character ¡ isn't
in a list of punctuation characters like [.!?,]
, but it
is part of [:punct:]
.
Tcl 8.1 has a standard set of character classes that are defined in the source code file generic/regc_locale.c. Tcl 8.1 has one locale defined: the Unicode locale. It may support other locales (and other character classes) in the future.
[.number-sign.]
Collating symbols
must be written in a bracket expression (inside []
).
So [[.number-sign.]]
will match the character #
, as
you can see here:
% regexp {[[.number-sign.]]+} {123###456} match
1
% set match
###
Tcl 8.1 has a standard set of collating symbols that are
defined in the source code file generic/regc_locale.c.
Note: Tcl 8.1 does not implement multi-character collating
elements like ch
(which is the fourth character in the Spanish alphabet
a, b, c, ch, d, e,
f, g, h, i...)
So the examples below are not supported in Tcl 8.1,
but are here for completeness.
(Future versions of Tcl may have multi-character collating elements.)
Suppose ch and c sort next to each other in your dialect, and ch is treated as an atomic character. The example bracket expression below uses two collating symbols. It matches one or more of ch and c. But it doesn't match an h standing alone:
% set input "cchchh"
cchchh
% regexp {[[.ch.][.c.]]+} $input match; set match
cchch
Here's one tricky and surprising thing about collating symbols.
A caret at the start of a bracket expression ([^...
)
means that, in a locale with multi-character collating elements,
the symbol can match more than one character. For instance,
the RE in the example below matches any character
other than c, followed by the character b. So the
expression matches all of chb:
% set input chb
% regexp {[^[.c.]]b} $input match; set match
chb
Again, the two previous examples are not supported in Tcl 8.1,
but are here for completeness.
[[=c=]]
.
It's any collating element that has the same relative order in
the collating sequence as c.
Note: Tcl 8.1 only implements the Unicode locale. It doesn't define any equivalence classes. So, although the Tcl regular expression engine supports equivalence classes, the examples below are not supported in Tcl 8.1. (Future versions of Tcl may define equivalence classes.)
Let's imagine that both of the characters A and a
fall at the same place in the collating sequence;
they belong to the same equivalence class.
In that case, both of the bracket expressions [[=A=]b]
and
[[=a=]b]
are equivalent to writing [Aab]
.
As another example, if o and ô are members
of an equivalence class, then all of the bracket expressions
[[=o=]]
, [[=ô=]]
, and [oô]
match those same two characters.
*
or +
) apply
to the parenthesized part. For instance, the RE Oh,( no!)+
would match Oh, no! as well as Oh, no! no!
and so on.
The other reason to use parentheses is that they capture the
matched text. Captured text is used in
back references,
in "matching" variables in the regexp command, as well as in the
regsub command.
If you don't want parentheses to capture text, add ?:
after the opening parenthesis.
For instance, in the example
below, the subexpression (?:http|ftp)
matches either http or ftp but doesn't capture it.
So the back reference 1円
will hold the end of the URL (from
the second set of parentheses):
% set x http://www.activestate.com
http://www.activestate.com
% regsub {(?:http|ftp)://(.*)} $x {The hostname is 1円} answer
1
% set answer
The hostname is www.activestate.com
A positive lookahead has the form (?=re)
.
It matches at any place ahead where there's a substring like re.
A negative lookahead has the form (?!re)
.
It matches at any point where the regular expression re does
not match.
Let's see some examples:
% set x http://www.activestate.com
http://www.activestate.com
% regexp {^[^:]+(?=.*\.com$)} $x match
1
% set match
http
% regexp {^[^:]+(?=.*\.edu$)} $x match
0
% regexp {^[^:]*(?!.*\.edu$)} $x match
1
% set match
http
The regular expressions above may seem complicated, but they're
really not bad! Find the lookahead expression in the first
regexp command above; it starts with
(?=
and ends at the corresponding parenthesis. The "guts" of this
lookahead expression is .*\.com$
, which stands for "a string
that ends with .com". So the first regexp command
above matches any string containing non-colon (:
)
characters, as long as the rest of the string ends with .com.
The second regexp is similar but looks for a string ending
with .edu.
Because regexp returns 0, you can see that this doesn't match.
The third regexp looks for a string not ending with .edu.
It matches because $x ends with .com.
Tcl 8.1 lets you document complex regular expressions by embedding comments. See the next section.
Complex REs can be difficult to document. The
-expanded switch sets expanded syntax, which
lets you add comments within a regular expression. Comments start with
a #
character; whitespace is ignored.
This is mostly for scripting -- but you can also use it on a command
line, as we'll do in the example below.
Let's look the same RE twice:
first in the standard compact syntax, and second in expanded syntax:
% set x http://www.activestate.com
http://www.activestate.com
% regexp {^[^:]+(?=.*\.com$)} $x match
1
% set match
http
% regexp -expanded {
^ # beginning of string
[^:]+ # all characters to the first colon
(?= # begin positive lookahead
.*\.com$ # for a trailing .com
) # end positive lookahead
} $x match
1
% set match
http
In expanded syntax, you can use space and tab characters to indent and
make your code clear.
To enter actual space and tab characters into your RE, use the escapes
\s
and
\t
, respectively.
The other important new switch we'll cover here is -line.
It enables newline-sensitive matching.
By default (without -line), Tcl regular expressions have always
treated newlines as an ordinary character.
For example, if a string contains several lines (separated by
newline characters), the end-of-string anchor $
wouldn't
match at any of the embedded newlines.
To write code that matched line-by-line, you had to read input lines one
by one and do separate matches against each line.
With the -line switch, the metacharacters
^
,
$
,
.
, and
[]
treat a newline as the end of a "line."
So, for example, the regular expression ^San Jose
matches
the second line of input below:
% set x {Dolores Sanchez
San Jose, CA}
Dolores Sanchez
San Jose, CA
% regexp {^San Jose} $x match
0
% regexp -line {^San Jose} $x match
1
% set match
San Jose
The -line switch actually enables two other switches.
You can set part of the features from -line by choosing
one of these switches instead:
The -lineanchor switch makes
^
and
$
match at the beginning and end of a line.
The -linestop switch makes
.
and
[]
stop matching at a newline character.
An 8.1 RE can start with embedded options. These look like
(?xyz)
, where xyz are one or more option letters.
For instance, (?i)ouch
matches OUCH because i
is the "case-insensitive matching" option.
Other options include (?e)
, which marks
the rest of the regular expression as an 8.0-style RE --
to let you avoid confusion with the new 8.1 syntax.
An RE can also start with three asterisks, which is a director.
For example, ***=
is the director that says the rest of the
regular expression is literal text. So the RE
***=(?i)ouch
matches exactly (?i)ouch; the
(?i)
isn't treated as an option.
This table below summarizes the new syntax:
Some of the new switches for regexp and regsub are:
[]
and .
stop at newlines.^
and $
match the start and end of a line.
This is the main Tcl Developer Xchange site,
www.tcl-lang.org .
About this Site |
[email protected]
Home |
About Tcl/Tk |
Software |
Core Development |
Community |
Documentation