Using Regular Expressions in Java

Java 4 (JDK 1.4) and later have comprehensive support for regular expressions through the standard java.util.regex package. Java 5 fixes some bugs and adds support for Unicode blocks. Java 6 fixes a few more bugs but doesn’t add any features. Java 7 adds named capture, Unicode scripts, and some binary Unicode properties. Java 8 changes \v to be a shorthand matching all vertical whitespace instead of just the vertical tab, and also adds \h and \R. Java 9 adds graphemes. Java 13 allows infinite quantifiers inside lookbehind. Java 19 makes word boundaries ASCII-only unless Pattern.UNICODE_CHARACTER_CLASS is set, for consistency with \w.

Quick Regex Methods of The String Class

The Java String class has several methods that allow you to perform an operation using a regular expression on that string in a minimal amount of code. The downside is that you cannot specify options such as "case insensitive" or "dot matches newline". For performance reasons, you should also not use these methods if you will be using the same regular expression often.

myString.matches("regex") returns true or false depending whether the string can be matched entirely by the regular expression. It is important to remember that String.matches() only returns true if the entire string can be matched. In other words: "regex" is applied as if you had written "^regex$" with start and end of string anchors. This is different from most other regex libraries, where the "quick match test" method returns true if the regex can be matched anywhere in the string. If myString is abc then myString.matches("bc") returns false. bc matches abc, but ^bc$ (which is really being used here) does not.

myString.replaceAll("regex", "replacement") replaces all regex matches inside the string with the replacement string you specified. No surprises here. All parts of the string that match the regex are replaced. You can use the contents of capturing parentheses in the replacement text via 1,ドル 2,ドル 3,ドル etc. 0ドル (dollar zero) inserts the entire regex match. 12ドル is replaced with the 12th backreference if it exists, or with the 1st backreference followed by the literal "2" if there are less than 12 backreferences. If there are 12 or more backreferences, it is not possible to insert the first backreference immediately followed by the literal "2" in the replacement text.

In the replacement text, a dollar sign not followed by a digit causes an IllegalArgumentException to be thrown. If there are less than 9 backreferences then a dollar sign followed by a digit greater than the number of backreferences throws an IndexOutOfBoundsException. So be careful if the replacement string is a user-specified string. To insert a dollar sign as literal text, use \$ in the replacement text. When coding the replacement text as a literal string in your source code, remember that the backslash itself must be escaped too: "\\$".

myString.split("regex") splits the string at each regex match. The method returns an array of strings where each element is a part of the original string between two regex matches. The matches themselves are not included in the array. Use myString.split("regex", n) to get an array containing at most n items. The result is that the string is split at most n-1 times. The last item in the string is the unsplit remainder of the original string.

Using The Pattern Class

In Java, you compile a regular expression by using the Pattern.compile() class factory. This factory returns an object of type Pattern. E.g.: Pattern myPattern = Pattern.compile("regex"); You can specify certain options as an optional second parameter. Pattern.compile("regex", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE) makes the regex case insensitive for US ASCII characters, causes the dot to match line breaks and causes the start and end of string anchors to match at embedded line breaks as well. When working with Unicode strings, specify Pattern.UNICODE_CASE if you want to make the regex case insensitive for all characters in all languages. You should always specify Pattern.CANON_EQ to ignore differences in Unicode encodings, unless you are sure your strings contain only US ASCII characters and you want to increase performance.

If you will be using the same regular expression often in your source code then you should create a Pattern object to increase performance. Creating a Pattern object also allows you to pass matching options as a second parameter to the Pattern.compile() class factory. If you use one of the String methods above then the only way to specify options is to embed mode modifier into the regex. Java supports the following modes:

Modifier Field Explanation
(?i)Pattern.CASE_INSENSITIVEASCII-only case folding without UNICODE_CASE.
(?m)Pattern.MULTILINE Anchors match at line breaks.
(?s)Pattern.DOTALL Dot matches line breaks.
(?d)Pattern.UNIX_LINES Anchors and dot treat only the line feed as a line break.
(?u)Pattern.UNICODE_CASE Simple Unicode case folding if the regex is case insensitive.
(?U)Pattern.UNICODE_CHARACTER_CLASS Shorthand and POSIX-style classes include Unicode characters. Also affects \b in Java 19 and later.
n/a Pattern.CANON_EQ Canonical equivalence.

Use myPattern.split("subject") to split the subject string using the compiled regular expression. This call has exactly the same results as myString.split("regex"). The difference is that the former is faster since the regex was already compiled.

Using The Matcher Class

Except for splitting a string (see previous paragraph), you need to create a Matcher object from the Pattern object. The Matcher will do the actual work. The advantage of having two separate classes is that you can create many Matcher objects from a single Pattern object, and thus apply the regular expression to many subject strings simultaneously.

To create a Matcher object, simply call Pattern.matcher() like this: myMatcher = Pattern.matcher("subject"). If you already created a Matcher object from the same pattern, call myMatcher.reset("newsubject") instead of creating a new matcher object, for reduced garbage and increased performance. Either way, myMatcher is now ready for duty.

To find the first match of the regex in the subject string, call myMatcher.find(). To find the next match, call myMatcher.find() again. When myMatcher.find() returns false, indicating there are no further matches, the next call to myMatcher.find() will find the first match again. The Matcher is automatically reset to the start of the string when find() fails.

The Matcher object holds the results of the last match. Call its methods start(), end() and group() to get details about the entire regex match and the matches between capturing parentheses. Each of these methods accepts a single int parameter indicating the number of the backreference. Omit the parameter to get information about the entire regex match. start() is the index of the first character in the match. end() is the index of the first character after the match. Both are relative to the start of the subject string. So the length of the match is end() - start(). group() returns the string matched by the regular expression or pair of capturing parentheses.

myMatcher.replaceAll("replacement") has exactly the same results as myString.replaceAll("regex", "replacement"). Again, the difference is speed.

The Matcher class allows you to do a search-and-replace and compute the replacement text for each regex match in your own code. You can do this with the appendReplacement() and appendTail() Here is how:

StringBuffer myStringBuffer = new StringBuffer();
myMatcher = myPattern.matcher("subject");
while (myMatcher.find()) {
 if (checkIfThisMatchShouldBeReplaced()) {
 myMatcher.appendReplacement(myStringBuffer, computeReplacementString());
 }
}
myMatcher.appendTail(myStringBuffer);

Obviously, checkIfThisMatchShouldBeReplaced() and computeReplacementString() are placeholders for methods that you supply. The first returns true or false indicating if a replacement should be made at all. Note that skipping replacements is way faster than replacing a match with exactly the same text as was matched. computeReplacementString() returns the actual replacement string.

Regular Expressions, Literal Strings and Backslashes

In literal Java strings the backslash is an escape character. The literal string "\\" is a single backslash. In regular expressions, the backslash is also an escape character. The regular expression \\ matches a single backslash. This regular expression as a Java string, becomes "\\\\". That’s right: 4 backslashes to match a single one.

The regex \w matches a word character. As a Java string, this is written as "\\w".

The same backslash-mess occurs when providing replacement strings for methods like String.replaceAll() as literal Java strings in your Java code. In the replacement text, a dollar sign must be encoded as \$ and a backslash as \\ when you want to replace the regex match with an actual dollar sign or backslash. However, backslashes must also be escaped in literal Java strings. So a single dollar sign in the replacement text becomes "\\$" when written as a literal Java string. The single backslash becomes "\\\\". Right again: 4 backslashes to insert a single one.

| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |

| grep | PowerGREP | RegexBuddy | RegexMagic |

| EditPad Lite | EditPad Pro | Google Docs | Google Sheets | LibreOffice | Notepad++ |

| Boost | C# | Delphi | F# | GNU (Linux) | Groovy | ICU (Unicode) | Java | JavaScript | .NET | PCRE (C/C++) | PCRE2 (C/C++) | Perl | PHP | POSIX | PowerShell | Python | Python.NET and IronPython | R | RE2 | Ruby | std::regex | Tcl | TypeScript | VBScript | Visual Basic 6 | Visual Basic (.NET) | wxWidgets | XML Schema | XQuery & XPath | Xojo | XRegExp |

| Google BigQuery | MySQL | Oracle | PostgreSQL |

AltStyle によって変換されたページ (->オリジナル) /