Using Regular Expressions with PHP

PHP is an open source language for producing dynamic web pages. It has a long history. Its regular expression support has evolved over time.

When writing new PHP code that needs to use regular expressions, you should use the regex functions that have names starting with preg. When the regex tutorial and regex reference on this website talk about PHP specifically they assume you’re using the preg functions. They have been included with PHP since PHP 4.2.0 was released in April 2002. The tutorial and reference cover PHP 5.0.0 (July 2004) and later.

If you inherited PHP code then you may encounter ereg and mb_ereg function calls. The ereg functions were deprecated in PHP 5.3.0 and removed in 7.0.0. The mb_ereg functions are still available in PHP 8. But there is really no point in using them. They’re designed to work with legacy multi-byte code pages. Modern PHP code should use UTF-8 and the preg functions. We discuss the ereg and mb_ereg functions at the bottom of this page so you can understand old code and migrate it to the preg functions.

Formatting Regexes as PHP preg Strings

All of the preg functions require you to specify the regular expression as a string using Perl syntax. In Perl, /regex/ defines a regular expression. In PHP, this becomes preg_match('/regex/', $subject). When forward slashes are used as the regex delimiter, any forward slashes in the regular expression have to be escaped with a backslash. So https://www\.regexp\.info/ becomes '/https:\/\/www\.regexp\.info\//'. Just like Perl, the preg functions allow any non-alphanumeric character as regex delimiters. The URL regex would be more readable as '%https://www\.jgsoft\.com/%' using percentage signs as the regex delimiters, since then you don’t need to escape the forward slashes. You would have to escape percentage sings if the regex contained any.

Unlike programming languages like C# or Java, PHP does not require all backslashes in strings to be escaped. If you want to include a backslash as a literal character in a PHP string then you only need to escape it if it is followed by another character that needs to be escaped. In single quoted-strings, only the single quote and the backslash itself need to be escaped. That is why in the above regex, I didn’t have to double the backslashes in front of the literal dots. The regex \\ to match a single backslash would become '/\\\\/' as a PHP preg string. Unless you want to use variable interpolation in your regular expression, you should always use single-quoted strings for regular expressions in PHP, to avoid messy duplication of backslashes.

To specify regex matching options such as case insensitivity are specified in the same way as in Perl. '/regex/i' applies the regex case insensitively. '/regex/s' makes the dot match all characters. '/regex/m' makes the start and end of line anchors match at embedded newlines in the subject string. '/regex/x' turns on free-spacing mode. You can specify multiple letters to turn on several options. '/regex/misx' turns on all four options.

A special option is the /u which turns on the Unicode matching mode, instead of the default 8-bit matching mode. You should specify /u for regular expressions that use \x{FFFF}, \X or \p{L} to match Unicode characters, graphemes, or properties. PHP interprets '/regex/u' as a UTF-8 string rather than as an ASCII string.

The preg Function Set

Though the preg functions pretend to be like Perl with their requirement to format the regex like a Perl operator, these functions are actually based on PCRE or PCRE2, which has no such requirement. The preg preg functions convert your pretend-Perl-operator back into a regular string which is then fed into the underlying PCRE library. PHP 4.2 to PHP 7.3 were based on various versions of PCRE. PHP 7.4 switched to PCRE2. This only affects the regular expression syntax. The tutorial and reference mention PHP versions specifically, so you don’t need to worry about which version of PCRE or PCRE2 you’re using. You only need to know your version of PHP.

bool preg_match (string pattern, string subject [, array groups]) returns TRUE if the regular expression pattern matches the subject string or part of the subject string. If you specify the third parameter then preg stores the regex match in $groups[0], the substring matched by the first capturing group in $groups[1], the second in $groups[2], and so on. If the regex pattern uses named capture then you can access the groups by name with $groups['name'].

int preg_match_all (string pattern, string subject, array matches, int flags) fills the array "matches" with all the matches of the regular expression pattern in the subject string. If you specify PREG_SET_ORDER as the flag, then $matches[0] is an array containing the match and backreferences of the first match, just like the $groups array filled by preg_match. $matches[1] holds the results for the second match, and so on. If you specify PREG_PATTERN_ORDER, then $matches[0] is an array with full consecutive regex matches, $matches[1] an array with the first backreference of all matches, $matches[2] an array with the second backreference of each match, etc.

array preg_grep (string pattern, array subjects) returns an array that contains all the strings in the array "subjects" in which the regular expression pattern can find a match.

mixed preg_replace (mixed pattern, mixed replacement, mixed subject [, int limit]) returns a string with all matches of the regex pattern in the subject string replaced with the replacement string. At most limit replacements are made. One key difference is that all parameters, except limit, can be arrays instead of strings. In that case, preg_replace does its job multiple times, iterating over the elements in the arrays simultaneously. You can also use strings for some parameters, and arrays for others. Then the function will iterate over the arrays, and use the same strings for each iteration. Using an array of the pattern and replacement, allows you to perform a sequence of search and replace operations on a single subject string. Using an array for the subject string, allows you to perform the same search and replace operation on many subject strings.

preg_replace_callback (mixed pattern, callback replacement, mixed subject [, int limit]) works just like preg_replace, except that the second parameter takes a callback instead of a string or an array of strings. The callback function will be called for each match. The callback should accept a single parameter. This parameter will be an array of strings, with element 0 holding the overall regex match, and the other elements the text matched by capturing groups. This is the same array you’d get from preg_match. The callback function should return the text that the match should be replaced with. Return an empty string to delete the match. Return $groups[0] to skip this match.

Callbacks allow you to do powerful search-and-replace operations that you cannot do with regular expressions alone. E.g. if you search for the regex (\d+)\+(\d+), you can replace 2+3 with 5 using the callback:

function regexadd($groups) {
 return $groups[1] + $groups[2];
}

array preg_split (string pattern, string subject [, int limit]) works just like split, except that it uses the Perl syntax for the regex pattern.

Replacement Text Syntax

PCRE does not have any search-and-replace features. PHP invented its own replacement string syntax for the preg_replace() function. PCRE2 does have search-and-replace features. But PHP continues to use its own replacement string syntax after migrating to PCRE2. This syntax has never changed. It does not even support named backreferences because at the time of PHP 4.2.0, PCRE did not support named capturing groups.

PHP’s replacement string syntax is very simple. 0円 and 0ドル insert the whole regex match. 1円 through 99円 and 1ドル through 99ドル insert the text matched by capturing groups 1 through 99. Backreferences to non-existent and non-participating groups are replaced with nothing.

It’s best to use the syntax with dollar signs if you add the replacement as a string literal to your code. In a double-quoted string, 1円 is an octal escape. So the string "1円" passes the control character 0x01 as the replacement text. Use the string "\1円" to pass the backreference 1円. You can avoid this mess by using single-quoted strings or the replacement syntax with dollar signs. The strings '1円', '1ドル', and "1ドル" are all interpreted as a backreference to the first capturing group. Single-quoted strings do not support escapes other than the single quote and the backslash itself. Variable interpolation does not affect dollar signs followed by numbers. 1ドル is not a magic variable in PHP as it is in Perl.

If you want to insert 1円 or 1ドル as literal text in your replacement then you can escape the backslash or dollar sign with a backslash as in \\1 or \$1 Any backslash not followed by a digit, dollar sign, or other backslash is a literal. A dollar sign not followed by a digit is a literal.

PHP string literals also require backslashes to be escaped. So \\1 needs to be coded as '\\\1円' or "\\\1円" for preg_replace() to see the escaped backslash. \$1 can be coded as '\1ドル' because the backslash does not escape the dollar in single-quoted strings. But it needs to be coded as "\\1ドル" to make sure preg_replace() sees the backslash if you prefer double-quoted strings.

The ereg Function Set

If you inherit and old website still running on PHP 6 or prior then it may still be using regex functions starting with ereg. Those implemented POSIX Extended Regular Expressions, like the traditional UNIX egrep command. They were maintained mainly for backward compatibility with PHP 3 after the preg functions were added. Because they were easy to use, many developers continued to use them for simple regex needs even after they were deprecated PHP 5.3.0. Many of the more modern regex features such as lazy quantifiers, lookaround and Unicode are not supported by the ereg functions. Don’t let the "extended" moniker fool you. The POSIX standard was defined in 1986. Regular expressions have come a long way since then. PHP 7.0.0 removed the ereg functions. The ereg functions treat every byte as a single character. They do not work correctly on strings that use UTF-8 or a multi-byte encoding.

The ereg functions require you to specify the regular expression as a string, without any Perl-style decorations. You cannot add any flags to specify matching modes. ereg('regex', "subject") checks if regex matches subject. You should use single quotes when passing a regular expression as a literal string to minimize the number of backslashes that need to be escaped.

int ereg (string pattern, string subject [, array groups]) returns the length of the match if the regular expression pattern matches the subject string or part of the subject string, or zero otherwise. Since zero evaluates to False and non-zero evaluates to True, you can use ereg in an if statement to test for a match. If you specify the third parameter then ereg stores the substring matched by the part of the regular expression between the first pair of parentheses in $groups[1], the second pair in $groups[2], and so on. ereg is case sensitive. eregi is the case insensitive equivalent.

string ereg_replace (string pattern, string replacement, string subject) replaces all matches of the regex patten in the subject string with the replacement string. You can use backreferences in the replacement string. \0円 is the entire regex match, \1円 is the first backreference, \2円 the second, etc. The highest possible backreference is \9円. ereg_replace is case sensitive. eregi_replace is the case insensitive equivalent.

array split (string pattern, string subject [, int limit]) splits the subject string into an array of strings using the regular expression pattern. The array will contain the substrings between the regular expression matches. The text actually matched is discarded. If you specify a limit then the resulting array will contain at most that many substrings. The subject string will be split at most limit-1 times, and the last item in the array will contain the unsplit remainder of the subject string. split is case sensitive. spliti is the case insensitive equivalent.

The mb_ereg Function Set

The last set is a variant of the ereg set, prefixing mb_ for "multibyte" to the function names. While ereg treats the regex and subject string as a series of 8-bit characters, mb_ereg can work with multi-byte characters from various code pages. If you want your regex to treat Far East characters as individual characters, you’ll either need to use the mb_ereg functions, or the preg functions with the /u modifier. mb_ereg is available in PHP 4.2.0 and later. It has not been deprecated or removed like the ereg functions. It uses the same POSIX ERE flavor.

The mb_ereg functions work exactly the same as the ereg functions, with one key difference: while ereg treats the regex and subject string as a series of 8-bit characters, mb_ereg can work with multi-byte characters from various code pages. When using Windows code page 936 (Simplified Chinese), for example, the word 中国 ("China") consists of four bytes: D6D0B9FA. Using the ereg function with the regular expression . on this string would yield the first byte D6 as the result. The dot matched exactly one byte, as the ereg functions are byte-oriented. Using the mb_ereg function after calling mb_regex_encoding("CP936") would yield the bytes D6D0 or the first character as the result.

To make sure your regular expression uses the correct code page, call mb_regex_encoding() to set the code page. If you don’t then the code page returned by or set by mb_internal_encoding() is used instead.

If your PHP script uses UTF-8 then you should use the preg functions with the /u modifier to match multi-byte UTF-8 characters instead of individual bytes. The preg functions do not support any other code pages.

| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |

| grep | PowerGREP | RegexBuddy | RegexMagic |

| EditPad Lite | EditPad Pro | Google Docs | Google Sheets | LibreOffice | Notepad++ |

| Boost | C# | Delphi | F# | GNU (Linux) | Groovy | ICU (Unicode) | Java | JavaScript | .NET | PCRE (C/C++) | PCRE2 (C/C++) | Perl | PHP | POSIX | PowerShell | Python | Python.NET and IronPython | R | RE2 | Ruby | std::regex | Tcl | TypeScript | VBScript | Visual Basic 6 | Visual Basic (.NET) | wxWidgets | XML Schema | XQuery & XPath | Xojo | XRegExp |

| Google BigQuery | MySQL | Oracle | PostgreSQL |

AltStyle によって変換されたページ (->オリジナル) /