Unicode Categories

Each Unicode character belongs to a certain category. Unicode categories, or "general categories" as they’re called by the Unicode standard, are the most fundamental Unicode property. Every regex flavor that supports Unicode properties at all supports Unicode categories. That includes .NET, Java, ICU, JavaScript with /u, Ruby, JGsoft, Perl, PCRE, PCRE2. What is said below about PCRE and PCRE2 also applies to PHP, R, and Delphi.

All these flavors support the \p{Property} syntax with the property being a single letter or two letter representing the category. You can match a single character belonging to the "letter" category with \p{L}. You can match a single character not belonging to that category with \P{L}. \p{Ll} matches a lowercase letter while \P{Ll} matches any character that is not a lowercase letter.

Again, "character" really means "Unicode code point". \p{L} matches a single code point in the category "letter". If your input string is à encoded as U+0061 U+0300 then it matches a without the accent. If the input is à encoded as U+00E0 then it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category "letter", while U+0300 is in the category "mark".

ICU, Perl, Ruby, JavaScript, and the JGsoft applications allow you to spell out the full category names, such as \p{Letter} or \p{Lowercase_Letter}.

PCRE and .NET are case sensitive for the category letters. \p{Zs} will match any kind of space character, while \p{zs} will throw an error. The first letter needs to be uppercase and the second letter, if used, needs to be lowercase. PCRE2 was case sensitive originally, but because case insensitive with version 10.40. PHP became case insensitive with version 8.2.0 and R with version 4.2.2. All other regex engines described in this tutorial ignore the case of the category between the curly braces. \p{zs}, \p{ZS}, and \p{zS} all match a single space separator. But, it’s best to stick with the capitalization required by the case sensitive flavors. It is how the category letters are defined in Unicode. It will make your regular expressions work with all Unicode regex engines.

Java, ICU, Perl, and JavaScript allow you to full property set syntax for categories. \p{gc=Lu} matches an uppercase letter just like \p{Lu}. Except Java, these flavors also support the long form \p{General_Category=Uppercase_Letter}.

In addition to the standard notation with curly braces, \p{L}, Java, Perl, PCRE, PCRE2, and the JGsoft engine allow you to use the shorthand \pL without curly braces. The shorthand only works with single-letter Unicode categories. \pLl is not the equivalent of \p{Ll}. It is the equivalent of \p{L}l which matches Al or àl or any Unicode letter followed by a literal l.

These are all the general categories defined in Unicode. Every code point is part of exactly one two-letter category. Single-letter categories include all the characters of all two-letter categories that start with the same letter.

The categories themselves are the same in every version of Unicode. But code points can be moved between categories with each new version of Unicode. Code points for new characters are always moved from the Unassigned category to another category. But previously assigned code points can also be moved. The Georgian letters U+10C0–U+10FA, for example, were originally in the lowercase letter category. Unicode 3.0.0 moved them to the "other letter" category because they didn’t have uppercase equivalents. Unicode 11.0.0 added uppercase equivalents of these letters and moved the original letters back to the lowercase letter category.

  • \p{L} or \p{Letter}: any kind of letter from any language.
    • \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
    • \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
    • \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
    • \p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
    • \p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
    • \p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
  • \p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
    • \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
    • \p{Mc} or \p{Spacing_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
    • \p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
  • \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
    • \p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
    • \p{Zl} or \p{Line_Separator}: line separator character U+2028.
    • \p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
  • \p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
    • \p{Sm} or \p{Math_Symbol}: any mathematical symbol.
    • \p{Sc} or \p{Currency_Symbol}: any currency sign.
    • \p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
    • \p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
  • \p{N} or \p{Number}: any kind of numeric character in any script.
    • \p{Nd} or \p{Decimal_Number}: a digit zero through nine in any script except ideographic scripts.
    • \p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
    • \p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
  • \p{P} or \p{Punctuation}: any kind of punctuation character.
    • \p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
    • \p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
    • \p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
    • \p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
    • \p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
    • \p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
    • \p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
  • \p{C} or \p{Other}: invisible control characters and unused code points.
    • \p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
    • \p{Cf} or \p{Format}: invisible formatting indicator.
    • \p{Co} or \p{Private_Use}: any code point reserved for private use.
    • \p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
    • \p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |

| Introduction | Astral Characters | Code Points and Graphemes | Unicode Categories | Unicode Scripts | Unicode Blocks | Unicode Binary Properties | Unicode Property Sets | Unicode Script Runs | Unicode Boundaries |

| Introduction | Table of Contents | Special Characters | Non-Printable Characters | Regex Engine Internals | Character Classes | Character Class Subtraction | Character Class Intersection | Shorthand Character Classes | Dot | Anchors | Word Boundaries | Alternation | Optional Items | Repetition | Grouping & Capturing | Backreferences | Backreferences, part 2 | Named Groups | Relative Backreferences | Branch Reset Groups | Free-Spacing & Comments | Unicode Characters & Properties | Mode Modifiers | Atomic Grouping | Possessive Quantifiers | Lookahead & Lookbehind | Lookaround, part 2 | Lookbehind Limitations | (Non-)Atomic Lookaround | Keep Text out of The Match | Conditionals | Balancing Groups | Recursion | Subroutines | Infinite Recursion | Recursion & Quantifiers | Recursion & Capturing | Recursion & Backreferences | Recursion & Backtracking | POSIX Bracket Expressions | Zero-Length Matches | Continuing Matches | Backtracking Control Verbs | Control Verb Arguments |

AltStyle によって変換されたページ (->オリジナル) /