Each Unicode character belongs to a certain category. Unicode categories, or "general categories" as they’re called by the Unicode standard, are the most fundamental Unicode property. Every regex flavor that supports Unicode properties at all supports Unicode categories. That includes .NET, Java, ICU, JavaScript with /u, Ruby, JGsoft, Perl, PCRE, PCRE2. What is said below about PCRE and PCRE2 also applies to PHP, R, and Delphi.
All these flavors support the \p{Property} syntax with the property being a single letter or two letter representing the category. You can match a single character belonging to the "letter" category with \p{L}. You can match a single character not belonging to that category with \P{L}. \p{Ll} matches a lowercase letter while \P{Ll} matches any character that is not a lowercase letter.
Again, "character" really means "Unicode code point". \p{L} matches a single code point in the category "letter". If your input string is à encoded as U+0061 U+0300 then it matches a without the accent. If the input is à encoded as U+00E0 then it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category "letter", while U+0300 is in the category "mark".
ICU, Perl, Ruby, JavaScript, and the JGsoft applications allow you to spell out the full category names, such as \p{Letter} or \p{Lowercase_Letter}.
PCRE and .NET are case sensitive for the category letters. \p{Zs} will match any kind of space character, while \p{zs} will throw an error. The first letter needs to be uppercase and the second letter, if used, needs to be lowercase. PCRE2 was case sensitive originally, but because case insensitive with version 10.40. PHP became case insensitive with version 8.2.0 and R with version 4.2.2. All other regex engines described in this tutorial ignore the case of the category between the curly braces. \p{zs}, \p{ZS}, and \p{zS} all match a single space separator. But, it’s best to stick with the capitalization required by the case sensitive flavors. It is how the category letters are defined in Unicode. It will make your regular expressions work with all Unicode regex engines.
Java, ICU, Perl, and JavaScript allow you to full property set syntax for categories. \p{gc=Lu} matches an uppercase letter just like \p{Lu}. Except Java, these flavors also support the long form \p{General_Category=Uppercase_Letter}.
In addition to the standard notation with curly braces, \p{L}, Java, Perl, PCRE, PCRE2, and the JGsoft engine allow you to use the shorthand \pL without curly braces. The shorthand only works with single-letter Unicode categories. \pLl is not the equivalent of \p{Ll}. It is the equivalent of \p{L}l which matches Al or àl or any Unicode letter followed by a literal l.
These are all the general categories defined in Unicode. Every code point is part of exactly one two-letter category. Single-letter categories include all the characters of all two-letter categories that start with the same letter.
The categories themselves are the same in every version of Unicode. But code points can be moved between categories with each new version of Unicode. Code points for new characters are always moved from the Unassigned category to another category. But previously assigned code points can also be moved. The Georgian letters U+10C0–U+10FA, for example, were originally in the lowercase letter category. Unicode 3.0.0 moved them to the "other letter" category because they didn’t have uppercase equivalents. Unicode 11.0.0 added uppercase equivalents of these letters and moved the original letters back to the lowercase letter category.
| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |
| Introduction | Astral Characters | Code Points and Graphemes | Unicode Categories | Unicode Scripts | Unicode Blocks | Unicode Binary Properties | Unicode Property Sets | Unicode Script Runs | Unicode Boundaries |
| Introduction | Table of Contents | Special Characters | Non-Printable Characters | Regex Engine Internals | Character Classes | Character Class Subtraction | Character Class Intersection | Shorthand Character Classes | Dot | Anchors | Word Boundaries | Alternation | Optional Items | Repetition | Grouping & Capturing | Backreferences | Backreferences, part 2 | Named Groups | Relative Backreferences | Branch Reset Groups | Free-Spacing & Comments | Unicode Characters & Properties | Mode Modifiers | Atomic Grouping | Possessive Quantifiers | Lookahead & Lookbehind | Lookaround, part 2 | Lookbehind Limitations | (Non-)Atomic Lookaround | Keep Text out of The Match | Conditionals | Balancing Groups | Recursion | Subroutines | Infinite Recursion | Recursion & Quantifiers | Recursion & Capturing | Recursion & Backreferences | Recursion & Backtracking | POSIX Bracket Expressions | Zero-Length Matches | Continuing Matches | Backtracking Control Verbs | Control Verb Arguments |
Page URL: https://www.regular-expressions.info/unicodecategory.html
Page last updated: 15 September 2025
Site last updated: 29 October 2025
Copyright © 2003-2025 Jan Goyvaerts. All rights reserved.