33

What are the (full) valid / allowed (削除) charset (削除ここまで) characters for CSS identifiers id and class?

Is there a regular expression that I can use to validate against? Is it browser agnostic?

BalusC
1.1m377 gold badges3.7k silver badges3.6k bronze badges
asked May 11, 2010 at 15:39
3

3 Answers 3

51

The charset doesn't matter. The allowed characters matters more. Check the CSS specification. Here's a cite of relevance:

In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B\&W\?" or "B26円 W3円F".

Update: As to the regex question, you can find the grammar here:

ident -?{nmstart}{nmchar}*

Which contains of the parts:

nmstart [_a-z]|{nonascii}|{escape}
nmchar [_a-z0-9-]|{nonascii}|{escape}
nonascii [240円-377円]
escape {unicode}|\\[^\r\n\f0-9a-f]
unicode \\{h}{1,6}(\r\n|[ \t\r\n\f])?
h [0-9a-f]

This can be translated to a Java regex as follows (I only added parentheses to parts containing the OR and escaped the backslashes):

String h = "[0-9a-f]";
String unicode = "\\\\{h}{1,6}(\\r\\n|[ \\t\\r\\n\\f])?".replace("{h}", h);
String escape = "({unicode}|\\\\[^\\r\\n\\f0-9a-f])".replace("{unicode}", unicode);
String nonascii = "[\240円-\377円]";
String nmchar = "([_a-z0-9-]|{nonascii}|{escape})".replace("{nonascii}", nonascii).replace("{escape}", escape);
String nmstart = "([_a-z]|{nonascii}|{escape})".replace("{nonascii}", nonascii).replace("{escape}", escape);
String ident = "-?{nmstart}{nmchar}*".replace("{nmstart}", nmstart).replace("{nmchar}", nmchar);
System.out.println(ident); // The full regex.

Update 2: oh, you're more a PHP'er, well I think you can figure how/where to do str_replace?

typo
2403 silver badges11 bronze badges
answered May 11, 2010 at 15:41

7 Comments

"the identifier "B&W?" may be written as "B\&W\?" or "B26円 W3円F"" - But nobody does that, and I'm glad they don't. :-)
THANK YOU! That's just awesome! :D I though it was very limited but didn't knew I could use `` as an escape character. Has anyone ever built a regex to validate the allowed chars?
That's perfect, and yes I can figure it out. =) Thanks again!
You're welcome. Don't forget to make it case insensitive or to lowercase the identifier beforehand.
If I evaluate your Java, I get the following regex pattern: -?([_a-z]|[\x200-\x377]|(\\[0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?|\\[^\r\n\f0-9a-f]))([_a-z0-9-]|[\x200-\x377]|(\\[0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?|\\[^\r\n\f0-9a-f]) )* Yet that matches the string "2thisshouldfail" which is not a valid CSS indicator
|
4

For anyone looking for something a little more turn-key. The full expression, replaced and all, from @BalusC's answer is:

/-?([_a-z]|[240円-377円]|([0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?|[^\r\n\f0-9a-f]))([_a-z0-9-]|[240円-377円]|([0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?|[^\r\n\f0-9a-f]))*/

And using DEFINE, which I find a little more readable:

/(?(DEFINE)
 (?P<h> [0-9a-f] )
 (?P<unicode> (?&h){1,6}(\r\n|[ \t\r\n\f])? )
 (?P<escape> ((?&unicode)|[^\r\n\f0-9a-f])* )
 (?P<nonascii> [240円-377円] )
 (?P<nmchar> ([_a-z0-9-]|(?&nonascii)|(?&escape)) )
 (?P<nmstart> ([_a-z]|(?&nonascii)|(?&escape)) )
 (?P<ident> -?(?&nmstart)(?&nmchar)* )
) (?:
 (?&ident)
)/x

Incidentally, the original regular expression (and @human's contribution) had a few rogue escape characters that allow [ in the name.

Also, it should be noted that the raw regex without, DEFINE, runs about 2x as fast as the DEFINE expression, taking only ~23 steps to identify a single unicode character, while the later takes ~40.

answered Dec 23, 2016 at 12:51

Comments

2

This is merely a contribution to @BalusC answer. It is the PHP version of the Java code he provided, I converted it and I thought someone else could find it helpful.

$h = "[0-9a-f]";
$unicode = str_replace( "{h}", $h, "\{h}{1,6}(\r\n|[ \t\r\n\f])?" );
$escape = str_replace( "{unicode}", $unicode, "({unicode}|\[^\r\n\f0-9a-f])");
$nonascii = "[240円-377円]";
$nmchar = str_replace( array( "{nonascii}", "{escape}" ), array( $nonascii, $escape ), "([_a-z0-9-]|{nonascii}|{escape})");
$nmstart = str_replace( array( "{nonascii}", "{escape}" ), array( $nonascii, $escape ), "([_a-z]|{nonascii}|{escape})" );
$ident = str_replace( array( "{nmstart}", "{nmchar}" ), array( $nmstart, $nmchar ), "-?{nmstart}{nmchar}*");
echo $ident; // The full regex.
answered Apr 19, 2016 at 6:54

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.