It happens to be that UTF-16 code units are 1:1 with USV (Unicode scalar value) code points within two ranges, U+0 to U+D7FF and U+E000 to U+FFFF inclusive. Most cased characters fall into those two ranges, but not all of them.
function capitalizeFirstLetter(str) {
if (!str) return '';
const firstCPfirstCodePoint = str.codePointAt(0);
const index = firstCPfirstCodePoint > 0xFFFF ? 2 : 1;
return String.fromCodePoint(firstCPfirstCodePoint).toUpperCase() + str.slice(index);
}
capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉") // "𐐎𐐲𐑌𐐼𐐲𐑉"
The \p{CWU} or Changes_When_Uppercased character property matches all code points which change when uppercased in the generic case where specific locale data is absent. There are other interesting case-related Unicode character properties that you may wish to play around with. It’s a cool zone to explore but we’d go on all day if we enumerated em all here. Here’s something to get your curiosity going if you’re unfamiliar, though: \p{Lower} is a larger group than \p{LowercaseLetter} (aka \p{Ll}) — conveniently illustrated by the default character set comparison in this tool provided by Unicode. (NB: not everything you can reference there is also available in ES regular expressions, but most of the stuff you’re likely to want is).
In Chromium at the time of writing, both the English and Dutch lines come out as Ijsselmeer — so it does no better than JS. But try it in current Firefox! The element that we told the browser contains Dutch will be correctly rendered as IJsselmeerIJsselmeer there.
It happens to be that UTF-16 code units are 1:1 with USV code points within two ranges, U+0 to U+D7FF and U+E000 to U+FFFF inclusive. Most cased characters fall into those two ranges, but not all of them.
function capitalizeFirstLetter(str) {
if (!str) return '';
const firstCP = str.codePointAt(0);
const index = firstCP > 0xFFFF ? 2 : 1;
return String.fromCodePoint(firstCP).toUpperCase() + str.slice(index);
}
capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉") // "𐐎𐐲𐑌𐐼𐐲𐑉"
The CWU or Changes_When_Uppercased character property matches all code points which change when uppercased in the generic case where specific locale data is absent. There are other interesting case-related Unicode character properties that you may wish to play around with. It’s a cool zone to explore but we’d go on all day if we enumerated em all here. Here’s something to get your curiosity going if you’re unfamiliar, though: \p{Lower} is a larger group than \p{LowercaseLetter} (aka \p{Ll}) — conveniently illustrated by the default character set comparison in this tool provided by Unicode. (NB: not everything you can reference there is also available in ES regular expressions, but most of the stuff you’re likely to want is).
In Chromium at the time of writing, both the English and Dutch lines come out as Ijsselmeer — so it does no better than JS. But try it in current Firefox! The element that we told the browser contains Dutch will be correctly rendered as IJsselmeer there.
It happens to be that UTF-16 code units are 1:1 with USV (Unicode scalar value) code points within two ranges, U+0 to U+D7FF and U+E000 to U+FFFF inclusive. Most cased characters fall into those two ranges, but not all of them.
function capitalizeFirstLetter(str) {
if (!str) return '';
const firstCodePoint = str.codePointAt(0);
const index = firstCodePoint > 0xFFFF ? 2 : 1;
return String.fromCodePoint(firstCodePoint).toUpperCase() + str.slice(index);
}
capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉") // "𐐎𐐲𐑌𐐼𐐲𐑉"
The \p{CWU} or Changes_When_Uppercased character property matches all code points which change when uppercased in the generic case where specific locale data is absent. There are other interesting case-related Unicode character properties that you may wish to play around with. It’s a cool zone to explore but we’d go on all day if we enumerated em all here. Here’s something to get your curiosity going if you’re unfamiliar, though: \p{Lower} is a larger group than \p{LowercaseLetter} (aka \p{Ll}) — conveniently illustrated by the default character set comparison in this tool provided by Unicode. (NB: not everything you can reference there is also available in ES regular expressions, but most of the stuff you’re likely to want is).
In Chromium at the time of writing, both the English and Dutch lines come out as Ijsselmeer — so it does no better than JS. But try it in current Firefox! The element that we told the browser contains Dutch will be correctly rendered as IJsselmeer there.
If digraphs with unique locale/language/orthography capitalization rules happen to have a single-codepoint "composed" representation in Unicode, these might be used to make one’s capitalization expectations explicit even in the absence of locale data. For example, as withwe could prefer the composed U+133 Ii-Jj digraph associated with Dutch, ij even in the absence of locale data/ U+133, associated with Dutch, to ensure a case-mapping to uppercase IJ / U+132:
If digraphs with unique locale/language/orthography capitalization rules happen to have a single-codepoint "composed" representation in Unicode, these might be used to make one’s capitalization expectations explicit, as with the composed U+133 I-J digraph associated with Dutch ij even in the absence of locale data:
If digraphs with unique locale/language/orthography capitalization rules happen to have a single-codepoint "composed" representation in Unicode, these might be used to make one’s capitalization expectations explicit even in the absence of locale data. For example, we could prefer the composed i-j digraph, ij / U+133, associated with Dutch, to ensure a case-mapping to uppercase IJ / U+132:
In Chromium at the time of writing, both the English and Dutch lines come out as Ijsselmeer — so it does no better than JS. But try it in current Firefox! The element that was explicitly annotated containingwe told the browser contains Dutch content renderswill be correctly rendered as IJsselmeer correctly now!
This solution is purpose-specific (it’s not gonna help you in Node, anyway) but it was silly of me not to draw attention to it previouslythere.
ThanksThis solution is purpose-specific (it’s not gonna help you in Node, anyway) but it was silly of me not to draw attention to it previously given some folks might not realize they’re googling the wrong question. Thanks @paul23 for clarifying more about the nature of the IJ digraph in practice and prompting further investigation!
In Chromium at the time of writing, both the English and Dutch lines come out as Ijsselmeer — it does no better than JS. But try it in Firefox! The element that was explicitly annotated containing Dutch content renders IJsselmeer correctly now!
This solution is purpose-specific (it’s not gonna help you in Node, anyway) but it was silly of me not to draw attention to it previously.
Thanks @paul23 for clarifying more about the nature of the IJ digraph in practice and prompting further investigation!
In Chromium at the time of writing, both the English and Dutch lines come out as Ijsselmeer — so it does no better than JS. But try it in current Firefox! The element that we told the browser contains Dutch will be correctly rendered as IJsselmeer there.
This solution is purpose-specific (it’s not gonna help you in Node, anyway) but it was silly of me not to draw attention to it previously given some folks might not realize they’re googling the wrong question. Thanks @paul23 for clarifying more about the nature of the IJ digraph in practice and prompting further investigation!
- 7.6k
- 2
- 32
- 42
- 7.6k
- 2
- 32
- 42