Return to Answer

added 48 characters in body

edited Jul 19, 2024 at 17:07

8.4k
5
53
90

It happens to be that UTF-16 code units are 1:1 with USV (Unicode scalar value) code points within two ranges, U+0 to U+D7FF and U+E000 to U+FFFF inclusive. Most cased characters fall into those two ranges, but not all of them.

function capitalizeFirstLetter(str) {
 if (!str) return '';
 const firstCPfirstCodePoint = str.codePointAt(0);
 const index = firstCPfirstCodePoint > 0xFFFF ? 2 : 1;
 return String.fromCodePoint(firstCPfirstCodePoint).toUpperCase() + str.slice(index);
}
capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉") // "𐐎𐐲𐑌𐐼𐐲𐑉"

The \p{CWU} or Changes_When_Uppercased character property matches all code points which change when uppercased in the generic case where specific locale data is absent. There are other interesting case-related Unicode character properties that you may wish to play around with. It’s a cool zone to explore but we’d go on all day if we enumerated em all here. Here’s something to get your curiosity going if you’re unfamiliar, though: \p{Lower} is a larger group than \p{LowercaseLetter} (aka \p{Ll}) — conveniently illustrated by the default character set comparison in this tool provided by Unicode. (NB: not everything you can reference there is also available in ES regular expressions, but most of the stuff you’re likely to want is).

In Chromium at the time of writing, both the English and Dutch lines come out as Ijsselmeer — so it does no better than JS. But try it in current Firefox! The element that we told the browser contains Dutch will be correctly rendered as IJsselmeerIJsselmeer there.

It happens to be that UTF-16 code units are 1:1 with USV code points within two ranges, U+0 to U+D7FF and U+E000 to U+FFFF inclusive. Most cased characters fall into those two ranges, but not all of them.

function capitalizeFirstLetter(str) {
 if (!str) return '';
 const firstCP = str.codePointAt(0);
 const index = firstCP > 0xFFFF ? 2 : 1;
 return String.fromCodePoint(firstCP).toUpperCase() + str.slice(index);
}
capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉") // "𐐎𐐲𐑌𐐼𐐲𐑉"

The CWU or Changes_When_Uppercased character property matches all code points which change when uppercased in the generic case where specific locale data is absent. There are other interesting case-related Unicode character properties that you may wish to play around with. It’s a cool zone to explore but we’d go on all day if we enumerated em all here. Here’s something to get your curiosity going if you’re unfamiliar, though: \p{Lower} is a larger group than \p{LowercaseLetter} (aka \p{Ll}) — conveniently illustrated by the default character set comparison in this tool provided by Unicode. (NB: not everything you can reference there is also available in ES regular expressions, but most of the stuff you’re likely to want is).

function capitalizeFirstLetter(str) {
 if (!str) return '';
 const firstCodePoint = str.codePointAt(0);
 const index = firstCodePoint > 0xFFFF ? 2 : 1;
 return String.fromCodePoint(firstCodePoint).toUpperCase() + str.slice(index);
}
capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉") // "𐐎𐐲𐑌𐐼𐐲𐑉"

added 74 characters in body

Source Link

edited Aug 24, 2022 at 19:18

CollectivesTM on Stack Overflow

Return to Answer