Return to Revisions

7 of 12

added 204 characters in body

edited Feb 11, 2020 at 23:33

7.6k
2
32
42

I didn’t see any mention in the existing answers of issues related to ~~(削除) astral plane code points or (削除ここまで)~~ internationalization. "Uppercase" doesn’t mean the same thing in every language using a given script.

Initially I didn’t see any answers addressing issues related to astral plane code points. There is one, but it’s a bit buried (like this one will be, I guess!)

Most of the proposed functions look like this:

function capitalizeFirstLetter(str) {
 return str[0].toUpperCase() + str.slice(1);
}

However, some cased characters fall outside the BMP (basic multilingual plane, code points U+0 to U+FFFF). For example take this Deseret text:

capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉"); // "𐐶𐐲𐑌𐐼𐐲𐑉"

The first character here fails to capitalize because the array-indexed properties of strings don’t access "characters" or code points*. They access UTF-16 code units. This is true also when slicing — the index values point at code units.

It happens to be that UTF-16 code units are 1:1 with USV code points within two ranges, U+0 to U+D7FF and U+E000 to U+FFFF inclusive. Most cased characters fall into those two ranges, but not all of them.

From ES2015 on, dealing with this became a bit easier. String.prototype[@@iterator] yields strings corresponding to code points**. So for example, we can do this:

function capitalizeFirstLetter([ first, ...rest ]) {
 return [ first.toUpperCase(), ...rest ].join('');
}
capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉") // "𐐎𐐲𐑌𐐼𐐲𐑉"

For longer strings, this is probably not terribly efficient*** — we don’t really need to iterate the remainder. We could use String.prototype.codePointAt to get at that first (possible) letter, but we’d still need to determine where the slice should begin. One way to avoid iterating the remainder would be to test whether the first codepoint is outside the BMP; if it isn’t, the slice begins at 1, and if it is, the slice begins at 2.

function capitalizeFirstLetter(str) {
 const firstCP = str.codePointAt(0);
 const index = firstCP > 0xFFFF ? 2 : 1;
 return String.fromCodePoint(firstCP).toUpperCase() + str.slice(index);
}
capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉") // "𐐎𐐲𐑌𐐼𐐲𐑉"

You could use bitwise math instead of > 0xFFFF there, but it’s probably easier to understand this way and either would achieve the same thing.

We can also make this work in ES5 and below by taking that logic a bit further if necessary. There are no intrinsic methods in ES5 for working with codepoints, so we have to manually test whether the first code unit is a surrogate****:

function capitalizeFirstLetter(str) {
 var firstCodeUnit = str[0];
 if (firstCodeUnit < '\uD800' || firstCodeUnit > '\uDFFF') {
 return str[0].toUpperCase() + str.slice(1);
 }
 return str.slice(0, 2).toUpperCase() + str.slice(2);
}
capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉") // "𐐎𐐲𐑌𐐼𐐲𐑉"

At the start I also mentioned internationalization considerations. Some of these are very difficult to account for because they require knowledge not only of what language is being used, but also may require specific knowledge of the words in the language. For example, the Irish digraph "mb" capitalizes as "mB" at the start of a word. Another example, the German eszett, never begins a word (afaik), but still helps illustrate the problem. The lowercase eszett ("ß") capitalizes to "SS," but "SS" could lowercase to either "ß" or "ss" — you require out-of-band knowledge of the German language to know which is correct!

The most famous example of these kinds of issues, probably, is Turkish. In Turkish Latin, the capital form of i is İ, while the lowercase form of I is ı — they’re two different letters. Fortunately we do have a way to account for this:

function capitalizeFirstLetter([ first, ...rest ], locale) {
 return [ first.toLocaleUpperCase(locale), ...rest ].join('');
}
capitalizeFirstLetter("italy", "en") // "Italy"
capitalizeFirstLetter("italya", "tr") // "İtalya"

In a browser, the user’s most-preferred language tag is indicated by navigator.language, a list in order of preference is found at navigator.languages, and a given DOM element’s language can be obtained (usually) with Object(element.closest('[lang]')).lang || YOUR_DEFAULT_HERE in multilanguage documents.

In agents which support Unicode property character classes in RegExp, which were introduced in ES2018, we can clean stuff up further by directly expressing what characters we’re interested in:

function capitalizeFirstLetter(str, locale=navigator.language) {
 return str.replace(/^\p{CWU}/u, char => char.toLocaleUpperCase(locale));
}

This could be tweaked a bit to also handle capitalizing multiple words in a string with fairly good accuracy. The CWU or Changes_When_Uppercased character property matches all code points which, well, change when uppercased. We can try this out with a titlecased digraph characters like the Dutch ij for example:

capitalizeFirstLetter('ijsselmeer'); // "IJsselmeer"

At the time of writing (Feb 2020), Firefox/Spidermonkey has not yet implemented any of the RegExp features introduced in the last two years*****. You can check the current status of this feature at the Kangax compat table. Babel is able to compile RegExp literals with property references to equivalent patterns without them, but be aware that the resulting code may be enormous.

In all likelihood, people asking this question will not be concerned with Deseret capitalization or internationalization. But it’s good to be aware of these issues because there’s a good chance you’ll encounter them eventually even if they aren’t concerns presently. They’re not "edge" cases, or rather, they’re not by-definition edge cases — there’s a whole country where most people speak Turkish, anyway, and conflating code units with codepoints is a fairly common source of bugs (especially with regard to emoji). Both strings and language are pretty complicated!

* The code units of UTF-16 / UCS2 are also Unicode code points in the sense that e.g. U+D800 is technically a code point, but that’s not what it "means" here ... sort of ... though it gets pretty fuzzy. What the surrogates definitely are not, though, is USVs (Unicode scalar values).

** Though if a surrogate code unit is "orphaned" — i.e., not part of a logical pair — you could still get surrogates here, too.

*** maybe. I haven’t tested it. Unless you have determined capitalization is a meaningful bottleneck, I probably wouldn’t sweat it — choose whatever you believe is most clear and readable.

**** such a function might wish to test both the first and second code units instead of just the first, since it’s possible that the first unit is an orphaned surrogate. For example the input "\uD800x" would capitalize the X as-is, which may or may not be expected.

***** Here’s the Bugzilla issue if you want to follow the progress more directly.

answered Dec 26, 2018 at 10:33

Semicolon

7.6k
2
32
42

CollectivesTM on Stack Overflow

Return to Revisions