Return to Revisions

1 of 12

answered Dec 26, 2018 at 10:33

7.6k
2
32
42

I didn’t see any mention in the existing answers of issues related to ~~(削除) astral plane codepoints or (削除ここまで)~~ internationalization. "Uppercase" doesn’t mean the same thing in every language using a given script.

Initially I didn’t see any answers addressing issues related to astral plane codepoints. There is one, but it’s a bit buried (like this one will be, I guess!)

Most of the proposed functions look like this:

function capitalizeFirstLetter(str) {
 return str[0].toUpperCase() + str.slice(1);
}

However, some cased characters fall outside the BMP (basic multilingual plane, codepoints U+0 to U+FFFF). For example take this Deseret text:

capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉"); // "𐐶𐐲𐑌𐐼𐐲𐑉"

The first character here fails to capitalize because the array indexed properties of strings do not access characters or codepoints. They access UTF-16 code units. This is true also when slicing — the index values point at code units.

It happens to be that UTF-16 code units are 1:1 to codepoints for the codepoints in two ranges, U+0 to U+D7FF and U+E000 to U+FFFF. Most cased characters fall into those two ranges, but not all of them.

From ES2015 on, dealing with this became a bit easier. String.prototype[@@iterator] yields strings corresponding to codepoints*. So for example, we can do this:

function capitalizeFirstLetter([ first, ...rest ]) {
 return [ first.toUpperCase(), ...rest ].join('');
}
capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉") // "𐐎𐐲𐑌𐐼𐐲𐑉"

For longer strings, this is probably not terribly efficient** — we don’t really need to iterate the remainder. We could use String.prototype.codePointAt to get at that first (possible) letter, but we’d still need to determine where the slice should begin. One way to avoid iterating the remainder would be to test whether the first codepoint is outside the BMP; if it isn’t, the slice begins at 1, and if it is, the slice begins at 2.

function capitalizeFirstLetter(str) {
 const firstCP = str.codePointAt(0);
 const index = firstCP > 0xFFFF ? 2 : 1;
 return String.fromCodePoint(firstCP).toUpperCase() + str.slice(index);
}
capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉") // "𐐎𐐲𐑌𐐼𐐲𐑉"

We can also make this work in ES5 and below by taking that logic a bit further if necessary. There are no intrinsic methods in ES5 for working with codepoints, so we have to manually test whether the first code unit is a surrogate***:

function capitalizeFirstLetter(str) {
 var firstCodeUnit = str[0];
 if (firstCodeUnit < '\uD800' || firstCodeUnit > '\uDFFF') {
 return str[0].toUpperCase() + str.slice(1);
 }
 return str.slice(0, 2).toUpperCase() + str.slice(2);
}
capitalizeFirstLetter("𐐶𐐲𐑌𐐼𐐲𐑉") // "𐐎𐐲𐑌𐐼𐐲𐑉"

At the start I also mentioned internationalization considerations. Some of these are very difficult to account for because they require knowledge not only of what language is being used, but also may require specific knowledge of the words in the language. For example, the Irish digraph "mb" capitalizes as "mB" at the start of a word, and while the German eszett never begins a word (afaik), it means lowercasing from "SS" in German requires additional knowledge (it could be "ss" or it could be "ß", depending on the word).

The most famous example of this issue, probably, is Turkish. In Turkish Latin, the capital form of i is İ, while the lowercase form of I is ı — they’re two different letters. Fortunately we do have a way to account for this:

function capitalizeFirstLetter([ first, ...rest ], locale) {
 return [ first.toLocaleUpperCase(locale), ...rest ].join('');
}
capitalizeFirstLetter("italya", "en") // "Italya"
capitalizeFirstLetter("italya", "tr") // "İtalya"

In a browser, the user’s most-preferred language tag is indicated by navigator.language, a list in order of preference is found at navigator.languages, and a given DOM element’s language can be obtained with Object(element.closest('[lang]')).lang || YOUR_DEFAULT_HERE.

In all likelihood, people asking this question will not be concerned with Deseret capitalization or internationalization. But it’s good to be aware of these issues because there’s a good chance you’ll encounter them eventually even if they aren’t concerns presently. They’re not "edge" cases, or rather, they’re not by-definition edge cases — there’s a whole country where most people speak Turkish, anyway, and conflating code units with codepoints is a fairly common source of bugs (especially with regard to emoji). Both strings and language are pretty complicated!

* or surrogate code units, if orphaned

** maybe. I haven’t tested it. Unless you have determined capitalization is a meaningful bottleneck, I probably wouldn’t sweat it — choose whatever you believe is most clear and readable.

*** such a function might wish to test both the first and second code units instead of just the first, since it’s possible that the first unit is an orphaned surrogate. For example the input "\uD800x" would capitalize the X as-is, which may or may not be expected.

answered Dec 26, 2018 at 10:33

Semicolon

7.6k
2
32
42

CollectivesTM on Stack Overflow

Return to Revisions