Given a language, how to get its alphabet letters

Question 1

Is there a programmatic way (or some open-source repository), that given a language (say in 2-leters ISO format), return the letters of the alphabet of that language?

For example:

console.log(getAlphabet('en'));

outputs:

a b c d ...

and

console.log(getAlphabet('he'));

outputs:

א ב ג ד ...

Question 2

and for languages without alphabets?

Question 3

I think that there is no pre-built library that has this feature.

Question 4

Although following question was asked for Python, it may help you: stackoverflow.com/q/61182560/42659

Question 5

Are you going to support diacritics? AÁÀÂÄÃEÉÈÊËeéèêë?

Question 6

I don't think that a language always has a well-defined alphabet associated with it. But in the Unicode CLDR standard, the //ldml/characters/exemplarCharacters seem to contain a "representative section" of letters typically used in a given language. This comes in an open-source repository, see here for Hebrew, for example.

Using an XML parser library, you can write a function that loads the file based on the language code (in the example above, https://raw.githubusercontent.com/unicode-org/cldr/HEAD/common/main/he.xml for language code he) and locates the //ldml/characters/exemplarCharacters element in it.

Below is an example function in client-side Javascript. It uses a regular expression with Unicode flag to split the exemplarCharacters into individual letters, even if they are represented by more than one Javascript character.

fetch("https://raw.githubusercontent.com/unicode-org/cldr/HEAD/common/main/he.xml")
 .then(r => r.text())
 .then(function(xml) {
 var dom = new DOMParser().parseFromString(xml, "text/xml");
 console.log(dom.evaluate("/ldml/characters/exemplarCharacters[1]", dom, undefined, XPathResult.STRING_TYPE).stringValue
 .match(/[^ \[\]]/gu));
 });

Alternatively, you could evaluate /ldml/characters/exemplarCharacters[@type='index'].

Question 7

The answer doesn't work for alphabets that contain exemplarCharacters that are more than one JS character, or ones where several diacritics are listed, for example Czech has {ch}, as well as ií and uúů: github.com/unicode-org/cldr/blob/…. So I would suggest using <exemplarCharacters type="index"> instead, and parsing as space separated, and possibly {}-enclosed, array.

Question 8

Indeed, the .sort() removes the diacritics, I have taken that out. And thanks for mentioning the type="index", I have included that.

Question 9

I think your suggestion deserves to be written down as a separate answer.

Question 10

Yep, can write that down in an answer, it's limiting to not have line breaks or full code blocks, so it's very hard to read in the end.

Question 11

Expanding on the other answer, which suggests to use Unicode CLDR data, while addressing some shortcomings:

Some languages have alphabets that include "letters" that are more than one JS character long, and where some letters are sets with several diacritics (for example Czech has {ch}, as well as ií and uúů in exemplary characters), it would be more convenient to use "index" type exemplar characters, which include only characters used for indexing/searching (note that they are also capitalized). In most cases they split diacritics that are significantly different in the language's use to be considered a different letter. For cases where they don't (i.e. её in Russian), it's best to use only the first variant. Some non-alphabetic languages like Chinese, have a very large set of exemplar characters, but are indexed using a way smaller set, which is likely the best one to use.

With that all in mind, here is a function that gets them and parses them, using a IETF language tag as an input, using the JSON-type CLDR data for simplicity:

async function getCharacters(languageTag) {
 const req = await fetch(`https://raw.githubusercontent.com/unicode-org/cldr-json/refs/heads/main/cldr-json/cldr-misc-full/main/${languageTag}/characters.json`);
 if (!req.ok) {
 return null;
 }
 try {
 const data = await req.json();
 // comes in the form of "[A B C {CH} IÍ]"
 const indexCharactersString = data.main?.[languageTag]?.characters?.index;
 if (!indexCharactersString) {
 return null;
 }
 const alphabetArray = indexCharactersString
 // removes []
 .substring(1, indexCharactersString.length - 1)
 // split by space, after we have either single-character letters, multi-character letters in {}, or diacritic sets like ЕЁ
 .split(" ")
 // for {}-encased letters, return everything inside {}, for non-{} letters, it's either a character or a diacritic set, so we just take the first character
 .map(char => char.startsWith("{") ? char.substring(1, char.length - 1) : char.substring(0,1));
 return alphabetArray;
 }
 catch(e) {
 return null;
 }
}
getCharacters("cs").then(arr => console.log(arr));

Any valid language tag that exists in the CLDR will return the alphabet, and non-existing ones will return null.

Heiko Theißen 18.3k2 gold badges14 silver badges50 bronze badges · Accepted Answer · 2024-03-14 16:56:44Z

I don't think that a language always has a well-defined alphabet associated with it. But in the Unicode CLDR standard, the //ldml/characters/exemplarCharacters seem to contain a "representative section" of letters typically used in a given language. This comes in an open-source repository, see here for Hebrew, for example.

Using an XML parser library, you can write a function that loads the file based on the language code (in the example above, https://raw.githubusercontent.com/unicode-org/cldr/HEAD/common/main/he.xml for language code he) and locates the //ldml/characters/exemplarCharacters element in it.

Below is an example function in client-side Javascript. It uses a regular expression with Unicode flag to split the exemplarCharacters into individual letters, even if they are represented by more than one Javascript character.

fetch("https://raw.githubusercontent.com/unicode-org/cldr/HEAD/common/main/he.xml")
 .then(r => r.text())
 .then(function(xml) {
 var dom = new DOMParser().parseFromString(xml, "text/xml");
 console.log(dom.evaluate("/ldml/characters/exemplarCharacters[1]", dom, undefined, XPathResult.STRING_TYPE).stringValue
 .match(/[^ \[\]]/gu));
 });

Alternatively, you could evaluate /ldml/characters/exemplarCharacters[@type='index'].

The answer doesn't work for alphabets that contain exemplarCharacters that are more than one JS character, or ones where several diacritics are listed, for example Czech has {ch}, as well as ií and uúů: github.com/unicode-org/cldr/blob/…. So I would suggest using <exemplarCharacters type="index"> instead, and parsing as space separated, and possibly {}-enclosed, array.
Indeed, the .sort() removes the diacritics, I have taken that out. And thanks for mentioning the type="index", I have included that.
I think your suggestion deserves to be written down as a separate answer.
Yep, can write that down in an answer, it's limiting to not have line breaks or full code blocks, so it's very hard to read in the end.

CollectivesTM on Stack Overflow

Given a language, how to get its alphabet letters

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related