4

Is there a programmatic way (or some open-source repository), that given a language (say in 2-leters ISO format), return the letters of the alphabet of that language?

For example:

console.log(getAlphabet('en'));

outputs:

a b c d ... 

and

console.log(getAlphabet('he'));

outputs:

א ב ג ד ... 
Heiko Theißen
18.3k2 gold badges14 silver badges50 bronze badges
asked Mar 14, 2024 at 16:01
4
  • 2
    and for languages without alphabets? Commented Mar 14, 2024 at 16:13
  • 2
    I think that there is no pre-built library that has this feature. Commented Mar 14, 2024 at 16:13
  • Although following question was asked for Python, it may help you: stackoverflow.com/q/61182560/42659 Commented Mar 14, 2024 at 16:18
  • 1
    Are you going to support diacritics? AÁÀÂÄÃEÉÈÊËeéèêë? Commented Mar 14, 2024 at 17:29

2 Answers 2

8

I don't think that a language always has a well-defined alphabet associated with it. But in the Unicode CLDR standard, the //ldml/characters/exemplarCharacters seem to contain a "representative section" of letters typically used in a given language. This comes in an open-source repository, see here for Hebrew, for example.

Using an XML parser library, you can write a function that loads the file based on the language code (in the example above, https://raw.githubusercontent.com/unicode-org/cldr/HEAD/common/main/he.xml for language code he) and locates the //ldml/characters/exemplarCharacters element in it.

Below is an example function in client-side Javascript. It uses a regular expression with Unicode flag to split the exemplarCharacters into individual letters, even if they are represented by more than one Javascript character.

fetch("https://raw.githubusercontent.com/unicode-org/cldr/HEAD/common/main/he.xml")
 .then(r => r.text())
 .then(function(xml) {
 var dom = new DOMParser().parseFromString(xml, "text/xml");
 console.log(dom.evaluate("/ldml/characters/exemplarCharacters[1]", dom, undefined, XPathResult.STRING_TYPE).stringValue
 .match(/[^ \[\]]/gu));
 });

Alternatively, you could evaluate /ldml/characters/exemplarCharacters[@type='index'].

answered Mar 14, 2024 at 16:56
Sign up to request clarification or add additional context in comments.

4 Comments

The answer doesn't work for alphabets that contain exemplarCharacters that are more than one JS character, or ones where several diacritics are listed, for example Czech has {ch}, as well as and uúů: github.com/unicode-org/cldr/blob/…. So I would suggest using <exemplarCharacters type="index"> instead, and parsing as space separated, and possibly {}-enclosed, array.
Indeed, the .sort() removes the diacritics, I have taken that out. And thanks for mentioning the type="index", I have included that.
I think your suggestion deserves to be written down as a separate answer.
Yep, can write that down in an answer, it's limiting to not have line breaks or full code blocks, so it's very hard to read in the end.
3

Expanding on the other answer, which suggests to use Unicode CLDR data, while addressing some shortcomings:

Some languages have alphabets that include "letters" that are more than one JS character long, and where some letters are sets with several diacritics (for example Czech has {ch}, as well as and uúů in exemplary characters), it would be more convenient to use "index" type exemplar characters, which include only characters used for indexing/searching (note that they are also capitalized). In most cases they split diacritics that are significantly different in the language's use to be considered a different letter. For cases where they don't (i.e. её in Russian), it's best to use only the first variant. Some non-alphabetic languages like Chinese, have a very large set of exemplar characters, but are indexed using a way smaller set, which is likely the best one to use.

With that all in mind, here is a function that gets them and parses them, using a IETF language tag as an input, using the JSON-type CLDR data for simplicity:

async function getCharacters(languageTag) {
 const req = await fetch(`https://raw.githubusercontent.com/unicode-org/cldr-json/refs/heads/main/cldr-json/cldr-misc-full/main/${languageTag}/characters.json`);
 if (!req.ok) {
 return null;
 }
 try {
 const data = await req.json();
 // comes in the form of "[A B C {CH} IÍ]"
 const indexCharactersString = data.main?.[languageTag]?.characters?.index;
 if (!indexCharactersString) {
 return null;
 }
 const alphabetArray = indexCharactersString
 // removes []
 .substring(1, indexCharactersString.length - 1)
 // split by space, after we have either single-character letters, multi-character letters in {}, or diacritic sets like ЕЁ
 .split(" ")
 // for {}-encased letters, return everything inside {}, for non-{} letters, it's either a character or a diacritic set, so we just take the first character
 .map(char => char.startsWith("{") ? char.substring(1, char.length - 1) : char.substring(0,1));
 return alphabetArray;
 }
 catch(e) {
 return null;
 }
}
getCharacters("cs").then(arr => console.log(arr));

Any valid language tag that exists in the CLDR will return the alphabet, and non-existing ones will return null.

answered Feb 5, 2025 at 10:32

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.