The Unicode Blog: CLDR

Showing posts with label CLDR. Show all posts

Friday, October 25, 2024

ICU 76 Released

ICU LogoUnicode® ICU 76 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).

ICU 76 updates to Unicode 16 (blog), including new characters and scripts, emoji, collation & IDNA changes, and corresponding APIs and implementations. It also updates to CLDR 46 (beta blog) locale data with new locales, significant updates to existing locales, and various additions and corrections. For example, the CLDR and Unicode default sort orders are now very nearly the same.

Most of the java.time (Temporal) types can now be formatted directly using the existing ICU4J date/time formatting classes.

There are some new APIs to make ICU easier to use with modern C++ and Java patterns. Most of the C/C++ APIs added for this purpose are implemented as C++ header-only APIs, and usable on top of binary stable C APIs, which is a first for ICU.

The Java and C++ technology preview implementations of the (also in tech preview) CLDR MessageFormat 2.0 specification have been updated to match recent changes.

ICU 76 and CLDR 46 are major releases, including a new version of Unicode and major locale data improvements.

For details, please see
https://unicode-org.github.io/icu/download/76.html.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Posted by Unicode, Inc. at 1:31 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 46, ICU, ICU 76, Unicode 16.0

Unicode CLDR 46 available

Postal Horn emojiUnicode CLDR 46 is now available and has been integrated into version 76 of ICU.

The most significant data changes in this release were:

Updated to Unicode 16.0 (including major changes to collation)
Substantial additions and modifications of Emoji search keyword data
‘Upleveling’ the locale coverage (see below)

The most significant changes in the specification were:

Updates to Message Format in tech preview
Updates to conformance
New tech preview section on semantic skeletons

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?))

Via the Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

In version 46, the following levels were reached:

New / Upleveled Locales

New Level

Locales

📈

Modern

Nigerian Pidgin, Tigrinya

📈

Moderate

Akan, Baluchi (Latin), Kangri, Tajik, Tatar, Wolof

📈

Basic

Ewe, Ga, Kinyarwanda, Konkani (Latin), Northern Sotho, Oromo, Sichuan Yi, Southern Sotho, Tswana

📉

Basic*

Chuvash, Anii

We are currently planning for CLDR 47 to be a closed release with no data submission period. The focus will be on improving the Survey Tool used for data submission, making necessary infrastructure changes, and some high priority data quality fixes.

For more information

See the CLDR 46 release page , which has information on accessing the data, reviewing charts of the changes, and — importantly — Migration issues.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Posted by Unicode, Inc. at 12:03 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 46

Monday, May 20, 2024

Unicode CLDR Version 46 Submission Open

[image] The Unicode CLDR Survey Tool is open for submission for version 46. CLDR provides key building blocks for software to support the world’s languages (dates, times, numbers, sort-order, etc.) All major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

Version 46 is focusing on:

Unicode 16 additions: new emoji, script names, collation data (Chinese & Japanese), …
Emoji search keywords: Expanding keyword coverage to make it easier for users to find the right emoji
New Languages targeting Basic:
- Ewe (ee),
- Ga (gaa)
- Kinyarwanda (rw)
- Northern Sotho (nso)
- Oromo (om),
- Sesotho (st)
- Setswana (tn),
Up-leveling: Akan (ak)

Submission of new data opened recently, and is slated to finish on June 11. The new data then enters a vetting phase, where contributors work out which of the supplied data for each field is best. That vetting phase is slated to finish on July 1. A public alpha makes the draft data available around August 28, and the final release targets October 16.

Each new locale starts with a small set of Core data, such as a list of characters used in the language. Submitters of those locales need to bring the coverage up to Basic level (very basic basic dates, times, numbers, and endonyms) during the next submission cycle.

Once a language reaches Basic coverage, it has the minimum support for use in language selection, such as on mobile devices. In the next submission cycle, the name for that language is also added for translation for all languages at Modern coverage.

If you would like to contribute missing data for your language, see Survey Tool Accounts. For more information on contributing to CLDR, see the CLDR Information Hub.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Posted by Unicode, Inc. at 3:29 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 46, Unicode

Thursday, April 18, 2024

Unicode CLDR v45 released

[image] The Unicode CLDR v45 is now available and has been integrated into version 75 of ICU. The CLDR v45 release page has information on accessing the data, reviewing charts of the changes, and — importantly — Migration issues.

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

CLDR 45 did not have a Survey Tool submission phase, and focused on tooling and just a few functional areas:

MessageFormat 2.0 Tech Preview

Software needs to construct messages that incorporate various pieces of information. The complexities of the world's languages make this challenging. The goal for MessageFormat 2.0 is to allow developers and translators to create natural-sounding, grammatically-correct, user interfaces that can appear in any language and support the needs of various cultures.

The new MessageFormat defines the data model, syntax, processing, and conformance requirements for the next generation of dynamic messages. It is intended for adoption by programming languages, software libraries, and software localization tooling. It enables the integration of internationalization APIs (such as date or number formats), and grammatical matching (such as plurals or genders). It is extensible, allowing software developers to create formatting or message selection logic that add on to the core capabilities. Its data model provides the means of representing existing syntaxes, thus enabling gradual adoption by users of older formatting systems.
See also:

UTW { } MessageFormat v2 (November 7, 2023)
Message Format Virtual Open House (February 20, 2024)

Keyboard 3.0 stable version

Keyboard support for digitally disadvantaged languages (DDLs) is often lacking or inconsistent between platforms. The updated LDML Keyboard 3.0 format specifies an interchange format for keyboard data. This will allow keyboard authors to create a single mapping file for their language, which implementations can use to provide that language’s keyboard mapping on their own platform. This format allows both physical and virtual (that is, on-screen or touch) keyboard layouts for a language to be defined in a single file.

See also:

CLDR, Beyond Locale Data (June 22, 2023)

Tooling changes

Many tooling changes are difficult to accommodate in a data-submission release, including performance work and UI improvements. The changes in v45 provide faster turn-around for linguists and higher data quality. They are targeted at the v46 submission period, starting in May, 2024.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Posted by Unicode, Inc. at 10:08 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 45, LDML Keyboard, message format 2

Wednesday, April 17, 2024

ICU 75 Released

ICU LogoUnicode® ICU 75 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 75 updates to CLDR 45 (beta blog) locale data with new locales and various additions and corrections. C++ code now requires C++17 (C code now requires C11) and is being made more robust.

The CLDR MessageFormat 2.0 specification is now in technology preview, together with a corresponding update of the ICU4J (Java) tech preview and a new ICU4C (C++) tech preview.

For details, please see https://icu.unicode.org/download/75.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Posted by Unicode, Inc. at 1:11 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 45, ICU, ICU 75

Tuesday, March 5, 2024

Unicode CLDR v45 Alpha available for testing

[image]

The Unicode CLDR v45 Alpha is now available for integration testing.

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

The alpha has already been integrated into the development version of ICU. We would especially appreciate feedback from non-ICU consumers of CLDR data and on Migration issues. Feedback can be filed at CLDR Tickets .

CLDR 45 is a closed release with no submission period, focusing on just a few areas:

MessageFormat 2.0 Tech Preview

The new MessageFormat defines the data model, syntax, processing, and conformance requirements for the next generation of dynamic messages. It is intended for adoption by programming languages, software libraries, and software localization tooling. It enables the integration of internationalization APIs (such as date or number formats), and grammatical matching (such as plurals or genders). It is extensible, allowing software developers to create formatting or message selection logic that add on to the core capabilities. Its data model provides a means of representing existing syntaxes, thus enabling gradual adoption by users of older formatting systems.

Keyboard 3.0 stable version

Keyboard support for digitally disadvantaged languages is often lacking or inconsistent between platforms. The updated LDML Keyboard 3.0 format specifies an interchange format for keyboard data. This will allow keyboard authors to create a single mapping file for their language, which implementations can use to provide that language’s keyboard mapping on their own platform. This format allows both physical and virtual (that is, on-screen or touch) keyboard layouts for a language to be defined in a single file.

Tooling changes

For more information

See the draft CLDR v45 release page , which has information on accessing the data, reviewing charts of the changes, and — importantly — Migration issues.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Posted by Unicode, Inc. at 7:14 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: alpha, CLDR, CLDR 45

Tuesday, October 31, 2023

ICU 74 Released

ICU LogoUnicode® ICU 74 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 74 updates to Unicode 15.1, and to CLDR 44 locale data with various additions and corrections.

ICU 74 and CLDR 44 are major releases, including a new version of Unicode and major locale data improvements. They subsume the changes for the ICU 73.2 and CLDR 43.1 maintenance releases.

Unicode 15.1 adds source code security mechanisms, improves line breaking for southeast Asian scripts, and adds important CJK unified ideographs.

CLDR 44 has added or improved data for a number of languages that have been newly added to ICU, and has improved measurement unit handling, conversion, and formatting.

ICU 74 implements these improvements, adds new C APIs for locale handling, adds a plug-in API for word segmentation, and switches the Java build system to Maven.

For details, please see https://icu.unicode.org/download/74.

Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Posted by Unicode, Inc. at 2:16 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 44, ICU, ICU 74

Unicode CLDR v44 available

[image] Unicode CLDR version 44 is now available and has been integrated into version 74 of ICU. In CLDR 44, the focus is on:

Formatting Person Names. Added further enhancements (data and structure) for formatting people's names. For more information on why this feature is being added and what it does, see Background.
Emoji 15.1 Support. Added short names, keywords, and sort-order for the new Unicode 15.1 emoji.
Unicode 15.1 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.
Digitally disadvantaged language coverage. Work began to improve DDL coverage, with the following DDL locales now having higher coverage levels:
1. Modern: Cherokee, Lower Sorbian, Upper Sorbian
2. Moderate: Anii, Interlingua, Kurdish, Māori, Venetian
3. Basic: Esperanto, Interlingue, Kangri, Kuvi, Kuvi (Devanagari), Kuvi (Odia), Kuvi (Telugu), Ligurian, Lombard, Low German, Luxembourgish, Makhuwa, Maltese, N’Ko, Occitan, Prussian, Silesian, Swampy Cree, Syriac, Toki Pona, Uyghur, Western Frisian, Yakut, Zhuang

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

There are many other changes: to find out more, see the CLDR v44 release page, which has information on accessing the date, reviewing charts of the changes, and — importantly — Migration issues.

In version 44, the following levels were reached:

v44 Level
Langs
Usage

Modern
95
Suitable for full UI internationalization

čeština, ‎Deutsch, ‎français, Kiswahili‎, Magyar‎, O‘zbek‎, Română‎‎, Tiếng Việt‎, Ελληνικά‎, Беларуская‎, ‎ᏣᎳᎩ‎, Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, አማርኛ‎, ‎नेपाली‎, অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, 中文, 日本語‎, … ‎

Moderate
13
Suitable for “document content” internationalization, eg. in spreadsheet

brezhoneg, ‎føroyskt, IsiXhosa, ‎sardu, чӑваш, …

Basic
50
Suitable for locale selection, eg. choice of language on mobile phone

asturianu, ‎Rumantsch, Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, …

We are currently planning for CLDR version 45 to be a closed release with no submission period. The focus will be on improving the Survey Tool used for data submission, making necessary infrastructure changes, and some high priority data quality fixes.

Posted by Unicode, Inc. at 10:59 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 44, DDL

Thursday, September 14, 2023

Unicode CLDR v44 Alpha available for testing

[image] The Unicode CLDR v44 Alpha is now available for integration testing.

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

The alpha has already been integrated into the development version of ICU. We would especially appreciate feedback from non-ICU consumers of CLDR data and on Migration issues. Feedback can be filed at CLDR Tickets.

Alpha means that the main data and charts are available for review, but the specification, JSON data, and other components are not yet ready for review. Some data may change if showstopper bugs are found. The planned schedule is:

Sep 27 — Beta (data)
Oct 04 — Beta2 (spec)
Nov 01 — Release

In CLDR 44, the focus is on:

Formatting Person Names. Added further enhancements (data and structure) for formatting people's names. For more information on why this feature is being added and what it does, see Background.
Emoji 15.1 Support. Added short names, keywords, and sort-order for the new Unicode 15.1 emoji.
Unicode 15.1 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.
Digitally disadvantaged language coverage. Work began to improve DDL coverage, with the following DDL locales now having higher coverage levels:
1. Modern: Cherokee, Lower Sorbian, Upper Sorbian
2. Moderate: Anii, Interlingua, Kurdish, Māori, Venetian
3. Basic: Esperanto, Interlingue, Kangri, Kuvi, Kuvi (Devanagari), Kuvi (Odia), Kuvi (Telugu), Ligurian, Lombard, Low German, Luxembourgish, Makhuwa, Maltese, N’Ko, Occitan, Prussian, Silesian, Swampy Cree, Syriac, Toki Pona, Uyghur, Western Frisian, Yakut, Zhuang

There are many other changes: to find out more, see the draft CLDR v44 release page, which has information on accessing the date, reviewing charts of the changes, and — importantly — Migration issues.

In version 44, the following levels were reached:

v44 Level
Langs
Usage

Modern
95
Suitable for full UI internationalization

Moderate
13
Suitable for “document content” internationalization, eg. in spreadsheet

brezhoneg, ‎føroyskt, IsiXhosa, ‎sardu, чӑваш, …

Basic
50
Suitable for locale selection, eg. choice of language on mobile phone

asturianu, ‎Rumantsch, Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, …

Posted by Unicode, Inc. at 10:13 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 44

Thursday, June 15, 2023

ICU 73.2 & CLDR 43.1 released: GB18030 compliance updates & compatibility fixes

ICU 73.2 & CLDR 43.1 released: GB18030 compliance updates & compatibility fixes ICU LogoUnicode® ICU 73.2 and CLDR 43.1 have just been released.

ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).
CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). All major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

There are significant changes for GB18030-2022 compliance support:

CLDR extends the support for “short” Chinese sort orders to cover some additional, required characters for Level 2. This is carried over into ICU collation.
ICU has a modified character conversion table, mapping some GB18030 characters to Unicode characters that were encoded after GB18030-2005.

There are also changes for compatibility:

There are optional variants of time formats with AM/PM (only for English) using ASCII spaces in CLDR that can also be used in ICU via custom data generation. This is intended to help certain implementers transition to the improved patterns, which have used a narrow no-break space between the time and AM/PM since CLDR 42.
- For how to generate ICU data with this option, look for alt="ascii" on tools/cldr/cldr-to-icu/README.md
The changes to the word segmentation behavior of @ sign that were in CLDR 42 (ICU 72) have been reverted. These caused problems for certain parsers that did not expect @ to join to letters.

ICU 73.2 updates to CLDR 43.1 locale data. These are maintenance releases for ICU 73 and CLDR 43, with limited sets of bug fixes and no API or structural changes. ICU 73.2 and CLDR 43.1 include several other bug fixes, including person name formatting, and Cyrillic transforms.

For details, please see:

ICU 73.2 Release Note: ICU 73.2 maintenance release
CLDR 43.1 Release Note: Version 43.1 Changes

Posted by Unicode, Inc. at 1:31 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 43, ICU, ICU 73

Thursday, June 1, 2023

Unlocking the Power of CLDR Person Name Formatting: A Solution for Formatting Names in a Globalized World

By Mike McKenna, Chair of CLDR Person Names Subcommittee

[image]

CLDR Person Names has moved from “tech preview” to “draft” status and is available for initial testing by implementors through ICU4J.

How a person’s name is displayed and used can convey respect, familiarity, or even be interpreted as rude if used improperly. That’s why it’s important to format names correctly, especially because naming practices vary across the globe. In many cultures, names can indicate gender, status, birthplace, nationality, ethnicity, religion, and more.

Until now, there have been no good standards for how to format people’s names in various contexts. A number of Unicode members wanted to address this problem and provide a mechanism that anyone could use to format people’s names in a wide variety of applications, such as contact lists, air travel, billing applications, CRMs, social media, and any other application that asks for user information and presents it back to the user or others.

The Unicode® Person Name Formats defines patterns used to take a person’s name and format it correctly in a given language or locale depending on a chosen context. With the Unicode Common Locale Data Repository (CLDR), locale codes and name sequences can be selected to create a specific pattern for formatting a person’s name — including preferences for formal, informal, or abbreviated versions. As a result, designers and developers can correctly display names according to the user’s native locale and culture, especially important when integrating names in different character scripts, such as Japanese, Chinese, or Russian.

The Unicode Consortium added Person Name formatting to CLDR in v42 and has been refined and enhanced for v43, which just released in April. In CLDR v43, with the help of linguists from around the world, we completed data for formatting people’s names for CLDR locales at modern coverage. Its formal name is "Unicode Technical Standard #35 Unicode Locale Data Markup Language (LDML); Part 8: Person Names". ICU has added the PersonNameFormatter class and is available in ICU 73.

To learn more, and get an idea of the implications for user experience and application design, see the following paper, which provides an illustration of the many contexts in which names can be formatted through CLDR Person Names.

LDML (UTS#35) Part 8: Person Names - a story teller’s case study

Posted by Unicode, Inc. at 11:26 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 43, CLDR 44, ICU, person names

Thursday, April 13, 2023

ICU 73 Released

ICU LogoUnicode® ICU 73 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 73 updates to CLDR 43 locale data with various additions and corrections.

ICU 73 improves Japanese and Korean short-text line breaking, reduces C++ memory use in date formatting, and promotes the Java person name formatter from tech preview to draft.

ICU 73 and CLDR 43 are minor releases, mostly focused on bug fixes and small enhancements. (The fall CLDR/ICU releases will update to Unicode 15.1 which is planned for September.)

ICU 73 updates to the time zone data version 2023c (March 2023). Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.

For details, please see https://icu.unicode.org/download/73.

Posted by Unicode, Inc. at 3:44 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 43, ICU, ICU 73

Wednesday, April 12, 2023

Unicode CLDR v43 released

Formatting Person Names
- Completing the data for formatting person names, allowing it to advance out of “tech preview”. For more information on the benefits of this feature, see Background.
Locales
- Adding substantially to the LikelySubtags data: This is used to find the likely writing system and country for a given language, used in normalizing locale identifiers and inheritance. The data has been contributed by SIL.
- Inheritance: Adding components to parentLocales, and documenting the different inheritance for rgScope data, which inherits primarily by region.
Other data updates
- In English, Türkiye is now the primary country name for the country code TR, and Turkey is available as an alternate. Other locales have been reviewed to see whether similar changes would be appropriate.
- Name for the new timezone Ciudad Juárez.
Structure
- Adding some structure and data needed for ICU4X & JavaScript, for calendar eras and parentLocales.
Collation & Searching
- Treat various quote marks as equivalent at a Primary strength, also including Geresh and Gershayim.

To find out more about these and other changes, see the CLDR v43 release page.

Posted by Unicode, Inc. at 3:39 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 43

Thursday, March 30, 2023

The Unicode CLDR v43 Beta is now available for integration testing

[image] CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

It is important to review the Migration section for changes that might require action by implementations using CLDR directly or indirectly (eg, via ICU), and the Specification changes, since those are new since the Alpha.

We appreciate feedback from both ICU and non-ICU consumers of CLDR data. (The Beta has already been integrated into the development version of ICU.) Feedback can be filed at CLDR Tickets. Any tickets should be filed as soon as possible, because the target release date is 2023 Apr 12, Wed.

CLDR 43 is a limited-submission release, focusing on just a few areas:

Formatting Person Names
- Completing the data for formatting person names, allowing it to advance out of “tech preview”. For more information on the benefits of this feature, see Background.
Locales
- Adding substantially to the LikelySubtags data: This is used to find the likely writing system and country for a given language, used in normalizing locale identifiers and inheritance. The data has been contributed by SIL
- Inheritance: Adding components to parentLocales, and documenting the different inheritance for rgScope data, which inherits primarily by region
Other data updates
- Alternate names for Turkey / Türkiye
- Name for the new timezone Ciudad Juárez
Structure
- Adding some structure and data needed for ICU4X & JavaScript, for calendar eras and parentLocales.
Collation & Searching
- Treat various quote marks as equivalent at a Primary strength, also including Geresh and Gershayim.

To find out more about these and other changes, see the draft CLDR v43 release page, which has information on accessing the date, reviewing charts of the changes, and — importantly — Migration issues.

Posted by Unicode, Inc. at 9:50 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: beta, CLDR, cldr 43

Thursday, February 23, 2023

The Unicode CLDR v43 Alpha is now available for integration testing

[image] CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

The Alpha has already been integrated into the development version of ICU. We would especially appreciate feedback from non-ICU consumers of CLDR data and on Migration issues. Feedback can be filed at CLDR Tickets.

Alpha means that the main data and charts are available for review, but the specification, JSON data, and other components are not yet ready for review. Data may change if release-blocking bugs are found. The planned schedule is:

2023 Mar 15, Wed — public Beta (data)
2023 Mar 29, Wed — public Beta2 (data & spec)
2023 Apr 12, Wed — Release

CLDR 43 is a limited-submission release, focusing on just a few areas:

Formatting Person Names
- Completing the data for formatting person names, allowing it to advance out of “tech preview”. For more information on the benefits of this feature, see Background.
Adding substantially to the LikelySubtags data
- This is used to find the likely writing system and country for a given language, used in normalizing locale identifiers and inheritance.
- The data has been contributed by SIL.
Other data updates
- Alternate names for Turkey / Türkiye
- Name for the new timezone Ciudad Juárez
Structure
- Adding some structure and data needed for ICU4X & JavaScript, for calendar eras and parentLocales.
- Cleanup of the inheritance structure in CLDR
Collation & Searching
- Treat various quote marks as equivalent at a Primary strength, also including Geresh and Gershayim.

Posted by Unicode, Inc. at 1:33 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: alpha, CLDR, cldr 43

Wednesday, December 21, 2022

Unicode in 2022

2022 Image

Hello Everyone!

As we go into the New Year, the Unicode team thought we’d share some highlights from this past year. From source-code spoofing to preserving indigenous languages, the Unicode team has had another full year, including expanding the number of characters that appear on billions of devices around the world.

Nearly 150,000 characters!

On the character side, we reached a total of just shy of 150,000 characters (149,186 to be exact). Of the 4,489 characters added in the 15.0 release, the biggest set was 4,192 ideographs for use in Chinese, Japanese, and Korean. There are also two new scripts, Nag Mundari and Kawi. Nag Mundari is a script used to write the Mundari language of India, a language with 1.1 million speakers. Kawi is an important historic script of insular Southeast Asia, found in inscriptions and on artifacts in several languages dating from the 8th to the 16th centuries — and is undergoing a revival today amongst enthusiasts.

And we can’t forget the 20 new emoji characters — we’re looking forward to seeing which are the most popular: shaking face? Goose? Maracas? Pink heart? If you’re involved in implementing emoji, you’ll also want to look at latest changes in UTS #51 Unicode Emoji.

See the Unicode15.0.0 page for more details. We’re also changing how we do releases — for more, see 2023 Release Planning.

The Launch of ICU4X

ICU is used in every major device and operating system; it’s how you see a date or number on your phone, for example. This new project, ICU4X, was created to solve the needs of clients who wish to provide client-side internationalization for their products in resource-constrained environments and across many programming languages. After 2½ years of work by Google, Mozilla, Amazon, and community partners, the Unicode Consortium has published ICU4X 1.0, its first stable release. Built from the ground up to be lightweight, portable, and secure, ICU4X learns from decades of experience to bring localized date formatting, number formatting, collation, text segmentation, and more to devices that, until now, did not have a suitable solution. For details, see Announcing ICU4X 1.0.

When does i ≠ і?

Can you tell the difference between i and і? Yeah, most people can’t. The first set of changes to help counter source-code spoofing were included in the 15.0 versions of the UAX #9 Unicode Bidirectional Algorithm, UAX #31 Unicode Identifier and Pattern Syntax, and UTS #39 Unicode Security Mechanisms.

For 2023, there is a new draft UTS #55 Unicode Source Code Handling, providing guidance for programming language designers and tooling developers, and specifying mechanisms to avoid usability and security issues arising from improper handling of Unicode. More changes are on their way for UAX #9, UAX #31, and UTS #39 as well.

Åge Møller, Πέτρος Νικόλαος Καρατζής, ராஜேந்திர சோழன்

We’re making great progress on internationalized formatting of people’s names. What does that mean? Software needs to be able to format people's names, such as John Smith or 宮崎駿. The formatting can be surprisingly complicated: for example, people may have a different number of names, depending on their culture — they might have only one name (“Zendaya”), only two (“Albert Einstein”), or three or more. So the software needs to handle missing or extra name fields gracefully.

There are many more complexities — for more details, see Formatting people’s names.

You have 2 unread messages.

Or, you have 3 items in your cart. Whenever a computer needs to construct a sentence using “placeholders” such as 3, it is formatting a message. The current industry standard is ICU’s message formatting; a project started about 3 years ago, with the goal of improving on that to build a more robust and extensible mechanism. There is now a Tech Preview in ICU — we’d urge developers to try it out!

See message-format-wg for details on the syntax and message2/package-summary.html for the API (note that the ICU’s convention for tech previews is to mark as Deprecated), and the test code in MessageFormat2Test.java for examples of usage.

(There are of course other fixes, upgrades and new features in ICU: see ICU 72 and ICU 71 for more details.)

Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ

In CLDR, we now have 95 languages at the Modern level (suitable for full UI internationalization), 6 at the Moderate level (suitable for “document content” internationalization), and 29 at the Basic level (suitable for locale selection). We added a tech preview of formatting for person names, plus additions for Unicode 15.0 (emoji names and search keywords), names for new scripts, new CJK collation, and so on. For more information, see CLDR v42.

Revitalization and Preservation of Indigenous Languages

The Nattilik language community was unable to use their language reliably for even simple, everyday digital text exchanges such as email or text messaging. The Typotheque Syllabics Project, an initiative based out of Toronto and The Hague, Netherlands, undertook research with language keepers across various Syllabics-using Indigenous communities in Canada. By collaborating with Nattilik language keepers and elders in the community, key issues the Nattilik community of Western Nunavut faced were identified, and it was discovered that there were 12 missing syllabic characters from the Unicode Standard. The Consortium worked with the Typotheque Syllabics Project to add 16 characters to the script to support Nattilik and other languages in Unicode version 14.0, and improved the glyphs in Unicode version 15.0. See this blog post from June.

The Past and Future of Flag Emoji

Despite being the largest emoji category with a strong association tied to identity, flags are by far the least used. Flag emoji have always been subject to special criteria due to their open-ended nature, infrequent use, and burden on implementations. The addition of other flags and thousands of valid sequences into the Unicode Standard has not resulted in wider adoption. They don’t stand still, are constantly evolving, and due to the open-ended nature of flags, the addition of one creates exclusivity at the expense of others. Curious to learn more? Read more about the Past and Future of Flag Emoji.

Available Now! New YouTube Playlist and Technical Quick Start Guide

On September 28th, Unicode held a webinar on the “Overview of Internationalization and Unicode Projects” for Unicode enthusiasts. Unicode technical leadership and other experts shared background on our core projects with participants from more than 30 countries. If you missed the webinar, no worries! The recorded sessions are available on this YouTube playlist. And if you are new to Unicode and internationalization or simply want a refresh, you can also check out our Technical Quick Start Guide. This handy guide explains what Unicode is, including answering the question, “What is Internationalization and Why it Matters.” There are also useful links to more detailed information and how you can get involved. Read more here.

Support Unicode 💞💕💌💯✨🌟🤠🛟🎁

Finally, if you are already a contributor to — or member of Unicode (or your company or organization is!), thank you, Danke, Děkuju, धन्यवाद, merci, 谢谢你, grazie, நன்றி, and gracias! What we have accomplished is only possible because of supporters like you.

And if you want to support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode is a US-based non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Posted by Unicode, Inc. at 10:05 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, emoji, highlights, ICU4X, spoofing

Tuesday, November 8, 2022

Available Now! New YouTube Playlist and Technical Quick Start Guide

Youtube Image

By Elango Cheran

On September 28th, Unicode held a webinar on the “Overview of Internationalization and Unicode Projects” for Unicode enthusiasts. More than 180 people across 30 countries joined us for this online event.

The Consortium is pleased to now make available the videos from this event. If you are new to Unicode and internationalization or want an overview of the most recent projects, check out our new YouTube playlist and Technical Quick Start Guide.

Our Technical Leadership and other experts provide a handy overview on such topics as:

Introduction to Internationalization - Addison Phillips, Internationalization Engineer
Unicode Consortium: Past, Present, and Future - Mark Davis, Cofounder and President
Scripts and Character Encoding - Deborah Anderson, Chair of the Script Ad Hoc Committee
Unicode CLDR (Common Locale Data Repository) - Mark Davis and Annemarie Apple, Chair and Vice Chair of the CLDR Committee
Unicode ICU (International Components for Unicode) - Markus Scherer, Chair of ICU Committee
Unicode ICU4X 2022 - Shane Carr, Chair of ICU4X Subcommittee

Also included in the playlist is the audience Q&A with Elango Charan, the webinar’s emcee, and Mark Davis, Cofounder and President of Unicode.

The Unicode Technical Quick Start Guide is also now available. The guide explains what Unicode is, including answering the question, “What is Internationalization and Why it Matters.” There is also an overview of the technical committees and useful links to more detailed information and how you can get involved.

Learn more about how you can support the Unicode Consortium and our mission, including information on our Adopt-a-Character program, here!
[badge]

Posted by Unicode, Inc. at 3:12 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, guide, I18n, ICU, playlist, quick start, scripts, webinar, Youtube

Friday, October 21, 2022

ICU 72 Released

ICU LogoUnicode® ICU 72 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 72 updates to Unicode 15 , and to CLDR 42 locale data with various additions and corrections.

ICU 72 and CLDR 42 are major releases, including a new version of Unicode and major locale data improvements.

ICU 72 adds two technology preview implementations based on draft Unicode specifications:

Formatting of people’s names in multiple languages (CLDR background on why this feature is being added and what it does)
An enhanced version of message formatting

This release also updates to the time zone data version 2022e (2022-oct). Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.

For details, please see https://icu.unicode.org/download/72.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Posted by Unicode, Inc. at 8:24 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 42, ICU, ICU 72, Unicode 15

Friday, October 25, 2024

Adopt a Character and Support Unicode’s Mission

New / Upleveled Locales

For more information

Adopt a Character and Support Unicode’s Mission

Monday, May 20, 2024

Adopt a Character and Support Unicode’s Mission

Thursday, April 18, 2024

MessageFormat 2.0 Tech Preview

Keyboard 3.0 stable version

Tooling changes

Adopt a Character and Support Unicode’s Mission

Wednesday, April 17, 2024

Adopt a Character and Support Unicode’s Mission

Tuesday, March 5, 2024

MessageFormat 2.0 Tech Preview

Keyboard 3.0 stable version

Tooling changes

For more information

Adopt a Character and Support Unicode’s Mission

Tuesday, October 31, 2023

Thursday, September 14, 2023

Thursday, June 15, 2023

Thursday, June 1, 2023

Thursday, April 13, 2023

Wednesday, April 12, 2023

Thursday, March 30, 2023

Thursday, February 23, 2023

Wednesday, December 21, 2022

Nearly 150,000 characters!

The Launch of ICU4X

When does i ≠ і?

Åge Møller, Πέτρος Νικόλαος Καρατζής, ராஜேந்திர சோழன்

You have 2 unread messages.

Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ

Revitalization and Preservation of Indigenous Languages

The Past and Future of Flag Emoji

Available Now! New YouTube Playlist and Technical Quick Start Guide

Support Unicode 💞💕💌💯✨🌟🤠🛟🎁

Tuesday, November 8, 2022

Friday, October 21, 2022

Links of Interest

Blog Archive

Labels

Followers

Subscribe to this blog