Showing posts with label ICU. Show all posts

Friday, October 25, 2024

ICU 76 Released

ICU LogoUnicode® ICU 76 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).

ICU 76 updates to Unicode 16 (blog), including new characters and scripts, emoji, collation & IDNA changes, and corresponding APIs and implementations. It also updates to CLDR 46 (beta blog) locale data with new locales, significant updates to existing locales, and various additions and corrections. For example, the CLDR and Unicode default sort orders are now very nearly the same.

Most of the java.time (Temporal) types can now be formatted directly using the existing ICU4J date/time formatting classes.

There are some new APIs to make ICU easier to use with modern C++ and Java patterns. Most of the C/C++ APIs added for this purpose are implemented as C++ header-only APIs, and usable on top of binary stable C APIs, which is a first for ICU.

The Java and C++ technology preview implementations of the (also in tech preview) CLDR MessageFormat 2.0 specification have been updated to match recent changes.

ICU 76 and CLDR 46 are major releases, including a new version of Unicode and major locale data improvements.

For details, please see
https://unicode-org.github.io/icu/download/76.html.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Posted by Unicode, Inc. at 1:31 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 46, ICU, ICU 76, Unicode 16.0

Wednesday, April 17, 2024

ICU 75 Released

ICU LogoUnicode® ICU 75 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 75 updates to CLDR 45 (beta blog) locale data with new locales and various additions and corrections. C++ code now requires C++17 (C code now requires C11) and is being made more robust.

The CLDR MessageFormat 2.0 specification is now in technology preview, together with a corresponding update of the ICU4J (Java) tech preview and a new ICU4C (C++) tech preview.

For details, please see https://icu.unicode.org/download/75.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Posted by Unicode, Inc. at 1:11 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 45, ICU, ICU 75

Tuesday, October 31, 2023

ICU 74 Released

ICU LogoUnicode® ICU 74 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 74 updates to Unicode 15.1, and to CLDR 44 locale data with various additions and corrections.

ICU 74 and CLDR 44 are major releases, including a new version of Unicode and major locale data improvements. They subsume the changes for the ICU 73.2 and CLDR 43.1 maintenance releases.

Unicode 15.1 adds source code security mechanisms, improves line breaking for southeast Asian scripts, and adds important CJK unified ideographs.

CLDR 44 has added or improved data for a number of languages that have been newly added to ICU, and has improved measurement unit handling, conversion, and formatting.

ICU 74 implements these improvements, adds new C APIs for locale handling, adds a plug-in API for word segmentation, and switches the Java build system to Maven.

For details, please see https://icu.unicode.org/download/74.

Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Posted by Unicode, Inc. at 2:16 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 44, ICU, ICU 74

Friday, October 6, 2023

ICU4X 1.3: Now With Built-In Data, Case Mapping, Additional Calendar Systems, And More

By Robert Bastian, ICU4X Technical Committee

ICU LogoAcross the globe, people are coming online with smaller and more varied devices including smartphones, smart watches, and gadgets. An offshoot of the International Components for Unicode (ICU) Committee, the ICU4X Committee is responsible for enabling these next-generation devices to communicate with their users in thousands of languages. Written in Rust, ICU4X brings lightweight, modular, and secure internationalization libraries to low-resource devices and many programming languages.

Since our last release in April 2023, the ICU4X team has been busy building additional features and improving the usability of the library. Today we're happy to announce the 1.3 release, including built-in data, a new datagen API, the first stable release of the case mapping component, support for more calendar systems, a technology preview of rule-based transliteration, and more.

We have heard feedback that ICU4X's data pipeline, while allowing powerful customization, has a significant learning curve. In ICU4X 1.3 we are therefore introducing a new feature called "compiled data", where we ship data generated from the latest CLDR and ICU versions in the library. This means that every ICU4X type gains a new constructor that does not take a data provider argument, but instead uses the compiled data. This data is using our existing "baked data" format, which, just being Rust code, allows the compiler to perform optimizations and granularly exclude unnecessary data. In fact, programs that are not using any of the new constructors will not see a binary size difference even with the compiled_data Cargo feature enabled (it is enabled by default).

In addition to adding compiled data, we have also revamped our data generation API icu_datagen. The new API is more ergonomic, allows for more flexible data generation, such as choosing which segmentation models to include, and also better optimizes the size of the generated data. For example, with the new "fallback mode" flag, data can be generated under the assumption that locale fallback is going to be used at runtime. This way, data for e.g. en-CA does not have to be included if it matches the data for en, because at runtime en will be tried if en-CA doesn't exist. This mode of data duplication is already used for compiled data, which comes with built-in fallback.

ICU4X 1.3 also stabilizes a new component: casemapping. Many scripts are bicameral, meaning they have an upper and lower case. Casemapping allows for converting between upper, lower, and title case, and the related casefolding operation allows for performing case-insensitive string matching. These operations can be rather nuanced and locale-dependent: for example, the letter “i” capitalizes to “İ” in Turkish, and modern Greek removes accents and adds diæreses when uppercasing.

This release also completes the set of calendars to include all CLDR calendars. In addition to the Gregorian, Thai Solar Buddhist, Coptic, Ethiopian, Indian National (Śaka), and Japanese calendars that have been supported since 1.0, ICU4X now also supports the Chinese, Korean (Dangi), Hebrew, Persian (Solar Hijri), R.O.C., and four variants of the Islamic calendar (civil, observational, tabular, and Umm al-Qura). This support includes formatting, though formatting for Chinese and Korean is currently in a preview state.

We're also launching a transliteration API as a technical preview. Transliteration is the conversion between scripts, such as from Arabic to Latin, preserving pronunciation as far as possible. CLDR supports many transliterations, and this release brings these CLDR transliterations to ICU4X. While data generation is not yet available, users can runtime-construct transliterators to convert between any scripts supported by CLDR.

Finally, ICU4X 1.3 brings a number of smaller features to other components. The experimental display names component now supports formatting language identifiers, in addition to language, script, and region display names; there are performance improvements across the board; and some APIs such as LocaleFallbacker have been moved to better locations.

Read the full ICU4X 1.3 release notes and then the ICU4X tutorial to start using ICU4X in your project.

Posted by Unicode, Inc. at 2:03 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: ICU, ICU4X, ICU4X 1.3

Thursday, June 15, 2023

ICU 73.2 & CLDR 43.1 released: GB18030 compliance updates & compatibility fixes

ICU 73.2 & CLDR 43.1 released: GB18030 compliance updates & compatibility fixes ICU LogoUnicode® ICU 73.2 and CLDR 43.1 have just been released.

ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).
CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). All major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

There are significant changes for GB18030-2022 compliance support:

CLDR extends the support for “short” Chinese sort orders to cover some additional, required characters for Level 2. This is carried over into ICU collation.
ICU has a modified character conversion table, mapping some GB18030 characters to Unicode characters that were encoded after GB18030-2005.

There are also changes for compatibility:

There are optional variants of time formats with AM/PM (only for English) using ASCII spaces in CLDR that can also be used in ICU via custom data generation. This is intended to help certain implementers transition to the improved patterns, which have used a narrow no-break space between the time and AM/PM since CLDR 42.
- For how to generate ICU data with this option, look for alt="ascii" on tools/cldr/cldr-to-icu/README.md
The changes to the word segmentation behavior of @ sign that were in CLDR 42 (ICU 72) have been reverted. These caused problems for certain parsers that did not expect @ to join to letters.

ICU 73.2 updates to CLDR 43.1 locale data. These are maintenance releases for ICU 73 and CLDR 43, with limited sets of bug fixes and no API or structural changes. ICU 73.2 and CLDR 43.1 include several other bug fixes, including person name formatting, and Cyrillic transforms.

For details, please see:

ICU 73.2 Release Note: ICU 73.2 maintenance release
CLDR 43.1 Release Note: Version 43.1 Changes

Posted by Unicode, Inc. at 1:31 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 43, ICU, ICU 73

Thursday, June 1, 2023

Unlocking the Power of CLDR Person Name Formatting: A Solution for Formatting Names in a Globalized World

By Mike McKenna, Chair of CLDR Person Names Subcommittee

[image]

CLDR Person Names has moved from “tech preview” to “draft” status and is available for initial testing by implementors through ICU4J.

How a person’s name is displayed and used can convey respect, familiarity, or even be interpreted as rude if used improperly. That’s why it’s important to format names correctly, especially because naming practices vary across the globe. In many cultures, names can indicate gender, status, birthplace, nationality, ethnicity, religion, and more.

Until now, there have been no good standards for how to format people’s names in various contexts. A number of Unicode members wanted to address this problem and provide a mechanism that anyone could use to format people’s names in a wide variety of applications, such as contact lists, air travel, billing applications, CRMs, social media, and any other application that asks for user information and presents it back to the user or others.

The Unicode® Person Name Formats defines patterns used to take a person’s name and format it correctly in a given language or locale depending on a chosen context. With the Unicode Common Locale Data Repository (CLDR), locale codes and name sequences can be selected to create a specific pattern for formatting a person’s name — including preferences for formal, informal, or abbreviated versions. As a result, designers and developers can correctly display names according to the user’s native locale and culture, especially important when integrating names in different character scripts, such as Japanese, Chinese, or Russian.

The Unicode Consortium added Person Name formatting to CLDR in v42 and has been refined and enhanced for v43, which just released in April. In CLDR v43, with the help of linguists from around the world, we completed data for formatting people’s names for CLDR locales at modern coverage. Its formal name is "Unicode Technical Standard #35 Unicode Locale Data Markup Language (LDML); Part 8: Person Names". ICU has added the PersonNameFormatter class and is available in ICU 73.

To learn more, and get an idea of the implications for user experience and application design, see the following paper, which provides an illustration of the many contexts in which names can be formatted through CLDR Person Names.

LDML (UTS#35) Part 8: Person Names - a story teller’s case study

Posted by Unicode, Inc. at 11:26 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 43, CLDR 44, ICU, person names

Thursday, April 13, 2023

ICU 73 Released

ICU LogoUnicode® ICU 73 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 73 updates to CLDR 43 locale data with various additions and corrections.

ICU 73 improves Japanese and Korean short-text line breaking, reduces C++ memory use in date formatting, and promotes the Java person name formatter from tech preview to draft.

ICU 73 and CLDR 43 are minor releases, mostly focused on bug fixes and small enhancements. (The fall CLDR/ICU releases will update to Unicode 15.1 which is planned for September.)

ICU 73 updates to the time zone data version 2023c (March 2023). Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.

For details, please see https://icu.unicode.org/download/73.

Posted by Unicode, Inc. at 3:44 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 43, ICU, ICU 73

Tuesday, November 8, 2022

Available Now! New YouTube Playlist and Technical Quick Start Guide

Youtube Image

By Elango Cheran

On September 28th, Unicode held a webinar on the “Overview of Internationalization and Unicode Projects” for Unicode enthusiasts. More than 180 people across 30 countries joined us for this online event.

The Consortium is pleased to now make available the videos from this event. If you are new to Unicode and internationalization or want an overview of the most recent projects, check out our new YouTube playlist and Technical Quick Start Guide.

Our Technical Leadership and other experts provide a handy overview on such topics as:

Introduction to Internationalization - Addison Phillips, Internationalization Engineer
Unicode Consortium: Past, Present, and Future - Mark Davis, Cofounder and President
Scripts and Character Encoding - Deborah Anderson, Chair of the Script Ad Hoc Committee
Unicode CLDR (Common Locale Data Repository) - Mark Davis and Annemarie Apple, Chair and Vice Chair of the CLDR Committee
Unicode ICU (International Components for Unicode) - Markus Scherer, Chair of ICU Committee
Unicode ICU4X 2022 - Shane Carr, Chair of ICU4X Subcommittee

Also included in the playlist is the audience Q&A with Elango Charan, the webinar’s emcee, and Mark Davis, Cofounder and President of Unicode.

The Unicode Technical Quick Start Guide is also now available. The guide explains what Unicode is, including answering the question, “What is Internationalization and Why it Matters.” There is also an overview of the technical committees and useful links to more detailed information and how you can get involved.

Learn more about how you can support the Unicode Consortium and our mission, including information on our Adopt-a-Character program, here!
[badge]

Posted by Unicode, Inc. at 3:12 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, guide, I18n, ICU, playlist, quick start, scripts, webinar, Youtube

Friday, October 21, 2022

ICU 72 Released

ICU LogoUnicode® ICU 72 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 72 updates to Unicode 15 , and to CLDR 42 locale data with various additions and corrections.

ICU 72 and CLDR 42 are major releases, including a new version of Unicode and major locale data improvements.

ICU 72 adds two technology preview implementations based on draft Unicode specifications:

Formatting of people’s names in multiple languages (CLDR background on why this feature is being added and what it does)
An enhanced version of message formatting

This release also updates to the time zone data version 2022e (2022-oct). Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.

For details, please see https://icu.unicode.org/download/72.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Posted by Unicode, Inc. at 8:24 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 42, ICU, ICU 72, Unicode 15

Thursday, October 6, 2022

ICU 72 Release Candidate Available

ICU LogoWe are pleased to announce the release candidate for Unicode® ICU 72. It updates to Unicode 15, and to CLDR 42 locale data with various additions and corrections.

ICU 72 adds technology preview implementations for person name formatting, as well as for a new version of message formatting based on a proposed draft Unicode specification.

ICU 72 and CLDR 42 are major releases, including a new version of Unicode and major locale data improvements.

ICU 72 updates to the time zone data version 2022b (2022-Aug) which is effectively the same as 2022c. Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.

For details, please see https://icu.unicode.org/download/72.

Please test this release candidate on your platforms and report bugs and regressions by Tuesday, 2022-Oct-18, via the icu-support mailing list, and/or please find/submit error reports.

Please do not use this release candidate in production.

The preliminary API reference documents are published on unicode-org.github.io/icu-docs/ – follow the “Dev” links there.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Posted by Unicode, Inc. at 3:57 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 42, ICU, ICU 72

Thursday, September 29, 2022

Announcing ICU4X 1.0

ICU Logo

I. Introduction

Hello! Ndeewo! Molweni! Салам! Across the world, people are coming online with smartphones, smart watches, and other small, low-resource devices. The technology industry needs an internationalization solution for these environments that scales to dozens of programming languages and thousands of human languages.

Enter ICU4X. As the name suggests, ICU4X is an offshoot of the industry-standard i18n library published by the Unicode Consortium, ICU (International Components for Unicode), which is embedded in every major device and operating system.

This week, after 2½ years of work by Google, Mozilla, Amazon, and community partners, the Unicode Consortium has published ICU4X 1.0, its first stable release. Built from the ground up to be lightweight, portable, and secure, ICU4X learns from decades of experience to bring localized date formatting, number formatting, collation, text segmentation, and more to devices that, until now, did not have a suitable solution.

Lightweight: ICU4X is Unicode's first library to support static data slicing and dynamic data loading. With ICU4X, clients can inspect their compiled code to easily build small, optimized locale data packs and then load those data packs on the fly, enabling applications to scale to more languages than ever before. Even when platform i18n is available, ICU4X is suitable as a polyfill to add additional features or languages. It does this while using very little RAM and CPU, helping extend devices' battery life.

Portable: ICU4X supports multiple programming languages out of the box. ICU4X can be used in the Rust programming language natively, with official wrappers in C++ via the foreign function interface (FFI) and JavaScript via WebAssembly. More programming languages can be added by writing plugins, without needing to touch core i18n logic. ICU4X also allows data files to be updated independently of code, making it easier to roll out Unicode updates.

Secure: Rust's type system and ownership model guarantee memory-safety and thread-safety, preventing large classes of bugs and vulnerabilities.

How does ICU4X achieve these goals, and why did the team choose to write ICU4X over any number of alternatives?

II. Why ICU4X?

You may still be wondering, what led the Unicode Consortium to choose a new Rust-based library as the solution to these problems?

II.A. Why a new library?

The Unicode Consortium also publishes ICU4C and ICU4J, i18n libraries written for C/C++ and Java. Why write a new library from scratch? Wouldn’t that increase the ongoing maintenance burden? Why not focus our efforts on improving ICU4C and/or ICU4J instead?

ICU4X solves a different problem for different types of clients. ICU4X does not seek to replace ICU4C or ICU4J; rather, it seeks to replace the large number of non-Unicode, often-unmaintained, often-incomplete i18n libraries that have been written to bring i18n to new programming languages and resource-constrained environments. ICU4X is a product that has long been missing from Unicode's portfolio.

Early on, the team evaluated whether ICU4X's goals could have been achieved by refactoring ICU4C or ICU4J. We found that:

ICU4C has already gone through a period of optimization for tree shaking and data size. Despite these efforts, we continue to have stakeholders saying that ICU4C is too large for their resource-constrained environment. Getting further improvements in ICU4C would amount to rewrites of much of ICU4C's code base, which would need to be done in a way that preserves backwards compatibility. This would be a large engineering effort with an uncertain final result. Furthermore, writing a new library allows us to additionally optimize for modern UTF-8-native environments.
Except for JavaScript via j2cl, Java is not a suitable source language for portability to low-resource environments like wearables. Further, ICU4J has many interdependent parts that would require a large amount of effort to bring to a state where it could be a viable j2cl source.
Some of our stakeholders (Firefox and Fuchsia) are drawn to Rust's memory safety. Like most complex C++ projects, ICU4C has had its share of CVEs , mostly relating to memory safety. Although C++ diagnostic tools are improving, Rust has very strong guarantees that are impossible in other software stacks.

For all these reasons, we decided that a Rust-based library was the best long-term choice.

II.B. Why use ICU4X when there is i18n in the platform?

Many of the same people who work on ICU4X also work to make i18n available in the platform (browser, mobile OS, etc.) through APIs such as the ECMAScript Intl object , android.icu , and other smartphone native libraries. ICU4X complements the platform-based solutions as the ideal polyfill:

Some platform i18n features take 5 or more years to gain wide enough availability to be used in client-side applications. ICU4X can bridge the gap.
ICU4X can enable clients to add more locales than those available in the platform.
Some clients prefer identical behavior of their app across multiple devices. ICU4X can give them this level of consistency.
Eventually, we hope that ICU4X will back platform implementations in ECMAScript and elsewhere, providing a maximal amount of consistency when ICU4X is also used as a polyfill.

II.C Why pluggable data?

One of the most visible departures that ICU4X makes from ICU4C and ICU4J is an explicit data provider argument on most constructor functions. The ICU4X data provider supports the following use cases:

Data files that are readable by both older and newer versions of the code; for more detail on how this works, see ICU4X Data Versioning Design
Data files that can be swapped in and out at runtime, making it easy to upgrade Unicode, CLDR, or time zone database versions. Swapping in new data can be done at runtime without needing to restart the application or clear internal caches.
Multiple data sources. For example, some data may be baked into the app, some may come from the operating system, and some may come from an HTTP service.
Customizable data caches. We recognize that there is no "one size fits all" approach to caching, so we allow the client to configure their data pipeline with the appropriate type of cache.
Fully configurable data fallbacks and overlays. Individual fields of ICU4X data can be selectively overridden at runtime.

III. How We Made ICU4X Lightweight

There are three factors that combine to make code lightweight: small binary size, low memory usage, and deliberate performance optimizations. For all three, we have metrics that are continuously measured on GitHub Actions continuous integration (CI).

III.A. Small Binary Size

Internationalization involves a large number of components with many interdependencies. To combat this problem, ICU4X optimizes for "tree shaking" (dead code elimination) by:

Minimizing the number of dependencies of each individual component.
Using static types in ways that scope functions to the pieces of data they need.
Splitting functions and classes that pull in more data than they need into multiple, smaller pieces.

Developers can statically link ICU4X and run a tree-shaking tool like LLVM link-time optimization (LTO) to produce a very small amount of compiled code, and then they can run our static analysis tool to build an optimally small data file for it.

In addition to static analysis, ICU4X supports dynamic data loading out of the box. This is the ultimate solution for supporting hundreds of languages, because new locale data can be downloaded on the fly only when they are needed, similar to message bundles for UI strings.

III.B. Low Memory Usage

At its core, internationalization transforms inputs to human-readable outputs, using locale-specific data. ICU4X introduces novel strategies for runtime loading of data involving zero memory allocations:

Supports Postcard -format resource files for dynamically loaded, zero-copy deserialized data across all architectures.
Supports compile-time linking of required data without deserialization overhead via DataBake .
Data schema is designed so that individual components can use the immutable locale data directly with minimal post-processing, greatly reducing the need for internal caches.
Explicit "data provider" argument to each function that requires data, making it very clear when data is required.

ICU4X team member Manish Goregaokar wrote a blog post series detailing how the zero-copy deserialization works under the covers.

III.C. Deliberate Performance Optimizations

Reducing CPU usage improves latency and battery life, important to most clients. ICU4X achieves low CPU usage by:

Writing in Rust, a high-performance language.
Utilizing zero-copy deserialization.
Measuring every change against performance benchmarks.

The ICU4X team uses a benchmark-driven approach to achieve highly competitive performance numbers: newly added components should have benchmarks, and future changes to those components should avoid regressing on those benchmarks.

Although we always seek to improve performance, we do so deliberately. There are often space/time tradeoffs, and the team takes a balanced approach. For example, if improving performance requires increasing or duplicating the data requirements, we tend to favor smaller data, like we've done in the normalizer and collator components. In the segmenter components, we offer two modes: a machine learning LSTM segmenter with lower data size but heavier CPU usage, and a dictionary-based segmenter with larger data size but faster. (There is ongoing work to make the LSTM segmenter require fewer CPU resources.)

IV. How We Made ICU4X Portable

The software ecosystem continually evolves with new programming languages. The "X" in ICU4X is a nod to the second main design goal: portability to many different environments.

ICU4X is Unicode's first internationalization library to have official wrappers in more than one target language. We do this with a tool we designed called Diplomat , which generates idiomatic bindings in many programming languages that encourage i18n best practices. Thanks to Diplomat, these bindings are easy to maintain, and new programming languages can be added without needing i18n expertise.

Under the covers, ICU4X is written in no_std Rust (no system dependencies) wrapped in a stable ABI that Diplomat bindings invoke across foreign function interface (FFI) or WebAssembly (WASM). We have some basic tutorials for using ICU4X from C++ and JavaScript/TypeScript .

V. What’s next?

ICU4X represents an exciting new step in bringing internationalized software to more devices, use cases, and programming languages. A Unicode working group is hard at work on expanding ICU4X’s feature set over time so that it becomes more useful and performant; we are eager to learn about new use cases and have more people contribute to the project.

Have questions? You can contact us on the ICU4X discussion forum !

Want to try it out? See our tutorials , especially our Intro tutorial !

Interested in getting involved? See our Contribution Guide .

Want to stay posted on future ICU4X updates? Sign up for our low-traffic announcements list, icu4x-announce@unicode.org !

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Posted by Unicode, Inc. at 7:15 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: ICU, ICU4X, Rust

Wednesday, September 21, 2022

New Online Event – Overview of Internationalization and Unicode Projects

The Unicode Consortium is excited to invite you to our upcoming online event, “Overview of Internationalization and Unicode Projects.”

During this ~2-hour event, hear pre-recorded sessions from some of the experts working to ensure that everyone can fully communicate and collaborate in their languages across all software and services. Unicode representatives will be available for live Q&A for the last 30-40 minutes and our emcee throughout will be Elango Cheran of Google.

Topics and speakers include:

An Introduction to Internationalization (i18n) - Addison Phillips, Internationalization Engineer
Overview of the Unicode Consortium: History and Future - Mark Davis, Cofounder and President
Scripts and Character Encoding - Deborah Anderson, Chair of the Script Ad Hoc Committee
The Common Locale Data Repository (CLDR) - Mark Davis and Annemarie Apple, Chair and Vice Chair of the CLDR Committee
International Components for Unicode (ICU) - Markus Scherer, Chair of ICU Committee
Bringing Internationalization to More Programming Languages and Resource-Constrained Environments (ICU4X) - Shane Carr, Chair of ICU4X Subcommittee

Date Wednesday, September 28th, 2022

Time 9:30am (California)/12:30pm (New York)/16:30 (UTC)/17:30 (London)

Location
and Cost Online, free to attend

Registration Register here. Please freely share this link with colleagues and anyone else who may be interested. Registration will also ensure you will receive updates for future Unicode events.

The recording and a playlist will be available on YouTube later this year for anyone who is unable to attend or if attendees want to share the information with others. Depending on community interest, Unicode project leaders will also be available in November and December for virtual “Office Hours” to talk more in depth and answer specific questions.

The link to share with your networks is: https://us06web.zoom.us/webinar/register/WN_ViDf3YFyS7WiAXnHYp88kw

Thanks and hope to see many of you on the 28th!

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Posted by Unicode, Inc. at 12:56 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, event, I18n, ICU, ICU4X, internationalization, scripts

Friday, April 8, 2022

ICU 71 Released

ICU LogoUnicode® ICU 71 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 71 updates to CLDR 41 locale data with various additions and corrections.

ICU 71 adds phrase-based line breaking for Japanese. Existing line breaking methods follow standards and conventions for body text but do not work well for short Japanese text, such as in titles and headings. This new feature is optimized for these use cases.

ICU 71 adds support for Hindi written in Latin letters (hi_Latn). The CLDR data for this increasingly popular locale has been significantly revised and expanded. Note that based on user expectations, hi_Latn incorporates a large amount of English, and can also be referred to as “Hinglish”.

ICU 71 and CLDR 41 are minor releases, mostly focused on bug fixes and small enhancements. (The fall CLDR/ICU releases will update to Unicode 15 which is planned for September.) We are also working to re-establish continuous performance testing for ICU, and on development towards future versions.

ICU 71 updates to the time zone data version 2022a. Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.

For details, please see https://icu.unicode.org/download/71.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Posted by Unicode, Inc. at 7:32 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 41, ICU, ICU 71, Unicode 14

Wednesday, November 10, 2021

ICU4X 0.4 Released

ICU LogoUnicode® ICU4X 0.4 has just been released. This revision brings an implementation of Unicode Properties, major performance and memory improvements for DateTimeFormat, and extends the data provider data loading models with BlobDataProvider.

ICU4X 0.4 also adds initial time zone support in DateTimeFormat, week of month/year, iteration APIs in Segmenter and experimental ListFormatter.

The ICU4X team is shifting to work on the 0.5 release in accordance with the roadmap and a product requirements document setting sights on a stable 1.0 release in Q2 2022.

ICU4X aims to develop a highly modular set of internationalization components for resource-constrained environments, portable across programming languages.

Multiple early adopters use ICU4X in pre-release software in Rust, C, C++, and WebAssembly. The team is ready to onboard additional early adopters to refine the APIs, build processes, and feature sets before the 1.0 release. The team is also looking for contributors to write code generation for additional target programming languages. For more information, please open a discussion on the ICU4X GitHub.

For details, please see the changelog.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Posted by Unicode, Inc. at 10:54 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: FFI, ICU, ICU4X, Rust, Unicode

Thursday, October 28, 2021

ICU 70 Released

ICU LogoUnicode® ICU 70 has just been released. ICU 70 incorporates updates to Unicode 14, including new characters, scripts, emoji, and corresponding API constants. ICU 70 adds support for emoji properties of strings. It also updates to CLDR 40 locale data with many additions and corrections. ICU 70 also includes many other bug fixes and enhancements, especially for measurement unit formatting, and it can now be built and used with C++20 compilers.

ICU is a software library widely used by products and other libraries to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).

For details, please see https://icu.unicode.org/download/70.

Note: Our website has moved. Please adjust your bookmarks.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Posted by Unicode, Inc. at 12:31 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 40, ICU, ICU 70, Unicode 14

Thursday, May 6, 2021

ICU4X 0.2 Released

ICU LogoUnicode® ICU4X 0.2 has just been released. This revision improves completeness of the components in ICU4X 0.1 and introduces a number of lower-level utilities.

ICU4X 0.2 adds minimal decimal formatting, time zone formatting, datetime skeleton resolution, and locale canonicalization.

This release comes with new low-level utilities for fixed decimal operations, ICU patterns, and foundational components allowing use of ICU4X from other ecosystems via Foreign Function Interfaces.

Additionally, the ICU4X team released a roadmap and a product requirements document setting sights on a stable 1.0 release.

ICU4X aims to develop a highly modular set of internationalization components for resource-constrained environments.

For details, please see changelog.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Posted by Unicode, Inc. at 10:30 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: FFI, ICU, ICU4X, Rust, Unicode

Friday, April 9, 2021

ICU 69 Released

ICU LogoUnicode® ICU 69 has just been released. ICU 69 incorporates updates to CLDR 39 locale data with its many additions and corrections. ICU 69 also includes significant improvements to formatting for measurement units and numbers, as well as many other bug fixes and enhancements.

ICU is a software library widely used by products and other libraries to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).

For details, please see http://site.icu-project.org/download/69.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Posted by Unicode, Inc. at 9:44 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 39, ICU, ICU 69

Thursday, October 29, 2020

ICU 68 Released

ICU LogoUnicode® ICU 68 has just been released. ICU 68 updates to CLDR 38 locale data with many additions and corrections. ICU 68 brings support for locale-dependent smart unit preferences (road distance, temperature, etc.), implements locale ID canonicalization conformant with CLDR, and includes many other bug fixes and enhancements.

ICU is a software library widely used by products and other libraries to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).

For details, please see http://site.icu-project.org/download/68.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Posted by Unicode, Inc. at 10:33 AM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, cldr 38, ICU, ICU 68

Friday, October 23, 2020

Announcing ICU4X 0.1

ICU LogoWe are thrilled to announce the first pre-release version of the ICU4X internationalization components. ICU4X aims to provide high quality internationalization components with a focus on:

Modularity
Flexible data management
Performance, memory, safety and size
Universal access from programming languages and ecosystems (FFI)

ICU4X draws from the experience of projects such as ICU4C, ICU4J, ECMA-402, CLDR, and Unicode.

Target

ICU4X is initially focusing on a subset of internationalization APIs standardized in ECMA-402 in order to cover the needs of client-side ecosystems and thin clients.

ICU4X targets a wide range of programming languages and environments, aiming to expose its APIs to languages such as Javascript, WebAssembly, Dart, C++, Python, PHP, and others.

With our focus on client-side ecosystems a lot of effort will be placed on minimizing the size, memory, and CPU utilization, and allowing for asynchronous data management.

More information on the design can be found in the project’s Announcement article.

Status

This first pre-release 0.1 version is written in Rust and introduces a small subset of APIs and scaffolding for flexible data management.

We would like to invite everyone to try it out. Take a look at the documentation and provide feedback on the API design. We’re also looking for feedback on the algorithms and data structures we use, especially from contributors with experience in Rust and ICU algorithms

More information on the release can be found in the Release Notes.

Roadmap

The next version, 0.2, will focus on validating the ability to expose ICU4X APIs to other programming environments and extending the data management system to be asynchronous.

The project is fully open source and invites all interested parties to join the effort of designing and developing a modular internationalization components system in Rust.

To learn more on how to contribute to the project, visit the CONTRIBUTE document in the project’s repository.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Posted by Unicode, Inc. at 12:41 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: FFI, ICU, Rust, Unicode

Friday, April 24, 2020

ICU 67 Released

ICU LogoUnicode® ICU 67 has just been released. ICU 67 updates to CLDR 37 locale data with many additions and corrections. This release also includes the updates to Unicode 13, subsuming the special CLDR 36.1 and ICU 66 releases. ICU 67 includes many bug fixes for date and number formatting, including enhanced support for user preferences in the locale identifier. The LocaleMatcher code and data are improved, and number skeletons have a new “concise” form that can be used in MessageFormat strings.

ICU is a software library widely used by products and other libraries to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).

For details, please see http://site.icu-project.org/download/67.

Posted by Unicode, Inc. at 2:48 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: CLDR, CLDR 37, ICU, ICU 67, Unicode 13

Friday, October 25, 2024

Adopt a Character and Support Unicode’s Mission

Wednesday, April 17, 2024

Adopt a Character and Support Unicode’s Mission

Tuesday, October 31, 2023

Friday, October 6, 2023

Thursday, June 15, 2023

Thursday, June 1, 2023

Thursday, April 13, 2023

Tuesday, November 8, 2022

Friday, October 21, 2022

Thursday, October 6, 2022

Thursday, September 29, 2022

I. Introduction

II. Why ICU4X?

II.A. Why a new library?

II.B. Why use ICU4X when there is i18n in the platform?

II.C Why pluggable data?

III. How We Made ICU4X Lightweight

III.A. Small Binary Size

III.B. Low Memory Usage

III.C. Deliberate Performance Optimizations

IV. How We Made ICU4X Portable

V. What’s next?

Wednesday, September 21, 2022

Friday, April 8, 2022

Wednesday, November 10, 2021

Thursday, October 28, 2021

Thursday, May 6, 2021

Friday, April 9, 2021

Thursday, October 29, 2020

Friday, October 23, 2020

Target

Status

Roadmap

Friday, April 24, 2020

Links of Interest

Blog Archive

Labels

Followers

Subscribe to this blog