Wednesday, February 23, 2022
Unicode CLDR v41 Alpha available for testing
[beta image] The
Unicode CLDR v41 Alpha is now available for testing. The alpha has already been
integrated into the development version of ICU. We would especially appreciate
feedback from non-ICU consumers of CLDR data. Feedback can be filed at
CLDR Tickets.
Alpha means that the main data and charts are available for review, but the specification, JSON data, and other components are not yet ready for review. Some data may change if showstopper bugs are found. The planned schedule is:
The tooling changes are targeted at the v42 general submission release. They include a number of features and improvements such as progress meter widgets in the Survey Tool.
Finally, the Basic level has been modified to make it easier to onboard new languages, and easier for implementations to filter locale data based on coverage levels.
The following table shows the number of Languages/Locales in this version. (See the v41 Locale Coverage table for more information.)
Beyond the member organizations of the Unicode Consortium, many dedicated communities and individuals regularly contribute to updating their locales, including:
Alpha means that the main data and charts are available for review, but the specification, JSON data, and other components are not yet ready for review. Some data may change if showstopper bugs are found. The planned schedule is:
- Mar 09 — Beta (data)
- Mar 23 — Beta2 (spec)
- Apr 06 — Release
The tooling changes are targeted at the v42 general submission release. They include a number of features and improvements such as progress meter widgets in the Survey Tool.
Finally, the Basic level has been modified to make it easier to onboard new languages, and easier for implementations to filter locale data based on coverage levels.
The following table shows the number of Languages/Locales in this version. (See the v41 Locale Coverage table for more information.)
Level
Languages
Locales
Notes
Modern
89
361
Suitable for full UI internationalization
Moderate
13
32
Suitable for full “document content” internationalization, such as formats in a spreadsheet.
Basic
22
21
Suitable for locale selection, such as choice of language in mobile phone settings.
Total
124
414
Total of all languages/locales with ≥ Basic coverage.
Beyond the member organizations of the Unicode Consortium, many dedicated communities and individuals regularly contribute to updating their locales, including:
- Modern: Cherokee, Cantonese, Scottish Gaelic, Sorbian (Lower), Sorbian (Upper)
- Moderate: Asturian [nearly Modern], Breton, Faroese, Fulah (Adlam), Kaingang, Nheengatu, Quechua, Sardinian
- Basic: Bosnian (Cyrillic), Interlingua, Kabuverdianu, Māori, Romansh, Tajik, Tatar, Tongan, Uzbek (Cyrillic), Wolof
Friday, February 11, 2022
Unicode 15.0 Alpha Review
u15 alpha image The repertoire for Unicode 15.0 is now open for early review and comment. During alpha review the repertoire is reasonably mature and stable, but is not yet completely locked down. Discussion regarding whether certain characters should be removed from the repertoire for publication is welcome. Character names and code point assignments are reasonably firm, but suggestions for improvement may still be entertained.
This early review is provided so that reviewers may consider the character repertoire issues prior to the start of beta review (currently scheduled to start in late May, 2022). Once beta review begins, the repertoire, code points, and character names will all be locked down, and no longer be subject to changes.
Feedback for the alpha review should be reported under PRI #442 using the Unicode contact form by April 5, 2022.
This early review is provided so that reviewers may consider the character repertoire issues prior to the start of beta review (currently scheduled to start in late May, 2022). Once beta review begins, the repertoire, code points, and character names will all be locked down, and no longer be subject to changes.
Feedback for the alpha review should be reported under PRI #442 using the Unicode contact form by April 5, 2022.
Wednesday, February 9, 2022
Enhancements to Unicode Regular Expressions
Regex image A new revision of
UTS #18, Unicode Regular Expressions is now available.
Regular expressions are a key tool in software development. Back in 2000, few regular expression engines supported Unicode, even at a basic level. UTS #18 set out to raise the bar, describing how regular expression engines could be adapted to deal with Unicode correctly and completely. Since that time, major programming languages and libraries have adopted level 1 features (supporting all Unicode literals, basic character properties, subtraction, intersection, ...), and some also adopted some level 2 features (full character properties, grapheme clusters, ...).
The main focus in this release is on handling the complement of properties of strings. The distinction is drawn between code point complement and full complement, followed by explicitly defining the complement operator [^...] to be code point complement, and providing the reasons for doing so in an annex. The important difference between [A--B] and [A&&[^B]] is outlined — setting out the reasons why the latter is insufficient to represent set difference.
For the EBNF in general, and for character classes with strings in particular, examples were added and the text clarified. A new annex provides examples for how character classes can be parsed.
Regular expressions are a key tool in software development. Back in 2000, few regular expression engines supported Unicode, even at a basic level. UTS #18 set out to raise the bar, describing how regular expression engines could be adapted to deal with Unicode correctly and completely. Since that time, major programming languages and libraries have adopted level 1 features (supporting all Unicode literals, basic character properties, subtraction, intersection, ...), and some also adopted some level 2 features (full character properties, grapheme clusters, ...).
The main focus in this release is on handling the complement of properties of strings. The distinction is drawn between code point complement and full complement, followed by explicitly defining the complement operator [^...] to be code point complement, and providing the reasons for doing so in an annex. The important difference between [A--B] and [A&&[^B]] is outlined — setting out the reasons why the latter is insufficient to represent set difference.
For the EBNF in general, and for character classes with strings in particular, examples were added and the text clarified. A new annex provides examples for how character classes can be parsed.
Subscribe to:
Comments (Atom)