Unicode-enabling Checklist for Progress Applications

Simplified deployment, Shared Access, Multilingual documents and user interface, Future-proof

This checklist enumerates the tasks you should consider and perform to enable Progress applications to support Unicode and become multilingual, multinational applications. The benefits of Unicode-based applications include:

Using Unicode makes it easier to deploy applications around the world, since the applications, databases, and other files do not need to be modified to be compatible with the local (native) code pages.
Unicode is compatible with all of the code pages used around the world, and so providing shared access to clients around the world is simplified.
Unicode supports all of the languages around the world, and therefore applications can create multinational, and multilingual documents, reports, invoices, mailing labels, etc.
All modern technologies (Java, XML, .Net, etc.) either support or are based on Unicode. The Unicode Consortium continues to extend and advance Unicode, in ways that are compatible with existing applications while meeting the requirements of new technologies and advanced linguistic programming.

For more information on Unicode see the Benefits of Unicode and the XenCraft Resources page. Note that XenCraft can assess your applications and provide you with a detailed roadmap of the changes that need to be made and an estimate of the effort.

Tasks for Estimating and Enabling Unicode

Review SUBSTRING, OVERLAY, LENGTH units

Insure each statement uses correct units for counting: characters, bytes, or columns. (Also see the tools section below.)

This table shows characters and their character, byte and column counts. Some of the values may surprise you. The ligature is one character not two, and uses only one column. The euro sign takes 3 bytes to store. The single ideograph character takes 2 columns to display. The accented letter takes two bytes to store. [画像:list of example characters and their character, bytes and column counts]

Insure FORMAT statements have the right width. Each "X" or other format character represents one column (as opposed to character count or byte count). To display the ideograph in the table above, there must be at least 2 columns available in the FORMAT statement. FORMAT "X(2)"

Database conversion Converting databases to Unicode is easy. You can first dump your existing database schema and data and then load them into the empty Unicode database from DLC/prolang/utf/empty.db. (Choose the one with the appropriate block size.) Make sure to assign the Unicode wordbreak table to the database. Compile the word break table DLC/prolang/convmap/utf8-bas.wbt, assigning it a rule number.
proutil -C wbreak-compiler utf8-bas.wbt <rule-#>
Then map the word rule to the Unicode database.
proutil <database> -C word-rules <rule-#>
Rebuild word indexes

Alternatively, you can use PROUTIL to convert an existing database to Unicode using the command:
PROUTIL <database> -C Convchar convert UTF-8
Then load the Progress Unicode collation table DLC/prolang/utf/_tran.df
And compile and map the word rule table as in the other method.
Rebuild all indexes.

Evaluate binary vs. linguistic sorting. Progress supports only binary collation for Unicode. Collation affects both the ordering of results and the content as well. For example, a query FOR EACH CUSTOMER WHERE NAME <= "Z" will no longer include accented characters when binary sorting is used. (Accented characters come after the letters A-Z in Unicode.)
A workaround is to use the COMPARE statement or the COLLATE phrase to sort records on the client. (For more information, see Demonstration of the Progress 4GL COLLATE Phrase.)
A solution is to use the XenCraft XenPUC product to provide linguistic sorting of indexes and comparisons, and to cause queries to perform consistent with the expectations of natural language.

CHR, ASC Switching to Unicode requires all uses of CHR with hardcoded values greater than 127 to change. This is because the code point (value assigned to the character) will be different when the code page changes from its current value to Unicode. A solution is to take advantage of the code page conversion capabilities of CHR. For example, if you had specified the euro character (128) in a client using Windows code page 1252 as CHR(128), then when you change to Unicode, you can instead use: CHR(128,"UTF-8","1252"). This statement creates the Unicode UTF-8 character for the character that has the value 128 in Windows 1252. You can use the ASC(CHR(128,"UTF-8","1252"),"UTF-8","UTF-8") statement to see the Unicode 3-byte value for the euro: 14,844,588.

Backslash, Yen, Won, other currency symbols This is not an issue specific to Unicode, but it is exacerbated in Unicode or multilingual environments. Character 0x5C (92 decimal) is the backslash (reverse solidus) character in ASCII. However, in many countries this character is used to represent the local currency symbol. In Japan, it is the Yen (¥). In Korea, it is the Won (₩). This causes a problem when converting to Unicode. Should the backslash be converted to the Unicode character for backslash, Yen, Won, or other? The answer is it depends on whether the character is being used as a file separator or one of the currency symbols. For most Progress applications the right thing is to keep the character as backslash. But be aware, there may be situations where you need to handle this character differently. You may also need to watch for code page conversions that Progress performs for you, and use a different conversion for this character.

China market, GB18030 Any new applications sold in China after Sept. 2001, are supposed to by law support the Chinese code page GB18030. This code page is equivalent to Unicode but organized differently. Most applications support this code page by converting the text to and from Unicode when reading and writing text respectively. Progress does not support this at the moment. If you are going to sell in the Chinese market, contact XenCraft to inquire about solutions.

Font You will need a font (or fonts) that can support all of the languages you intend to use in your application. Progress 4GL character-based reporting requires fonts to be fixed width. Tools that take advantage of the GUI features of Progress can use variable width (proportional) fonts. In addition, Progress character-based reporting requires that Japanese, Chinese, and Korean characters (ideographs) will be exactly two times the width of latin characters. Fonts with this property are sometimes called duospace. Many Unicode fonts are proportional rather than fixed width, so character widths are variable. Also, many "fixed width" fonts can choose to make Asian ideograph characters be either the same width as the latin characters or 1.5 times latin width, instead of twice their width. A font that XenCraft and Progress recommends for Unicode reports is the "Andale Duospaced WT" WorldType font by :
Agfa Monotype Corporation Phone (in U.S.A.): +1 800-666-6897 Email: sales@agfamonotype.com
Tell 'em Tex sent ya!

Font vendors that support Unicode with a duospace font are welcome to contact XenCraft about supporting the Progress community.

Font height If you have not supported Chinese, Japanese, and Korean characters before, you may find that you need to increase the font size to make the text readable and have enough pixels per character to be able to distinguish one character from another. In general, fonts should be increased by 2 points to support Asian ideographs where only Latin characters were supported before. This can require changing reports and the user interface to accommodate the height expansion of each row.

Multilingual data entry Consider whether you need to provide any tools to assist users in entering multilingual data. In practice, users of most business applications, do not enter data in a wide variety of languages and especially languages that require distinctly different keyboards or input methods. Also, recent versions of operating systems make it possible, and even easy, to enter text in any language (after perhaps some special installation and configuration). Windows 2000 and XP can do this for example.

You should consider how your application will be deployed and the relevant usage scenarios. Determine if you need to provide information to your users about configuring their system for additional languages, and whether the operating system data entry capabilities are sufficient or if you should provide tools to simplify or make more efficient the entry of text in different languages.

Note also, that some applications are designed either to use hot keys or certain characters (keyboard fingerings) to make operation very efficient. When the application is deployed in new regional markets, or supports multilingual data entry, the support of new keyboards or input methods can conflict with the original design and new hot keys and key fingerings may need to be designed in to maintain data entry productivity.

Byte Order Mark (BOM) When importing or exporting Unicode plain text, you should decide whether you need to filter out BOMs or prepend, respectively. You will need to identify the places where you import/export Unicode text, whether it is plain text or a higher level protocol (e.g. XML), and how you will decide whether or not to remove or prepend a BOM. You should take into account the conventions for UTF-16BE, UTF-16LE. For internal processing, plain text should never have BOMs, since then it is very difficult to perform string operations while knowing precisely whether BOMs should be added or removed.

See Progress 4GL and the Unicode Byte Order Mark (BOM) for more information.

Normalization Unicode has more than one way to represent many characters. For example, most of the accented latin characters can be represented as a single composed character, and as a base character plus a separate combining accent character which combine together to make the accented version of the base character. For example, Å, (Letter A with ring above), can be written with the single Unicode character U+00C5. It can also be written as the base letter A (U+0041) plus a separate character ˚ (Combining ring above) U+030A.

There are often other reasons for having characters or symbols duplicated within Unicode. Sometimes there are slight differences in meaning for each character or for certain applications there is significance in the difference. For other applications, there is no significance, and the duplicated characters should be treated the same. For example, there is another character Å called "Angstrom" (U+212B), which looks identical to A with ring above.

Consider what this means for searches in your database. Every search for "A with ring above" has to look for all 3 variations (U+00C5, U+212B, and the 2 character string U+0041,U+030A). To simplify searching and comparing, the Unicode standard defines normalization rules, (Unicode Standard Annex #15 Unicode Normalization Forms) which map the duplicated characters, and the different encoding schemes to a reduced set of characters with a preferred encoding scheme.

Whether you are likely to encounter character variations, depends on the sources of your data. The greater the variety of both the input methods and applications integrated into your environment, the greater the likelihood of having differences. To reduce or eliminate the problems due to these variations you should consider employing normalization algorithms across the data as it comes in. The alternative is to change the search algorithms to make the variations equivalent, but that would entail a performance cost and in any event you are not free to change how Progress compares, searches and indexes text.

The World Wide Web consortium is defining "Normalization Form C" (NFC) as standard for the web in its Character Model document. Progress users should consider using "Normalization Form D" (NFD) for the majority of business applications.
If you need help deciding on normalization rules, or implementing them, XenCraft can help you.

Progress components Not all Progress components are Unicode-enabled. Currently, the dataservers, and the GUI, Character and Web client do not support Unicode. The Batch client, WebSpeed Agent, AppServer, Database Server, SQL-92 Server and Database do support Unicode. To use the non-Unicode clients with Unicode databases requires applying some special restrictions on record processing. (Either insure read-only access or restrict updatable records to characters that the client supports.)
Workarounds include using Java or Active-X clients (which are inherently Unicode).
If you need Unicode client support, XenCraft can help you. For example, we can provide you with a Unicode-based Progress GUI client.

Third party components You will need to review your third party components (OCXs, DLLs, reporting tools, etc.) for support for Unicode or at least support for the languages and cultures you will be marketing to. It is often the case that products claiming Unicode support do not support all regional markets and this needs to be investigated carefully.

Unicode version The latest release of Unicode (3.2) supports 95,000+ characters. Progress does not support the latest release yet, but supports most of the characters that Progress business applications will need. However, there is potential for integration problems with technologies that support later versions of Unicode or if the application is to be used in certain regional markets with advanced language and character requirements. XenCraft can help you with version discrepancies.

Project Estimation The majority of the changes to enable an application to support Unicode, involve the review of statements or functions using SUBSTRING, OVERLAY, LENGTH and insuring the correct units are used for counting. It is a very crude estimate, but you might guess that for every 1000 lines of code there will be 2 of these statements. Of course, a number of the statements may not need changing if the default value of CHARACTER is correct for the way the statements are used. However, if you are moving into new regional markets, there can be many other aspects of internationalization to consider beyond Unicode-enabling. XenCraft can assess your applications and provide you with a detailed roadmap of the changes that need to be made and an estimate of the effort.

Unicode-enabling Tools

-checkdbe Optionally, specify -checkdbe on the client command line. When -checkdbe is specified, compile listings will highlight all uses of SUBSTRING, OVERLAY, LENGTH that do not specify explicitly RAW, COLUMN, CHARACTER, or FIXED

Prolint, Proparse Globalshared's Prolint is a free tool for automated source code review. It examines Progress source files for bad programming practices and violations of coding standards. Prolint is built on top of Joanju's Proparse, a parser for Progress source. Prolint can detect and report on SUBSTRING, OVERLAY, LENGTH statements that do not specify explicitly RAW, COLUMN, CHARACTER, or FIXED.