Logo for XenCraft, Making e-business work around the world
Making e-business work around the world!
Unicode-enabling Checklist for Progress Applications
Simplified deployment, Shared Access, Multilingual documents and user interface, Future-proof
This checklist enumerates the tasks you should
consider and perform to enable Progress applications to support
Unicode and become multilingual, multinational applications.
The benefits of Unicode-based applications include:
- Using Unicode makes it easier to deploy applications around the
world, since the applications, databases, and other files do not
need to be modified to be compatible with the local (native) code
pages.
- Unicode is compatible with all of the code pages used
around the world, and so providing shared access to clients
around the world is simplified.
- Unicode supports all of the
languages around the world, and therefore applications can create
multinational, and multilingual documents, reports, invoices,
mailing labels, etc.
- All modern technologies (Java, XML, .Net,
etc.) either support or are based on Unicode. The Unicode
Consortium continues to extend and advance Unicode, in ways that
are compatible with existing applications while meeting the
requirements of new technologies and advanced linguistic
programming.
For more information on Unicode see the Benefits of Unicode
and the XenCraft Resources page. Note that XenCraft can
assess your applications and provide you with a detailed roadmap of the changes that need to be made
and an estimate of the effort.
Tasks for Estimating and Enabling Unicode
Review
SUBSTRING, OVERLAY, LENGTH units
Insure each statement uses correct units for counting: characters, bytes, or columns.
(Also see the
tools section below.)
This table shows characters and their character, byte and column counts. Some of the values may surprise you. The ligature is
one character not two, and uses only one column. The euro sign takes 3 bytes to store. The single ideograph character takes 2 columns to
display. The accented letter takes two bytes to store.
[
画像:list of example characters and their character, bytes and column counts]
Insure FORMAT statements have the right width.
Each "X" or other format character represents one column (as opposed to character count or byte count).
To display the ideograph in the table above, there must be at least 2 columns available in the FORMAT statement.
FORMAT "X(2)"
Database conversion
Converting databases to Unicode is easy. You can first dump your existing database schema and data and then load
them into the empty Unicode database from
DLC/prolang/utf/empty.db. (Choose the one with the appropriate block size.)
Make sure to assign the Unicode wordbreak table to the database.
Compile the word break table
DLC/prolang/convmap/utf8-bas.wbt, assigning it a rule number.
proutil -C wbreak-compiler utf8-bas.wbt <rule-#>
Then map the word rule to the Unicode database.
proutil <database> -C word-rules <rule-#>
Rebuild word indexes
Alternatively, you can use PROUTIL to convert an existing database to Unicode using the command:
PROUTIL <database> -C Convchar convert UTF-8
Then load the Progress Unicode collation table DLC/prolang/utf/_tran.df
And compile and map the word rule table as in the other method.
Rebuild all indexes.
Evaluate binary vs. linguistic sorting.
Progress supports only binary collation for Unicode. Collation affects both the ordering of results and
the content as well. For example, a query
FOR EACH CUSTOMER WHERE NAME <= "Z" will no longer include accented characters when binary sorting is used.
(Accented characters come after the letters A-Z in Unicode.)
A workaround is to use the
COMPARE statement or the
COLLATE phrase to
sort records on the client. (For more information, see
Demonstration of the Progress 4GL COLLATE Phrase.)
A solution is to use the
XenCraft XenPUC product to
provide linguistic sorting of indexes and comparisons, and to cause queries to perform consistent with the expectations of natural language.
CHR, ASC
Switching to Unicode requires all uses of
CHR with hardcoded values greater than 127 to
change. This is because the code point (value assigned to the
character) will be different when the code page changes from its
current value to Unicode. A solution is to take advantage of the
code page conversion capabilities of CHR.
For example, if you had specified the euro character (128) in
a client using Windows code page 1252 as
CHR(128), then when you change to Unicode, you can instead
use: CHR(128,"UTF-8","1252"). This
statement creates the Unicode UTF-8 character for the character
that has the value 128 in Windows 1252. You can use the ASC(CHR(128,"UTF-8","1252"),"UTF-8","UTF-8")
statement to see the Unicode 3-byte value for the euro:
14,844,588.
Backslash, Yen, Won, other currency symbols
This is not an issue specific to Unicode, but it
is exacerbated in Unicode or multilingual environments. Character
0x5C (92 decimal) is the backslash (reverse solidus) character in
ASCII. However, in many countries this character is used to
represent the local currency symbol. In Japan, it is the Yen
(¥). In Korea, it is the Won (₩). This causes a
problem when converting to Unicode. Should the backslash be
converted to the Unicode character for backslash, Yen, Won, or
other? The answer is it depends on whether the character is being
used as a file separator or one of the currency symbols. For most
Progress applications the right thing is to keep the character as
backslash. But be aware, there may be situations where you need
to handle this character differently. You may also need to watch
for code page conversions that Progress performs for you, and use
a different conversion for this character.
China market, GB18030
Any new applications sold in China after Sept. 2001,
are supposed to by law support the Chinese code page
GB18030.
This code page is equivalent to Unicode but organized
differently. Most applications support this code page by
converting the text to and from Unicode when reading and writing
text respectively. Progress does
not support this at the
moment.
If you are going to sell in the Chinese market,
contact XenCraft to inquire about solutions.
Font
You will need a font (or fonts) that can support
all of the languages you intend to use in your application.
Progress 4GL character-based reporting requires fonts to be fixed width.
Tools that take advantage of the GUI features of Progress can use variable width (proportional) fonts.
In addition, Progress character-based reporting requires
that Japanese, Chinese, and Korean characters
(ideographs) will be exactly two times the width of latin
characters. Fonts with this property are sometimes called
duospace. Many Unicode fonts are proportional rather than
fixed width, so character widths are variable. Also, many "fixed width"
fonts can choose to make Asian ideograph characters be either the same width as
the latin characters or 1.5 times latin width, instead of twice their width.
A font that XenCraft and Progress recommends for Unicode reports is the "
Andale Duospaced WT"
WorldType font
by :
Agfa Monotype Corporation
Phone (in U.S.A.): +1 800-666-6897 Email:
sales@agfamonotype.com
Tell 'em Tex sent ya!
Font vendors that support Unicode with a duospace font are welcome to contact XenCraft
about supporting the Progress community.
Font height
If you have not supported Chinese, Japanese, and
Korean characters before, you may find that you need to increase the font size to
make the text readable and have enough pixels per character to be
able to distinguish one character from another. In general, fonts
should be increased by 2 points to support Asian ideographs where
only Latin characters were supported before. This can require
changing reports and the user interface to accommodate the height
expansion of each row.
Multilingual data entry
Consider whether you need to provide any tools to assist users in
entering multilingual data. In practice, users of most business
applications, do not enter data in a wide variety of languages
and especially languages that require distinctly different
keyboards or input methods. Also, recent versions of operating
systems make it possible, and even easy, to enter text in any
language (after perhaps some special installation and
configuration). Windows 2000 and XP can do this for example.
You
should consider how your application will be deployed and the
relevant usage scenarios. Determine if you need to provide
information to your users about configuring their system for
additional languages, and whether the operating system data entry capabilities are sufficient or if you should provide tools to
simplify or make more efficient the entry of text in different
languages.
Note also, that some applications are designed either
to use hot keys or certain characters (keyboard fingerings) to
make operation very efficient. When the application is deployed
in new regional markets, or supports multilingual data entry, the
support of new keyboards or input methods can conflict with the
original design and new hot keys and key fingerings may need to
be designed in to maintain data entry productivity.
Byte Order Mark (BOM)
When importing or exporting Unicode plain text,
you should decide whether you need to filter out BOMs or prepend,
respectively. You will need to identify the places where you
import/export Unicode text, whether it is plain text or a higher
level protocol (e.g. XML), and how you will decide whether or not
to remove or prepend a BOM. You should take into account the
conventions for UTF-16BE, UTF-16LE. For internal processing,
plain text should never have BOMs, since then it is very
difficult to perform string operations while knowing precisely
whether BOMs should be added or removed.
See Progress 4GL and the Unicode Byte Order Mark (BOM) for more information.
Normalization
Unicode has more than one way to represent many
characters. For example, most of the accented latin characters
can be represented as a single composed character, and as a base
character plus a separate
combining accent character which
combine together to make the accented version of the base
character. For example,
Å, (Letter A with ring
above), can be written with the single Unicode character U+00C5.
It can also be written as the base letter
A (U+0041) plus a
separate character
˚ (Combining ring above) U+030A.
There are often other reasons for having characters or symbols
duplicated within Unicode. Sometimes there are slight differences
in meaning for each character or for certain applications there
is significance in the difference. For other applications, there
is no significance, and the duplicated characters should be
treated the same. For example, there is another character
Å called "Angstrom" (U+212B), which looks identical to A
with ring above.
Consider what this means for searches in your
database. Every search for "A with ring above" has to
look for all 3 variations (U+00C5, U+212B, and the 2 character
string U+0041,U+030A). To simplify searching and comparing, the
Unicode standard defines normalization rules,
(Unicode Standard Annex #15
Unicode Normalization Forms)
which map the
duplicated characters, and the different encoding schemes to a
reduced set of characters with a preferred encoding scheme.
Whether you are likely to encounter character variations, depends
on the sources of your data. The greater the variety of both the
input methods and applications integrated into your environment,
the greater the likelihood of having differences.
To reduce or eliminate the problems due to these variations
you should consider employing normalization algorithms across the
data as it comes in.
The alternative is to change the search algorithms to
make the variations equivalent, but that would entail a
performance cost and in any event you are not free to change how
Progress compares, searches and indexes text.
The World Wide
Web consortium is defining "Normalization Form C" (NFC) as
standard for the web in its Character Model
document. Progress users should consider using
"Normalization Form D" (NFD) for the majority of business
applications.
If you need help deciding on normalization
rules, or implementing them, XenCraft can
help you.
Progress components
Not all Progress components are Unicode-enabled.
Currently, the dataservers, and the GUI, Character and Web client
do not support Unicode. The Batch client, WebSpeed Agent,
AppServer, Database Server, SQL-92 Server and Database do support
Unicode. To use the non-Unicode clients with Unicode databases
requires applying some special restrictions on record processing.
(Either insure read-only access or restrict updatable records to
characters that the client supports.)
Workarounds include using Java or Active-X clients (which are inherently Unicode).
If you need Unicode client support,
XenCraft can help you.
For example, we can provide you with a Unicode-based Progress GUI client.
Third party components
You will need to review your third party
components (OCXs, DLLs, reporting tools, etc.) for support for
Unicode or at least support for the languages and cultures you
will be marketing to. It is often the case that products claiming
Unicode support do not support all regional markets and this
needs to be investigated carefully.
Unicode version
The latest release of Unicode (3.2) supports
95,000+ characters. Progress does not support the latest
release yet, but supports most of the characters that Progress
business applications will need. However, there is potential for
integration problems with technologies that support later
versions of Unicode or if the application is to be used in
certain regional markets with advanced language and character
requirements. XenCraft can help you with
version discrepancies.
Project Estimation
The majority of the changes to enable an
application to support Unicode, involve the review of statements
or functions using SUBSTRING, OVERLAY, LENGTH
and insuring the correct units are used for counting. It
is a very crude estimate, but you might guess that for every 1000
lines of code there will be 2 of these statements. Of course, a
number of the statements may not need changing if the default
value of CHARACTER is correct for the
way the statements are used. However, if you are moving into new
regional markets, there can be many other aspects of
internationalization to consider beyond Unicode-enabling. XenCraft can assess your applications and
provide you with a detailed roadmap of the changes that need to
be made and an estimate of the effort.
-checkdbe
Optionally, specify -checkdbe on the client command line. When -checkdbe is specified,
compile listings will highlight all uses of
SUBSTRING, OVERLAY, LENGTH that do not specify
explicitly RAW, COLUMN, CHARACTER, or FIXED
Prolint, Proparse
Globalshared's
Prolint
is a free tool for automated source code review.
It examines Progress source files for bad programming practices and violations of coding standards. Prolint is built on top of
Joanju's
Proparse, a parser for Progress source. Prolint
can detect and report on
SUBSTRING, OVERLAY, LENGTH
statements that do not specify explicitly
RAW, COLUMN, CHARACTER, or FIXED.
To Top
Copyright © 2002
XenCraft. All rights reserved.