DEV-10: Supporting Multiple Languages In Your Application Salvador Viñals Consultant Product Manager

DEV-10: Supporting Multiple Languages In Your Application

Salvador ViñalsConsultant Product Manager

© 2006 Progress Software Corporation2DEV-10: Supporting Multiple Languages In Your Application

Agenda

International support with OpenEdge® 10 OpenEdge internationalization update

• GB18030

• Sorting and Collations

• Unicode Normalization

• Default word-break tables and double-byte

For more information, go to… Summary

This presentation includes annotations with additional, complementary information


Code-Pages and Unicode

Code-pages• Many code-pages• Max 255 characters each• Each with regionally-limited repertoire of characters

Unicode• Uni code = One• Uni code = Universal• Virtually all the world's characters• Distinguishes characters by script, but not by language.

UTF-8, UTF-16, UTF-32• Unicode binary representations (8,16,32 bits)


OpenEdge Products

OpenEdge 10 products support UTF-8 (Unicode)• Database (Personal, Workgroup, Enterprise)• Application Servers [AppServer, WebSpeed] (Basic, Enterprise)• GUI Clients (Client Networking, WebClient) and Batch Client

Exceptions• Character Client and DataServers: Use code-pages instead

Code-pages and Unicode can interoperate

International readiness


Configurations

UTF-8 or Code-pages

AppServer™

WebSpeed®

OpenEdge Application

Servers

OE Batch

ClientUTF-8 or

Code-pages

OpenEdge RDBMS

UTF-8 or Code-pages

OracleMS SQLODBC

UTF-8

OpenEdge DataServers

Code-pages

Web Service Client

GUI

Character

UTF-8 or Code-pages

Code-pages

SQL Clients

UTF-8


Translation Products

Translation Manager (TranMan) Visual Translator (VisTran)

Products life cycle• Progress V9 – Functionally Stable

• OpenEdge 10 – Active

TranMan and VisTran run on Windows only, however they can be used to manage translations of ChUI or GUI applications.


Agenda

International support with OpenEdge 10 OpenEdge internationalization update

• GB18030







Support for GB18030 Code Page

Chinese code page

Required for all new software sold in mainland China



Why is this code page unique?• Does not fit into lead-byte / trail-byte model

• It has 1, 2, and 4 byte characters

• Cannot tell from lead-byte if there are 2 or 4 bytes in the character



Supported by making conversions of the GB18030 code page to and from UTF-8

• Requires cpinternal to be UTF-8 – No cpinternal for GB18030

• Reading and writing a file in GB18030– Converts to/from UTF-8


Linguistic Sorting

Unicode sorting for UTF-8 Language-sensitive collations Tailor app to expectations of locale

• Language

• Location (country, region, etc.)

Easy to use• Functions just like any other collation

for ABL, and OpenEdge Database or SQL users

• Prior to 10.0B UTF-8 collation was binary sort

The goal …


Catalan, català (ca,cat)-- Catalan alphabet:---- Aa (Àà), Bb, Cc (Çç), Dd, -- Ee (Éé, Èè), Ff, Gg, Hh, -- Ii (Íí, Ïï), Jj, [Kk], Ll, Mm, Nn, -- Oo (Óó, Òò), Pp, Qq, Rr, Ss, Tt, -- Uu (Úú, Üü), Vv, [Ww], Xx, [Yy], Zz ---- L·L is ordered as L+L.--& LL << l·l <<< L·l <<< L·L

Finnish, suomi (fi,fin)-- Finnish alphabet:---- Aa, Bb, [Cc], Dd, Ee, Ff, Gg, Hh, -- Ii, Jj, Kk, Ll, Mm, Nn, Oo, Pp, -- [Qq], Rr, Ss (Šš), Tt, Uu, Vv [Ww], -- [Xx], Yy [Üü], Zz (Žž), [Åå], Ää -- [Ææ], Öö [Øø] --& V << w <<< W& Y << ü <<< Ü& Z < å <<< Å < ä <<< Ä << æ <<< Æ < ö <<< Ö << ø <<< Ø

French, français (fr,fra) -- French alphabet:---- Aa (Àà, Ââ), (Ææ), Bb, Cc (Çç), Dd, -- Ee (Éé, Èè, Êê, Ëë), Ff, Gg, Hh, -- Ii (Îî, Ïï), Jj, [Kk], Ll, Mm, -- Nn (Ññ), Oo (Ôô), (Œœ), Pp, Qq, Rr, -- Ss, Tt, Uu (Ùù, Ûû), Vv, [Ww], Xx, -- Yy (Ÿÿ), Zz ---- The ligatures Æ and Œ are ordered-- as A+E and O+E respectively.--[accentorder backward]

Unicode 4.1 Default Collation OrderISO/IEC 14651-- Unicode default latin alphabet: ---- Aa, Bb, Cc, Dd, Ee, Əə, Ff, Gg, Hh, -- Ii, ı, Jj, Kk, Ll, Mm, Nn, Ŋŋ, Oo, -- Pp, Qq, ĸ, Rr, Ss, Tt, Ŧŧ, Uu, Vv, -- Ww, Xx, Yy, Zz, Þþ ---- Unicode default greek alphabet:---- Αα, Ββ, Γγ, Δδ, Εε, Ζζ, Ηη, Θθ, Ιι, -- Κκ, Λλ, Μμ, Νν, Ξξ, Οο, Ππ, Ρρ, Σσς, -- Ττ, Υυ, Φφ, Χχ, Ψψ, Ωω ---- Unicode default cyrillic alphabet:---- Аа, Әә, Бб, Вв, Гг, Ғғ, Дд, Ђђ, Ѓѓ, -- Ее, Єє, Жж, Җҗ, Зз, Ѕѕ, Ии, Іі, Її, -- Йй, Јј, Кк, Ққ, Ҝҝ, Лл, Љљ, Мм, Нн, -- Ңң, Њњ, Оо, Өө, Пп, Рр, Сс, Тт, Ћћ, -- Ќќ, Уу, Ўў, Үү, Ұұ, Фф, Хх, Ҳҳ, Һһ, -- Цц, Чч, Ҹҹ, Џџ, Шш, Щщ, Ъъ, Ыы, Ьь, -- Ээ, Юю, Яя --

Some collation examples Latin alphabet


Linguistic Sorting

OpenEdge Database meta-schema• Table _DB-collate

– Already used for single-byte sort weights– New functionality used for summary information

• Table _Collation– Added in 10.0A in preparation – Can hold any amount of collation data

Internals


Linguistic Sorting

ABL Usage• Reference collation by name

– For example “ICU-fr” for French

Specify using• -cpcoll <table name>

– Identifies collation table to use with code page in memory at session startup

– <table name> is the collation table in convmap.cp or the name of the ICU collation

• ABL Statements– COMPARE– COLLATE


Linguistic Sorting

COMPARE and COLLATE new strengths supported• 10.0A strengths: CASE-INSENSITIVE, CASE-

SENSITIVE, CAPS and RAW

Added strengths• PRIMARY• SECONDARY = CASE-INSENSITIVE• TERTIARY = CASE-SENSITIVE• QUATERNARY


Linguistic Sorting

/* French collation */DISPLAY “ICU-fr = ” + COMPARE("côte", "<", "coté", "case-insensitive",

"ICU-fr")

/* Spanish collation */DISPLAY “ICU-es = ” +

COMPARE("côte", "<", "coté", "case-insensitive", "ICU-es")

ICU-fr = yesICU-es = no

Output of above statements

Sort order depends on selected collation


Linguistic Sorting

OpenEdge uses collations for• The –cpcoll startup parameter

• The database collation

• The collation of a database CLOB column

• An argument to the COMPARE function or COLLATE option of the BY phrase


Linguistic Sorting

Once a collation is specified for the database in the _Collation table, it cannot be modified

Once the collation is written to the _Collation table, it is the only collation with that name that can be used by that database

It is strongly recommended that databases should be backed up before using an ICU collation

Rules


Linguistic Sorting

The following examples assume• UTF-8 database with “basic” collation

• Names: – beet, carrot, çedilla, entry, école, trust, zoom

FOR EACH words WHERE name < “t”:DISPLAY name.

END.

beetcarrotentry

Output result

Example 1 of 4


Linguistic Sorting

FOR EACH words WHERE name >= “t”:DISPLAY name.

END.

trustzoomécoleçedilla

Output result

Example 2 of 4


Linguistic Sorting

FOR EACH words WHERE COMPARE(name < “t”,“case-insensitive”,“ICU-en”):DISPLAY name.

END.

beetcarrotentryécoleçedilla

Output result

Example 3 of 4

beetcarrotentry

Before, without COMPARE


Linguistic Sorting

FOR EACH words WHERE COMPARE(name < “t”,“case-insensitive”,“ICU-en”)

BY COLLATE(name,“case-insensitive”,“ICU-en”):DISPLAY name.

END.

beetcarrotçedillaécoleentry

Example 4 of 4

Output result Before, without BY COLLATE

beetcarrotentryécoleçedilla


Linguistic Sorting

OpenEdge supports ICU collations in the icui18n library for supported OpenEdge languages

ICU-ja__HQ = Japanese Hiragana Quaternary

One additional collation is supported - Japanese Hiragana Quaternary as case-sensitive• Uses the QUATERNARY strength as the

CASE-SENSITIVE strength

Supported Collations


Linguistic SortingICU Collations Available 1 of 3

ICU-UCA UCA (default Unicode Collation Algorithm)

ICU-ar Arabic ICU-be Belarusian ICU-bg Bulgarian ICU-ca Catalan ICU-cs Czech ICU-da Danish ICU-de__PHONEBOOK German phonebook ICU-el Greek ICU-en_BE English Belgium ICU-eo Esperanto ICU-es Spanish ICU-es__TRADITIONAL Spanish traditional ICU-et Estonian ICU-fa Persian ICU-fi Finnish ICU-fr French ICU-gu Gujarati



ICU-he Hebrew ICU-hi Hindi ICU-hi__DIRECT Hindi direct ICU-hr Croatian ICU-hu Hungarian ICU-is Icelandic ICU-ja Japanese ICU-ko Korean ICU-kn Kannada ICU-lt Lithuanian ICU-lv Latvian ICU-mk Macedonian ICU-mr Marathi ICU-mt Maltese ICU-nb Norwegian Bokmål ICU-nn Norwegian Nynorsk ICU-pl Polish ICU-ro Romanian



ICU-ru Russian ICU-sh Saint Helena ICU-sk Slovak ICU-sl Slovenian ICU-sq Albanian ICU-sr Serbian ICU-sv Swedish ICU-ta Tamil ICU-te Telugu ICU-th Thai ICU-tr Turkish ICU-uk Ukrainian ICU-vi Vietnamese ICU-zh Chinese ICU-zh__PINYIN Chinese Pinyin ICU-zh_HK Chinese Hong Kong ICU-zh_MO Chinese Macau ICU-zh_TW Chinese Taiwan


Collations Gotchas

If Database, Clients and Servers use different collations (-cpcoll), indexed and non-indexed queries may return different results

If a client needs different collation than database, you can use COMPARE, COLLATE on the client• Performance impact with large results sets


Configuration Gotchas

Database code-page is 1252 on Windows server

OpenEdge install startup.pf setting is:• –cpinternal 1252 –cpstream 1252

French Windows Client with • a default Windows code page of 1252, and • a DOS system code page of ibm850

DOS Character Client starts without specifying -cpinternal and –cpstream• so uses 1252 from startup.pf

Typical character client configuration, 1/2


Configuration Gotchas

User enters “è” (Hex 8A in ibm850)

Since session is started with –cpinternal 1252 OpenEdge doesn’t convert when writing to the database. • The entered value is written to the

database as 8A, when it should be E8 (1252)

Start Character Client with –cpinternal and –cpstream set to ibm850

Typical character client configuration, 2/2


Unicode Normalization

Unicode has different ways of expressing the same characters

Decomposed• Á = (U+0041, Latin Capital Letter A) +

(U+0301, Combining Acute Accent ´)

Composed• Á = (U+00C1, Latin Capital Letter A with Acute)

What is Normalization?



XML (and other W3C entities) expects data in “NFC” form

Best way to convert from Unicode to other code pages

Useful when doing tasks such as making comparisons

Why Normalization?

NFC = Canonical Decomposition, followed by Canonical Composition



NORMALIZE• Returns either CHAR or LONGCHAR

– Matches the source string

• CHAR variable must be UTF-8

• LONGCHAR variable can be any form of Unicode– UTF-8, UTF-16, UTF-32

result-string = NORMALIZE(source-string, normalization-mode)

NORMALIZE Language Function


Normalization Modes Supported

NFD: Canonical Decomposition

NFC: Canonical Decomposition, followed by Canonical Composition (default)

NFKD: Compatibility Decomposition

NFKC: Compatibility Decomposition, followed by Canonical Composition

None: No change to source string. Turns off normalization when normalization-mode is a variable

Normalization modes from ICU library



Unicode Normalization FormsRecommended for understanding normalization

forms used with NORMALIZE functionhttp://www.unicode.org/unicode/reports/tr15/

International Components for Unicode (ICU) libraries & globalization, in-depth informationhttp://icu.sourceforge.net/userguide/intro.html

Additional information


Default Word-Break Tables

Prior to 10.1A• User had to configure word-break tables for

use with double-byte and UTF-8 databases



Default Word-Break Tables added for: • Double-byte

• UTF-8 Databases

These are available ‘out of the box’• Either in product or for download

Simplifies accessing non-single-byte databases

10.1A simplifies implementing double-byte databases



10.1A provides 10 compiled files• See list on next slide

• Ranging from proword.245 to proword.254

Located in subdirectory with corresponding empty databases• Subdirectory prolang/<language>



Default Word-Break TablesCompiled, Available out of the box

Available as part of the Supplemental PROMSGS package

Available for download• Japanese SHIFT-JIS proword.253• Japanese EUCJIS proword.250• Korean CP949 proword.248• Korean KSC5601 proword.252• Chinese (simplified) CP936 proword.247• Chinese (simplified) GB2312 proword.251• Chinese (traditional) CP950 proword.249• Chinese (traditional) BIG-5 proword.246• Chinese (traditional) CP950-HKSCS proword.245 • UTF-8 proword.254




What if you are using proword file in the range of 245 – 254?• Copy the file to proword.<nnn>

– Where <nnn> is less than 240• Apply word rule to the database

– No index-build is required for this change

Remember, apply the change in all tiers (Client, Server, Database) to prevent corruption!


Agenda


• GB18030







For More Information, go to…

Expand to New Countries Business Empowerment Program• Contact your Account Manager

Product documentation• OpenEdge Development: Internationalizing Applications• OpenEdge Development: Visual Translator• OpenEdge Development: Translation Manager

Visit PSDN for white papers and presentations, for example:• “Understanding Internationalization” web seminar

Training and Professional Services – www.progress.com


Agenda


• GB18030







In Summary

Use UTF-8 GB18030 Linguistic Sorting and Collations

• Use ICU-*

Unicode Normalization Default word-break tables and

double-byte

Expand to New Countries Business Empowerment Program


Questions?


Thank you foryour time


Documents

DEV-10: Supporting Multiple Languages In Your Application Salvador Viñals Consultant Product Manager