SoftwareInternationalizationand Localization: Basic Concepts
Doug Kunz
000000XX-2
Outline Introduction
Localization Examples
Design and development impact
000000XX-3
Why does internationalization matter?
Web’s global reach – potential global user base
Support foreign language speakers within our borders
Ever-increasing numbers of international business transactions
000000XX-4
Definitions Internationalization (i18n): The practice of writing
software which can easily be extended to support users from multiple cultural and linguistic backgrounds
Localization (L10n): The process of taking internationalized software and actually producing a version tailored to users from a particular culture and language background
000000XX-5
Language Tags – IETF BCP 47 A “language tag” or “locale” describes a common
language + culture shared by a group of users, often at a national level.
Documented by IETF “Best Current Practice” 47 http://www.ietf.org/rfc/bcp/bcp47.txt Refers to underlying RFC’s (these can change over time, but the BCP number
does not)
Typically represented by an identifier describing a combination of: 2-3 letter language code (ISO 639, parts 1 or 2) 2 letter country code (ISO 3166) Optional extensions for dialect, writing system
en – English zh – Chinese (macrolanguage)en-US – US English zh-cmn – Mandarin Chineseen-GB – UK English zh-cmn-TW – Mandarin Chinese as spoken in Taiwanes-US – US Spanish zh-cmn-Hans-CN – Mandarin Chinese written with
Simplified system, as used in China
ISO = International Organization for Standardization; IETF = Internet Engineering Task Force
What to localize(a non-exhaustive listing)
000000XX-7
Writing System Direction of scan (Left-to-Right vs. tfeL-ot-tghiR)
Character set (various alphabets, syllabaries and logographies)
000000XX-8
Display captions Regional variations within language
Spelling variations, e.g. US “color” vs. UK “colour” Terminology variations (“lift” vs. “elevator”, “Español”
vs. “Castellano”)
Language variations (“Login” vs. “Conectese” vs. “Anmelden” vs. “Connessione”)
000000XX-9
Display layoutsUS English
Caption 1 nnnnn Caption 2 nnnnn
German
BigGermanTranslationOfCaption1 nnnnn
BigGermanTranslationOfCaption2 nnnnn
Arabic
nnnnn 2noitpaC nnnnn 1noitpaC
000000XX-10
Print layouts US Letter paper (8 ½ by 11 inches) vs. A4 paper (210×297 mm)
000000XX-11
Units of Measure “British Engineering” (Imperial) System – U.S.A, Liberia and
Myanmar Feet/inches/miles Pounds, stone or slugs Fahrenheit
SI (Système International) – Rest of world Meters/centimeters/kilometers Kilograms Celsius or Kelvin
000000XX-12
Formats: Numbers Decimal separator – character varies
1,000 (US) “one thousand” 1,000 (Most of Europe) “one”
Readability delimiters – placement and character vary 1,000,000 (US) 10,00,000 (“10 lakh” India/Pakistan/Sri Lanka) 1.000.000 (Germany) 1 000 000 (France) 100,0000 (China)
000000XX-13
Formats: Contact Info Phone numbers
(415) 644-3912 within US +1 415 6443912 outside US
Postal Codes (a few examples) – US Zip Codes: 99999 or 99999-9999 Canadian Postal Codes: A9A 9A9 UK Postal Codes (generally):
A9 9AA A99 9AA A9A 9AA AA9 9AA AA99 9AA AA9A 9AA
000000XX-14
Formats: Contact Info Address layout examples
Line1
Line2 etc.
City PostCode
Country
Line1
Line2 etc.
PostCode City
Country
Line1
Line2 etc.
City Region PostCode
Country
000000XX-15
Formats: Dates and Times Dates –
Commonly, formats differ within calendar systems: does 01/06/2006 mean “January 1, 2006” or “June 1, 2006”?
Less commonly, across calendar systems 22 May 2006 - Gregorian 9 May 2006 - Julian 24 Iyyar 5766 (before sunset) – Hebrew 23 or 24 Rabi`-ul-Akhir 1427 (before sunset) - Islamic
Times – 5:00pm vs. 17.00
Time Zones – 22 May 2006 12:00pm (UTC+14) = 21 May 2006 10:00am (UTC-12)
Design/DevelopmentImpacts and Techniques
000000XX-17
Know your user Collect information in user profile, such as:
“Preferred language” store as language tag containing least possible amount of
information (subtags) needed to localize experience for that particular user (e.g. “en-US” is better than “en-Latn-US”)
Time zone Preferred units-of-measure Preferred currency
000000XX-18
User Interface vs. Data Locales User Interface locale
The captioning, formats and layout needed to present data to the current user
Data locale Locale to which a business object belongs, may be
distinct from current user’s locale. Example: purchase order has comment text written in French, although current user is English-speaking
Typically the locale of the user who created the object
000000XX-19
Resource Extraction A “resource” is a screen artifact—text, image, etc.—which
contains localized information. For example, a field caption written in US English would be a resource.
Place text captions in a separate file for translation
Images Where possible, implement buttons as text with a background
image, to avoid producing locale-specific images When text *must* be included in an image:
“ALT” text should be placed in a separate file, and should match image text (if any) for ease of translation
Image “path” should be locale-specific, e.g. medem.com/images/en_us/next_button.gif
Sometimes screen shots help translation services by providing context
000000XX-20
Layouts Technique 1: Produce general layout that will
work for most languages Where needed, make language-specific
“override layouts”
Technique 2: “Least common denominator” layouts that will always work
Example: restrict print layouts to 210mm by 279mm – works on US Letter and A4
000000XX-21
“Store globally, display locally” Pick a reasonable standard format for storage in your
database (e.g. ISO 8601 “2006-05-24T18:15:00Z”)
Translate for display based on user’s locale (5/24/06 10:15am Pacific Daylight Time)
000000XX-22
Flexible storage design Explicit rate/unit storage
Bad: Column “Height” Bad: Column “Height_inches” Good: Column “Height” and Column “Height_Units” Good: Column “Price” and Column “Currency”
Globally appropriate data type Bad: Column “ZipCode” Integer(5) Good: Column “PostCode” Varchar2 (10)
Globally appropriate name Bad: Column “State” Better: Column “Region”
000000XX-23
Appropriate character encoding US-ASCII (American Standard Code for Information Interchange)
7 bits / character English only: diacritics not supported (ü, è, ç, etc.)
ISO-8859-1 (“Latin 1”) 1 byte (8 bits) / character Superset of US-ASCII Western European languages Default encoding for “text/*” MIME types Basis of the set of characters allowed in HTML 3.2 documents
UTF-8 1 to 4 bytes/character (in practice, 1 to 3 bytes) Backward compatible with US-ASCII and ISO-8859-1 Unicode (all character sets, including extinct languages) Basis of the set of characters allowed in HTML 4.0 documents
000000XX-24
For More Information International Telecommunications Union (ITU) http://www.itu.int/
Universal Postal Union (UPU) http://www.upu.int/
International Organization for Standardization http://www.iso.org/
UTF-8 http://en.wikipedia.org/wiki/UTF-8