Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
CRICOS No. 00213J
XLIFF in the Localisation of
Open Source Software
One step forward, two steps back?
Asgeir Frimannsson <[email protected]>
CRICOS No. 00213J
Overview
• Motivation
• Background
– GNU Gettext and XLIFF
• One step forward…
– XLIFF in Open Source Localisation
• …And two steps back?
– Tool Support
– Challenges in Open Source Localisation
– Where do we go from here?
• Questions?
CRICOS No. 00213J
Motivation
CRICOS No. 00213J
Open Source Software
CRICOS No. 00213J
Open Source Localisation
• Open Source What?
– Localisation of Open Source Software?
– Localisation using Open Source Tools?
Open Source Tools Proprietary Tools
Proprietary
Project ☺ Ask someone else
Open Source
Project ☺ Not Interested
CRICOS No. 00213J
Gnome 2.18 (Latest Release)
• User Interface
– 36 000 Translation Units
– 46 Languages >85% Translated
– 62 Languages >50% Translated
– 170 Language Teams
• Documentation
– 23 000 Translation Units
– 2 Languages >85% Translated
– 4 Languages >50% Translated
CRICOS No. 00213J
KDE 3.x (Latest Release)
• User Interface
– 107 000 Translation Units
– 27 Languages >85% Translated
– 47 Languages >50% Translated
– 107 Language Teams
• Documentation
– 68 000 Translation Units
– 8 Languages >85% Translated
– 13 Languages >50% Translated
CRICOS No. 00213J
Not just statistics…
• Some teams focus on specific applications or
software distributions
• E.g. the KhmerOS initiative have been
targeting the OpenSUSE Linux distribution:
[ From: http://i18n.opensuse.org/stats/ ]
CRICOS No. 00213J
Language InitiativesAchinese | Afar | Afrikaans | Akan | Albanian | Amharic | Arabic | Aragonese | Armenian | Assamese |
Asturian | Azerbaijani | Basque | Belarusian | Bengali | Berber (Other) | Blin; Bilin | Bosnian | Brazilian
Portuguese | Breton | Bulgarian | Buriat | Burmese | Catalan | Cebuano | Chinese (Hong Kong) |
Cornish | Corsican | Croatian | Czech | Danish | Divehi | Dutch | Dzongkha | English | English
(Australia) | English (Canada) | English (United Kingdom) | Esperanto | Estonian | Faroese | Filipino |
Finnish | French | Frisian | Friulian | Gaelic; Scottish | Galician | Ganda | Georgian | German |
German, Low | Greek | Greenlandic (Kalaallisut) | Guarani | Gujarati | Haitian; Haitian Creole | Hausa
| Hawaiian | Hebrew | Hiligaynon | Hindi | Hungarian | Icelandic | Indonesian | Interlingua | Inuktitut
| Irish | Italian | Japanese | Javanese | Kabyle | Kannada | Kashubian | Kazakh | Khmer | Kinyarwanda
| Kirghiz | Klingon; tlhIngan-Hol | Konkani | Korean | Kurdish | Kurdish (Sorani) | Lao | Latin | Latvian |
Limburgian | Lingala | Lithuanian | Lojban | Lower Sorbian | Luxembourgish | Macedonian | Malagasy
| Malay | Malayalam | Maltese | Manx | Maori | Marathi | Mongolian | Navaho | Ndebele, South |
Neapolitan | Nepali | Northern Sami | Norwegian Bokmal | Norwegian Nynorsk | Occitan (post 1500) |
Oriya | Oromo | Pampanga | Papiamento | Persian | Polish | Portuguese | Punjabi | Pushto | Quechua
| Raeto-Romance | Romanian | Romany | Russian | Sanskrit | Sardinian | Scots | Serbian | Sidamo |
Simplified Chinese | Sindhi | Sinhalese | Slovak | Slovenian | Somali | Sotho, Northern | Sotho,
Southern | Spanish | Swahili | Swati | Swedish | Syriac | Tagalog | Tajik | Tamashek | Tamil | Tatar |
Telugu | Tetum | Thai | Tibetan | Tigre | Tigrinya | Traditional Chinese | Tsonga | Tswana | Turkish |
Turkmen | Uighur | Ukrainian | Urdu | Uzbek | Venda | Vietnamese | Walamo | Walloon | Welsh |
Wolof | Xhosa | Yiddish | Yoruba | Zulu
[ From: https://translations.launchpad.net/ubuntu/feisty/+translations ]
CRICOS No. 00213J
Motivating Factors
• Community-driven Translation
• Enabling end-users to contribute to the
localisation process
– Domain experts
– Knowledge of Language and Culture
• “Crowdsourcing” of translations
• Not strictly confined to open source software
– Google, Microsoft
CRICOS No. 00213J
Motivating Factors
• Allowing Users to embrace software without letting
go of their language and cultural identity
– Translation Initiatives driven by native language speakers
– E.g. KhmerOS, Translate.org.za,
• Technology enabling community-driven Localisation
of Software and E-content
– Localisation Tools
– Processes
– Enabling Technologies
CRICOS No. 00213J
Background
CRICOS No. 00213J
GNU Gettext
• De facto standard for i18n support in GNU
based open source applications
• Based around two file formats:
– Portable Object (PO) – Bi-lingual String Table
containing original extracted (English US) strings
and translations
– Machine Object (MO) – Binary representation of
String table for retrieving strings at run-time.
CRICOS No. 00213J
Gettext Overview
Translate
3
Convert
Machine Object
(MO) file
4
Retrieve Messages
Application
(runtime)
5
Runtime
CRICOS No. 00213J
Gettext PO – Overview# SOME DESCRIPTIVE TITLE.# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDERmsgid ""msgstr "" " Project-Id-Version: Project Name and Version\n "" PO-Revision-Date: YYYY-DD-MM HH:MM-SSSS\n "" POT-Creation-Date: YYYY-DD-MM HH:MM-SSSS\n "" Language-Team: Language Team <email@addr>\n “" Last-Translator: Translator Name <email@addr>\n "" MIME-Version: 1.0\n "" Content-Type: text/plain; charset=UTF-8\n "" Content-Transfer-Encoding: 8bit\n "
# translator-comments#. extracted-comments#: filename:linenumber#, flag...msgid untranslated-stringmsgstr translated-string
Comments
Header
White-space
Translation
Unit(s)
Segment Meta data
CRICOS No. 00213J
Gettext PO – Header# SOME DESCRIPTIVE TITLE.# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDERmsgid ""msgstr "" " Project-Id-Version: Project Name and Version\n "" Report-Msgid-Bugs-To: Name <email@addr>\n "" Language-Team: Language Team <email@addr>\n "" Last-Translator: Translator Name <email@addr>\n "" PO-Revision-Date: YYYY-DD-MM HH:MM-SSSS\n "" POT-Creation-Date: YYYY-DD-MM HH:MM-SSSS\n "" MIME-Version: 1.0\n "" Content-Type: text/plain; charset=UTF-8\n "" Content-Transfer-Encoding: 8bit\n " " Plural-Forms: nplurals=2; plural=(n != 1)\n "" X-User-Defined-Variable: value\n "
Comments
Informative
meta data
Technical
Meta-data
Custom fields
CRICOS No. 00213J
Gettext PO – Translation Units# translator-comments#. extracted-comments#: filename:linenumber#, flag...msgid untranslated-stringmsgstr translated-string
# Not sure if 'Katalog' is the right word to use #. Menu entry, as in File->open..#: example.c:23#, fuzzymsgid " Open Catalog.. "msgstr " Åpne Katalog.. "
In example.c:
22 /* Menu entry, as in File->open.. */23 gui_set_text(menuitem, gettext( "Open Catalog.." ) );
CRICOS No. 00213J
Gettext PO – Plural Forms
# translator-comments#. extracted-comments#: filename:linenumber#, flag...msgid untranslated-stringmsgid_plural untranslated-string-pluralmsgstr[0] translated-string-case-0msgstr[1] translated-string-case-1...msgstr[n] translated-string-case-n
msgid " %d file "msgid_plural " %d files "msgstr[0] " %d plik "msgstr[1] " %d pliki "msgstr[2] " %d plików "
Polish Example:
CRICOS No. 00213J
Recent Updates to the PO format
• Preserve changes in the source string
• Disambiguate by adding context.
# translator-comments#. extracted-comments#: filename:linenumber#, flag...#prev_msgid previous-msgid#prev_msgid_plural previous-msgid-plural#prev_msgctxtmsgctxt message-contextmsgid untranslated-stringmsgid_plural untranslated-string-pluralmsgstr[0] translated-string-case-0msgstr[1] translated-string-case-1...msgstr[n] translated-string-case-n
CRICOS No. 00213J
Limitations of PO
• Limited support for meta-data
• Very limited pre-translation support
• No support for binary content such as images and icons
• No segmentation & alignment support
• PO is a simple string table format, and not fit for paragraph-based text and inline elements
– PO is exploited for translation of XML-based formats such as Docbook and SVG
• So we need a replacement…
CRICOS No. 00213J
XLIFF Overview
• XML Localisation Interchange File Format
“…A specification for the lossless interchange of
localizable data and its related information,
which is tool-neutral, has been formalized as
an XML vocabulary, and features an
extensibility mechanism.”[ XLIFF FAQ]
CRICOS No. 00213J
XLIFF Overview
• Extract localisable content to a common file format
• Extract – Localise – Merge
Original Material
Extract
(convert)
Localised Data
(Translation Units)
CRICOS No. 00213J
XLIFF Overview<xliff version='1.2'>
<file original='example.txt' source-language='en' target-language='nb-NO'>
<header>Meta-data on file and localisation process
</header><body>
<trans-unit id='#1'><source> Hello World! </source><target> Hei Verden! </target><alt-trans>
Translation suggestions from TM, MT...</alt-trans>
</trans-unit><group>
<trans-unit> ... </trans-unit></group>
</body></file>
</xliff>
Header
Body
CRICOS No. 00213J
XLIFF Overview
• Support for features PO is lacking
– abstraction of inline codes and markup
– advanced context information
• Through <context-group> elements
– Workflows
• Through <phase> elements
– Pre-translation and Translation suggestions
• Through <alt-trans> elements
– Other meta-data
CRICOS No. 00213J
One step forward…
CRICOS No. 00213J
From PO to XLIFF
• A deliverable for XLIFF 1.2 was a set of
Representation Guides for describing how
common file formats could be presented in
XLIFF
• Our Goal: To create a standard XLIFF
representation of the PO file format
• The XLIFF Representation Guide for Gettext
PO is part of XLIFF 1.2
CRICOS No. 00213J
The XLIFF Tools Project
• Aimed to develop tools to support XLIFF in Open Source Localisation
– Guide for representing PO in XLIFF
– Input from key people in various Open Source Communities
• Started in January 2005
• Hosted on freedesktop.org
• >200 Messages between January and July 2005
• The XLIFF representation Guide for PO was transferred to the XLIFF TC in July 2005
CRICOS No. 00213J
PO to XLIFF
• A PO file maps to a XLIFF <file> element
• PO Header stored in skeleton or treated as a
translation unit
• A PO translation unit maps to an XLIFF <trans-unit> element
– Each plural form also maps to a <trans-unit>element, but contained within a <group> element
• Inline codes such as parameterized strings
abstracted when possible
CRICOS No. 00213J
PO Translation Units<trans-unit id=' messages_1 ' approved=' no'>
<source> untranslated-string </source><target> translated-string </target><note from=' po-file '>
translator-comments</note><context-group name=' po-reference#1 ' purpose=' location '>
<context context-type=' sourcefile '> sourcefile </context><context context-type=' linenumber '> linenumber </context
</context-group><context-group name=' po-entry-header ' purpose=' information '>
<context context-type=' x-po-autocomment '>extracted-comments
</context></context-group>
</trans-unit>
CRICOS No. 00213J
Plural Forms<group restype=' x-gettext-plurals '>
<trans-unit id=' messages_1[0] '><source> untranslated-string-singular </source><target> translated-string-form-0 </target>
</trans-unit><trans-unit id=' messages_1[1] '>
<source> untranslated-string-plural </source><target> translated-string-form-0 </target>
</trans-unit>...<trans-unit id=' messages_1[n] '>
<source> untranslated-string-plural </source><target> translated-string-form-n </target>
</trans-unit>...additional context information...
</group>
CRICOS No. 00213J
PO Header
• PO header stored in skeleton in workflows where
variables are not modified by translators
<trans-unit id=' messages_1 ' approved=' yes '><source>
Project-Id-Version: Project Name and Version...POT-Creation-Date: YYYY-DD-MM HH:MM-SSSS
</source><target>
Project-Id-Version: Project Name and Version...POT-Creation-Date: YYYY-DD-MM HH:MM-SSSS
</target><note from=' po-file '>
SOME DESCRIPTIVE TITLE.Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
</note></trans-unit>
CRICOS No. 00213J
Abstraction of inline codes
PO:#, c-formatmsgid "My name is %s"msgstr ""
becomes
<trans-unit id=' messages_1 ' approved=' no'><source> My name is <ph id='#1' type='x-c-param'>%s</ph> </source>
</trans-unit>
CRICOS No. 00213J
XLIFF in Open Source Localisation
Workflows
How do we handle XLIFF-based
localisation within present build
systems and development processes?
CRICOS No. 00213J
Common PO-based Workflow
• Uses the PO file format throughout the localisation process
– PO files stored in Version Control System
• Uses PO Compendiums as Translation Memories
• Translators work with PO editors (like KBabel) or text editors
• Other formats converted to PO for localisation and follow a similar process
CRICOS No. 00213J
1) Optional XLIFF-based Workflow
• PO files optionally converted to XLIFF for
translation in XLIFF-based editors
• Translators can choose to use PO or XLIFF
• PO files still used as persistent file format
• XLIFF meta-data lost on back-conversion to PO
CRICOS No. 00213J
2) Native XLIFF-based Workflow
• Eliminate the need for PO in the localisation process
• Convert to XLIFF in the build system
• Store XLIFF files – not PO files, in the repository
• Uses PO in the build systems as a intermediate format
• Needs additional gettext-like tools to merge and initialize XLIFF files
CRICOS No. 00213J
3) Gettext-integrated XLIFF Workflow
• Same as previous, but:
• XLIFF support implemented within the Gettext
toolkit
• Extract resources directly to XLIFF, eliminating
need for PO
• Only works for GNU Gettext based processes
– What about 3rd party tools?
CRICOS No. 00213J
…and two steps back again?
CRICOS No. 00213J
Status Quo
• 2 years have passed by since these discussions
and there is little or no uptake of XLIFF in
Open Source Localisation processes
• Why is there so little interest in XLIFF?
CRICOS No. 00213J
Status Quo
“I don't expect XLIFF based open-source translation editors in the next 3 years: It took ca. 2 years until KBabel was built, which is so far the only good open-source translation editor. (….) …An editor which not only has to accommodate a hundred of different elements and attributes, but also a configurable GUI around it, is not going to be seen in the open-source world soon.”
Bruno Haible (GNU Gettext maintainer)
xliff-tools mailing list, Feb 2005
CRICOS No. 00213J
Tool Support
CRICOS No. 00213J
SUN’s Open Language Tools
• XLIFF 1.0 Editor
• Integrated “Mini-TM”
• XLIFF converters for
– HTML
– Docbook SGML
– JSP
– XML
– OpenOffice.org
– Plaintext
– Software Messages
CRICOS No. 00213J
The Wordforge Foundation
• Translate.org.za & KhmerOS Collaboration
• The Translate Toolkit
– Converters for e.g. Mozilla and OpenOffice.org formats to PO and XLIFF
– Uses XLIFF and PO as common Resource Containers
– QA Tools
• Pootle
– Web-based Translation Environment
• Pootling
– Rich Client Translation Environment
CRICOS No. 00213J
Wordforge: Pootle
• Web based Translation and Project
Management
CRICOS No. 00213J
Wordforge: Pootling
• XLIFF and PO editor
• Supports TBX glossaries
• Integrated TM
• Uses the Translate Toolkit Internally
• Still in very early development
CRICOS No. 00213J
KBabel & Kaider
• KBabel has been the most advanced Translation Editor for PO
• Part of the KDE project
• No longer maintained
• New tool on the block: Kaider
• TM and TBX support
• XLIFF support on the TODO-list
CRICOS No. 00213J
Okapi & OmegaT
• Use the Okapi
Framework for
processing
files for
translation
• Translate
using OmegaT
• .NET and Java
combination
CRICOS No. 00213J
XLIFF Tool Support
• Tools only implement the basic features of
XLIFF
• In reality, there is not even a single mature
open source XLIFF (1.1 or 1.2) translation
environment for the GNU/Linux platform
CRICOS No. 00213J
Vertical Solutions
• Project-specific Localisation Tools
– KDE: KBabel/Kaider (PO based)
– Mozilla: Mozilla Translator
– GNU/Gnome: GTranslator (PO based)
– Eclipse: Eclipse Babel (planned!)
• Significant challenges in creating a cross-
project solution based around XLIFF
CRICOS No. 00213J
Current Challenges
• Little separation between Developer and
Translator
– Need to know Source Control systems like CVS
– No abstraction
• Need to know source formats like C and C++ format
strings
• No protection of inline markup
• Only ad-hoc glossary management
• Only ad-hoc translation reuse
CRICOS No. 00213J
• Wiki-based glossaries at best
• Language-based, not project based
• New Tools starting to consider Terminology
Management
– Kaider and Pootling support TBX based glossaries
– Still only retro-fitted Terminology Management
CRICOS No. 00213J
Where to from here?
CRICOS No. 00213J
What is XLIFF support?
• XLIFF is only the resource container…
• There is a pressing need to build an eco-system of
tools around XLIFF, similar to the rich set of tools
currently existing for the PO format
– Merging, word-count, QA checks…
– XLIFF without the tools support is really two steps back…
• Perhaps a case for an XLIFF ‘reference
implementation’?
CRICOS No. 00213J
Towards an XLIFF Standard
• The XLIFF 1.2 Specification a significant
improvement from previous versions
– Less ambiguity
– More consistent
– Support for segmentation
– Representation Guides for HTML, Java, PO
CRICOS No. 00213J
Towards an XLIFF Standard
• Some challenges with the current standard
– “Poor” white-space handling
– Canonicalisation
• XML representation
• Resource representation
• Tool processing
– Complexity and Separation of Concerns
• XLIFF 2.0 Should address these issues
CRICOS No. 00213J
What we are doing at QUT with XLIFF
• Using XLIFF extensively in our research
• Java-based API for manipulating XLIFF
– Thin layer upon the XOM (XML object model) library
– Every element is an object (File, Group, TransUnit …)
– XPath, XSLT support
– Property change support
• Converter for PO
• Will be made available as open source “at some
point, or sooner if you bug me”™
CRICOS No. 00213J
What we are doing at QUT with XLIFF
CRICOS No. 00213J
Thank You!
• James M. Hogan (QUT)
• XLIFF Tools Project Contributors
• Red Hat team, Brisbane
• Dwayne Bailey (translate.org.za & Wordforge)
CRICOS No. 00213J
Questions / Discussion