17
Localisation Focus Vol.14 Issue 1 The International Journal of Localisation 1. Introduction The XML Localisation Interchange File Format (XLIFF) standard from OASIS is intended to function as a file format for the interchange of localisable data in bitext form that are passed between tools during a localisation/translation process, with the primary goal of lossless information transfer (Comerford, T., Filip, D., Raya, R.M., Savourel, Y., Eds. 2014; Savourel, Y., 2014). XLIFF 2.0 allows for terms in segments of the bitext to be linked to simple entries in an optional glossary module intended to store important term-related information as fully as possible without overstepping the scope of the overall file. This specific intended purpose also allows the XLIFF glossary module to maintain compatibility with larger glossary formats which are specialized for the task of terminology management, such as the Basic dialect of TermBase eXchange (TerminOrgs, 2014). TBX-Basic is a dialect of the XML-based ISO 30042, TermBase eXchange (TBX) format (ISO 30042:2009, referred to in this article as the TBX Standard) and is intended to be, as its name suggests, simpler than its older and more powerful cousin TBX- Default, which comprises the full scope of the standard. TBX-Basic is not a standard per se (although it is sometimes referred to as a de facto standard) and is considered a guideline for use in localisation environments. Where TBX-Default has more than 120 data categories and many of those can be used with multiple type values, TBX-Basic is a fully contained subset of TBX-Default that features 28 data categories (DCs) and substantially reduces the number of permissible instances assigned to various DCs. Nevertheless, TBX-Basic is still capable of storing a large amount of terminological information, is fully compatible with the core TBX standard, and adheres to the constraints of a terminological markup language (TML) as defined by ISO 16642, Terminological Markup Framework (TMF) (2003). Figure 1 illustrates the structure of a TMF/TBX data record, which comprises a concept- oriented container called a <termEntry>. In addition to conceptual information (possibly including a definition for the concept) pertaining to the entire entry, the term entry has embedded in it at least one <langSet> containing all terms for the concept and all related information pertaining to a given language. Included in each langSet is/are one or more <tig> elements, each containing a single term in that language and associated with the concept, along with related information, including 23 Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic James Hayes 1 , Sue Ellen Wright 2 , David Filip 3 , Alan Melby 4 , and Detlef Reineke 5 [1] BYU Translation Research Group [2] Kent State University, Kent, Ohio, USA [3] University of Limerick, Limerick, Ireland [4] LTAC Global [5] Universidad de Las Palmas de Gran Canaria [email protected], [email protected], [email protected], [email protected], [email protected] Abstract This article describes a bidirectional mapping between the XLIFF 2.0 Glossary Module and the TermBase eXchange format (TBX), in particular the TBX-Basic dialect. This mapping is slated to be endorsed by the OASIS XLIFF TC as a Committee Note, thus providing the canonical model for interoperability between the two complementary standards. The article recounts the history of the TBX format’s evolution from SGML to XML, beginning with its development through TEI to ISO, LISA, ETSI, and TerminOrgs. It presents the core structure of the TBX term entry and explains how the XLIFF Glossary entry easily fits inside this model, facilitating interchange. The structure and data categories of the two models are discussed, followed by a mapping demonstrating the logical conversion path between the two approaches. The role played by the ISOcat Data Category Registry is also introduced. Appendices provide a more detailed view of overall structures and data category assignments. Keywords: XLIFF, TBX, Interoperability, Standards, Translation, Terminology, Terminology Exchange, Terminology Management

Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

1. Introduction

The XML Localisation Interchange File Format(XLIFF) standard from OASIS is intended tofunction as a file format for the interchange oflocalisable data in bitext form that are passedbetween tools during a localisation/translationprocess, with the primary goal of lossless informationtransfer (Comerford, T., Filip, D., Raya, R.M.,Savourel, Y., Eds. 2014; Savourel, Y., 2014). XLIFF2.0 allows for terms in segments of the bitext to belinked to simple entries in an optional glossarymodule intended to store important term-relatedinformation as fully as possible without oversteppingthe scope of the overall file. This specific intendedpurpose also allows the XLIFF glossary module tomaintain compatibility with larger glossary formatswhich are specialized for the task of terminologymanagement, such as the Basic dialect of TermBaseeXchange (TerminOrgs, 2014).

TBX-Basic is a dialect of the XML-based ISO30042, TermBase eXchange (TBX) format (ISO30042:2009, referred to in this article as the TBXStandard) and is intended to be, as its name suggests,simpler than its older and more powerful cousin TBX-Default, which comprises the full scope of the

standard. TBX-Basic is not a standard per se(although it is sometimes referred to as a de factostandard) and is considered a guideline for use inlocalisation environments. Where TBX-Default hasmore than 120 data categories and many of those canbe used with multiple type values, TBX-Basic is afully contained subset of TBX-Default that features28 data categories (DCs) and substantially reducesthe number of permissible instances assigned tovarious DCs. Nevertheless, TBX-Basic is stillcapable of storing a large amount of terminologicalinformation, is fully compatible with the core TBXstandard, and adheres to the constraints of aterminological markup language (TML) as definedby ISO 16642, Terminological Markup Framework(TMF) (2003). Figure 1 illustrates the structure of aTMF/TBX data record, which comprises a concept-oriented container called a <termEntry>. Inaddition to conceptual information (possiblyincluding a definition for the concept) pertaining tothe entire entry, the term entry has embedded in it atleast one <langSet> containing all terms for theconcept and all related information pertaining to agiven language. Included in each langSet is/are oneor more <tig> elements, each containing a singleterm in that language and associated with theconcept, along with related information, including

23

Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic

James Hayes1, Sue Ellen Wright2, David Filip3, Alan Melby4, and Detlef Reineke5

[1] BYU Translation Research Group [2] Kent State University, Kent, Ohio, USA

[3] University of Limerick, Limerick, Ireland[4] LTAC Global

[5] Universidad de Las Palmas de Gran [email protected], [email protected], [email protected], [email protected],

[email protected]

AbstractThis article describes a bidirectional mapping between the XLIFF 2.0 Glossary Module and the TermBaseeXchange format (TBX), in particular the TBX-Basic dialect. This mapping is slated to be endorsed by theOASIS XLIFF TC as a Committee Note, thus providing the canonical model for interoperability between thetwo complementary standards. The article recounts the history of the TBX format’s evolution from SGML toXML, beginning with its development through TEI to ISO, LISA, ETSI, and TerminOrgs. It presents the corestructure of the TBX term entry and explains how the XLIFF Glossary entry easily fits inside this model,facilitating interchange. The structure and data categories of the two models are discussed, followed by amapping demonstrating the logical conversion path between the two approaches. The role played by the ISOcatData Category Registry is also introduced. Appendices provide a more detailed view of overall structures anddata category assignments.

Keywords: XLIFF, TBX, Interoperability, Standards, Translation, Terminology, Terminology Exchange,Terminology Management

Page 2: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

(optionally) one or more contexts where the term isused in text (the complete TBX core structure isillustrated in Appendix I).

Terminological data categories are generallyinstantiated as a value of an attribute associated witha meta data-category (<descrip>, <admin>,<xref>, etc.). A small number of data categories isdirectly instantiated in form of element names(<term>, <date>, <note>) or as attributes(id, xml:lang). Metadata categories can also beused to group information as shown in the followingexample:

<tig><term>fish</term><descripGrp>

<descriptype=”context”>Thisis a sample fishcontext</descrip>

<admintype=”source”>NewYork Times</admin></descripGrp>

</tig>

TBX-Basic is intended to be a structurally compliantmember of the TBX family of formats that ispopulated by a selected set of the most common datacategories used in fairly uncomplicated terminologydatabases. Whether by serendipity or by design,TBX-Basic can be treated as structurally compatible

with the XLIFF glossary module because theelements in the XLIFF model map easily to a subsetof the elements in the TBX-Basic set.

2. TBX Development

The TBX standard has deep roots. It began asChapter 13 of the Text Encoding Initiative’s P3iteration of TEI’s original SGML-based text markupenvironment. (Text Encoding Initiative, 1994/1999;Cover, R., 2002). Under the guidance of Alan Melby(Brigham Young University), Klaus-Dirk Schmitz(University of Applied Sciences, Cologne), Sue EllenWright (Kent State University), and Gerhard Budin(University of Vienna), it was introduced to ISO andeventually became ISO 12200:1999, MARTIF. Herelies the origin of the enigmatic <martif> rootelement, which has been maintained in keeping witha commitment to backward compatibility.

With the general move from SGML to XML as theprimary vehicle for encoding textual data, an XMLserialization of the MARTIF model was developedthrough the so-called SALT project under the aegis ofthe European research 5th framework known as theHuman Language Technologies (HLT) project(SALT, 1998-2002). As the format evolved, theLocalization Industry Standards Association (LISA)OSCAR (Open Standards for Container/contentAllowing Re-use) Special Interest Group (SIG)picked up the project under the leadership of KaraWarburton, publishing the new industry standard

24

Figure 1: Structural model of a terminological entry in TBX

Page 3: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

openly on the web (Lommel, A., 2007). It eventuallycame back to ISO in the form of a jointly publishedstandard, ISO 30042:2008.

Unfortunately, the LISA organization experiencedfinancial difficulties and ceased operations inFebruary 2011, which led to transfer of LISA’sintellectual Property including the TBX standard toETSI in May 2011 (see e.g. Cuddihy, K. 2011). Thethen chief executive of the organization, recognizingthat in its function as a standards body, LISA hadproduced a number of viable industry standards,TMX (Translation Memory eXchange), SRX(Segmentation Rules eXchange) and TBX inparticular, chose to transfer the intellectual propertyrights for the standards to ETSI, the EuropeanTelecommunications Standards Institute ISG(Industry Specification Group) (Guillemin, 2011).ETSI now shares further development of the standardwith the TerminOrgs (Terminology for LargeOrganizations) component of LTAC Global(TerminOrgs, LTAC 2014), which enjoys significantjoint membership with the old LISA/OSCAR groupand with ISO’s Technical Committee 37, Sub-Committee 3, Systems to manage terminology,knowledge and content.

Both the TBX-Basic and the parent TBX standard areavailable on the TerminOrgs site (TerminOrgs, 2014).Another important source of TBX and LISA-relatedinformation is the GALA/CRISP (Globalization andLocalization Association/Collaborative Research,Innovation, and Standards Program), whose missionis to provide a clearing house for information onlanguage industry standards, including the latest(last) versions of LISA’s TMX, TBX, and SRX(GALA, 2015). The standards are also available fromttt.org, along with a significant collection of utilitiesand sample files (TerminOrgs (previously the LISAterminology SIG) at ttt.org, 2015).

A further source of information under development isthe TBXinfo website at www.tbxinfo.net, which isslated to provide a full range of support materialsconcerning the TBX standard (ISO 30042) and itsvarious forms and dialects (TBX-Default, TBX-Basic, TBX-min). For referencing TBX in web-related xml documents, the namespace ishttp://iso.org/ns/tbx. Many of the TBXdata categories (DCs) are already available viapersistent identifiers (PIDs) of the form:

https://www.isocat.org/datcat/DC-[xxxx],

where [xxxx] represents the unique IDof a given DC in the Data Category Registry (see

below). Anyone wishing to examine a sample TBXdatabase may download the IATE (InterActiveTerminology for Europe) termbase, which contains 8million terms in 24 European languages. Theintention of this massive download is to enable usersto integrate IATE data into local terminologymanagement systems and Translation EnvironmentTools (TEnTs).

Work is ongoing to issue an updated version of thefull TBX standard, with the goal of introducingenhancements while at the same time maintainingreverse compatibility in order to protect legacy data.

In parallel with the development of TBX, ISO TC 37has also developed a Data Category Registry (DCR)designed as a dynamic repository of data categoryspecifications, which houses not only TBX-relateddata categories originally listed in ISO 12620:1999,but several thousand data categories used in a widerange of language resources (ISO 12620:1999;ISOcat, 2015). Originally sponsored by The MaxPlanck Institute for Psycholinguistics in Nijmegen,The Netherlands, the ISOcat resource has recentlychanged venue, but remains accessible as a staticrepresentation at http://www.isocat.org. It is currentlyavailable as a static repository, but plans are underway (at the time of writing, early 2015) for itsresurrection as an active data resource residing in theTermWeb environment.

The rather confusing collection of differentorganizations reflects the need to bring together theessential experts in the field in openly availableforums, as some industry organizations are closed tonon-paying members, and some industry expertshave not affiliated with official standards bodies. ISOstandards are desirable on the one hand because theyare required in some official venues, but the ISOmodel contrasts the policy of free and open standardsthat prevails in the Internet and World Wide Webenvironment. As a consequence, we could follow thestandard being repositioned several times, in order toensure both the international weight and validity ofan ISO standard and the free availability of mostcomponents, particularly all processablecomponents, of the standard.

2.1 Scenarios for XLIFF<->TBXInteroperabilityThe following scenarios describe some of thepossible use cases, in which conversion betweenTBX and XLIFF Glossary data, and vice versa willcontribute to improved localisation productivity

25

Page 4: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

and/or quality.

At the beginning of a project, the described•mapping will enable agents and users topopulate the XLIFF glossary module with datafrom an existing TBX-Basic compatibletermbase, using a conversion utility or webservice designed for that purpose.An XLIFF-compatible Translation Environment•Tool (TEnT) or Computer Assisted Translation(CAT) tool that does not feature an interactiveinterface with a companion termbase can allowtranslators to mark terms while translating andautomatically store them in the XLIFF glossarymodule. After completion of the translation, theglossary module data can be harvested for newterms to add to any terminological databaseusing the same TBX mapping in the oppositedirection. This procedure can comprise anapproval workflow for updating obsolete andadding of new entries in an existing TBXcompliant termbase. In both cases, contextualexamples can come directly from the bitext, thatis the segments or units in which the terms wereused can be featured as such examples. In the event that one or more TEnTs in the•localisation production chain does not have aninteractive termbase available to the translator,XLIFF Glossary can be used for terminologicalsupport of translators and editors working withthe XLIFF file.Even in case, interactive termbases are available•to translators, the glossary module can be usedto provide just the locally relevant terminologyand as working space for just the locallyrelevant terminology for the project.Terminology suggestions collected via the•glossary module can be used as seedterminology in target or even the sourcelanguage to jump start terminology managementand termbase setup efforts, where it did notpreviously exist. A scenario more common inthe industry than professional terminologists arewilling to believe.

All of the above and many more possible scenariosmake use of at least two of the four possiblefacilitated interactions

Termbase data and metadata enrich1XLIFF using the mappingTranslation agents (human and2machine) are informed by the seededdata, which helps them make betterdecisionsHuman or text analysis agents enter or3

update data and metadata in themodule during the translation processThe wider terminology management4process consumes data and metadataintroduced or curated through themodule using the mapping.

3. XLIFF Glossary Module

As defined in XLIFF Version 2.0 (Comerford, T.,Filip, D., Raya, R.M., Savourel, Y., Eds. 2014;Savourel, Y., 2014), the XLIFF Glossary Module is anamespace based extension optionally embedded inan XLIFF 2.0 file. The <glossary> element is theroot element of the module and is only mandatoryupon inclusion of the module in an XLIFF file. Themodule allows the inclusion of simple glossaries andin its current form comprises the following elements:<glossary>, <glossEntry>, <term> (theterm occurring in a given context in the source text),<translation> (one or more possible targetlanguage equivalents) and <definition>.

A glossary node can contain one or more<glossEntry> elements, and each<glossEntry> must contain exactly one <term>element. It is accompanied by all relevantinformation pertaining to this single term as used in aspecific translatable text context, including anoptional definition, reference to the usage within thetranslatable text at hand, and possibly multipletranslations. Since it only contains information on asingle term in the given context, an XLIFF<glossEntry> complies with the TMF/TBXrequirement that a <termEntry> treat a singleconcept.

Obviously, if users wish to document multiple locallyrelevant terms, there can be multiple<glossEntry> elements in each glossary node. Itis interesting to note that in contrast to term entries intermbases, there is only one term reflecting a singlegiven instance of the term in a specific context. Therecan, however, be multiple equivalents in the case ofmultiple existing or proposed translations. Should asituation occur in which multiple source termsrepresent a single concept, there are a few ways toconvey this within the Glossary Module:

While each <glossEntry> can only1point to a single source occurrence of theterm within the same <unit>, XLIFFcore term annotation can be used toreference a <glossEntry> the otherway round, that is from the source text, for

26

Page 5: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

instance in the following cases:a) The same term has been used morethan once in the same <unit>element.b) Different lemmas of the same termhave been used in the same <unit>element.c) Different synonymous terms havebeen used throughout the <unit>,<file> or the entire XLIFF file.

Use identical definitions in the2<definition> elements ofsynonymous term entries, possiblymention the other synonymous terms in thedefinition.Use an external termbase concept identifier3(ideally a dereferencable URL) to linksynonyms. This information can be sentthrough a dedicated extended attribute orincluded in the module’s own sourceattribute that is free text and does not haveany prescribed semantics.Introduce an extended element to express4the synonymy relationship without anexternal reference. Such element couldcarry a list of fragment identifiers that pointto synonymous terms within the same

<unit>, <file> or <xliff> element.

Method 1., possibly combined with method 2., willensure maximum interoperability along the bitextroundtrip. Information conveyed via methods 3. or 4.would not be interoperable during the XLIFFroundtrip without a pre-agreed handshakemechanism, may be nevertheless critical forterminology post-processing in the termbaseenvironment. If method 3 or 4 has to be used for thesake of automated terminology management outsideof the XLIFF based bitext roundtrip, theinteroperability during the XLIFF roundtrip shouldstill be ensured using 1 and/or 2.

4. TBX-Basic

As noted above, the root element of TBX-Basic is<martif>, as it is based on the original SGMLMARTIF standard (see above). A <martif>element contains a <martifHeader> element anda <text> element. As the XLIFF Glossary Moduledoes not contain any data categories that would mapto the <martifHeader>, only the <text>element will be discussed; see the TBX-Basicguidelines on the Terminorgs site for detailedinformation on <martifHeader>. The <text>

27

Element/Attribute Name Description

<glossary>This is the Glossary Module container element at <unit>level that can contain an arbitrary number of locallyrelevant glossary entries.

<glossEntry>Single glossary entry element wrapping a single sourceterm and all related data and metadata, It is extensible byelements and attributes from other namespaces.

<term> contains one term and only one term, single word or a multiword expression.

<translation>contains a translation of the sibling <term> elementcontent in the XLIFF file’s target language; multipletranslations can be proposed as variants or synonymswithin the same entry.

<definition> contains a definition of the concept represented by the term.

ref

IRI that identifies the term as a text span within source ortarget translatable text of the same <unit> element.

May be used on <glossEntry> or <translation>

source free text indicating the origin of the <term>,<translation>, or <definition> content.

Table 1: XLIFF Glossary Elements and Attributes

Page 6: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

element contains a <body> element, which containsthe terminological entries of the TBX file and isorganized as illustrated in Figure 1 and Appendix I.

As it is language specific, the <langSet> elementmust include an xml:lang attribute representingthe language or locale to which it refers incompliance with IETF BCP 47 (2009). At itssimplest, such a language code may comprise just thetwo letter ISO 639-1 code (e.g., “en” for English). Itis also commonly combined with ISO 3166 countrycodes to provide more specific regional information(e.g., “fr-CA for Canadian French), or combined withother information such as script codes. Importantlyfor the mapping, XLIFF also uses BCP 47 as thenorm to indicate its source and target languages at thethe <xliff> element level by setting the srcLangang trgLang attributes. The srcLang attributedetermines the language of the Glossary Module<term> element, while the trgLang attributedetermines the language of the <translation>elements.

Each <langSet> element contains at least one<tig> (term information group) element. The<tig> element provides all of the specificinformation on a term such as contextual examples,part of speech, and so forth. In TBX entries,definitions are often anchored either at the<termEntry> or <langSet> levels because theygenerally pertain to the whole entry or to a specificlanguage.

The <tig> element must include at least one<term>, which contains a plain text representationof a term associated with the <termEntry>concept. Aside from <term>, there are several otherelements that may be used to provide additionalinformation, but none are mandatory. NeverthelessPart of Speech (<termNotetype=”partOfSpeech”>) is highlyrecommended, and TerminOrgs maintains that itshould be used in all cases to optimizerepurposability of the termbase (see Appendix I formore specific information).

5. The Mapping

The following table maps the data categoriesavailable in the XLIFF Glossary Module to those inTBX-Basic based on their respective semantics. Thismapping has been proposed as a way to enableinteroperability between the two formats by layingdown a foundation upon which file conversionapplications and web services could be based.

Because XLIFF <glossEntry> is extensible byattributes and elements from other namespaces,obviously a maximalist one-to-one mapping ispossible that could roundtrip all TBX data categoriesin the TBX namespace elements and attributes.However, such endeavour is not necessary and noteven advisable or desirable. Such a full mappingwould clutter the minimalistic glossary module withunnecessary information, which would thusundermine the benefit of providing just the locallyrelevant terminology with the necessary minimum ofmetadata.

Moreover, default XLIFF and Glossary Modulefeatures are expressive enough to roundtrip allmandatory TBX-Basic data categories. Thus thismapping does not consider extensibility allowed inthe XLIFF Glossary Module and focuses only on thedefault elements and attributes specified in theXLIFF standard (Comerford, T., Filip, D., Raya,R.M., Savourel, Y., Eds. 2014; Savourel, Y., 2014). Aconversion routine between the two files is underdevelopment and will be made available athttp://www.tbxinfo.net/tbx-downloads/. A simpleexample of XLIFF module data, which has beenconverted to TBX-Basic using this mapping may befound in Appendix III.

The ref attribute actually points to the exact markerdelimited span of text that contains just the termwithin a <segment>; so typically the wholeenclosing <segment> content will be used as thecontext content in TBX.

Occasionally a term may span more than one<segment> element. If this happens, there must besomething wrong going on:

either the term is not really a term, or1wrong segmentation has been applied.2Or authors have erroneously used3structural implements for an ad hocline break, which caused a correctsegmentation rule to break a termerroneously.

Nevertheless, such situations do happen and themapping needs to have a way how handle them. Thuswhen converting an XLIFF term that spans more thanone <segment> element, concatenation of allspanned <segment> elements will be needed ascontext for TBX in those cases.

28

Page 7: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

29

Data category Representation Description

Context <descriptype=”context”>

comprises a sample sentence to showcontextual usage of the term

Created by

<transactype="transactionType">

creation</transac>">

appears in a <transacGrp> andaccompanied by a <transacNote>specifying the creator’s name and date

Creation date <date>

appears in the <transacGrp>containing <transactype="transactionType">

creation</transac>

Cross Reference<reftype=”crossReference”target=”element_id”>

points to another entry or term withinthe same TBX-Basic file

Customer <admintype=”customerSubset”>

identifies term that may be required forspecific customers

Definition <descriptype=”definition”>

defines the concept represented by theterms in the term entry

External cross-reference

<xreftype=”externalCrossReference”target=”external_id>

points to external reference orexplanatory text such as a website link

Figure

<xref type="xGraphic"target="file_location">

description ofgraphic</xref>

Reference (URI, URL, or local file path)external to the TBX file. The referenceis the target value and the description isthe element value

Gender<termNotetype=”grammaticalGender”>

indicates grammatical relationshipsbetween words in sentencesPermissible values:

masculinefeminineneuterother

GeographicalUsage

<termNotetype=”geographicalUsage”>

indicates geographical area of usage(best implemented as a picklist). Shouldeither use ISO 3166 country codes orIETF BCP 47

Last modified by

<transactype="transactionType">

modification</transac>

appears in a <transacGrp> andaccompanied by a <transacNote>specifying the modifier’s name and date

Table 2 part 1: TBX-Basic data categories and their representations

Page 8: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

30

Table 2 part 2: TBX-Basic data categories and their representations

Last modificationauthor

<transacNotetype="responsibility"target=’person_id’>[creator name]

appears in the <transacGrp>containing <transactype="transactionType">

</transacNote>modification

</transac>.person_id refers to the specific IDgiven a person in the backmatter

Last modifieddate <date>

appears in the <transacGrp>containing <transactype="transactionType">

modification</transac>

Note <note> any kind of note

Part of Speech <termNotetype=”partOfSpeech”>

associated with a category assigned to aword based on its grammatical andsemantic propertiesPermissible values:

noun (www.isocat.org/datcat/DC-1333)

verb (www.isocat.org/datcat/DC-1424)

adjective(www.isocat.org/datcat/DC-1230)

adverb (www.isocat.org/datcat/DC-1232)

properNoun(www.isocat.org/datcat/DC-384)

other (www.isocat.org/datcat/DC-4336)

Project <admintype=”projectSubset”>

identifies terms which may be requiredfor specific jobs/projects

Source ofContext <admin type=”source”>

indicates the source of context sample.should be found in the<descripGrp> containing context

Source ofDefinition <admin type=”source”>

describes the source of the definition;appears in in the <descripGrp>containing definition

Page 9: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

31

Table 2 part 3: TBX-Basic data categories and their representations

Source of Term <admin type=”source”>indicates the source of the term; appearsin in the <descripGrp> containingdefinition

Term Location <termNotetype=”termLocation”>

records the location in a user interfacewhere the term occurs, such as <listitem> or <button label>

Term Type <termNotetype=”termType”>

attribute assigned to a term indicating itsform; permissible values: fullFormacronymabbreviationshortFormvariantphrase

Usage Status<termNotetype=”administrativeStatus”>

indicates whether a term is approved foruse or notPermissible values (note they aresimplified in TBX-Basic):preferred(www.isocat.org/datcat/DC-72)

admitted (www.isocat.org/datcat/DC-73)

notRecommended(www.isocat.org/datcat/DC-74)

obsolete(www.isocat.org/datcat/DC-75)

XLIFF Elementsand Attributes TBX-Basic Comment

<glossEntry> <termEntry><term> <term>

<translation> <term> <term> belonging to the targetlanguage’s <langSet>

<definition> <descrip type=”definition”>

ref <descrip type=”context”> see Page 28 Column 2 Paragraph 3

source <admin type=”source”>Table 3: XLIFF-TBX Mapping-

Page 10: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

6. Conclusion

When mapping TBX-Basic mandatory datacetegories to XLIFF core and Glossary Module, wemeet with a fairly straightforward match. There isperhaps one subtlety worth noting. While TBX-Basicrequires definition or context information for aconcept entry to be valid, XLIFF Glossary modulerequires either a definition or a translation for a validglossary entry. Hence in cases when XLIFF glossaryentries are not provided with a definition as theydon’t have to be. The valid TBX-Basic entry needs toextract context information. That is however alwayspresent in the underlying XLIFF core bitext. Thus thebidirectional mapping is feature complete. If aparticular process needs to make use of optional TBXcategories, these can be always roundtripped usingXLIFF core and Glossary module extension points.This aspect has not been however discussed except asa brief mention as an option for handling sourcesynonymy.

Terminologists and lexicographers have been makingfor a long time the distinction betweenlexicographical resources and terminologicalresources, asserting that lexicographical entries areword-centred with potentially many associatedsenses, while terminological entries are concept-

centred with potentially many terms (synonyms) andtarget language equivalents. In contrast to thesetraditional models, the XLIFF Glossary Module entrydocuments a single term embedded in the context ofthe source language component of a bitext andprovides the option to link that term to one or morepotential target language equivalents (Figure 2). Thispaper demonstrates that this model is mappable to theTBX interchange model (specifically TBX-Basic)because a single term in a single context comprisesone feature complete facet of a concept-orientedterminological entry. This mapping, together with theappropriate utilities, will enable users working in avariety of technical writing and localisationenvironments to utilize context-groundedterminological information across applications andplatforms.

References

Comerford, T., Filip, D., Raya, R.M., Savourel, Y.(Eds.) (2014) XLIFF Version 2.0 [online], OASISStandard. Available: http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html [accessed 22 Aug 2014].

Cover, R. (2002) Technology Reports: TextEncoding Initiative (TEI). Available at:

32

Figure 2: Lexical, Terminological, and XLIFF Structural Model

Page 11: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

http://xml.coverpages.org/tei.html [accessed 18 Jan2015]

Cuddihy, K. (2011) LISA Intellectual PropertyTurned Over to ETSI. STC Notebook. Available at:http://notebook.stc.org/lisa-intellectual-property-turned-over-to-etsi/ [accessed 17 Jan. 2015]

Ethnologue. (2015) Ethnologue Languages of theWorld: Browse by Language Name, LanguageCode, Language Family, Map Title. Available athttp://www.ethnologue.com/browse [accessed 20Jan 2015] Note: all ISO 639 codes are available inthe entries for each language.

GALA (2015) LISA OSCAR Standards. Availableat: http://www.gala-global.org/lisa-oscar-standards[accessed 18 Jan 2015]

Guillemin, P. (ETSI Secretariat) (2011)Correspondence: In the beginning WhP and ETSITechnical Committee Human Factors...Available at:http://docbox.etsi.org/ISG/Open/ISGLIS/LocWorld-SantaClara/Patrick%20GUILLEMIN%20TEXT%20%20in%20C1%20v2.pdf [accessed 18 Jan 2-15]

IATE. (2014) InterActive Terminology for Europe.Available at: http://iate.europa.eu/;http://iate.europa.eu/tbxPageDownload.do [accessed18 Jan 2015]

IETF BCP 47. (2009) Tags for IdentifyingLanguages. Available at:https://tools.ietf.org/html/bcp47 [accessed 19 Jan2015]

ISO 639. (Varies) Family of Language Codestandards: see Ethnologue.

ISO 3166. (2015) Country Codes Online BrowsingPlatform (OBP). Available at: https://www.iso.org/obp/ui/#search [accessed 19 Jan2015]ISO 12200:1999 Computer applications interminology – Machine-readable terminologyinterchange format (MARTIF) – Negotiatedinterchange. Withdrawn. Geneva: ISO.

ISO 12620. (1999) Computer Applications inTerminology – Data Categories. Geneva: ISO.Withdrawn.

ISO 12620. (2009) Terminology and other languageand content resources – Specification of datacategories and management of a Data CategoryRegistry for language resources. Geneva: ISO.

ISO 16642. (2003) Computer applications interminology – Terminological markup framework(TMF). Geneva: ISO.

ISO 30042. (2008) Systems to manage terminology,knowledge and content – TermBase eXchange(TBX). Geneva: ISO.

ISO. (2015) ISOcat Data Category Registry.Available at: http://www.ISOcat.org [accessed 18Jan 2015]

Kemps-Snijders, M.; Windhouwer, M.; and Wright,S.E. ISOcat: An ISO 12620:2009 Data CategoryRegistry. Available at:http://www.slideshare.net/mwindhouwer/isocat-an-iso-126202009-data-category-registry [accessed 18Jan 2015]

Lommel, A. (2007) “OSCAR Standards forLocalization and Globalization Environments”Available at: http://www.ttt.org/TC37/ISO%20Conference%202007_files/Arle_LISA%20standards.pdf [accessed 17Jan 2015]

LTAC Global (Language Technology and AuthoringConsortium) Available at: http://www.ltacglobal.org/[accessed 18 Jan 2015]

OASIS. (2014) XLIFF Version 2.0: OASISStandard. See Comerford et al. above.

SALT. (1998-2002) Standards-based Access tomultilingual Lexicons and Terminologies.Available at:https://web.archive.org/web/20090319040215/http://www.loria.fr/projets/SALT/saltsite.html [Accessed17 Jan 2015 via Wayback Machine]

Savourel, Y. (2014) An Introduction to XLIFF 2.0.Multilingual, pp 42-47. Available at:http://dig.multilingual.com/201406/8B19207B6B20FA6ADBAB2612383D9EEF/201406.pdf [accessed2015-01-20]

TBXinfo. Available at: www.tbxinfo.net [accessed18 Jan 2015]

33

Page 12: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

TBX-Basic. (2015) [See: TerminOrgs, ETSI,tbxinfo.net]

Text Encoding Initiative. (1994) Part 3: Base TagSets, 13: Terminological Databases. In: Sperberg-McQueen, C. M., and Burnard, L, Eds. Guidelinesfor Electronic Text Encoding and Interchange.P3Revised reprint, Oxford, May 1999. Available at: http://quod.lib.umich.edu/cgi/t/tei/tei-idx?type=HTML&rgn=DIV1&byte=1158058[accessed 16 Jan 2015]

TerminOrgs (Terminology for Large Organizations)(2014) TBX-Basic Version 3.1; Termbase eXchange(TBX). Available at:http://www.terminorgs.net/downloads/TBX_Basic_Version_3.1.pdf[accessed 18 Jan 2015]`

Terminorgs/LISA. (2014) An Archive of OscarStandards: Termbase eXchange (TBX). Available at:http://www.ttt.org/oscarStandards/tbx/ [accessed 18Jan 2015]

TermWeb. (2015) http://www.interverbumtech.com/(See also ISOcat.org above)

34

Page 13: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

35

Appendix I - TBX-Basic Implementation Guide

This Appendix describes the elements required to create a valid TBX-Basic file. TBX-Basic can be validatedusing the TBX Checker with the TBX-Basic DTD file (TBXBasiccoreStructV02.dtd) and XCS file(TBXBasicXCSV02.xcs). Each of these items can be found in the TBX-Basic Package at the website:http://www.tbxinfo.net/tbx-downloads/

The prescribed file structure is shown in figure 3:

Other constraints:The <back> element is required if internal references in <body> (such as in creator or•modifier) point to the ID of a person listed in the back matter. The auxInfo box represents themeta data-categories representations such as <descrip>, <descripGrp>, <admin>,<adminGrp>, <xref>, etc.One of definition or context is required.•

Figure 3 TBX-Basis structure

Page 14: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

Appendix II - XLIFF Core + Glossary Module tree and Constraints(see http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html)

36

Figure 4: XLIFF Core structure

Page 15: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

37

Figure 5 - XLIFF Glossary Module structure

Source terms appear within the <source> children of <segement> elements, translated terms appear withintheir <target> siblings.

In order to reference terms in context from the Glossary Module, the term spans need to be delimited using theXLIFF core term annotation, making use of <mrk> elements or <sm/>/<em/> pairs.

The <glossary> wrapper is allowed at each <unit> element. The Glossary Module structure is shown infigure 5.

ConstraintsA <glossEntry> element MUST contain a <translation> or a <definition> element to be•valid.

Page 16: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

38

Appendix III XLIFF Glossary Module to TBX-Basic sample conversionThese files can be downloaded at: http://www.tbxinfo.net/tbx-downloads/

XLIFF File

<?xml version=”1.0” encoding=”UTF-8”?><xliff xmlns=”urn:oasis:names:tc:xliff:document:2.0” version=”2.0” srcLang=”en”trgLang=”de”

xmlns:gls=”urn:oasis:names:tc:xliff:glossary:2.0”><file id=”f1”><unit id=”1”>

<gls:glossary><gls:glossEntry ref=”#m1”>

<gls:term source=”publicTermbase”>TAB key</gls:term><gls:translation id=”1” source=”myTermbase”>Tabstopptaste</gls:translation><gls:translation ref=”#m2” source=”myTermbase”>TAB-TASTE</gls:translation><gls:definition source=”publicTermbase”>A keyboard key that

istraditionally used to insert tab characters into a document.</gls:definition>

</gls:glossEntry></gls:glossary><segment>

<source>Press the <mrk id=”m1” type=”term”>TAB key</mrk>.</source><target>Drücken Sie die <mrk id=”m2” type=”term”>TAB-TASTE</mrk>.</target>

</segment></unit></file>

</xliff>

Page 17: Interoperability of XLIFF 2.0 Glossary Module and TBX-Basic · 2015-04-09 · Localisation Focus The International Journal of Localisation Vol.14 Issue 1 (optionally) one or more

Localisation Focus Vol.14 Issue 1The International Journal of Localisation

39

TBX File

<?xml version=’1.0’?><!DOCTYPE martif SYSTEM “TBXBasiccoreStructV02.dtd”><!— THIS FILE MAKES USE OF THE TBX NAMESPACE —><martif type=”TBX-Basic” xml:lang=”en-US” xmlns=”iso.org/ns/tbx/2016”>

<martifHeader><fileDesc>

<titleStmt><title>XLIFF 2.0 Glossary Module to TBX-Basic

Demonstration</title></titleStmt><sourceDesc>

<p>This is a demonstration of a potential mapping from

the glossary module of XLIFF 2.0to TBX-Basic.

</p></sourceDesc>

</fileDesc><encodingDesc>

<p type=”XCSURI”>TBXBasicXCSV02.xcs</p>

</encodingDesc></martifHeader><text>

<body><termEntry>

<langSet xml:lang=”en”><tig>

<term>TAB Key</term><admin type=’source’>publicTermbase</admin><descripGrp>

<descrip type=’definition’>A keyboard keythat is

traditionally used to insert tab charactersinto a document.

</descrip><admin

type=’source’>publicTermbase</admin></descripGrp><descripGrp>

<!— Here the segments were pulled from <segment> and used as data foran ‘example’ —>

<descrip type=’context’>Press the TABkey.</descrip>

</descripGrp></tig>

</langSet><langSet xml:lang=”de”>

<tig><term>Tabstoptaste</term><admin type=’source’>myTermbase</admin>

</tig><tig>

<term>TAB-TASTE</term><admin type=’source’>myTermbase</admin><descripGrp>

<!— Here the segments were pulled from <segment> and used as datafor an ‘example’ —>

<descrip type=’context’>Drücken Sie dieTAB-TASTE</descrip>

</descripGrp></tig>

</langSet></termEntry>

</body></text>

</martif>