Micro Crowdsourcing: A new Model for Software …...in more languages. This paper proposes and describes a new model of crowdsourcing which may provide a platform by which the "equal

Localisation Focus Vol.8 Issue 1The International Journal of Localisation

Micro Crowdsourcing: A new Model for Software Localisation

Chris Exton1, Asanka Wasala2, Jim Buckley1, Reinhard Schäler1

,2

[1]Centre for Next Generation Localisation,[2]Localisation Research Centre,

University of Limerickwww.cngl.ie

[email protected]; [email protected]; [email protected]; [email protected]

Abstract One obvious flaw in the concept of the knowledge society is our collective failure to date to provide equal accessto information and knowledge across languages. We are a long way away from the ideal world, where, asMuhammad Yunus, winner of the 2006 Nobel Peace Prize said, there would only be one language in theinformation technology (IT) world - your own (Yunus 2007). While the US$16b mainstream localisation industrylikes to see itself as the vehicle that is removing this barrier to universal access to digital knowledge andinformation (i.e. language), in reality it is making limited impact on the widening gap between the information richand the information poor.

Crowdsourcing has been described as an approach to address the shortcomings of current mainstream localisation,allowing the localisation decision to be shifted from large corporations to service users, thus making IT availablein more languages. This paper proposes and describes a new model of crowdsourcing which may provide aplatform by which the "equal access to information and knowledge" might be achieved.

Keywords: localisation, digital divide, micro crowdsourcing, real time localisation

Introduction

Current software localisation efforts are largelydriven by the economic imperative of short-termreturn on investment. The localisation decision, i.e.the decision on the languages and locales to becovered by the mainstream localisation effort, isalmost exclusively determined by the size of themarket a language or a locale represents. Therefore,software is translated for the approximately fivemillion speakers of Danish, but not for the 27 millionspeakers of Amharic, the national language ofEthiopia. However, there is also a long-term returnon investment issue to be considered for commercialorganizations. Digital publishers have recognisedthat, without pursuing deployment in large butcurrently unviable locales, their product will havelimited exposure that may jeopardise future, largermarket gains when those locales developeconomically.

In addition, the social effects of this short-term

economic imperative are grave. Access toinformation technology is restricted to thosespeaking the languages of the global north whileexcluding those speaking the languages of the globalsouth. The majority of people living on this planetcannot share their knowledge in the digital world, nordo they have access to existing knowledge.Organisations engaged in localisation activities notprimarily for commercial but for social, cultural,political or developmental reasons, seecrowdsourcing as a mechanism to connect with theircommunities, through reduced cost.

The concept of crowdsourcing was first described byJeff Howe in his now famous article in WiredMagazine in 2006 (Howe 2006). He described it asthe harnessing of a community/group of people toperform a task traditionally undertaken byemployees. Crowdsourcing has been taken upenthusiastically and has resulted in almost 8.2 millionhits in Google's search engine. It has featured as amajor topic at recent, seminal localisation-industry

81

Vol 8 Issue 1.qxp 07/05/2010 16:36 Page 81


events, such as Localization World 2009 and LRCXIII. This is because both commercial and altruisticorganizations see it as a mechanism to lower the costof localisation, enabling them to enter currentlyinaccessible locales (Rickard 2009). In addition,(Losse 2008) in her keynote at the LRC conference,stated that organizations like Facebook pursuedcrowdsourcing because it also produced higherquality translations.

Consequently, attempts by companies such asMicrosoft, Facebook and Google to createcrowdsourcing frameworks that allow volunteertranslators and localisers to translate digital contentinto marginally commercial languages are seen bythe industry as having delivered very promisingresults.

While commentators seem to agree that the mainissues around crowdsourcing in the localisation spaceare control, quality and motivation, there is still alack of comprehensive studies on any of these issues.In addition, another central issue of crowdsourcedlocalisation, i.e. the need for the localisation decisionto be shifted from multinational corporations to theuser (Howe 2006), has only been raised on occasionand not as a central pre-requisite for the success ofany crowdsourced localisation effort. (Schäler 2008).

Based on these issues, this paper considers how thepractices and experience from the open source andWeb 2.0 communities could provide a path forsoftware localisation to make Muhammad Yunus'dream a reality.

1.1 The Cathedral and the BazaarThe "Cathedral and the Bazaar" is an essay by Eric S.Raymond on software engineering methods(Raymond 1999), based on his observations of theLinux kernel development process and hisexperiences managing an open source project. In it hedescribes how the traditional software developmentparadigm could be viewed as hierarchical and tightlyplanned; Raymond likens this view to a Cathedralwhich is monolithic and obviously architected bysome controlling authority. The open sourcedevelopment paradigm however, he continues, couldbetter be likened to a market or bazaar, where it isobvious that industry of some kind is occurring butthere seems to be no or little central authority orcontrol.

Traditionally, mainstream software localisation hasbeen tightly controlled by multinational corporations.

They strictly managed everything from thelocalisation decision itself to the selection of anappropriate localisation process, the use of certainterminology and translation memories, and thedeployment of adequate tools and technologies.

This model is not unlike the Cathedral modeldescribed by Raymond, with its central control andtight management through a number of levels ofactivity and quality control, to ensure a suitable andtested final product. Indeed, the type of softwareoften used to support and control the activity oflocalisation, (essentially customised project-management systems such as Idiom Worldserver)provides strong evidence of the approach chosen bymainstream localisation. However, such a 'Cathedral'model brings with it the implicit need for largecoordination efforts and subsequently high costs.

The 'Bazaar' model, in turn, is associated with theopen source community, and requires lessercontrolling authority. Work is somehow carried out inan almost chaotic, community-driven manner. Thishas provided the business and technical communitieswith a suite of software, including Linux (Raymond2001), Apache (Mockus et. al 2000) and Openoffice(Feller and Fitzgerald 2002), upon which manycompanies rely heavily today. These systems areproof that a community-driven, open source modelcan also deliver quality software systems.

Indeed the open source community model mayprovide a paradigm to address some of the problemsfaced by the localisation community in relation totheir desire to expand into underserved markets andto break down the digital divide. However, manyinitiatives addressing underserved markets today,such as the ones initiated by Facebook, Microsoft orSymantec, still have a central authority driving thelocalisation effort, rather than being bottom-up andcommunity-driven.

This paper proposes an alternative approach to theidea of crowdsourcing in relation to the translation ofsoftware systems. In this model, individual userstranslate elements of a system and its documentationas they use them in return for free access to theseartifacts. Periodically, the elements of the system anddocumentation translated by the individualtranslators are gathered centrally and aggregated intoan integral translation of all, or parts of, the system.

82



2. Approach to Micro-Crowdsourcing

Consider a software package such as Open Officethat has been developed for a purely Englishspeaking audience. Even if this product weredesigned to facilitate its easy adaptation into otherlanguages it would still require the effort of either anumber of altruistic individuals or the coordinatedeffort of expensive professional localisers to makethis product available in another language.

Where there is no/limited immediate economicimperative for the digital content publishers, such asin the case of open source software or voluntaryorganisations aiming to bridge the digital divide, onesolution might be the automatic translation of contentinto non-commercially viable languages. Althoughthis option might be preferable to simply ignoringthese languages, automatic translation is not yet at astage where such a product could be released withconfidence.

Imagine however, developing a software applicationsuch as Open Office that allowed a community ofusers to update the user interface in situ, eitherdirectly from the original English version or perhapsworking from a less than perfect machine translation.The update could be enabled via a simple popupmicro localisation editior that would allow them to

change UI text in situ simply by ctrl-clicking on anytext that is displayed.

This editor may have to enforce constraints on thetranslation, such as restricting string length, andcould perhaps include appropriate translationmemories and standards to assist in the translation.As a ctrl-click could be applied to any displayabletext area, error messages and help informationmessages could also be included as translatablematerial.Indeed, the editor may even go as far asallowing graphical replacement of certain artefacts.

The result would be a set of textual (and possiblygraphical updates for each user. Then suppose thateach update-set could be automatically gathered in acentral repository that would, in turn, push updateevents back to the community of users, periodicallyor on-demand. This would update their product withthe latest translations. Imagine that these users could,in turn, quality assure the updates and re-instigate thecycle, in the same way that Web 2.0 communities likeWikipedia reach consensus by iterative refinement.

Such an approach would represent a radically novelapproach to localisation requiring a novelarchitecture, as demonstrated in figure 1. On the farleft, we see a central server that receives and sendsupdates to and from individual deployments of a

83

Figure 1: The UpLoD-Based Architecture for MicroCrowd-Sourced System Translation



software package called 'Writer', one of which (to theleft of the figure) is substantially expanded.

As can be seen from the 'expanded Writer', the threetiers of the application are augmented by an 'Update-Log-Daemon' (UpLoD) module. This UpLoDmodule allows the user to update the user interfacesas they use the system and log the changes in a localaudit file. The records in this local audit file containunique identifiers for the GUI elements that havebeen changed. The identifiers are associated with thepre-translation and post-translation. Periodically, aDaemon trawls the audit log and, on finding newrecords, passes updates to the central repository onthe server.

These updates can be handled in a number of ways.For example in publically edited wikis, revisioncontrol enables a human editor to reverse a change toits previous version. For a "Micro Crowdsourcing"system it is possible to consider that there might be alimited number of trusted editors (self moderating)for a specific language group to tidy up thelocalisation in this fashion. A version control systemwould then enable editors to build a release packageon a periodic basis based on the influx of microchanges from standard users. This would serve todrastically decrease the number of changes andupdates to the UI and avoid updates with newtranslations on an ongoing basis.

For a more automated approach the server mightperiodically analyse the update set of all users, basedon an aggregate consensus, and may be able torecommend the changes to be made to the otherversions of 'Writer'. These changes are captured bythe UpLoD module and update the GUIcorrespondingly.

Another open source development concept whichcould be adapted to suit the "Micro Crowdsourcing"model is that of distributed revision control.Distributed revision control is built on a peer-to-peerapproach, unlike the centralised client-serverapproach classically used by software versioningsystems such as CVS. In a distributed revisioncontrol system each peer maintains a completeworking copy of the codebase. Synchronization isconducted by exchanging patches (change-sets) frompeer to peer, a more in-depth discussion of theprocess is described by Noah and Adam (2009).

Regardless of the version control system that will beused, the translation is carried out in an incremental,

ad-hoc manner by a community of (not necessarilyexperienced) "translators", each of whom woulddouble as a proof reader for each other's work.

Once we allow all registered end users to becometranslators or localisers, we spread the workload overa large user base. The limiting factors would be thenumber of bi-lingual speakers with access tocomputers and internet connectivity. However, eventhis limiting factor could be overcome by offeringmonoglot users, familiar with the software, access tosuitable translation aids including machinetranslation, translation memories and terminologydatabases.

To a large degree, a similar model already exists inthe Wikipedia community where content may beadded and amended by any registered user. Qualityand precision, issues discussed as highly problematicin the context of crowdsourced localisation, are inthis case simply promoted by the fact that any readerof Wikipedia can register and thereby correct orupdate any particular entry. This phenomenon can belikened to the "many eyes" principle associated withopen source. This phrase was coined by LinusTorvalds (Raymond 1999) states "Given a largeenough beta-tester and co-developer base, almostevery problem will be characterized quickly and thefix will be obvious to someone." It simply describesthe notion that, since open source code can be viewedand potentially changed by anyone who cares to lookat it, the number of bugs that are caught and fixedincreases dramatically compared to proprietarydriven development. Likewise, it is envisaged thatthis "many-eyes" characteristic of the UpLoDarchitecture will promote an increasingly stable, highquality, and locale-specific application over time asusers are empowered to become part of thelocalisation process.

In addition, the software is translated for free byvolunteers, provided that the digital publisher iswilling to deploy its un-translated or automaticallypre-translated version to registered volunteers in eachlocale. In open source scenarios and scenarios wherethe aim is to bridge the digital divide, this will not bea concern. However, even in commercial scenariosthe corporation may consider the exposure gainedfrom having a localised version to be worth the lossof potential license revenue from registeredvolunteers. This would be particularly true in the caseof emerging markets where the possibility of salesmight be low at the moment, but where earlyexposure to localised systems could lead tocommercial opportunities later.

84



3. Proof of Concept Implementation

A proof of concept prototype of this architecture wascreated to validate and refine this approach. Theprototype consists of two components: the centralserver component and a simple RTL (Real TimeLogging) Notepad application which imitates the"Writer" of Figure 1. The UpLoD module wasimplemented and integrated in the RTL Notepadapplication in addition to its generic text editingfunctions. Due to its simplicity and portability thePortable Object (PO) file format was chosen as theformat for the local audit file.

In the RTL Notepad, simply right-clicking on textinside any UI element brings up a context menuwhere users can enter into a 'localisation dialog' (Seefigure 2). From this window, a user can translate theselected UI strings. The changes are reflected in theUI in real-time. Options have been provided to usersin the localisation dialog for the online transmissionof their changes to the central server or for the offlinesaving of the translation to the local audit file for laterbatch transmission to the server.

In this prototype, we propose an automatedtranslation voting mechanism to ensure the quality ofthe translation. For this purpose, the server maintainsa database containing translations and the number ofvotes for each translation (ie: the number of userswho have suggested that translation), for eachlanguage.

In the following sections, the main phases of thelocalisation process associated with this architectureare explained in more detail. For illustrationpurposes, screenshots of the RTL Notepad in Englishand its translated version in Sinhala are given inFigure 3 and Figure 4.

3.1 InitialisationThe RTL Notepad application can be configured toupdate its UI by connecting to the central server or byreading from its local audit file, i.e. in line with thechanges made by its immediate user. If the Notepadis configured to connect to the central server, it willretrieve the translations from the server and willupdate its user interface accordingly. This may resultin overwriting the customisations already made bythe user and they should be alerted to this possibilitybefore they agree to the update or they should bealerted on a (UI) string-by-string basis.

In the first scenario, i.e. when the RTL Notepad isconfigured to update its UI using the informationobtained from the server, the RTL Notepad will sendan HTTP request to the central server stating theuser's language, as configured in the RTL Notepadapplication. Then, the server will send an XMLresponse which contains a list of source-targettranslation units for all the UI elements. This processis illustrated in figure 5. The server will generate theXML by choosing the translation with the highestnumber of votes, for each UI string. The RTL

85

Figure 2: The Localisation Dialog of the RTL Notepad application



Notepad will process the XML response and updateits UI. An illustrative XML response is given infigure 6 (where the GUID tag uniquely identifies theUI element).

There will be situations where different translationsfor the same UI string have the same number ofvotes. To handle such potentially 'thrashing-like'

scenarios, the RTL Notepad will show a specialdialog, allowing users to choose their preferredtranslation during start-up. In order to minimise suchtranslation conflicts in the future, the userpreferences are sent back to the server so that theserver will increase the votes for the relevanttranslations. They will also be stored locally so thatthe user does not have the repeat his/her choice.

86

Figure 3: The RTL Notepad application in English

Figure 4: The RTL Notepad application in Sinhala



Figure 5: UI String Translations Request and Response

Process

Figure 6: Typical Server XML Response

3.2 Translation Submission ProcessUsers can submit translations of UI strings to thecentral server through the localisation dialog of theRTL Notepad. User submissions will be directed tothe central server as HTTP requests. Uponsubmission, the central server will query its databaseto see whether the submitted translation alreadyexists. If so, the server will increase its votes.Otherwise, the translation will be added to therelevant language table, initialising its number ofvotes to one.

3.3 Community Suggestions Retrieval ProcessIn the localisation dialog of RTL Notepad, an option

is given to retrieve the suggestions of the usercommunity. Once the 'View other suggestions bycommunity' option is selected in the RTL Notepad, itwill send an HTTP request to the server asking forcommunity translations for the selected UI string.The server will send an XML response to the clientRTL Notepad application containing all thesuggestions for the given UI string for a givenlanguage. The UpLoD module in RTL Notepad willprocess this XML and list these suggestions in itslocalisation dialog, in the order of number of votesreceived, as illustrated in figure 2. The users then canchoose their preferred suggestion to be used in theGUI of that version of RTL Notepad. Once a userselects a suggestion, the suggestion will be sent backto the central server as an HTTP request. The serverwill then increase the votes of the given suggestion.

4. Outstanding Challenges

Of course, localising software is not as simple asportrayed in this prototype. Not only does text haveto be changed: holding boxes have to be resized, andimages may have to be replaced, for example.However if tools that facilitate localisation wereincorporated into an UpLoD-type architecture, itwould not be unreasonable to expect that the need forthese changes could be covered satisfactorily. Afterall, such changes are currently performed in existinglocalisation efforts. The model proposed heresuggests piggy-backing such functionality into the"Update" component of the system deployed.

There is also the possibility that volunteer translatorswould focus their efforts on only a small proportionof the user interface. This proposition is based onPareto's Principle (Bookstein 1990) which, toparaphrase for this context, suggests that most usersof large applications will only use a small proportionof its functionality. If translators choose to translateas they use, or choose to do the translations thatothers will see, rather than translating holistically, itis likely that translation coverage will be patchy andwill result in a 'pidgin' system made up of translated'frequently-used' facilities and untranslated'infrequently-used' facilities. This may provesufficient for the majority of users, but runs the riskof frustrating users who have more demandingrequirements. However, frustration can become amotivating factor if the user is empowered tosubsequently change the associated UI strings.

Another potential challenge is that the voting

87



mechanism proposed may prove insufficient andineffective; specifically, there is the possibility of'thrashing', where two individual translators, orgroups of translators, have very strong andconflicting ideas about the translation required forspecific GUI elements. In such instances, the'Analysis Engine' of the central server would need tointervene, analysing the central logs, deriving theappropriate translation, possibly with humanintervention, and locking future changes.

Indeed, we see this 'thrashing' problem as being oneof the main issues with this approach. Imagine, as auser, you customise the interface and then send yourchanges to the server. Imagine then, retrieving theserver-side customisations and finding that very fewof your changes had survived. This is a micro-form ofthrashing that would probably be prevalent,particularly if there were a wide number of userscustomising the interface - a measure of theapproach's success. Such negative feedback mightdiscourage the user from making further changes tothe interface and result in a fall off in localisationactivity over time. Indeed, it might discourage themfrom using the application itself, as the interface theystrove to create has been destroyed by the server-sidecustomisations. Hence, as mentioned in section 3.1,we see a strong role for 'change alerts' and the optionto opt out of server-side customisations as core forthe users of this approach

The voting mechanism currently implemented in theprototype takes no account of user quality, anattribute that could easily be calculated from theavailable data (a simple measure could be thepercentage of each user's suggestions that equate tothe customisations with the highest vote). Thisadditional information could be used to resolve ties,where equal numbers of votes were obtained for twoor more different translations, to resolve thrashing or,more generally, as a weighting on the votes. It may be that user submissions might have to bereviewed by human experts (preferably by a pool oflinguists) prior to committing to the server's database.This additional step would ensure the quality of thetranslations to be used in UI elements in terms ofcriteria such as relevancy, accuracy, suitability, andconsistency. However larger scale deployments,where a bigger community is involved, may wellcounteract this potential issue.

It is noteworthy to mention the programmingdifficulties that may be encountered when developingUpLoD-architecture-based applications. The

development of the prototype revealed that some UIwidgets and built-in UI components such as file opendialog, printer dialog etc. provided by severalprogramming languages are tightly integrated withthe underlying platform and hence cannot bemodified to incorporate real-time localisablefeatures.

Notwithstanding, UpLoD-architecture-based appli-cations are easy to develop using programminglanguages that support object oriented programming(OOP), especially if their UI developmentcomponents are loosely coupled with the operatingsystems and the rest of the system's components.Indeed, ideally, the "real-time localisability" shouldfind support within the operating systems and theprogramming languages themselves.

However, as long as this is not the case, it has to beacknowledged that there is an overhead associatedwith the UpLoD architecture that adds to the expenseof this initial development, so ideally thisarchitecture should be as pluggable-and-playable innature as possible, and this is seen as an area of futureresearch for the group.

This overhead may be increased in a large-scaleUpLoD system. For example, if the deployment waswide enough, server farms may have to be designedfor load balancing as well as efficient processing ofclient requests.

It would also be interesting for future research toinvestigate the possibility of using the UpLoDarchitecture for the localisation of existingapplications. One possibility is to develop a daemonor Windows service that would facilitate this. Thedaemon or the service could display translations astooltips whenever the mouse is hovered over the UIstrings of the existing application.

Future work will include the development of asuitable light-weight localisation model that includesan appropriate container that could facilitate a newand ongoing micro versioning capability. Toaccompany this a micro versioning workflow modelwould have to be developed that could facilitate andaddress many of the features described throughoutthis paper, for example the capability to facilitate a 24hour micro update capability that could cover up to100+ languages on a 24 hour basis.

Appropriate techniques for the development andmaintenance of an associated translation history

88



would also be a major objective. It is envisaged thatthese issues will be worked through, by thedevelopment of a series of prototypes for a selectedsample open source application and associated usertrials. This iterative design approach will then serveto inform on the overheads required to implement theUpLoD architecture and drive development of theassociated tools and facilities required to optimisethis approach.

References

Bookstein, A. (1990). Informetric distributions, partI: Unified overview. Journal of the American Societyfor Information Science 41: 368-375

Feller, J. and Fitzgerald, B. (2002) UnderstandingOpen Source Software Development. Addison-Wesley Longman Publishing Co., Inc.

Gift, Noah and Shand, Adam (2009) Introduction todistributed version control systems, IBMDeveloperWorks, 07 Apr 2009.https://www.ibm.com/developerworks/aix/library/au-dist_ver_control/

Howe, Jeff (2006). The Rise of Crowdsourcing.Wired. Magazine June 2006. http://www.wired.com/wired/archive/14.06/crowds.html. Retrieved 22November 2009.

Localization World Berlin 2009.http://www.localizationworld.com/lwber2009/about.php (last accessed 21 November 2009).

Losse, Kate (2008). Keynote at the 2008 LRC XIIIConference. Localisation4All. Dublin, Ireland, 02-03October 2008. http://www.localisation.ie/resources/conferences/2008/keynote.htm#kate (last accessed22 November 2009).

LRC XIV. Localisation in the Cloud. 2009 LRCConference, Limerick, Ireland, 24-25 September2009.ht tp : / /www.local i sa t ion . ie / resources /conferences/2009/programme.htm (last accessed 21November 2009)

Mockus, A., Fielding, R. T., and Herbsleb, J. (2000).A case study of open source software development:the Apache server. In Proceedings of the 22ndinternational Conference on Software Engineering(Limerick, Ireland, June 04 - 11, 2000). ICSE '00.ACM, New York, NY, 263-272.DOI=http://doi.acm.org/10.1145/337180.337209Raymond E. S. (1999). The Cathedral & the Bazaar.O'Reilly. ISBN 1-56592-724-9. http://www.catb.org/~esr/writings/cathedral-bazaar/cathedral-bazaar/.

Raymond, E. S. (2001). The Cathedral and theBazaar: Musings on Linux and Open Source by anAccidental Revolutionary. O'Reilly & Associates,Inc.

Rickard, Jason (2009). Translation in theCommunity. 2009 LRC XIV Conference.Localisation in the Cloud, Limerick, Ireland, 24-25September 2009. http://www.localisation.ie/resources/conferences/2009/presentations/LRC_L10N_in_the_Cloud.pdf (last accessed 21 November2009).

Schäler, Reinhard (2008). Localisation4all: Shiftingthe Mainstream Localization Paradigm. LocalizationWorld Conference, Berlin, 2008.http://www.localizationworld.com/lwber2009/programDescription.php#P6 (last accessed 21 November2009).

Yunus, Muhammad and Weber, Karl (2007). Creatinga world without poverty: social business and thefuture of capitalism. PublicAffairs, 2007.

89


Documents

Micro Crowdsourcing: A new Model for Software …...in more languages. This paper proposes and describes a new model of crowdsourcing which may provide a platform by which the "equal