40
CANOLFAN BEDWYR E-Welsh Unit Tender for Machine Translation to the Welsh Language Board 31.05.05

Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Embed Size (px)

Citation preview

Page 1: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

CANOLFAN BEDWYR

E-Welsh Unit

Tender for Machine Translation to the Welsh Language Board

31.05.05

Page 2: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

Table of Contents

Introduction....................................................................................................................3Background................................................................................................................3Partners.......................................................................................................................3Welsh Language Resources at Canolfan Bedwyr......................................................3Method of Working....................................................................................................5

Implementation of Requirements...................................................................................71. Building a One Million Word Parallel Text Corpus..............................................72. Preparing a Customized Bilingual Dictionary for the MT Project........................73. Statistics Based Machine Translation (SMT)........................................................74. Example Based machine Translation (EBMT)......................................................85. Rule Based Machine Translation (RBMT)............................................................96. A Multi-Engine machine Translation System and Third-Party Applications......107. Speech-to-Speech Translation..............................................................................107. Future Proofing....................................................................................................11

CVs Team Members....................................................................................................12Project Budget..............................................................................................................15Project Plan..................................................................................................................17Authorisation................................................................................................................20Appendix 1 – Deployment of a Shard Repository for Project Artefacts and Other On-line Collaboration Tools within the WISPR Project....................................................21Appendix 2 – Internal Quality Audit Caolfan Bedwyr................................................24

Page 2

Page 3: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

Introduction

Background

This tender document details the application led by Canolfan Bedwyr, University of Wales, Bangor for the project proposed by the Welsh Language Board for English-Welsh and Welsh-English machine translation.

This tender document is being made in cooperation with partners from leading institutions and companies in the field of machine translation.

The tender states not only how the partners will fulfil the requirements of the immediate two year project but also strategies for firmly establishing an international network of excellence within which a centre of competence in Wales may play its role and thus be best able to continue research and development for improved machine translation and indeed any other multilingual computing for the Welsh language.

Partners

The partners involved in the tender document are:

University of Wales, Bangor, Wales University of Yamaguchi, Japan France Telecom, France Carnegie Mellon University, USA

Together all partners have a wide ranging and world leading complementary expertise in various aspects in language technology, including machine translation.

The project will be directed by the e-Welsh Terminology and Language Engineering Unit at Canolfan Bedwyr within the University of Wales, Bangor.

Each partner has a leading expertise in each of the various types of machine translation as described in later sections of this document.

Welsh Language Resources at Canolfan Bedwyr

Canolfan Bedwyr is the only institution in the world that conducts Language Engineering for developing software components and related resources to support Welsh language linguistic characteristics and requirements from the Welsh speaking community.

The identification, development and maintenance of such linguistic software components is regarded as a crucial activity by Canolfan Bedwyr. Latest techniques from the international Natural Language Processing research community are closely followed and are used, where applicable, to aid constructing a library of building blocks for use in as wide a range of applications as possible. (e.g.  Hicks W.J. : Welsh

Page 3

Page 4: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

Proofing Tools – Making a Little NLP Go a Long Way (Workshop on International Proofing Tools and Language Technologies Greece 2004))

To date this library represents the result of many man months of development and improvement from integration into many applications and is thus at a high level of maturity and capability. Applications and projects to have benefited include :

Cysill 3.0 (Welsh Spelling and Grammar Checker) Cysgeir (English-Welsh CD based electronic dictionaries) Cysgliad's Pop-up dictionary facility. (a merging of Cysill and Cygeir

functionalities) BBC LearnWelsh on-line Welsh-English dictionary Microsoft Word's Spelling checker Star/OpenOffice.org spelling checker

Welsh-English machine translation would also be an end application requiring Canolfan Bedwyr's software building blocks either as components in the eventual translation engines or as tools during the development process.

Library components likely to be of greatest significance to machine translation are :

Welsh Monolingual and Bilingual DictionariesThe library employs a dictionary of 45000 entries which contain all the grammatical and morphological information for a basic Welsh lexeme.

The dictionary also contain the English translations of the Welsh entries.

LemmatiserWith the aid of the dictionary, this component contains all the morphological information for a given Welsh word: its mutation, list of possible parts of speech. For verbs it contains verb endings, tense and person. Thus it can 'lemmatise' a word and provide the basic lexeme base. e.g. 'ellir' is recognised as the mutated impersonal present tense of the verb 'gallu'. (N.B. this capability is key to be able to provide pop up dictionary where nearly all words in a give Welsh text can be recognised and an English translation offered from the library's bilingual dictionary component.

This component is capable also of generating all possible morphological forms of a word.

Language DetectorSince Welsh is often used in a bilingual environment where some parts of a text may contain an English word, this component can be called upon to determine the language of the word. Consideration is given to whether the word may be a misspelling of a Welsh word.

Part of Speech TaggerThis component tags words in a given text with their parts of speech. Such information is valuable to parse or perform pattern matching across texts.

Page 4

Page 5: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

A rule based tagger using a constraint grammar with hand written rules has been developed and is loosely based on the ENGCG formalism developed for tagging English text. For more infomation see ENGCG, Information on the English Constraint Grammar (http://www.ling.helsinki.fi/~avoutila/cg/index.html) and Voutilainen A 2003, Part Of Speech Tagging, In (ed) Mitkov, R Oxford Handbook of Computational Linguistics.

The rules have been written and improved incrementally as needed.

Grammar Checker and Rule BaseThe grammar checker component primarily checks for three main kinds of errors in a given text – mutation errors, incorrect syntax and incorrect word choice (most incorrect literal translations from English).

A collection of Welsh grammar rules are defined separately from the checker, at the moment 200 mutation rules and 300 non-mutation rules, using again a simple constraint grammar.

Code for the realisation of the grammar rules in the grammar checker is automatically generated by other library components from the rule base.

Canolfan Bedwyr aims to strengthen its expertise and funding base in language technology in order to provide a long-term future for the discipline in Wales. This was acknowledged in an internal audit in the University of Wales, Bangor recently and the recommendation was made that UWB consider making some of the e-Welsh posts permanent ones with this in mind (UWB Internal Quality Audit 2005 4.2.3.b.1). The full report is included as Appendix 2

Method of Working

Canolfan Bedwyr has in recent years implemented within such projects as WISPR (Welsh and Irish Speech Processing Resources) remote working methods and tools to allow it to manage and conduct work and collaboration in a distributed manner.  

A progress report on developing distributed methods of working in the WISPR project has been included with this tender (Appendix 1).

Thus an international network of partners coordinated from Canolfan Bedwyr is realistic to fulfil the requirements of this MT project.

The project will have Canolfan Bedwyr as its central point for project management and workshops and visits that will exchange knowledge between all partners and further secure the competencies in computational linguistics at Bangor. 

Dewi B Jones from Canolfan Bedwyr will proposed as project manager and academic supervisors from the partner organisations will be designated to guide development of each separate strand of MT.

Page 5

Page 6: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

Individual team members will be assigned lead roles in the various strands, but it is understood that other team members and additional support will be called upon as necessary during the life of the project. This is illustrated in the attached organisational chart.

Page 6

Page 7: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

Implementation of Requirements

1. Building a One Million Word Parallel Text Corpus We already have around 800 000 word parallel text corpus from the Bible prepared by Prof. John Phillips (see below section 3).

We will also use 100 000 word parallel text (minimum) from UWB’s own bilingual corpus developed for use by its own in-house translation team in its translation memory system. This is a high-quality corpus of translated texts in the fields of the sciences and humanities at higher education level, together with administrative documents and web pages covering the broad extent of topics which a higher level education establishment in Wales has encountered in the last three years.

A further 100 000 word parallel text will come from translating the 100 000 word corpus which forms the basis of the Avenue EBMT system (see below section 4) from English into Welsh and inclusion into the parallel text corpus.

2. Preparing a Customized Bilingual Dictionary for the MT Project.Although not specified as a task in the tender document, a comprehensive bilingual dictionary is needed to facilitate building a Welsh/English MT system. Canolfan Bedwyr holds extensive databases of general language and specialist terminology bilingual English/Welsh dictionaries, containing 150 000 words and phrases, along with disambiguators, part of speech and additional grammatical information and morphology. Subject to copyright restrictions, these will be revised and adapted for use with this project.

Initially, a ‘rough and ready’ dictionary will be collated, so as to provide a language tool for the early weeks of the project. This will then be refined to answer specific needs of the project and include the words aligned from the corpora so that grammatical information and subject tagging may be optimized. Further improvements to the dictionaries will be ongoing throughout the life of the project as needed.

3. Statistics Based Machine Translation (SMT)

This strand of the project will have Prof. John D Phillips from the Department of Linguistics, Yamaguchi University, Japan as partner.  

Prof. Phillips has (as stated in  H.Somers’ report to the Welsh Language Board 2004) a stochastic machine translation system built using a parallel text corpus from the NIV English translation of the Bible and Y Beibl Cymraeg Newydd (1988 translation of the Bible and the Collins Welsh/English Dictionary. (ref. The Bible as a basis for machine translation, John D Phillips, Pacific Association for Computational Linguistics)

Page 7

Page 8: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

We would aim to draw a lot on the excellent work already carried out by Prof. Phillips in order to fulfil SMT whilst incorporating further resources such as parallel texts and dictionaries from the University of Wales, Bangor.

Prof. Phillips has further refined this system by adding automatic identification of rules based on statistical methods to the initial purely statistical system. This is intended to be an adaptable system, able to incorporate other languages to the initial language pair. German has already been included into his system on this basis. However, our proposed project will concentrate on refining the system for English/Welsh only, carrying it towards integration with other systems in a composite Windows and web-based environment.

The first iteration will concentrate on refining the Welsh>English SMT system. This will then be evaluated and revised. An attempt will then be made to adapt it to an English>Welsh version. Due to the need for higher quality English>Welsh output, this is envisaged primarily as a contribution to the multi-engine composite system.

Academic supervisor: Prof. John D. Phillips (Department of Linguistics, Univeristy of Yamaguchi). Canolfan Bedwyr Contacts, Dr. Ivan Uemlianin, Ambrose Choy

4. Example Based machine Translation (EBMT)

This strand of the project will have the Language Technologies Institute at Carnegie Mellon University as its partner. 

The Language Technologies Institute was founded in 1996 as an extension of the Center for Machine Translation. It forms part of the School of Computer Science at Carnegie Mellon University (CMU), Pittsburgh, USA. According to US News & World Report magazine, CMU annually ranks among the USA's top universities, while in 2002 CMU's School of Computer Science was ranked first above all USA university computer science departments.

The work carried out to develop EBMT based machine translation for Welsh will be based on the transfer based system developed within Carnegie Mellon’s AVENUE program.

AVENUE is concerned with the design and development of machine translation for languages which have scarce resources or are minority languages. To date, its specific focus has been on Native and Latin American languages however CMU have indicated a strong interest in being involved in English-Welsh machine translation.  

Although Welsh is in the fortunate position of having more electronic linguistic tools than many other minority language, AVENUE has been designed to minimize development cost and time and is therefore suitable for use in this two year project.

In the AVENUE MT system, translation rules are learnt automatically from human translated parallel texts and word aligned data. Machine learning techniques are used to generalize transfer rules from specific translated examples and combines these with

Page 8

Page 9: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

decoding techniques from SMT for producing the best translation from a lattice of translation segments.  

Its most active work is on designing a typologically comprehensive elicitation corpus, advanced techniques of automatic rule learning, improved decoding, and rule refinement via user interaction.

Academic Supervisor: Dr Alon Lavie (Language Technologies Institute, CMU) aided by members of the Avenue team. Canolfan Bedwyr Contacts: Dr Briony Williams, Dr Rhys Jones

5. Rule Based Machine Translation (RBMT)

This strand of the project will involve Canolfan Bedwyr taking advantage of their extensive array of language tools to develop RBMT whilst being aided by the Research and Development Laboratory at France Télécom. 

France Télécom 's Research and Development division states that "the overarching objective is customer-focused innovation, one of our fundamental competitive advantages". Hence their emphasis is on customer-led and needs-led research rather than unfocused research. It is also extensive: France Télécom R&D (formerly CNET) is Europe’s top telecommunications research organization, home to 90 percent of the company’s 3,400 researchers. It spans 12 facilities, four of them outside France (in San Francisco, Boston, London and Tokyo), plus the research labs of TP S.A. in Poland and teams in Beijing and New Delhi. Research is structured according to fifteen "world class" areas determined by a jury of independent experts. In 2002, the parent company (France Télécom) launched its Internet subsidiary (Wanadoo) on the Paris stock market, and also acquired the British mobile telephone operator Orange, thus becoming the second largest European player in the mobile phone industry. Its research division features among the top European leaders in telecom R&D.

France Télécom are currently working on RBMT for English, French and German.

The work will involve constructing Dependency Grammar Rules for Welsh. Components from Cysgliad such as bilingual dictionaries, Welsh morphological analyser and POS tagger will be used in this work. Grammar rules already present in Cysill spelling and grammar checking software will be used as an input to rules required within RBMT.  This will include a syntactic parser that returns base lemmas and compound structures, and produces part of speech classes, inflectional tags, noun phrase markers and any syntactic dependencies. Syntactical dependencies show functional relations between words and phrases in sentences.

The system will be developed initially for Welsh>English. Elements from RBMT will be integrated with the SMT and EBMT systems directly as needed, as well as with the composite engine described in section 5 below.

Academic Supervisor: Dr Johannes Heinke (France Télécom Recherche & Développement TECH/EASY, Canolfan Bedwyr Contact: Ivan Uemlianin.

Page 9

Page 10: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

6. A Multi-Engine machine Translation System and Third-Party Applications

A requirement of the project is for the three MT approaches to be all integrated into a composite multi engine system for use in office software such as Microsoft Office and others.

Integration of the three systems will be achieved by each system’s eventual use of a shared common infrastructure. As noted in the section titled ‘Welsh language Resources at Canolfan Bedwyr’ such a shared common infrastructure for an integrated multi engine system would benefit immensely from the re-use of Canolfan Bedwyr’s own library of components.

In this bid, Canolfan Bedwyr proposes to overlap the work for a shared common infrastructure with that of developing an RBMT system for Welsh-English. This is so that as the first pass of RBMT is available for evaluation, so too are the linguistic components to be used later in the project for integrating SMT and EBMT into one system.

Canolfan Bedwyr’s other strength lies in it being able to deliver its linguistic building blocks to every day software tools, such as its Cysgliad application and its integration with Microsoft Office, Star/OpenOffice.org and translation memory systems.

Canolfan Bedwyr intends to develop as early as possible the Application Programming Interface (API) to the composite engine. This API would be the interface for adding Welsh-English machine translation functionality into any third-party application such as Microsoft Office, Star/OpenOffice.org and any website service.

Providing an API however is not dependent on the completion of the composite engine. Dummy translations can be used to develop and verify correct interfacing with third party applications. Therefore flexibility is achieved by the early availability of an interface to develop third-party application integration against. Having MT functionality within such third-party applications earlier in the project may be of benefit in the development and improvement of the underlying multiengine system and shared infrastructure. 

7. Speech-to-Speech Translation

Certain components of speech-to-speech translation for Welsh already exist (at least in embryonic form), as follows:

 - The WISPR project at Canolfan Bedwyr has been developing text-to-speech synthesis for Welsh, partly re-using resources developed ten years ago for a less sophisticated text-to-speech system for Welsh.

 - As part of the WISPR project, some preliminary work has been carried out in

Page 10

Page 11: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

developing the "acoustic models" that might be needed in a speech recognition system for Welsh. Some initial tools and expertise have already been developed, and the way has been smoothed for any future project in automatic speech recognition for Welsh. A further project called SpeechBridge proposes developing limited domain speaker-specific speech recognition for Welsh. Subject to funding approval, this is scheduled to run from January 2006 to July 2007.

 - A speech-to-speech translation system would also require speech synthesis and recognition for English.  Expertise has been built up at Canolfan Bedwyr in developing new synthetic voices, which would enable an English voice to be developed fairly quickly in any future project.

 - Speech recognition for English would require a new project, as this task is more complex than speech synthesis.  The task could be made more tractable by limiting the domain, and hence the vocabulary to be used.

A full "dialogue system" (utilising speech-to-speech translation) is still a little way in the future, but Canolfan Bedwyr has already laid much of the groundwork for the speech components of this task.  The proposed machine translation project would form the core of any such system, while the existing WISPR project would contribute essential tools and resources to it. 

7. Future Proofing

This tender calls for a two year MT project between English and Welsh, concentrating on a system which can be used in web-based applications and office suite environments. However we are eager to design an open-ended system which will be infinitely adaptable. All the systems mentioned above are open to adaptation to other languages. This means that it may be further developed in terms of accuracy and sophistication, and to accommodate new technological advances such as MT in telephony and hand-held devices. MT for Welsh may also be developed to integrate with Translation Memory systems, e.g. Wordfast and Trados. It may be further developed in limited domains to provide a high-quality bilingual service in areas such as weather forecasting or flood warnings.

The small sum of money assigned in this project for future proofing will be used to write bids for large-scale European funding where the current Welsh Language Board’s financial contribution can be match-funded with outside monies. This will take the work forward in a network of institutions in Europe and beyond, of which the current partners will form the core.

Page 11

Page 12: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

CVs Team Members

Dewi B. Jones: Project Manager

Dewi Jones worked as a software engineer for two years at Cambridge, England and four years at Nokia, Helsinki, Finland. During his time at Nokia Helsinki Dewi worked as a Software Development Project Manager for the last three of those years before deciding to return to Wales to live and work.

In Wales, Dewi joined the then small Bangor based software consultancy company, Draig Technology where he led on a number of projects for clients in the Welsh public sector.

In pursuit of his ambition to improve the provision of Welsh I.T. he joined Canolfan Bedwyr in 2002 and is now head of the software development team in the e-Welsh Unit. He has worked on the WISPR (speech technology project), Cysgliad, the national database of Welsh terminology (currently being undertaken for the Welsh Language Board), and numerous other on-line databases.

Dr Briony Williams: Research Officer

Briony Williams is currently Lead voice researcher for Canolfan Bedwyr on the WISPR project (“Welsh and Irish Speech Processing Resources”), developing high-quality text-to-speech synthesis for Welsh. She is project manager for the Voice Development sub-team. She was a Post-doctoral researcher in speech technology 1983-2000 and again from January 2004, mainly text-to-speech synthesis, also some speech recognition. She has an extensive and varied publication record: books, book chapters, referred journals and international conferences. She has a wide circle of international contacts including European networking experience (knowledge of German and French).

She was a founder of the ISCA Special Interest Group in “Speech and Language Technology for Minority Languages”, Oct. 1999. and instigator and chief organiser of the half-day workshop on “Language Resources for European Minority Languages”, Granada, Spain, May 27th 1998 (a satellite workshop of the First International Conference on Language Resources and Evaluation, Granada, Spain, May 28-30 1998). She is the founder and co-owner of worldwide electronic discussion list WELSH-L (set up November 1992). She organised the IOA Speech Group 1-day meeting, Edinburgh, June 1995: 'Speech systems for the handicapped' and organised the IOA Speech Group 1-day meeting, Edinburgh, June 1993: 'Human interaction with computers'. She was secretary of the Institute of Acoustics’ Speech Group, 1990-1996; and member of the Speech Group Committee, 1998-present. She is also a former co-ordinator of the UK's SALT (Speech and Language Technology) electronic discussion list.

Page 12

Page 13: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

Dr Ivan Uemlianin: Research Officer

Ivan Uemlianin has an academic and professional background in speech technology. He is experienced in the development and use of speech recognition software. With a background in Artificial Intelligence, he has a good understanding of the foundations and current developments in the Semantic Web. For example: RDF, DAML+OIL and Topic Maps. He is also experienced in higher-order logics, unification-based formalisms and statistics and their implementation of these theories in computer applications including logic programming, object-oriented programming and neural networks.

Currently working part-time at Canolfan Bedwyr, he is also a self-employed software consultant, specialising in natural language processing and knowledge management. He is a founding director of the Open-Source Speech Recognition Initiative (OSSRI), leading the Technical Evaluation Taskforce.

Dr Rhys James Jones: Research Officer

Rhys Jones is currently employed full-time at Canolfan Bedwyr on the WISPR speech technology project. His PhD thesis was First investigations and experiments in speech recognition for Welsh which led to the SpeechDat Welsh database for the fixed telephone network.

He has since worked as a software engineer, adding good linguistic skills to his knowledge of computer programming.

Ambrose Choy; Research Officer

Ambrose Choy is a software engineer at Canolfan Bedwyr since March 2005 where he contributed significantly to a number of Canolfan Bedwyr projects.

Previously he was worked as a software engineer with Sony UK Marketing – TIMMS where he completed the Sony Eurograd Graduate Scheme. During this time he worked on developing hand-held devices software for the Chinese market.

Delyth Prys: Internal Adviser

Delyth Prys is team leader of the e-Welsh Unit at Canolfan Bedwyr. An experienced terminologist and lexicographer, she will serve as linguistic adviser to the project. She will also advise on further grant capture to fund further developments beyond the present remit of the project.

Canolfan Bedwyr Staff

Page 13

Page 14: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

Canolfan Bedwyr employs a number of other academic staff and researchers. This includes experienced translators and other linguists who may be called upon to help in this project on an occasional basis.

Prof. John Phillips: SMT Academic Adviser

John Phillips worked at Edinburgh University's Department of Artificial Intelligence in the 1980's. He worked on the design of computational grammar frameworks, particularly for the groundbreaking Alvey broad-coverage English grammar project, and later for integrating grammatical analysis with speech recognition. Subsequently he was at Tübingen University, and at UMIST, where he worked on several machine translation projects, including Eurotra. Since 1996 he has been associate professor of Yamaguchi University where, as well as writing a book on the Manx language, he has developed methods for automatic induction of machine translation systems.

Dr Alon Lavie

Dr Alon Lavie is Associate Research Professor at the Institute of Language Technologies at Carnegie Mellon University, USA. He directs research projects on parsing and translation of text and spoken language. He is Principal Co-investigator of the AVENUE MT Project for Minority Languages and of the BABYLON Mobile Speech-to-Speech Translation Project.

Dr Johannes Heinke: RBMT Academic Adviser

Johannes Heinke is a researcher based with France Telecom team developing MT at Lannion in Brittany. He is fluent in Welsh as well as German, French and English and spent a year in the Linguistics Department at UWB. He has previously advised Canolfan Bedwyr in its IT Terminology project for the Welsh Language Board.

Page 14

Page 15: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

Project Budget

Description of activity PRICE EXCLUDING VAT

Location of 1m words of suitable parallel text, conversion if necessary to text form, initial sentence alignment and verification

     2000

Word and sentence alignment work for Welsh–English SMT system.      1500

Additional translation work for SMT system evaluation      5000

Refinement of Welsh–English SMT system and adaptation to English–Welsh SMT system.

     7000

Word and sentence alignment work for English–Welsh EBMT system.      2000

Sub-sentential alignment for EBMT      1000

Adaptation of recombination algorithm for EBMT      5000

Evaluation of EBMT system      1000

Adaptation to Welsh–English EBMT system in year 2      3000

Development of RBMT system      60000

Project management (one point of contact)      12000

Project startup workshop      5000

End of project workshop      5000

Page 15

Page 16: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

Written bimonthly progress reports to the Board and quarterly meetings with the Board

     2000

Administration, secretarial and travel costs for SMT developer      6000

Administration, secretarial and travel costs for EBMT developer      6000

Administration, secretarial and travel costs for RBMT developer      3000

Ensure that the composite three engine machine translation system will work in Microsoft Office 2003 and later versions (background information below):

     20000

Ensure other Office suites can use the composite translation engine      10000

Ensure that the composite engine is available as a web service on a public website, branded in the Board’s and the Welsh Assembly Government’s style guide.

     2500

Ensure that the engine is future-proofed for technological developments

     2100

Total excluding VAT      161,100

Will you charge VAT?      No

Page 16

Page 17: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Canolfan Bedwyr, E-Welsh UnitTender for Machine Translation to the Welsh Language Board

Page 17

Page 18: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Project Plan

Page 19: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project
Page 20: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project
Page 21: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Authorisation

I certify that the information contained in this tender is correct and confirm that this project will be carried out as described.

Signed ………………………………………………………………..

Name (in capitals) …………………………………………………….

Position in Organisation ……………………………………………..

Date …………………………..

Signed ………………………………………………………………..

Name (in capitals) …………………………………………………….

Position in Organisation ……………………………………………..

Date …………………………..

Page 22: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Appendix 1 – Deployment of a Shard Repository for Project Artefacts and Other On-line Collaboration Tools within the WISPR Project

IntroductionEarly on its formation of the WISPR project it was recognised that the Welsh and Irish partners needed to work jointly on common project artefacts such as documents, speech corpora databases and source code. Therefore Canolfan Bedwyr as the lead partner proposed the following as a milestone for WISPR :

The Welsh partner will develop and manage a shared repository of code and documents for the project using an open source CVS (concurrent versioning system) to facilitate joint working between partners. [Note: this is a method of managing information exchange using open source software which will be a major contribution to developing information management and flow systems in this area, improving cross border movement of information].

[Milestone 13: CVS joint project completed, December 2005]

This document gives a brief interim report on the progress of this subproject as of May 2005 and of what the future before December 2005 entails.

Version Control SystemTraditionally multi-site, multi-developer software development projects employ a version control system as its central repository for the storage and management of all significant project files.

A number of version control system solutions exist, from commercial offerings from Microsoft (i.e. Visual SourceSafe) IBM (i.e. Rational ClearCase) to open source and free offerings namely CVS (Concurrent Versioning System, http://www.cvshome.org) and SubVersion. (http://subversion.tigris.org/)So as to attain a minimal cost outlay and to remain compatible with the open source nature of the WISPR project, only open source and free version control systems were considered as solutions for the WISPR.

CVS has been in use for a number of years and has been critical to the success of a number of open source projects such as Mozilla, OpenOffice etc. It is thus proven to be a stable environment for shared resources. However, its architecture and mode of usage has become quite dated compared to more recent versioning systems such as SubVersion.

After careful consideration the WISPR team in Wales decided upon the SubVersion version control system, described at its official website as “a compelling replacement for CVS”The reasons for WISPR team in Wales to employ SubVersion were :

Ease of Use. The repository and its contents may be accessed simply via a conventional web browser, where as CVS would need special client software and configuration.

Some client programs offering enhanced client side functionality do exist for SubVersion which improve the ease of use even further such that it is easy for the less technically capable person to use.

Page 23: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Excellent support of branches and tagging. These are important features for efficient and well coordinated management of shared resources.

Multiple versions of a given code/file base can be developed independently of each other by different developers or for different functionalities in separate branches and when complete merged into candidates for releasing.

Tagging permits the project to record snapshots of the code/file base at a certain moment in the project development.

Access rights and security to the repository are easily configurable.

Realisation of Version Control System

The WISPR team at Canolfan Bedwyr consulted and co-ordinated with the Information Services (IS) department of University of Wales, Bangor for the realisation of a collection of SubVersion server repositories to be hosted on a Linux based server.

Important considerations where IS were best able to help WISPR were :

private and public connectivity from the world wide web to the repositories

backup strategies and solutions

With assistance from IS, the WISPR team in Wales have so far the following

A dedicated Linux server especially built for hosting SubVersion repositories for WISPR.

Three available SubVersion repositories – pub, dev and repos (described further later)

A VPN (Virtual Private Network) for remote and secure access to the entire University of Wales, Bangor (UWB) intranet.

nightly incremental and off-site backups of the dedicated Linux server's hard disks.

SubVersion Repositories

The Welsh WISPR team has deployed three repositories within its SubVersion server. All three are at the moment accessible within only the UWB intranet, however increased access will be gradually given to Irish partners and the wider public.

If logged in onto the UWB network via VPN, the associated URLs to each repository are :

http://bedwyr-redhat.bangor.ac.uk/svn/pub

Public and intended are the location to host all releases and resources usable by any end users. e.g. Installation programs for Welsh / Irish MSAPI, documentation and speech corpora.

Source code to be shared with the wider open source community would be hosted here as well.

Page 24: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

The entire repository is read only.

At the moment this repository is not yet viewable from outside the UWB network.

http://bedwyr-redhat.bangor.ac.uk/svn/dev

A private repository for the developers within the WISPR project where all work in progress is kept into an evolving directory hierarchy.

Read and write access is permitted to those with suitable login accounts.

At the moment, this repository is not yet viewable from outside the UWB network.

http://bedwyr-redhat.bangor.ac.uk/svn/repos

Again a private within the UWB network for the purposes of testing and training repository related issues. (see below)

VPN – Virtual Private Network

Since a number of the Welsh WISPR team work remotely for the project from other parts of the U.K a VPN has been put in place for the team members in Canolfan Bedwyr with help from the University's IS department.

It is planned for the team in Ireland to have also VPN connectivity.

This simply facilitates connecting into the UWB's internal network where computing resources such as server machines would otherwise heave been guarded and hidden from the outside world via the University's firewall. A VPN connection permits such resources to be as accessible as if they were local resources.

A number of applications have been found to be very useful between remote workers within the WISPR Wales team that will be studied and developed upon further. Namely :

Microsoft NetMeeting – meetings are conducted online between a number of people. Video may also be used. NetMeeting is included as a basic component in Windows XP.

Remote Desktop / Terminal server – where a desktop PC located within UWB may be controlled and used remotely by anyone sitting at another PC within the intranet or VPN. Some remote project members use this remote desktop as their main means of working within the project since

the client PC may be relatively ‘thin’ and does not need to be as powerful.

since the controlled PC is within the UWB premises it is physically connected to the UWB high speed network thus giving the remote worker to use a machine with a much higher bandwidth connectivity than a dial up or even domestic broadband.

Date: 4 March 2005

Page 25: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Appendix 2 – Internal Quality Audit Caolfan Bedwyr

Team: Dr. Gerwyn Williams (Chair), Mr. David Allsup (School of Nursing, Midwifery and Health Studies), Mr. Sim Barbaresi (Information Services),Mrs Ruth Goggin (Assistant Registrar)

External Assessor: Professor M. Wynn Thomas, University of Wales, Swansea

1. Introduction

Canolfan Bedwyr was established in 1996 following the merger between the Coleg Normal and the University of Wales, Bangor uno. The Ganolfan has now expanded to offer a range of services to the University and external customers. The Ganolfan has six sub-units, e-Welsh, Cymraeg Clir (‘Plain Welsh’) , Gloywi Iaith (Language Refresher) and Use of Cysgliad Courses, and the Translation Unit.

Canolfan Bedwyr’s mission statement reads as follows: “To co-ordinate the strategy of promoting and fostering the commitment of the University of Wales, Bangor to excellence in all aspects of the situation of the Welsh language in higher education. Ensure that the expertise and services of the University of Wales, Bangor are available to other bodies in Wales”.

2. Documents

[1] Centre self-evaluation report [2] Canolfan Bedwyr Management Committee Minutes from 22

September 2003 – 19 January 2005.[3] Canolfan Bedwyr Employment Structure[4] Cymraeg Clir self-evaluation report [5] Translation Unit self-evaluation report [6] Teaching and Gloywi Iaith Unit self-evaluation report [7] e-Welsh Unit self-evaluation report [8] Language Scheme Monitoring and Development self-evaluation report[9] Research and Staff Development self-evaluation report

3. Staff members interviewed during the Audit

Dr. Cen Williams (Director), Professor Gareth Roberts (Pro Vice-Chancellor), Ms Meg Elis (Translation Unit), Mrs. Delyth Prys (e-Welsh Unit), Mrs Eleri Jones (Cymraeg Clir Unit), Mrs Judith Hughes (Administrator), Ms Menna Morgan (Teaching and Gloywi Iaith Unit), Mr. Dawi Griffith (Translation Unit), Ms Helen Smith (Translation Unit), Mrs Nia Roberts (Translation Unit), Mr Dewi Jones (e-Welsh Unit), Mr Ambrose Choy (e-Welsh Unit), Mr. Gruff Prys (e-Welsh Unit), Mr. Owain Davies (e-Welsh Unit), Mr Eilir Evans (e-Welsh Unit), Miss Hunydd Andrews (e-Welsh Unit), Dr. Briony Williams (e-Welsh Unit).

Page 26: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

Written Evidence

Mrs Stella Fuller (Occupational Health and Safety Unit)Ms Gwawr Griffiths (Departmental Secretary, Information Services)Dr. Gwenno Ffrancon (Lecturer in Film, Department of Media)Mr. James F. Goodman (Business and Development Manager, Management Development Centre)Dr. Eddie Williams (Department of Linguistics)Mrs Steph Barbaresi (Assistant Registrar, Student Services)Marika Fusser (Research Assistant, Department of Linguistics)Mr. Dyfed Wyn Roberts (Library Assistant at the Normal Site Library)Mrs T. Johnstone (Secretary, School of Biological Sciences)Ms Eluned Jones (Head of the Centre for Careers and Opportunities)Mrs Gwenan Owen (University Records Manager)Dr Patsy Thomas (Assistant Registrar, Academic Registry)Mr Nic Ross (Head of Department, Communications and Media)

4. Results of Quality Audit

The Panel investigated the quality of service at Canolfan Bedwyr by collating various types of evidence: reading the self-evaluation documents and other relevant documentation listed above, meeting Centre staff, and investigating customer and consumer perceptions. The Panel considered processes and operations, human input and work environment and material resources at the Ganolfan.

4.1 Strengths identified:

Canolfan Bedwyr is to be commended for the following:

[1] Canolfan Bedwyr’s provision makes a valuable and essential contribution to the success of the mission of the University of Wales, Bangor, and particularly its contribution to the “commitment to promote the language, culture, health and economy of Wales, in partnership with the local community.”

[2] The Canolfan is recognized as a leader in its range of work and contribution to the Welsh language. The increase in breadth of provision and developments in new technologies to support Welsh language use at the University and other public bodies is applauded.

[3] A competent leadership within the Canolfan successfully manages the integration of the units and has been successful in creating effective teams and systems of operation. The regular meetings of the Canolfan’s Management Committee, provides direction and guidance that is both innovative and responsive to change. The ‘open door’ policy of the Director is warmly welcomed by all staff.

[4] The staff are congratulated for their considerable dedication, professionalism and devotion to their work. Their sense of

Page 27: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

engagement and enthusiasm for their work, coupled with a conscientious effort and team spirit ensures success of the service.

[5] Appropriate managerial systems have been developed to support the recent growth in provision and increase in staff. The systems are welcomed by staff and are considered to be relevant and sympathetic to the ethos and nature of the work.

[6] The Canolfan has provided guidance and leadership in developing the University’s Welsh Language Scheme. The enthusiasm and commitment in implementing and monitoring the scheme’s requirements and leading the University in submitting a revised edition this year is worthy of high praise.

[7] The e-Gymraeg Unit is congratulated for its willingness to respond to new initiatives and projects and for its instrumental contribution to the development of electronic Welsh-medium work within new, exciting and emergent fields.

[8] The ‘Uned Ddysgu a Gloywi Iaith’ is congratulated for its continued work in supporting the increasing use of the Welsh Language by both staff and students in the University. The quality of the teaching has secured high praise from amongst both staff and students.

[9] The Translation Unit is congratulated for its continued specialist and dedicated response to the ever increasing range of work requiring translation. The speed of response and professionalism in all aspects of the translation work has resulted in much positive feedback from customers.

[10] The ‘Cymraeg Clir’ Unit and the recent development of a ‘Cynllun Partneriaid i Cymraeg Clir’ is applauded. The Unit provides an important function and service to outside bodies.

[11] The recent initiative of introducing a Staff Performance Review system which meets with the requirements of the new University system for support staff, is a welcome addition to the management systems within the Canolfan.

[12] Canolfan Bedwyr is to be praised for the quality of the self-evaluation and supporting documentation prepared for the internal quality audit.

[13] The range of questionnaires and feedback forms used by the Canolfan and submitted for the audit purposes, demonstrates that a customer service provision is an important feature of the services under review. The feedback received from customers is uniformly positive and appreciative, and suggests high standards of service.

4.2 Recommendations

[a] For the Ganolfan

Page 28: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

The recent re-location of Canolfan Bedwyr to its current accommodation has proved a positive feature in cementing the ethos and team spirit evident within the department. Throughout the audit visit the panel recognized that the appointment process for temporary staff had significant implications on future development of the Ganolfan’s activities.

Many of the points listed below are areas that Canolfan Bedwyr is currently addressing. We wish to encourage the Ganolfan to engage in further discussion and action planning on the following:

4.2.1 Processes and Operations

[1] The e-Welsh Unit have been highly successful in creating Welsh-medium Software. In discussing the current work with staff during the audit, the importance of recording and documenting software development and procedures was high lightened.

Recommendation: The e-Welsh Unit is encouraged to continue to document software development and procedures. Consideration should also be given for introducing systems so as to ensure that all new software procedures are documented following the successful completion of a product.

[2] Historically the Academic Translator has been responsible for providing secretarial support to the School of Welsh-medium Studies. Due to the increasing workload within the Translation Unit other methods of fulfilling this function should be considered.

Recommendation: The Director in consultation with the Registrar is invited to consider whether the current designated secretarial support to the School of Welsh-medium Studies and the Welsh-medium Studies Task Group should be transferred from Canolfan Bedwyr to elsewhere in the University.

[3] Recent examples of departments appointing their own translators as a result of the increase in bilingual provision, coupled with this occasional use made by the Translation Unit of external translators, offers new opportunities for the Translation Unit to formalise working relationships. The need to monitor consistency and level of translation work for quality assurance purposes was highlighted. Closer working relationships could also result in increased dialogue in sharing good practice.

Recommendation: The Translation Unit is invited to consider ways of formalising working relationships with translators within and outside the University. Arrangements should encourage the monitoring of the levels and consistency of work but also offer opportunities for networking and sharing good practice.

Page 29: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

4.2.2 Human Inputs

[4] Positive working relationships between the Director and staff members were clearly evident to the Panel. Due to the individual nature of the workload within the ‘Uned Ddysgu a Gloywi Iaith’, further consideration should be given to strengthening levels of support.

Recommendation: The Director is invited to consider ways by which the staff member in the Unit may be offered further support and guidance for both personal and academic development.

[5] The opportunities for the staff within the Translation Unit to increase their expertise in using the ‘Wordfast’ system and sharing personal translation dictionaries have been limited.

Recommendation: Training opportunities for Translation Unit staff in using ‘Wordfast’ should be explored. Such training would facilitate greater use of the system by individuals and allow for the sharing of good practice amongst colleagues.

4.2.3 Contexts/Working Environments and Material Resources

[6] Issues relating to the level of ‘externality’ used by the Canolfan within current managerial systems was considered briefly. The Panel concluded that the Ganolfan itself should determine whether such support would be of benefit.

Recommendation: Canolfan Bedwyr is invited to consider the level of externality currently provided and whether additional expertise in supporting and developing the work of the Ganolfan in the future would be of any benefit.

(Note: After one year from the Audit, the Registrar’s Office will request a report from the Ganolfan as to how it has addressed each of the above issues).

[b] For the University:

The University is recommended to consider the following in enhancing its operations and quality assurance systems.

[1] Due to the nature of the work of the e-Welsh Unit, significant difficulties have been experienced over the years in trying to secure permanent contracts for staff members. The staff within this unit are highly skilled and expert in new technological advances. Staff are constantly working under a fear of unemployment. Securing permanent contracts for a number of staff would benefit the University in the long term

Page 30: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

as this would enable future planning and development and possibly support the University’s RAE submission.

Recommendation: The Human Resources Department is encouraged to engage in discussions with the Director of Canolfan Bedwyr to try to secure as soon as possible permanent contracts for staff within the e-Welsh Unit.

[2] The implementation and monitoring of the Welsh Language Scheme currently represents a third of the Director’s workload. The recent report from the Welsh Language Board on the University’s Welsh Language Scheme highlighted concerns as to the future levels of support provided for the implementation of the scheme. During the audit discussions the Panel considered in detail arguments for and against the appointment of a Language Officer and concluded that such an appointment was necessary to build on and to ensure future success.

Recommendation: The Panel wishes to encourage the Executive to agree to the appointment of a Language Officer who would be responsible for implementing and monitoring the University’s Language Scheme.

[3] The importance of designating the portfolio of Welsh-medium to a Pro-Vice-Chancellor was clearly evident to the Panel. The significant contribution made by Canolfan Bedwyr to Welsh-medium activities at the University requires clear and direct lines of communication into the Executive. With the announcement of the retirement of the current Pro-Vice-Chancellor (Welsh-medium) the Panel was anxious to secure that future arrangements build on from the current foundations, so as to secure a strong representation of Welsh-medium activities within the decision-making processes of the University.

Recommendation: The Registrar is invited to secure the future arrangements of the Welsh-medium Portfolio within the Executive.

[4] The innovative and expert work of the ‘e-Welsh Unit’ should be explored in detail to clarify whether it can be included in the University’s RAE submission.

Recommendation: The RAE Task Group is invited to consider whether the current work of the e-Welsh Unit is of sufficient standard to be included in the University’s RAE submission.

[5] The difficulties of promoting the relevance of the Gloywi Iaith courses to students at the University was explored during the audit. The possibility of including within the role of a

Page 31: Background - Carnegie Mellon School of Computer … · Web viewBuilding a One Million Word Parallel Text Corpus 7 2. Preparing a Customized Bilingual Dictionary for the MT Project

‘Language Officer’, the need to market and develop good working relationships with academic departments to promote these courses, should be investigated further.

Recommendation: The Welsh-medium Studies Task Group is invited to consider ways of further promoting and marketing the purpose and importance of Gloywi Iaith courses for students studying through the medium of Welsh.

Signed:………………………………… Date:…………………

(External Assessor)

Signed:………………………………. Date:…………………..

(Panel Chair)