26
D6.1.1: First report on scientific evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription and Translation of Video Lectures ICT Project 287755 Deliverable D6.1.1 October 31, 2012 Project funded by the European Community under the Seventh Framework Programme for Research and Technological Development.

EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

D6.1.1: First report on scientific evaluations

EML, XEROX, UPVLC, RWTH

Distribution: Public

trans LecturesTranscription and Translation of Video Lectures

ICT Project 287755 Deliverable D6.1.1

October 31, 2012

Project funded by the European Communityunder the Seventh Framework Programme forResearch and Technological Development.

Page 2: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

Project ref no. ICT-287755Project acronym trans LecturesProject full title Transcription and Translation of Video LecturesInstrument STREPThematic Priority ICT-2011.4.2 Language TechnologiesStart date / duration 01 November 2011 / 36 Months

Distribution PublicContractual date of delivery October 31, 2012Actual date of delivery November 18, 2012Date of last update October 31, 2012Deliverable number D6.1.1Deliverable title First report on scientific evaluationsType ReportStatus & version FinalNumber of pages 26Contributing WP(s) WP6WP / Task responsible EMLOther contributors XEROX, UPVLC, RWTHInternal reviewer Jorge Civera and Alfons JuanAuthor(s) EML, XEROX, UPVLC, RWTHEC project officer Susan Fraser

The partners in trans Lectures are:

Universitat Politecnica de Valencia (UPVLC)XEROX Research Center Europe (XRCE)Josef Stefan Institute (JSI)Knowledge for All Foundation (K4A)RWTH Aachen University (RWTH)European Media Laboratory GmbH (EML)Deluxe Digital Studios Limited (DDS)

For copies of reports, updates on project activities and other trans Lectures related in-formation, contact:

The trans Lectures Project Co-ordinatorAlfons Juan, Universitat Politecnica de ValenciaCamı de Vera s/n, 46018 Valencia, [email protected] +34 699-307-095 - Fax +34 963-877-359

Copies of reports and other material can also be accessed via the project’s homepage:http://www.translectures.eu

c© 2012, The Individual AuthorsNo part of this document may be reproduced or transmitted in any form, or by any means,

electronic or mechanical, including photocopy, recording, or any information storage andretrieval system, without permission from the copyright owner.

Page 3: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

Executive Summary

This deliverable contains a summary of the scientific evaluation in terms of automatic evaluationmetrics such as WER, BLEU and TER on the transcription and translation development andtest sets defined for conventional experimental evaluation.

Contents

1 Introduction 4

2 Experimental setup 52.1 Definition of transcription experiments . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 English transcriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Slovenian transcriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Spanish transcriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Definition of translation experiments . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 English↔Spanish translation . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 English into French translation . . . . . . . . . . . . . . . . . . . . . . . . 102.2.3 English into German translation . . . . . . . . . . . . . . . . . . . . . . . 102.2.4 English↔Slovenian translation . . . . . . . . . . . . . . . . . . . . . . . . 10

3 VideoLectures.NET 113.1 Transcription quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 English (RWTH) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 English (EML) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.3 Slovenian (RWTH) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.4 Slovenian (EML) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Translation quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.1 English into Spanish (UPVLC) . . . . . . . . . . . . . . . . . . . . . . . . 143.2.2 English into French (XEROX) . . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 English into German (RWTH) . . . . . . . . . . . . . . . . . . . . . . . . 173.2.4 English↔Slovenian (RWTH) . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 poliMedia 194.1 Transcription quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Spanish (UPVLC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.2 Spanish (RWTH) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1.3 Spanish (EML) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Translation quality (UPVLC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Conclusions 23

A Acronyms 26

3

Page 4: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

1 Introduction

This deliverable describes the first report on scientific evaluations related to Task 6.1. Theaim of this task is to measure trans Lectures progress in terms of current transcriptions andtranslations quality. Standard evaluation metrics such as WER, BLEU and TER are reportedon the development and test sets defined for conventional experimental evaluation. More detailsabout models and techniques can be found in deliverable D3.1.1.

This deliverable is structured as follows. First, development and test sets are defined fortranscription and translation on VideoLectures.NET and poliMedia in Section2. Then, scientificevaluations are presented on VideoLectures.NET and poliMedia in Sections 3 and 4, respectively.Finally, conclusions are drawn in Section 5.

4

Page 5: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

2 Experimental setup

This section describes how development and test sets were selected for both tasks, transcriptionand translation, and both repositories poliMedia and VideoLectures.NET. Training materialsinclude those planned in the Dow and those additionaly incorporated to improve ASR and MTsystem performance.

This selection allows the consortium to carry out evaluation experiments according totrans Lectures goals. Furthermore, in order to guarantee that evaluation measures are com-parable among partners, common preprocessed reference transcription and translation files fordevelopment and test were defined.

2.1 Definition of transcription experiments

2.1.1 English transcriptions

The evaluation of English transcriptions was carried out on a subset of those 20 hours manuallytranscribed from VideoLectures.NET in Task 2.2 of WP2 (see details in Deliverable D2.1).

The proposed experiments follow a usual training-development-test scheme. A subset ofcategories from the available complete set has been considered for development and testingpurposes. Video lectures for development and test were selected according to the followingcriteria:

1. Availability of human translations.

2. Availability of presentation slides.

3. Availability of enough material for speaker adaptation.

Lectures selected for the development set. Table 1 shows a summary of the selectedlectures. Further information about the author, the title and category-related material in thetraining set follows:

1. id: 6789. Frederic Barbaresco. Applications of Information Geometry to Radar SignalProcessing.

Training material in the same category: 2 speakers (1 male, 1 female), duration: 0:50:14.

2. id: 3804. Darko Markovec. Structural properties of magnetic nanoparticles.

Training material in the same category: 1 speaker (female), duration: 0:16:02.

3. id: 9977. Eva Jablonka. Evolution in Four Dimensions.

Training material in the same category: 1 speaker (female), duration 0:33:14.

4. id: 3731. Samer A. Abdallah. Information Dynamics and the Perception of TemporalStructure in Music.

Training material in the same category: 1 speaker (male), duration: 0:11:45.

5

Page 6: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

Table 1: Summary of English development lectures for transcription experiments.

id DurationTranslated into

Category GenderNon-native

fr ge sp sl speaker

6789 0:47:30 4 4 4 4 Physics Male 4

3804 0:24:54 4 4 4 4 Technology Male 4

9977 1:03:41 4 4 4 4 Biology Female3731 0:55:09 4 4 4 Arts Male

Lectures selected for the test set. Table 2 shows a summary of the selected lectures.Further information about the author, the title and category-related material in the trainingset follows:

1. id: 5123. Yoav Andrew Leitersdorf. Early-stage venture capital investing; shaping suc-cessful high tech startup companies towards accelerated exits.

Training material in the same category: 3 speakers (1 male, 2 female), duration: 1:19:19.

2. id: 3698. Eric Moulines. MCMC, SMC, . . . What next?

Training material in the same category: 2 speakers (2 male), duration 0:38:19.

3. id: 2302. Blazenka Divjak. Gender issue in ICT: Dealing with educational obstacles inmathematical education for ICT.

Training material in the same category: 1 speaker (male), duration: 0:22:23.

4. id: 4948. Mike Culver. Amazon Web Services.

Training material in the same category: 1 speaker (male), duration: 0:53:15.

Table 2: Summary of English test lectures for transcription experiments.

Id DurationTranslated into

Category GenderNon-native

fr ge sp sl speaker

5123 0:42:55 4 4 4 4 Business Male 4

3698 0:37:32 4 4 4 4 Mathematics Male 4

2302 0:58:23 4 4 4 4 Society Female4948 1:06:40 4 4 4 4 Computers Male

Training material as planned in the DoW.

a) Acoustic models: VideoLectures.NET lectures not included in the development and testsets (13.4 hours) and EPPS (92 hours).

b) Language models. The textual material of the previous item and the following additionalcorpora: the Scientext corpus, the EuroParl corpus, the JCR-Acquis corpus, and theOPUS-OpenSubs corpus.

6

Page 7: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

Additional training material employed in the experiments.

a) Acoustic models: broadcast news and broadcast conversational dataset 1 (BNBC1 - 100hours), broadcast news and broadcast conversational dataset 1 (BNBC2 - 170 hours), andHUB4 and TDT4 corpora (400 hours).

b) Language models. The textual material of the previous item and various data sourcesaccounting for 3.5G running words.

2.1.2 Slovenian transcriptions

The evaluation of Slovenian transcriptions was carried out on a subset of those 35 hours manuallytranscribed from VideoLectures.NET in Task 2.2 of WP2 (see details in deliverable D2.1). Asubset of categories was also selected for development and test data. Slovenian video lectureswere selected according to the following criteria:

1. Availability of presentation slides.

2. Availability of enough material for speaker adaptation.

Lectures selected for the development set. Table 3 shows a summary of those lecturesselected. Further information about the author, the title and category-related material in thetraining set follows:

1. id: 11063. Matevz Leskovsek. Nevrolingvisticno programiranje, oziroma kaj nas sodobnaznanost lahko nauci o denarju in ljudeh.

Training material in the same category: 4 speakers (4 male), duration: 4:10:32.

2. id: 3319. Janez Kos. Odkrivanje planetov izve nasega osoncja.

Training material in the same category: 1 speaker (male), duration: 0:53:18.

3. id: 12208. Spela Stres. Pregled procesa komercializacije tehnologij in obveznosti razisko-valca do intelektualne lastnine.

Training material in the same category: 4 speakers (3 male, 1 female), duration: 1:54:33.

4. id: 10021. Joze Mecinger. Zasvojenost z gospodarsko rastjo na planetu z omejeniminaravnimi viri / Addiction with Economic Growth on the Planet with Limited.

Training material in the same category: 1 speakers (1 male), duration 0:46:28

Table 3: Summary of Slovenian development lectures for transcription experiments.

Id Duration Category Gender

11063 1:09:58 CS Male3319 0:56:04 Physics Male

12208 0:26:57 Law Female10021 0:46:28 Economics Male

7

Page 8: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

Lectures selected for the test set. Table 4 shows a summary of those lectures selected.Further information about the author, the title and category-related material in the trainingset follows:

1. id: 12616. Andrej Gogala. Biotska raznovrstnost Krasu. Category: Biology. Duration:1:02:39. Male.

Training material in the same category: 2 speakers (2 male), duration: 2:07:50.

2. id: 7495. Polona Gantar. Leksikalna baza: vse, kar ste vedno zeleli vedeti o jeziku.

Training material in the same category: 4 speakers (2 male, 2 female), duration: 1:48:20.

3. id: 2354. Mario Orasche. Gamma Capital Partners

Training material in the same category: 1 speaker (male), duration: 1:44:03.

4. id: 15191. Ciril Horjak. Stripovski algoritmi.

Training material in the same category: 4 speakers (3 male, 1 female), duration: 2:54:03.

Table 4: Summary of Slovenian test lectures for transcription experiments.

Id Duration Category Gender

12616 1:02:39 Biology Male7495 0:41:34 Humanities Female2354 0:50:36 Business Male

15191 0:47:44 Arts Male

Training material as planned in the DoW.

a) Acoustic models. The lectures not included in the development and test sets (28.3 hours)and unprocessed EPPS (700 hours)

b) Language models. The textual material of the previous item and the following additionalcorpora: the JCR-Acquis corpus, the OPUS-OpenSubs corpus and the IJS-ELAN.

2.1.3 Spanish transcriptions

The evaluation of Spanish transcriptions was carried out on a subset of 106 hours manuallytranscribed from poliMedia (see details in deliverable D2.1). The proposed experiments follow ausual training-development-test scheme. In contrast to VideoLectures.NET, poliMedia lecturesare classified into courses. Thus, lectures for development and test were selected taking intoaccount this course-based organisation:

• Videos were selected so that a complete course was included.

• Speakers are not present in the training set.

• Gender was no taken into account.

Lectures selected for the development set. Table 5 shows a summary of lectures selectedin poliMedia.

8

Page 9: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

Table 5: Summary of Spanish development lectures for transcription experiments.

Course Duration Category Gender

00501-Profesores POLIMEDIA II/M46 1:24:00 Law MaleProfesores POLIMEDIA/M74 0:48:00 Statistics FemaleProfesores POLIMEDIA/M26 0:42:00 Graphics Female00501-Profesores POLIMEDIA II/M54 0:36:00 Geolocation MaleProfesores POLIMEDIA I/M40 0:18:00 Botanics Male

Table 6: Summary of Spanish test lectures for transcription experiments.

Course Duration Category Gender

30000-Quimica de productos naturales/M01 1:12:00 Chemistry FemaleProfesores POLIMEDIA/M62 0:48:00 Marketing Male00505-Profesores Alcoy/M03 0:42:00 Environment MaleProfesores POLIMEDIA/M67 0:30:00 Comp. Science MaleProfesores POLIMEDIA/M47 0:18:00 Urbanism Female

Lectures selected for the test set. Table 6 shows a summary of lectures selected in poli-Media.

Training material as planned in the DoW.

a) Acoustic models. The courses not included in the development and test sets (99 hours)and EPPS (62 hours).

b) Language models. The textual material of the previous item and the following additionalcorpora: the EuroParl corpus, the JCR-Acquis corpus, and the OPUS-OpenSubs corpus.

Additional training material employed in the experiments.

a) Acoustic models. QUAERO (170 hours, news and podcasts) and Hub4 (30 hours, Amer-ican Spanish).

b) Language models. The textual material of the previous item, Spanish gigaword corpus(1G running words) and QUAERO (600M), Spanish Google n-gram [13] and various datasources accounting for 654M running words.

2.2 Definition of translation experiments

Test and development sets for translation experiments were the same as those defined for tran-scription experiments. In the case of VideoLectures.NET, these experiments involved translatingfrom English into Spanish, German, French and Slovenian, and from Slovenian into English. Inthe case of poliMedia, the task was to translate from Spanish into English.

Language models are those described in Section 2.1 Below we describe the training materialemployed to train the different translation models.

9

Page 10: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

2.2.1 English↔Spanish translation

Training material as planned in the DoW.

a) VideoLectures.NET (4.5K bilingual sentences)

b) poliMedia (4.0K bilingual sentences)

c) Europarl (1.7M bilingual sentences)

Additional training material employed in the experiments.

a) TED corpus (144K bilingual sentences)

b) United Nations (11M bilingual sentences)

2.2.2 English into French translation

Training material as planned in the DoW.

a) VideoLectures.NET (3.2K bilingual sentences)

b) Europarl (1.7M bilingual sentence)

c) JRC-Acquis (1.3M bilingual sentences)

Additional training material employed in the experiments.

a) WIT-3 (144K bilingual sentences)

b) WMT-Newstext (136K bilingual sentences)

c) COSMAT (55K bilingual sentences)

d) Conference lectures (3.2K bilingual sentences)

2.2.3 English into German translation

Training material as planned in the DoW.

a) VideoLectures.NET (4.0K bilingual sentences)

Additional training material employed in the experiments.

a) TED corpus (133.8K bilingual sentences)

2.2.4 English↔Slovenian translation

Training material as planned in the DoW.

a) VideoLectures.NET (14.6K bilingual sentences)

c) JRC-Acquis (1.1M bilingual sentences)

Training material employed in the experiments.

a) TED corpus (9.7K bilingual sentences)

10

Page 11: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

3 VideoLectures.NET

3.1 Transcription quality

3.1.1 English (RWTH)

For the English language, RWTH started from a baseline system which was trained on broadcastnews and broadcast conversational data comprising 100 hours. In addition, 100 hours of EPPSdata were used. The baseline language model was trained on the corpus described in Table 32using Kneser-Ney smoothing. This system achieved the word error rate (WER) shown inTable 8.

Table 7: Statistics of the baseline English text corpus to train language models in RWTH ASRsystem.

Sentences Words Vocabulary Dev OOV Dev PPL Test OOV Test PPL

217.8M 4.5G 150K 2.8% 175 1.0% 141

Table 8: WER results for the baseline English RWTH ASR system on the development andtest sets.

Dev Test

Baseline (VTLN) 59.5 55.5+CMLLR 44.8 36.1+LM rescoring 43.0 35.9

While the first pass results include Vocal Tract Length Normalization (VTLN), the secondpass also applies Constrained Maximum Likelihood Linear Regression (CMLLR) as a moreadvanced form of speaker adaptation.

In the first two decoding passes, only a pruned language model is used. Therefore, in a thirddecoding step, the word lattices from the previous decoding stage are rescored based on thelanguage model (LM rescoring) probabilities obtained from the full, un-pruned model. Finally,confusion network decoding was applied to the rescored lattices.

RWTH obtained large improvements in WER by including additional acoustic training datafrom broadcast news and broadcast conversational data as well as the HUB4 and TDT4 corpora(see statistics in Table 9). In addition, the English transcriptions of the transLectures projectwere added to the LM training data as a source of in-domain data. Results for the improvedsystem can be seen in Table 10.

Table 9: Statistics of the complete English speech corpus to train acoustic models on the RWTHASR system.

Duration Sentences Words Vocabulary

763h 958K 10.3M 71K

3.1.2 English (EML)

EML has to carefully balance the use of computing resources (CPU time and memory) againstrecognition accuracy in her scalable transcription platform (EML Transcription Platform), soas to provide speech recognition technology for 7x24 services to her customers.

11

Page 12: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

Table 10: WER results for the English RWTH ASR system trained on the complete Englishspeech corpus.

Dev Test

Baseline (VTLN) 44.5 34.0+CMLLR 39.4 28.4+LM rescoring 38.2 27.9

In the course of the trans Lectures project, EML has to provide transcripts of severalthousands of hours of video lectures. This fact imposes a significant computational load anddetermines some of the design principles behind EML’s (acoustic and language) model trainingand recognition strategy, amongst which are:

• The use of plain MFCC features and a fairly low-dimensional feature vectors for fastfeature extraction and labelling.

• Audio-streaming based on a coarse, model-based speech/silence segmentation.

• A simple two-pass recognition strategy without, for example, an additional rescoring witha huge and powerful language model.

• Parameter settings that limit the search efforts during recognition.

While EML has integrated all advanced modeling and recognition techniques that are neededby the project partners more sophisticated models, the investigation of methods for a significantreduction of resource consumptions remain an ongoing topic in EML’s work.

EML developed an ASR system which integrated acoustic models trained on the materialplanned in the DoW plus 330 hours from an in-house English speech database. The total corpussize is approximately 350M running words and 235K vocabulary.

Table 11 shows WER results on the test and development sets for a two-pass decodingstrategy with online CMLLR adaptation obtained for an operating point of approximately 7xRT (i.e. the system needs 7 minutes for the transcription of 1 minute of audio). This system wasemployed to automatically transcribe 8.000 lectures from the VideoLectures.NET repository.

Table 11: WER results for the English EML ASR system.

Dev Test

Baseline + CMLLR (2nd pass) 57.2 44.0

From unsupervised adaptation of the acoustic model with lectures in the development andtest set, we observe a 15% relative improvement of WER. Furthermore, by adapting the languagemodel with data from the accompanying slides, we experience a relative improvement of morethan 10% of WER in the test set. More details on these results can be found in deliverable3.1.1.

3.1.3 Slovenian (RWTH)

For the speech recognition experiments RWTH used its own toolkit RASR. The phoneme setcontains 39 speech and 3 non-speech phonemes. The acoustic training was performed on 34lectures with a total duration of approx. 27 hours of speech. The underlying Hidden Markov

12

Page 13: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

model (HMM) has a three-state left-to-right topology. The number of allophone states wasreduced to 4500 generalized triphone states by tying with a Classification and Regression Tree(CART), and an LDA transform was applied to a sliding window of 9 frames. The final acousticmodel consists of approximately 440k Gaussians.

The evaluation was performed on the development and test sets. The speech corpus statisticsare given in Table 12.

Table 12: Statistics of the Slovenian speech corpus to train acoustic models inRWTH ASRsystem.

Duration Sentences Words Vocabulary

27.0h 11.8K 208K 26K

A second system was trained in the same fashion as the baseline system on the MFCCfeatures warped for VTLN.

A third system was trained by means of Speaker Adaptive Training (SAT) with CMLLR.The feature vectors were clustered in an unsupervised manner to obtain “speaker-like” clusters.No speaker information from the provided manual transcriptions was used.

During recognition, a 4-gram language model (LM) depicted in Table 13 was used in allsystems.

Table 13: Statistics of the Slovenian text corpus to train language models in RWTH ASRsystem.

Sentences Words Vocabulary Dev OOV Dev PPL Test OOV Test PPL

1.6M 74.7M 400K 2.4% 468 4.0% 692

The transcription obtained from the first pass recognition with the VTLN sytem was used forperforming an unsupervised adaptation with CMLLR. The results of the second pass recognitionperformed with the SAT system on the CMLLR adapted features are given in Table 14.

Table 14: WER results for the Slovenian RWTH ASR system.Dev Test

Baseline (MFCC) 38.8 59.2

VTLN 38.6 57.6SAT (CMLLR in training) 37.8 57.2SAT (CMLLR in training+recognition) 35.3 47.3

While on the development data the relative improvement by CMLLR is in the usual range ofabout 10%, the gain is much larger on the test data. This indicates a mismatch in the acousticconditions in the training and test data which is normalized by the CMLLR transform.

Such a mismatch also exists on the level of the language model, partly because of the widespectrum of topics covered in the lectures. As a result, the perplexity on the test data (692) ismuch larger than that of the development data (468).

Moreover, the lecturers speak freely (“spontaneous speech” as opposed to “prepared” or“scripted”), while the language model is trained on written texts like books and articles. Lastly,Slovene is characterized by a large number of dialects, which makes the automatic transcriptionof lectures with many different speakers a difficult task.

13

Page 14: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

3.1.4 Slovenian (EML)

EML is also responsible for the automatic transcription of VideoLectures.NET lectures in Slove-nian. To this purpose, EML is hosting the Slovenian acoustic and language models developedby RWTH and balancing computing resources against recognition accuracy. Table 15 presentsWER results on the development and test sets for Slovenian in the EML transcription platformusing RWTH’s models.

Table 15: Results on the development and test sets for RWTH’s Slovenian models in the EMLTranscription Platform.

Dev Test

Baseline + CMLLR (2nd pass) 34.0 46.0

3.2 Translation quality

3.2.1 English into Spanish (UPVLC)

Translation experiments from English into Spanish were performed using out-domain and in-domain corpora based on the well-known Phrase-Based SMT (PBSMT) system, Moses toolkit [6].In-domain corpora included the poliMedia corpus and the TED corpus from the IWSLT 2012evaluation. The main statistics of training, development and test sets of in-domain corpora areshown in Table 16. It should be noticed that poliMedia development and test sets are only in-cluded as training material when evaluating on VideoLectures.NET. The same happens with theVideoLectures.NET development and test sets when evaluating on poliMedia. Out-domain cor-pora involved the Europarl and UN corpora (see statistics in Table 17). These large out-domainwere filtered applying sentence selection [4] before performing the translation process.

Table 16: Basic statistics of in-domain English→Spanish corpora.

Corpus SentencesWords Vocabulary

En Es En Es

TED 144K 2.6M 2.45M 46.4K 67.9KVideoLectures.NET Train 2142 58K 54.2K 3.7K 5.0KVideoLectures.NET Dev 1013 28.6K 27.4K 6.2K 2.7KVideoLectures.NET Test 1360 36.1K 33.5K 6.2K 3.3KpoliMedia Train 1529 40.2K 40.3K 3.86K 4.71KpoliMedia Dev 1401 38.7K 37.8K 3.66K 3.51KpoliMedia Test 1139 32.1K 32.1K 3.26K 4.08K

Table 17: Basic statistics of out-domain English→Spanish corpora.

Corpus Genre SentencesWords Vocabulary

En Es En Es

Europarl Parliament proceedings 1.73M 40.6M 42.3M 98.4K 149KUN United Nations resolutions 11M 298M 340M 55.6K 57.1K

A Spanish language model was trained on diverse bilingual and monolingual external cor-pora, Google counts [13] and the in-domain corpora. Table 18 summarises the most important

14

Page 15: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

statistics of the external corpora. Individual 4-gram language models with modified Kneser-Ney discount [8] and interpolation smoothing were computed for each corpus using the SRILMtoolkit [21]. Individual models were linearly interpolated to minimize the perplexity of thein-domain development set [7].

Table 18: Basic statistics of the external corpora involved in the generation of the Spanishlanguage model.

Corpus Genre Sentences Words Vocabulary

EPPS Parliament speeches 0.1M 0.9M 27KNews Commentary News 0.2M 4.6M 174KTED Science for laymen 0.3M 2.3M 133KEuroparl v7 Parliament proceedings 2.1M 55M 439KEl Periodico News 2.7M 45M 916KNews (07-11) News 8.6M 217M 2.9MUnited Nations United Nations resolutions 9.5M 253M 1.6MUnDoc EU sources 10.0M 318M 1.9M

Total 33.5M 895.8M 148.3K

Table 19 summarises adaptation results in terms of BLEU and TER for Spanish into En-glish translations. Translation and language models for the baseline system were trained onthe in-domain training set and the Europarl corpus However, the baseline language model isa smoothed 5-gram with modified Kneser-Ney and absolute interpolation. As shown, baselinefigures were slightly improved by using the linearly interpolated language model instead of thebaseline language model. A significant improvement is obtained by bilingual sentence selectiontechniques. This improvement is even more notable when using the linearly interpolated lan-guage model. However, the best results were obtained by a log-linear combination of languagemodels, the linearly interpolated model and the model resulting from sentence selection. In thislatter case, 33.5 BLEU points were achieved.

Table 19: BLEU and TER results on the English→Spanish VideoLectures.NET developmentand test sets.

Dev TestBLEU TER BLEU TER

Baseline 33.4 46.9 30.9 48.7linear interpolation 33.5 46.8 31.2 48.7

Selection 35.7 44.4 32.4 47.2+linear interpolation 37.4 43.5 33.3 46.4+log-linear interpolation 37.6 43.1 33.5 46.1

3.2.2 English into French (XEROX)

XEROX was tasked with automatically translating English transcriptions of video lectures intoFrench. As in Section 3.2.1, the Moses SMT toolkit was employed.

There is a large domain distance between the language found in lecture transcriptions andthat used in the large, most frequently used training corpora for SMT, such as the Europarlcorpus [10]. In order to train a PBSMT translation model which would perform well on theEnglish transcription data, two directions were followed.

15

Page 16: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

We first compiled a large, composite training collection, made up of a large array of differenttraining data corpora. Some of these corpora, like the Europarl, the JRC-Aquis [19] and theWMT-Newstext corpora [2] were selected due to their large size, which allows us to train amodel offering a large coverage for the source language sentences. Others, like the COSMATcorpus [11], the WIT3 corpus [3] and the trans Lectures training data that were manuallycreated by our human translators, are smaller in size but contain data that are closer in styleand genre to the transcriptions of scientific lectures. Tables 20 and 21 summarise the list ofcorpora that were employed in training the translation and language models, respectively.

Table 20: Basic information and statistics of English→French corpora.

Corpus Genre SentencesWords Vocabulary

En Fr En Fr

Europarl Parliament proceedings 1M 26.4M 29.4M 63.5K 87.8KJRC-Acquis Legal 1.2M 23.6M 25.7M 183.6K 191.9KWIT-3 Science for laymen 145K 2.8M 3.0M 43.0K 58.8KWMT-Newstext News 137K 3.4M 4.0M 44.1K 58.9KCOSMAT Science for specialists 56K 1.4M 1.6M 31.4K 39.0KVideoLectures.NET Science for specialists 3.2K 77.1K 79.2K 5.2K 6.3K

Table 21: Basic statistics of the external corpora involved in the generation of the Frenchlanguage model.

Corpus Sentences Words Vocabulary

Europarl 1M 31.7M 90.2KJRC-Acquis 1.2M 41.0M 230.0KWIT-3 145K 3.1M 59.8KWMT-Newstext 137K 4.1M 59.8KCOSMAT 1.5M 38.2M 366.7KVideoLectures.NET 3.2K 82.8K 6.4K

Table 22: BLEU and TER results on the English→French VideoLectures.NET test set.

Dev TestBLEU TER BLEU TER

Baseline 35.2 49.3 31.2 54.6

Domain-adapted model 37.2 47.6 32.2 53.5

Training on a combination of corpora, including close- or in-domain data, already deliversa very competitive baseline. However, when training from such a diverse collection of data, thephrasal translation options that are specific to the target domain and which can be extractedfrom the close- or in-domain training collections, can be lost among the multitude of irrelevantoptions extracted from the out-of-domain corpora. In order to further increase performance,we also applied a domain-adaptation method that we developed as part of the trans Lecturesproject.

Our method focuses on tracking the relative contribution, in terms of source words trans-lated, of each distinct training corpus in formulating candidate translation hypotheses. Fur-thermore, we register how closely each complete translation hypothesis matches the linguistic

16

Page 17: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

domain of each training set by means of multiple language models, each trained on target-language data from a different domain. We then use development data to train our system touse this information to arrive at translation hypotheses which better target the scientific lecturetranscription domain.

We tuned and tested both the baseline as well as the domain-adapted model on the desig-nated development and test sets respectively. Table 22 presents the results of the automaticevaluation of the translations produced by the two models, using the BLEU [16] and TER [18]MT metrics. As measured by both metrics, our domain-adaptation method delivers bettertranslations for the trans Lectures data. Additionally, using the significance test of [9], theBLEU score improvement offered by our method was shown to be statistically significant at the1% level.

Further information on the our domain-adaptation method, the evaluation methodology, aswell as an in-depth comparison of the performance of variations of our method and results fora further language domain past lecture transcriptions, can be found in deliverable D3.1.1.

3.2.3 English into German (RWTH)

The English→German SMT system is based on the hierarchical-phrase-based translation de-coder which is part of the open-source toolkit Jane [23]. RWTH used GIZA++ [15] to word-alignthe bilingual data, from which the phrase table is extracted.

Table 23: BLEU and TER results on the English→German VideoLectures.NET developmentand test sets.

Dev TestBLEU TER BLEU TER

Baseline 20.8 60.6 18.9 65.4

DWL 20.9 60.3 19.2 64.9RNN slides 20.7 60.4 19.0 65.2

RWTH used two sources as training data. The baseline system is trained and optimizedon the TED corpus from the IWSLT 2012 evaluation, since VideoLectures.NET comes fromthe same domain. Additional to this data, RWTH used the 4019 sentences transcribed audiolectures from transLectures. RWTH generated a language model with the target part of thebilingual data. The results for development and test sets are shown in Table 23. In addition,the baseline system was refined by training a discriminative word lexicon as in [12], using theslide information as the source side of the model. Thus, the translation model from English toGerman is adapted based on the content of the slide shown at the moment. First experimentsshowed some improvement over the baseline system. The third setup extended the baselinereranking the 1000 best found translations using an additional score calculated by an artificialneural network. The network was trained using the aligned slides words, the aligned sourcewords, and the previous target word.

Due to the tight schedule of the quality control evaluation, the English→German translationsgiven to the human evaluators were produced with a translation system inferior to the onedescribed here. This improved system was developed after the evaluation took place.

3.2.4 English↔Slovenian (RWTH)

The Slovenian→English and English→Slovenian SMT systems are based on the phrase-basedtranslation decoder of the Jane toolkit. The language models are 4-gram LMs trained with the

17

Page 18: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

SRILM toolkit [22]. We use the standard set of models with phrase translation probabilities andlexical smoothing in both directions, word and phrase penalty, distance-based reordering model,an n-gram target language model and three binary count features. The parameter weights areoptimized with MERT [14].

RWTH used three sources as training data. The baseline system is trained and optimizedon the JRC-Acquis corpus [19]. This corpus is not related to the lecture domain, but it waschosen due to its considerable size (about 27 million running words in English). In the next step,we ran MERT on the in-domain development set of VideoLectures.NET. This leads to someimprovement in translation quality, as the pure out-of-domain system is optimized towardsthe lecture translation task. Adding training data collected in VideoLectures.NET and TEDtalks available from the IWSLT 2012 shared evaluation task yields substantial improvements intranslation quality. Although the additional data is significantly smaller than the JRC-Acquiscorpus (about 0.5 million running words in English), it has a strong impact due to its in-domaincharacteristics. For the Slovenian→English task, we finally added monolingual TED talk datato train the language model (about 3 million running words), which again slightly improved theresults. The results are shown in Tables 24 and 25.

Table 24: BLEU and TER results on the Slovenian→English VideoLectures.NET developmentand test sets.

Dev TestBLEU TER BLEU TER

Baseline (JRC-Acquis) 12.5 67.0 9.1 73.1

+videoLectures MERT 15.1 69.4 10.2 74.7+videoLectures & TED 20.5 61.8 15.0 67.8+TED monolingual 21.6 59.5 15.7 65.2

Table 25: BLEU and TER results on the English→Slovenian VideoLectures.NET developmentand test sets.

Dev TestBLEU TER BLEU TER

Baseline (JRC-Acquis) 7.7 79.9 6.2 95.0

+videoLectures MERT 8.8 75.9 6.4 89.5+videoLectures & TED 13.4 68.5 12.0 77.4

Unfortunatelly, due to the tight schedule of the quality control evaluation, the English→Sloveniantranslations given to the human evaluators were produced with the baseline system. The im-provements described here were developed afterwards.

18

Page 19: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

4 poliMedia

4.1 Transcription quality

4.1.1 Spanish (UPVLC)

The UPVLC Spanish automatic speech recognition (ASR) system is based on the AK toolkit [1]to train acoustic models and the SRILM toolkit [21] to deploy n-gram language models. TheAK toolkit is released under a free software license and provides similar features to other HMMtoolkits as HTK [25] or RASR [17].

Acoustic models for Spanish were trained on the poliMedia corpus (see Table 26). Trainingspeech segments were first transformed into 16KHz, and then parameterized into sequences of39-dimensional real feature vectors. Regarding to the transcriptions, 23 basic phonemes wereconsidered plus a silence model. These extracted features were used to train a tied triphoneHMM with Gaussian mixture models for emission probabilities. The training configuration wastuned on preliminary experiments using the development set.

Table 26: Basic statistics of poliMedia speech corpus

Duration Sentences Words Vocabulary

96h 41.5K 96.8K 28K

Two approaches to language modelling were considered to initially evaluate our ASR base-line system. The first type is a simple n-gram model trained on the poliMedia corpus. Thesecond approach computes a linear mixture of language models trained with the poliMedia tran-scriptions along with other external resources previously presented in Table 18. A summary ofcorpora statistics employed in these two approaches in shown in Table 27.

Table 27: Statistics of corpora employed to train the baseline and external approaches in lan-guage modelling.

Sentences Words Vocabulary Dev OOV Dev PPL Test OOV Test PPL

Baseline 41.5K 96.8M 28K 4.5% 293 5.2% 313External 33.5M 895.8M 58.1K 2.2% 180 3.0% 240

Table 28 reports WER results on development and test sets for both approaches using 4-gram language models. As observed, the employment of external resources produces a greatimprovement of about 6 points of WER is obtained.

Table 28: WER results for the Spanish UPVLC ASR system on the development and test setswith external resources.

Dev Test

Baseline 36.3 36.0

+External resources 28.1 30.3

Two speaker adaptation techniques were implemented in order to improve our baseline ASRsystem: fast vocal tract length normalization (VTLN) [24], and constrained MLLR (CMLLR)

19

Page 20: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

Table 29: WER (%) using VTLN and CMLLR on baseline Spanish UPVLC ASR system.

Dev Test

Baseline 28.1 30.3

VTLN 26.2 28.8

CMLLR (HMM-Full 1st Pass) 22.3 24.8+CMLLR (HMM-Full 2nd Pass) 22.2 24.6

features [20, 5]. Table 29 reports WER figures on VTLN and two passes of CMLLR achieving24.6% WER points on the test set.

Moreover, several experiments were carried out in order to assess the improvements thatcan be obtained taking advantage of the text extracted from the slides. In all experiments,an additional language model computed from the text extracted was introduced in the linearmixture described above. Two basic scenarios were considered depending on how the text wasextracted from the slides, manually or automatically. In the first case, a human annotatorsynchronised and transcribed the text of the slides in the development and test sets. In thesecond case, an error-prone OCR system carried out this slide transcription process. ThisOCR system provides automatic transcriptions with 64% WER. A third scenario (sync) as arefinement of the first one is also proposed to adapt language models from the slide being shownat the time segment being transcribed.

WER figures are summarised in Table 30. The first row depicts the results obtained withthe baseline system without slide information.

Table 30: WER (%) incorporating slide information on the Spanish UPVLC ASR system.

Dev TestOOV PPL WER OOV PPL WER

Baseline 2.4 179.4 22.3 3.1 238.6 24.8

Human slides 1.6 108.4 21.0 1.8 125.9 21.4OCR slides 2.0 128.7 22.4 2.2 143.5 23.8Human slides + sync 2.4 95.0 21.1 3.1 113.1 21.9

4.1.2 Spanish (RWTH)

The RWTH Spanish speech recognition system uses the trans Lectures acoustic training data(100 hours) along with QUAERO (170 hours, news and podcasts), EPPS (90 hours, parliamentspeeches) and Hub4 (30 hours, American Spanish). From all transcriptions, the numbers areexpanded to their word form and alternate terms (i.e. of the form /phrase 1/phrase 2//)the first term is retained. The silence and noise markers are removed from the dev and evaltranscriptions.

Acoustic features are created using a three-layer Multilayer-Perceptron (MLP), which inturn has been trained on a concatenation of MFCC, PLP and Gammatone features. These arethen concatenated with MFCC features. A Gaussian mixture set is calculated by MaximumLikelihood training. These mixtures are then converted to log-linear form, and Minimum PhoneError (MPE) discriminative training is performed.

The language model is a combination of several LMs to minimize the perplexity on thetranslectures development corpus (see statistics in Table 32).

20

Page 21: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

Table 31: Statistics of Spanish speech corpus to train acoustic models in RWTH ASR system.

Corpus Duration Sentences Words Vocabulary

poliMedia 96h 41.5K 96.8K 28KEPPS 91h 101.6K 823K 27.5KQuaero 173h 41.1K 1.9M 58.2KHub4 30h 29.6K 309.3K 24.1K

Table 32: Statistics of Spanish text corpus to train language models in RWTH ASR system.

Corpus Genre Sentences Words Vocabulary

Gigaword EU sources 119.9M 1.2G 271.4KQuaero News 19.9M 667.8M 324.0KTranscriptions Diverse sources 213.7M 4.0M 91.1K

The recognition acoustic model is a 6-state hidden markov model (HMM). After a firstpass of recognition, speaker adaptive training (MLLR and CMLLR) is done twice by usingthe last stage’s hypothesis as the reference. Looking at the output of the speech recognizer,one sees that there are few errors due to the acoustic quality, because the audio recording isclean. One source of errors is the mispronounced words on part of the lecturer and hesitations.Also, mathematical formulas are sometimes recognized incorrectly because they may not haveequivalents in the LM training data. The lectures of political science and law have small WERs,while the computer science lectures have the highest error rates, principally because they usea lot of English terms and little-known abbreviations, which may even be pronouned accordingto their English orthography. The WERs are given in the table below

Table 33: WER (%) using MLLR and CMLLR on the Spanish RWTH ASR system.

OOV PPL Dev OOV PPL Test

Baseline1.6 195

20.32.5 225

21.4MLLR+CMLLR 15.7 17.1

4.1.3 Spanish (EML)

EML is also responsible for the automatic transcription of 6.000 poliMedia lectures in Span-ish. To this purpose, EML is hosting the Spanish acoustic and language models developed byRWTH and balancing computing resources against recognition accuracy for an operating pointof approximately 4 xRT (i.e. the system needs 4 minutes for the transcription of 1 minute ofaudio). Table 34 presents WER results on the development and test sets for Spanish in theEML transcription platform using RWTH’s models.

4.2 Translation quality (UPVLC)

Translation experiments from Spanish into English were performed using out-domain and in-domain corpora based on the well-known Phrase-Based SMT system, Moses toolkit [6]. In-domain corpora included the poliMedia corpus and the TED corpus from the IWSLT 2012evaluation. The main statistics of training, development and test sets of the in-domain corpora

21

Page 22: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

Table 34: Results on the development and test sets for RWTH’s Spanish models in the EMLTranscription Platform.

Dev Test

Baseline + CMLLR (2nd pass) 20.4 22.3

are shown in Table 16. Out-domain corpora involved the Europarl and UN corpora (see statis-tics in Table 17). These large out-domain were filtered applying sentence selection [4] beforeperforming the translation process.

An English language model was also trained on diverse bilingual and monolingual exter-nal corpora, Google counts [13] and the in-domain corpora. Table 35 summarises the mostimportant statistics of the external corpora. Individual 4-gram language models were trainedfor each corpus using the SRILM [21] toolkit. Modified Kneser-Ney discount and interpolationsmoothing was applied [8]. These models were interpolated to minimize the perplexity of thein-domain development set [7].

Table 35: Basic statistics of the external corpora involved in the generation of the Englishlanguage model.

Corpus Genre Sentences Words Vocabulary

TED Science for laymen 0.1M 2.3M 0.1MNews-Commentary News 0.2M 4.5M 0.2MEuroparl-v7 Parliament proceedings 2.2M 54.1M 0.3MUnited Nations United Nations resolutions 10.6M 286M 1.8M109 Canadian and EU sources 19.8M 504.8M 5.8MNews (07-11) News 48.9M 986M 6.2M

Table 36 summarises adaptation results obtained on Spanish into English translations.Translation and language models for the baseline system were trained on the in-domain trainingset and the Europarl corpus However, the baseline language model is 5-gram model smoothedwith modified Kneser-Ney absolute interpolation method [8]. Baseline figures were slightlyimproved by using the linearly interpolated language model instead of the baseline languagemodel. A significant improvement is obtained by bilingual sentences selection techniques. Thisimprovement is even larger with the linearly interpolated language model. However, the bestresults were obtained by a log-linear combination of both language models: the large linearmixture and the language model resulting from sentence selection. In this latter case, 26.0BLEU points were achieved.

Table 36: BLEU and TER results on the poliMedia corpus.

Dev TestBLEU TER BLEU TER

Baseline 27.8 51.2 23.4 56.5linear interpolation 28.3 51.2 24.0 56.4

Selection 29.9 49.6 24.9 55.6linear interpolation 29.7 49.9 25.3 55.1+log-linear interpolation 30.9 49.1 26.0 54.6

22

Page 23: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

5 Conclusions

This deliverable has presented a summary of scientific evaluations for automatic transcriptionsand translations on both repositories, VideoLectures.NET and poliMedia.

Regarding VideoLectures.NET, the best system in automatic transcription of English achieves27.9% WER points. This system still has room for improvement since slide information was notconsidered in this first report. The first results on Slovenian transcriptions were also providedobtaining 46% WER points. In this case, more training data is planned to be incorporated toimprove acoustic and language models, as well as slide information.

Scientific evaluation of translations out of English delivered BLEU and TER figures thatbasically depend on the complexity of target language. Spanish and French translation systemsachieved BLEU scores of 33.5 and 32.2, respectively. However, more complex languages such asGerman and Slovenian obtained significantly poorer BLEU scores, 19.2 and 12.0, respectively.Finally, Slovenian into English MT system scored 15.7 BLEU points. As in ASR, we plan toenlarge the training set with additional training material.

As far as poliMedia is concerned, scientific evaluation of Spanish transcription reportedlower WER figures than English transcription due to recording conditions and predominantnative speakers. The lowest WER score was 17.1 and it is expected to be improved when slideinformation is employed. Surprisingly, translation of Spanish into English with 30.9 BLEUpoints obtained worse results than Spanish into English. This may be explained by the specificityof topics in poliMedia.

EML’s ASR systems for English, Slovenian and Spanish provided the transcriptions for thequality control in task 6.2. Regarding translations supervised in task 6.2, they were deliv-ered by UPV for those translation directions involving Spanish, by XEROX for English intoFrench translations, and by RWTH for English into German translations and those translationsdirections involving Slovenian.

23

Page 24: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

References

[1] AK toolkit. http://aktoolkit.sourceforge.net/.

[2] Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan. Findings of the2011 Workshop on Statistical Machine Translation. In Proceedings of the Sixth Workshop onStatistical Machine Translation, pages 22–64, Edinburgh, Scotland, July 2011. Associationfor Computational Linguistics.

[3] Mauro Cettolo, Christian Girardi, and Marcello Federico. WIT3: Web Inventory of Tran-scribed and Translated Talks. In Proceedings of the 16th Conference of the EuropeanAssociation for Machine Translation (EAMT 2012), pages 261–268, Trento, Italy, May2012.

[4] Guillem Gasco, Martha-Alicia Rocha, German Sanchis-Trilles, Jesus Andres-Ferrer, andFrancisco Casacuberta. Does more data always yield better translations? In Proceedingsof the 13th Conference of the European Chapter of the Association for ComputationalLinguistics, pages 152–161, Avignon, France, April 2012. Association for ComputationalLinguistics.

[5] Diego Giuliani, Matteo Gerosa, and Fabio Brugnara. Speaker normalization through con-strained mllr based transforms. In INTERSPEECH. ISCA, 2004.

[6] Hieu Hoang and Philipp Koehn. Design of the Moses Decoder for Statistical MachineTranslation. In ACL Workshop on Software Engineering, Testing, and Quality Assurancefor NLP 2008, 2008.

[7] Frederick Jelinek and Robert L. Mercer. Interpolated estimation of Markov source pa-rameters from sparse data. In In Proceedings of the Workshop on Pattern Recognition inPractice, pages 381–397, Amsterdam, The Netherlands: North-Holland, May 1980.

[8] R Kneser and Hermann Ney. Improved backing-off for M-gram language modeling. InAcoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Confer-ence on, pages 181–184, 1995.

[9] Philipp Koehn. Statistical Significance Tests for Machine Translation Evaluation . InDekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004, pages 388–395, Barcelona,Spain, July 2004. Association for Computational Linguistics.

[10] Philipp Koehn. Europarl: A Parallel Corpus for Statistical Machine Translation. In MTSummit 2005, 2005.

[11] Patrik Lambert, Jean Senellart, Laurent Romary, Holger Schwenk, Florian Zipser, PatriceLopez, and Frederic Blain. Collaborative Machine Translation Service for Scientific texts.In Proceedings of the Demonstrations at the 13th Conference of the European Chapter ofthe Association for Computational Linguistics, pages 11–15, Avignon, France, April 2012.Association for Computational Linguistics.

[12] Arne Mauser, Sasa Hasan, and Hermann Ney. Extending statistical machine translationwith discriminative and trigger-based lexicon models. In Conference on Empirical Methodsin Natural Language Processing, pages 210–217, Singapore, August 2009.

[13] Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K.Gray, The Google Books Team, Joseph P. Pickett, Dale Holberg, Dan Clancy, Peter Norvig,Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. Quantitativeanalysis of culture using millions of digitized books. Science, 2010.

[14] Franz Josef Och. Minimum Error Rate Training in Statistical Machine Translation. pages160–167, Sapporo, Japan, July 2003.

24

Page 25: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

[15] Franz Josef Och and Hermann Ney. A Systematic Comparison of Various Statistical Align-ment Models. Computational Linguistics, 29(1):19–51, March 2003.

[16] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A Method forAutomatic Evaluation of Machine Translation. In Proceedings of 40th Annual Meeting ofthe Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania,USA, July 2002. Association for Computational Linguistics.

[17] D. Rybach, C. Gollan, G. Heigold, B. Hoffmeister, J. Loof, R. Schluter, and H. Ney. TheRWTH Aachen University open source speech recognition system. In 10th Annual Confer-ence of the International Speech Communication Association, pages 2111–2114, 2009.

[18] Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and Ralph Weischedel.A Study of Translation Error Rate with Targeted Human Annotation. In Proceedings ofthe Association for Machine Translation in the Americas (AMTA 2006), 2006.

[19] Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaz Erjavec, DanTufis, and Daniel Varga. The JRC-Acquis: A multilingual aligned parallel corpus with20+ languages. In 5th International Conference on Language Resources and Evaluation,Genoa, Italy, May 2006.

[20] G. Stemmer, F. Brugnara, and D. Giuliani. Adaptive training using simple target mod-els. In Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEEInternational Conference on, volume 1, pages 997 – 1000, 18-23, 2005.

[21] Andreas Stolcke. SRILM – an extensible language modeling toolkit. In Proc. of ICSLP,2002.

[22] Andreas Stolcke. SRILM – An Extensible Language Modeling Toolkit. In Proc. of the Int.Conf. on Speech and Language Processing (ICSLP), volume 2, pages 901–904, Denver, CO,September 2002.

[23] David Vilar, Daniel Stein, Matthias Huck, and Hermann Ney. Jane: Open source hierar-chical translation, extended with reordering and lexicon models. In ACL 2010 Joint FifthWorkshop on Statistical Machine Translation and Metrics MATR, pages 262–270, Uppsala,Sweden, July 2010.

[24] L. Welling, S. Kanthak, and H. Ney. Improved methods for vocal tract normalization.In Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE InternationalConference on, volume 2, pages 761 –764 vol.2, mar 1999.

[25] S. Young et al. The HTK Book. Cambridge University Engineering Department, 1995.

25

Page 26: EML, XEROX, UPVLC, RWTH Distribution: Public · 2017-04-21 · D6.1.1: First report on scienti c evaluations EML, XEROX, UPVLC, RWTH Distribution: Public trans Lectures Transcription

A Acronyms

ASR Automatic Speech RecognitionBLEU Bilingual Evaluation UnderstudyCMLLR Conditional Maximum Likelihood Linear RegressionDDS Deluxe Digital Studios LimitedDoW Description of WorkDWL Discriminative Word LexiconEML European Media Laboratory GmbHHMM Hiddent Markov ModelJSI Josef Stefan InstituteK4A Knowledge for All FoundationLM Language ModelMERT Minimum Error Rate TrainingMLLR Maximum Likelihood Linear RegressionMT Machine TranslationOOV Out-Of-VocabularyPPL PerPLexityRNN Recurrent Neural NetworkRWTH RWTH Aachen UniversitySAT Speaker Adaptation TechniqueSMT Statistical Machine TranslationTER Translation Error RateUPVLC Universitat Politecnica de ValenciaVTLN Vocal Track Length NormalisationWER Word Error RateXRCE XEROX Research Center Europe

26