13
HAL Id: halshs-03030529 https://halshs.archives-ouvertes.fr/halshs-03030529v2 Submitted on 23 Feb 2021 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Distributed under a Creative Commons Attribution - NonCommercial - ShareAlike| 4.0 International License User-friendly automatic transcription of low-resource languages: Plugging ESPnet into Elpis Oliver Adams, Benjamin Galliot, Guillaume Wisniewski, Nicholas Lambourne, Ben Foley, Rahasya Sanders-Dwyer, Janet Wiles, Alexis Michaud, Séverine Guillaume, Laurent Besacier, et al. To cite this version: Oliver Adams, Benjamin Galliot, Guillaume Wisniewski, Nicholas Lambourne, Ben Foley, et al.. User- friendly automatic transcription of low-resource languages: Plugging ESPnet into Elpis. ComputEL-4: Fourth Workshop on the Use of Computational Methods in the Study of Endangered Languages, Mar 2021, Hawai‘i, United States. halshs-03030529v2

User-friendly automatic transcription of low-resource

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: User-friendly automatic transcription of low-resource

HAL Id: halshs-03030529https://halshs.archives-ouvertes.fr/halshs-03030529v2

Submitted on 23 Feb 2021

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Distributed under a Creative Commons Attribution - NonCommercial - ShareAlike| 4.0International License

User-friendly automatic transcription of low-resourcelanguages: Plugging ESPnet into Elpis

Oliver Adams, Benjamin Galliot, Guillaume Wisniewski, Nicholas Lambourne,Ben Foley, Rahasya Sanders-Dwyer, Janet Wiles, Alexis Michaud, Séverine

Guillaume, Laurent Besacier, et al.

To cite this version:Oliver Adams, Benjamin Galliot, Guillaume Wisniewski, Nicholas Lambourne, Ben Foley, et al.. User-friendly automatic transcription of low-resource languages: Plugging ESPnet into Elpis. ComputEL-4:Fourth Workshop on the Use of Computational Methods in the Study of Endangered Languages, Mar2021, Hawai‘i, United States. �halshs-03030529v2�

Page 2: User-friendly automatic transcription of low-resource

User-Friendly Automatic Transcription of Low-Resource Languages:Plugging ESPnet into Elpis

Oliver Adams a , Benjamin Galliot b , Guillaume Wisniewski c , Nicholas Lambourne d e ,Ben Foley d e , Rahasya Sanders-Dwyer d e , Janet Wiles d e , Alexis Michaud b ,

Séverine Guillaume b , Laurent Besacier f , Christopher Cox g ,Katya Aplonova h , Guillaume Jacques i , Nathan Hill j

a Atos zData, United States of Americab Langues et Civilisations à Tradition Orale (LACITO), CNRS-Sorbonne Nouvelle, France

c Université Paris, Laboratoire de Linguistique Formelle (LLF), CNRS, Franced The University of Queensland, Brisbane, Australia

e ARC Centre of Excellence for the Dynamics of Language (CoEDL), Australiaf Laboratoire d’Informatique de Grenoble (LIG), CNRS-Université Grenoble Alpes, France

g University of Alberta, Canadah Langage, Langues et Civilisation d’Afrique (LLACAN), CNRS-INALCO, France

i Centre de Recherches Linguistiques sur l’Asie Orientale (CRLAO), CNRS-EHESS, Francej School of Oriental and African Studies, University of London, United Kingdom

[email protected], [email protected],[email protected],

{n.lambourne|b.foley|uqrsand5|j.wiles}@uq.edu.au,{alexis.michaud|severine.guillaume}@cnrs.fr,

[email protected], [email protected],{aplooon|rgyalrongskad}@gmail.com, [email protected]

AbstractThis paper reports on progress integrating thespeech recognition toolkit ESPnet into Elpis,a web front-end originally designed to provideaccess to the Kaldi automatic speech recogni-tion toolkit. The goal of this work is to makeend-to-end speech recognition models avail-able to language workers via a user-friendlygraphical interface. Encouraging results arereported on (i) development of an ESPnetrecipe for use in Elpis, with preliminary resultson data sets previously used for training acous-tic models with the Persephone toolkit alongwith a new data set that had not previouslybeen used in speech recognition, and (ii) in-corporating ESPnet into Elpis along with UIenhancements and a CUDA-supported Dock-erfile.

1 Introduction

Transcription of speech is an important part of lan-guage documentation, and yet speech recognitiontechnology has not been widely harnessed to aidlinguists. Despite revolutionary progress in theperformance of speech recognition systems in thepast decade (Hinton et al., 2012; Hannun et al.,2014; Zeyer et al., 2018; Hadian et al., 2018; Ra-vanelli et al., 2019; Zhou et al., 2020), including its

application to low-resource languages (Besacieret al., 2014; Blokland et al., 2015; van Esch et al.,2019; Hjortnaes et al., 2020), these advances areyet to play a common role in language documenta-tion workflows. Speech recognition software oftenrequires effective command line skills and a rea-sonably detailed understanding of the underlyingmodeling. People involved in language documen-tation, language description, and language revital-ization projects (this includes, but is not limitedto, linguists who carry out fieldwork) seldom havesuch knowledge. Thus, the tools are largely inac-cessible by many people who would benefit fromtheir use.

Elpis1 is a tool created to allow language work-ers with minimal computational experience tobuild their own speech recognition models and au-tomatically transcribe audio (Foley et al., 2018,2019). Elpis uses the Kaldi2 automatic speechrecognition (ASR) toolkit (Povey et al., 2011) asits backend. Kaldi is a mature, widely used andwell-supported speech recognition toolkit whichsupports a range of hidden Markov model basedspeech recognition models.

1https://github.com/CoEDL/elpis2https://github.com/kaldi-asr/kaldi

alexis
Texte tapé à la machine
Fourth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-4), March 2-3, 2021.
Page 3: User-friendly automatic transcription of low-resource

In this paper we report on the ongoing integra-tion of ESPnet3 into Elpis as an alternative to thecurrent Kaldi system. We opted to integrate ESP-net (Watanabe et al., 2018) as it is a widely usedand actively developed tool with state-of-the-artend-to-end neural network models. By supportingESPnet in Elpis, we aim to bring a wider range ofadvances in speech recognition to a broad group ofusers, and provide alternative model options thatmay better suit some data circumstances, such asan absence of a pronunciation lexicon.

In the rest of this paper, we describe changesto the Elpis toolkit to support the new backend,and preliminary experiments applying our ESPnetrecipe to several datasets from a language docu-mentation context. Finally, we discuss plans goingforward with this project.

2 Related Work

Automatic phonetic/phonemic transcription inlanguage documentation As a subset of speechrecognition research, work has been done in ap-plying speech recognition systems to the verylow-resource phonemic data scenarios typical inthe language documentation context. Encourag-ing results capitalizing on the advances in speechrecognition technology for automatic phonemictranscription in a language documentation contextwere reported by Adams et al. (2018). Their workused a neural network architecture with connec-tionist temporal classification (Graves et al., 2006)for phonemic (including tonal) transcription. Acommand line toolkit was released called Perse-phone. To assess the reproducibility of the re-sults on other languages, experiments were ex-tended beyond the Chatino, Na and Tsuut’ina datasets, to a sample of languages from the PanglossCollection, an online archive of under-resourcedlanguages (Michailovsky et al., 2014). The re-sults confirmed that end-to-end models for au-tomatic phonemic transcription deliver promis-ing performance, and also suggested that prepro-cessing tasks can to a large extent be automated,thereby increasing the attractiveness of the tool forlanguage documentation workflows (Wisniewskiet al., 2020). Another effort in this space is Al-losaurus (Li et al., 2020), which leverages mul-tilingual models for phonetic transcription andjointly models language independent phones andlanguage-dependent phonemes. This stands as a

3https://github.com/espnet/espnet

promising step towards effective universal pho-netic recognition, which would be of great valuein the language documentation process.

User-friendly speech recognition interfacesSince such research tools do not have user friendlyinterfaces, efforts have been put into making thesetools accessible to wider audience of users. Theauthors of Allosaurus provide a web interface on-line.4 To integrate Persephone into the languagedocumentation workflow, a plugin, Persephone-ELAN,5 was developed for ELAN,6 a piece ofsoftware that is widely used for annotation in lan-guage documentation (Cox, 2019).

Meanwhile, Elpis is a toolkit that provides auser-friendly front-end to the Kaldi speech recog-nition system. The interface steps the user throughthe process of preparing language recordings us-ing existing ELAN transcription files, training amodel and applying the model to obtain a hypoth-esis orthographic transcription for untranscribedspeech recordings.

3 Bringing ESPnet to Elpis

ESPnet is an end-to-end neural network-basedspeech recognition toolkit. Developed with Py-torch (Paszke et al., 2019) in a research context,the tool satisfies three desiderata for our purposes:(a) it is easy to modify training recipes, which con-sist of collections of scripts and configuration filesthat make it easy to perform training and decodingby calling a wrapper script. These recipes describea wide range of the hyperparameters and architec-ture choices of the model; (b) it is actively devel-oped, with frequent integration of the latest ad-vances in end-to-end speech recognition; and (c) itsupports Kaldi-style data formatting, which makesit a natural end-to-end counterpart to Kaldi back-end that was already supported in Elpis. Thesepoints make it a more appealing candidate back-end than Persephone, primarily due to ESPnet’slarger developer base.

3.1 Development of an ESPnet recipe forElpis

One goal of the integration is to create a defaultESPnet recipe for Elpis to use, that performs wellacross a variety of languages and with the small

4https://www.dictate.app5https://github.com/coxchristopher/

persephone-elan6https://archive.mpi.nl/tla/elan

Page 4: User-friendly automatic transcription of low-resource

Figure 1: Training stages of the Elpis interface. Notice the choice of backend in the upper right-hand corner.

amount and type of data typically available in alanguage documentation context.

To get a sense of how easy it would be us-ing ESPnet to get similar performance as previ-ously attained we applied it to the single-speakerNa and Chatino datasets as used in Adams et al.(2018) (see Table 1, which includes other detailsof the datasets used, including the amount of train-ing data). We report character error rate (CER)rather than phoneme error rate (PER) because it isgeneral, does not require a subsequent language-specific post-processing step, and also capturescharacters that a linguist might want transcribedthat aren’t strictly phonemic. Because of minordifferences in the training sets, their preprocess-ing, and metrics used, these numbers are not in-tended to be directly comparable with previouswork. While these results are not directly compa-rable to the results they reported, the performancewas good enough to confirm that integrating ES-Pnet was preferable to Persephone. We do nolanguage-specific preprocessing, though the Elpisinterface allows the user to define a character setfor which instances of those characters will be re-moved from the text. For the Na data and the Ja-phug data in §4, the Pangloss XML format is con-verted to ELAN XML using a XSLT-based tool,Pangloss-Elpis.7

7gitlab.com/lacito/pangloss-elpis

While we did not aggressively tune hyperpa-rameters and architecture details, they do have asubstantial impact on performance and computa-tional requirements. Owing to the small datasetsand limited computational resources of many ofthe machines that Elpis may run on, we used arelatively small neural network. In the future weaim to grow a representative suite of evaluationlanguages from a language documentation settingfor further tuning to determine what hyperparam-eters and architecture best suit different scenarios.Though we aim for a recipe that does well acrossa range of possible language documentation datacircumstances, the best architecture and hyperpa-rameters will vary depending on the characteris-tics of the input dataset. Rather than have the userfiddle with such hyperparameters directly, whichwould undermine the user-friendliness of the tool,there is potential to automatically adjust the hyper-parameters of the model on the basis of the datasupplied to the model. For example, the parame-ters could be automatically set depending on thenumber of speakers in the ELAN file and the totalamount of speech.

The architecture we used for these experimentsis a hybrid CTC-attention model (Watanabe et al.,2017b) with a 3-layer BiLSTM encoder and a sin-gle layer decoder. We use a hidden size of 320and use an equal weighting between the CTC and

Page 5: User-friendly automatic transcription of low-resource

Language Num speakers Type Train (minutes) CER (%)

Na 1 Spontaneous narratives 273 14.5Na 1 Elicited words & phrases 188 4.7Chatino 1 Read speech 81 23.5Japhug 1 Spontaneous narratives 170 12.8

Table 1: Information on the evaluation datasets used and the character error rate performance of the current recipe.

attention objectives. For optimization we use abatch length of 30 and the Adadelta gradient de-scent algorithm (Zeiler, 2012). For more details,we include a link to the recipe.8

3.2 Elpis enhancements

Beyond integration of ESPnet into Elpis, severalother noteworthy enhancements have been madeto Elpis.

Detailed training feedback Prior to the workreported in this paper, the progress of training andtranscribing stages was shown as a spinning iconwith no other feedback. Due to the amount oftime it takes to train even small speech recognitionmodels, the lack of detailed feedback may cause auser to wonder what stage the training was at, orwhether a fault had caused the system to fail. Dur-ing training and transcription, the backend pro-cesses’ logs are now output to the screen (see Fig-ure 1). Although the information in these logs maybe more detailed than what the intended audienceof the tool needs to understand, it does serve togive any user feedback on how training is going,and reassure them that it is still running (or notifythem if a process has failed). The logs can alsoprovide useful contextual information when de-bugging an experiment in collaborations betweenlanguage workers and software engineers.

CUDA-supported Docker image The type ofKaldi model which Elpis trains (using a Gaussianmixture model as the acoustic model) was origi-nally selected to be computationally efficient, andable to run on the type of computers commonlyused by language researchers, which often don’thave a GPU (graphics processing unit). With theaddition of ESPnet, the benefit of using a GPU willbe felt through vastly reduced training times forthe neural network. To this end, Elpis has been

8https://github.com/persephone-tools/espnet/commit/1c529eab738cc8e68617aebbae520f7c9c919081

adapted to include Compute Unified Device Ar-chitecture (CUDA) support, which is essential inorder to leverage a GPU when training ESPnet ona machine that has one available.

4 Application to a new data set: Japhug

The point of this work is to provide a tool that canbe used by linguists in their limited-data scenar-ios. To this end we aim to experiment with diversedatasets that reflect the breadth of language doc-umentation contexts. Going forward, this will beuseful in getting a sense of what sort of model per-formance users can expect given the characteris-tics of dataset. In this section we report on furtherapplication of the model underpinning the Elpis–ESPnet integration to another data set.

Japhug is a Sino-Tibetan language with a richsystem of consonant clusters, as well as flamboy-ant morphology. In Japhug, syllables can have ini-tial clusters containing at most three consonants,and at most one coda (Jacques, 2019). Japhug doesnot have lexical tones. The language’s phonolog-ical profile is thus very different from Na (aboutwhich see Michaud, 2017) and Chatino (Cruz,2011; Cruz and Woodbury, 2014; Cavar et al.,2016).

The data set comprises a total of about 30hours of transcribed recordings of narratives, time-aligned at the level of the sentence (Macaire,2020), which is a huge amount in a language doc-umentation context. The recordings were madein the course of field trips from the first years ofthe century until now, in a quiet environment, andalmost all of a single speaker. Our tests on vari-ous data sets so far suggest that these settings (onespeaker – hence no speaker overlap – and clean au-dio) are those in which performance is most likelyto be good when one happens to be training anacoustic model from scratch.

The full data set is openly accessible onlinefrom the Pangloss Collection, under a CreativeCommons license, allowing visitors to browse the

Page 6: User-friendly automatic transcription of low-resource

texts, and computer scientists to try their hand atthe data set.9 The data collector’s generous ap-proach to data sharing sets an impressive exam-ple, putting into practice some principles whichgather increasing support, but which are not yetsystematically translated into institutional and ed-itorial policies (Garellek et al., 2020).

The dataset can be downloaded by sending a re-quest to the Cocoon data repository, which hoststhe Pangloss Collection. A script, retriever.py,10

retrieves resources with a certain language name.Data sets can then be created in various ways, suchas sorting by speaker (tests suggest that single-speaker models are a good way to start) and bygenre, e.g. excluding materials such as songs,which are a very different kettle of fish from or-dinary speech and complicate model training.

Figure 2 shows how the phoneme error rate de-creases as the amount of training data increasesup to 170 minutes. Tests are currently being con-ducted to verify whether performance stagnateswhen the amount of data is increased beyond 170minutes. As with the other experiments the recipedescribed in §3.1 was used. For each amount oftraining data, the model was trained for 20 epochsfor each of these training runs, with the smallersets always as a subset of all larger sets. Figure 3shows the training profile for a given training runusing 170 minutes of data.

0 50 100 150

10

20

30

40

50

60

Training time (min)

Cha

ract

erE

rror

Rat

e(%

)

Figure 2: Character error rate for Japhug as a functionof the amount of training data, using the ESPnet recipeincluded in Elpis.

9Each text has a Digital Object Identifier, allowing forone-click access. Readers are invited to take a look: https://doi.org/10.24397/pangloss-0003360

10https://gitlab.com/lacito/pangloss-elpis

0 5 10 15 200

0.2

0.4

0.6

0.8

1

Number of epochs

Cha

ract

erE

rror

Rat

e(%

)

trainvalidation

Figure 3: Character error rate on the training set (blue)and validation set (orange) for Japhug as training pro-gresses (up to 20 epochs), using the ESPnet recipe in-cluded in Elpis.

5 Challenges concerning adoption ofautomatic speech recognition tools inlanguage documentation

Devoting a section to reflections about adoptionof automatic speech recognition tools in languagedocumentation may seem superfluous here. Theaudience of a conference on the use of compu-tational methods in the study of endangered lan-guages is highly knowledgeable about the difficul-ties and the rewards of interdisciplinary projects,as a matter of course. But it seemed useful to in-clude a few general thoughts on this topic nonethe-less, for the attention of the broader readershipwhich we hope will probe into the Proceedings ofthe ComputEL-4 conference: colleagues who mayconsider joining international efforts for wideradoption of natural language processing tools inlanguage documentation workflows. We brieflyaddress a few types of doubts and misgivings.

5.1 Is automatic speech recognition softwaretoo complex for language workers?

A first concern is that automatic speech recogni-tion software is simply too complex for languageworkers. But it should be recalled that new tech-nologies that seem inaccessible to language work-ers can be game-changers in linguistics. For in-stance, the LATEX software is the typesetting back-end used by the journal Glossa (Rooryck, 2016)and by the publishing house Language SciencePress (Nordhoff, 2018), which publish researchin linguistics, offering high-quality open-access

Page 7: User-friendly automatic transcription of low-resource

venues with no author fees or reader fees. Thus,LATEX, a piece of software which is notorious forits complexity, is used on a large scale in linguis-tics publishing: Glossa publishes more than 100articles a year, and Language Science Press about30 books a year. Key to this success is an or-ganizational setup whereby linguists receive notonly a set of stylesheets and instructions, but alsohands-on support from a LATEX expert all alongthe typesetting process. Undeniably complex soft-ware is only accessible to people with no priorknowledge of it if support is available. Automaticspeech recognition software should be equally ac-cessible for language workers, given the right or-ganization and setup. Accordingly, special empha-sis is placed on user design in the Elpis project.This aspect of the work falls outside of the scopeof the present paper, but we wanted to reassurepotential users that it is clear to Elpis developersthat the goal is to make the technology available topeople who do not use the command line. If userscan operate software such as ELAN then they willbe more than equipped for the skills of uploadingELAN files to Elpis and clicking the Train button.

5.2 Will the technology deliver on itspromise?

A second concern among language workers iswhether the technology can deliver on its promise,or whether transcription acceleration projects area case of “digital innovation fetishism” (Ampuja,2020). Some language workers have reporteda feeling that integration of automatic transcrip-tion into the language documentation workflow(as described in Michaud et al., 2018) feels outof reach for them. There is no denying that nat-ural language processing tools such as ESPnet andKaldi are very complex, and that currently, thehelp of specialists is still needed to make use ofthis technology in language documentation. How-ever, progress is clearly being made, and a moti-vated interdisciplinary community is growing atthe intersection of language documentation andcomputer science, comprising linguists who areinterested in investing time to learn about natu-ral language processing and computer scientistswho want to achieve “great things with smalllanguages” (Thieberger and Nordlinger, 2006).It seems well worth investing in computationalmethods to assist in the urgent task of document-ing the world’s languages.

5.3 Keeping up with the state of the art vs.stabilizing the tool

Finally, a concern among linguists is that thestate of the art in computer science is evolvingso rapidly that the tool cannot be stabilized, andhence cannot be proposed to language workersfor enduring integration into the language docu-mentation workflow. In cases where significant,high-frequency updates are required to keep upwith changes in speech recognition software, theinvestment could be too much for the relativelysmall communities of programmers involved intranscription acceleration projects.

Our optimistic answer is that state-of-the-artcode, or code close to the state of the art, neednot be difficult to integrate, use or maintain. Forexample, the developers of Huggingface’s Trans-formers 11 do an impressive job of wrapping thelatest and greatest in natural language processinginto an easy-to-use interface (Wolf et al., 2019).They have shown an ability to integrate new mod-els quickly after their initial publication. Usabil-ity and stability of the interface is dictated by thequality of the code that is written by the authorsof the backend library. If this is done well then thestate of the art can be integrated with minimal cod-ing effort by users of the library. For this reason,we are not so concerned about the shifting sandsof the underlying building blocks, but the choiceof quality backend library does count here. Whileit is true that there will have to be some modesteffort to keep up to date with ESPnet (as would bethe case using any other tool), in using ESPnet weare optimistic that the models supported by Elpiscan remain up to date with the state of the art with-out too much hassle.

6 Further improvements

The broader context to the work reported here isa rapidly evolving field in which various initia-tives aim to package natural language process-ing toolkits in intuitive interfaces so as to allowa wider audience to leverage the power of thesetoolkits. Directions for new developments in Elpisinclude (i) refining the ESPnet recipe, (ii) refin-ing the user interface through user design pro-cesses, (iii) preparing pre-trained models that canbe adapted to a small amount of data in a target

11https://github.com/huggingface/transformers

Page 8: User-friendly automatic transcription of low-resource

language, and (iv) providing Elpis as a web ser-vice.

6.1 Refining the ESPnet recipe

Refinement of the ESPnet recipe that is used inthe Elpis pipeline, such that it works as well aspossible given the type of data found in languagedocumentation contexts, is a top priority. Thiswork focuses on achieving lower error rates acrossdata sets, starting with refining hyperparametersfor model training and extends to other projectobjectives including providing pre-trained models(see §6.3). This work is of a more experimentalnature and can be done largely independently ofthe Elpis front-end.

6.2 Refining the interface

In parallel with the technical integration of ESP-net with Elpis, a user-design process has been in-vestigating how users expect to use these new fea-tures. In a series of sessions, linguists and lan-guage workers discussed their diverse needs witha designer. The feedback from this process in-formed the building of a prototype interface basedon the latest version of Elpis at the time. Thetest interface was then used in individual testingsessions to discover points of confusion and un-certainty in the interface. Results of the designprocess will guide an update to the interface andfurther work on writing supporting documentationand user guides. The details of this process are be-yond the scope of this paper and will be reportedseparately in future.

6.3 Pre-trained models and transfer learning

Adapting a trained model to a new language hasa long history in speech recognition, having beenused both for Hidden Markov Model based sys-tems (Schultz and Waibel, 2001; Le and Besacier,2005; Stolcke et al., 2006; Tóth et al., 2008;Plahl et al., 2011; Thomas et al., 2012; Imsenget al., 2014; Do et al., 2014; Heigold et al., 2013;Scharenborg et al., 2017) and end-to-end neuralsystems (Toshniwal et al., 2017; Chiu et al., 2018;Müller et al., 2017; Dalmia et al., 2018; Watan-abe et al., 2017a; Inaguma et al., 2018; Yi et al.,2018; Adams et al., 2019). In scenarios wheredata in the target domain or language is limited,leveraging models trained on a number of speak-ers in different languages often can result in a bet-ter performance. The model can learn to cope with

acoustic and phonetic characteristics that are com-mon between languages, such as building robust-ness to channel variability due to different record-ing conditions, as well as learning common fea-tures of phones and sequences of phones betweenlanguages.

In recent years pre-training of models on largeamounts of unannotated data has led to break-through results in text-based natural languageprocessing, initially gaining widespread popular-ity with the context-independent embeddings ofword2vec (Mikolov et al., 2013) and GloVe (Pen-nington et al., 2014), before the recent contextualword embedding revolution (Peters et al., 2018;Devlin et al., 2019; Liu et al., 2020) that hasharnessed the transformer architecture (Vaswaniet al., 2017). It is now the case that the best ap-proaches in natural language processing are typ-ically characterized by pre-training of a modelon a large amount of unannotated data using thecloze task (a.k.a masked language model train-ing) before fine-tuning to a target task. Modelspre-trained in this way best make use of avail-able data since the amount of unannotated data faroutweighs annotated data and such pre-training isadvantageous to downstream learning, whether asmall or large amount of data is available in the tar-get task (Gururangan et al., 2020). Despite the es-tablished nature of pre-training in natural languageprocessing, it is less well established in speechrecognition, though there has been recent work(Rivière et al., 2020; Baevski et al., 2020).

The language documentation scenario, whereannotated data is very limited is a scenario thatwe argue stands most to gain from such pre-training (both supervised and self-supervised out-of-domain); followed by model adaptation to lim-ited target language data. One of the features Elpiscould provide is to include pre-trained models inits distribution or via an online service. Such mod-els may be pre-trained in a self-supervised manneron lots of untranscribed speech, trained in a super-vised manner on transcribed speech in other lan-guages, or use a combination of both pre-trainingtasks. In cases where the pre-trained model wastrained in a supervised manner, there is scope todeploy techniques to reconcile the differences inacoustic realization between phonemes of differ-ent languages via methods such as that of Al-losaurus (Li et al., 2020) which uses a joint modelof language-independent phones and language-

Page 9: User-friendly automatic transcription of low-resource

dependent phonemes. Providing a variety of pre-trained models would be valuable, since the bestseed model for adaptation may vary on the basisof the data in the target language (Adams et al.,2019).

A recognized problem in language documenta-tion is that, owing to the transcription bottleneck,a large amount of unannotated and untranscribeddata ends up in data graveyards (Himmelmann,2006): archived recordings that go unused in lin-guistic research. It is frequently the case that thevast majority of speech collected by field linguistsis untranscribed. Here too, self-supervised pre-training in the target language is likely a promis-ing avenue to pursue, perhaps in tandem with su-pervised pre-training regiments. For this reason,we are optimistic that automatic transcription willhave a role to play in almost all data scenariosfound in the language documentation context –even when training data is extremely limited – andis not just reserved for certain single-speaker cor-pora with consistently high quality audio and cleanalignments with text. In the past one could plau-sibly argue that the limited amount of transcribedspeech as training data is an insurmountable hur-dle in a language documentation context, but thatwill likely not remain the case.

One of the next steps planned for Elpis is to al-low for acoustic models to be exported and loaded.Beyond the immediate benefit of saving the trou-ble of training models anew each time, havinga library of acoustic models available in an on-line repository would facilitate further research onadaptation of acoustic models to (i) more speak-ers, and (ii) more language varieties. Buildinguniversal phone recognition systems is an activearea of research (Li et al., 2020); these develop-ments could benefit from the availability of acous-tic models on a range of languages. Hostingacoustic models in an online repository, and us-ing them for transfer learning, appear as promisingperspectives.

6.4 Providing Elpis as a web service

Training models requires a lot of computingpower. Elpis now supports high-speed parallelprocessing in situations where the user’s operat-ing system has compatible GPUs (see Section 3.2).However, many users don’t have this technologyin the computers they have ready access to, sowe also plan to investigate possibilities for host-

ing Elpis on a high-capacity server for end-useraccess. Providing language technologies via webservices appears to be a successful method of mak-ing tools widely available, with examples includ-ing the WebMAUS forced-alignment tool.12 Thesuite of tools provided by the Bavarian SpeechArchive (Kisler et al., 2017) have successfully pro-cessed more than ten million media files sincetheir introduction in 2012. For users who want toavoid sending data to a server, there are other pos-sibilities: Kaldi can be compiled to Web Assem-bly so it can do decoding in a browser (Hu et al.,2020). But for the type of user scenarios consid-ered here, hosting on a server would have majoradvantages, and transfer over secure connectionis a strong protection against data theft (for thosedata sets that must not be made public, to followthe consultants’ wishes or protect the data collec-tors’ exclusive access rights to the data so that theywill not be scooped in research and placed at a dis-advantage in job applications).

This context suggests that it would be highly de-sirable to design web hosting for Elpis. It wouldfacilitate conducting broad sets of tests trainingacoustic models, and would also facilitate the tran-scription of untranscribed recordings.

7 Conclusion

In this paper we have reported on integrating ES-Pnet, an end-to-end neural network speech recog-nition system, into Elpis, the user-friendly speechrecognition interface. We described changes thathave been made to the front-end, the addition ofa CUDA supported Elpis Dockerfile, and the cre-ation of an ESPnet recipe for Elpis. We reportedpreliminary results on several languages and artic-ulated plans going forward.

Acknowledgments

Many thanks to the three reviewers for commentsand suggestions.

We are grateful for financial support to theElpis project from the Australian Research Coun-cil Centre of Excellence for the Dynamics of Lan-guage, the University of Queensland, the Insti-tut des langues rares (ILARA) at École Pratiquedes Hautes Études, the European Research Coun-cil (as part of the project “Beyond boundaries:

12https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface

Page 10: User-friendly automatic transcription of low-resource

Religion, region, language and the state” [ERC-609823]) and Agence Nationale de la Recherche(as part of two projects, “Computational LanguageDocumentation by 2025” [ANR-19-CE38-0015-04] and “Empirical Foundations of Linguistics”[ANR-10-LABX-0083]).

Linguistic resources used in the present studywere collected as part of projects funded by theEuropean Research Council (“Discourse report-ing in African storytelling” [ERC-758232]) and byAgence Nationale de la Recherche (“Parallel cor-pora in languages of the Greater Himalayan area”[ANR-12-CORP-0006]).

References

Oliver Adams, Trevor Cohn, Graham Neubig, Hi-laria Cruz, Steven Bird, and Alexis Michaud. 2018.Evaluating phonemic transcription of low-resourcetonal languages for language documentation. InProceedings of LREC 2018 (Language Resourcesand Evaluation Conference), pages 3356–3365,Miyazaki. https://halshs.archives-ouvertes.fr/halshs-01709648.

Oliver Adams, Matthew Wiesner, Shinji Watanabe,and David Yarowsky. 2019. Massively multilin-gual adversarial speech recognition. In Proceed-ings of the 2019 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers), pages 96–108,Minneapolis, Minnesota. Association for Compu-tational Linguistics. https://www.aclweb.org/anthology/N19-1009.

Marko Ampuja. 2020. The blind spots of digital in-novation fetishism. In Matteo Stocchetti, editor,The digital age and its discontents: Critical reflec-tions in education, pages 31–54. Helsinki Univer-sity Press, Helsinki. https://doi.org/10.33134/HUP-4-2.

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed,and Michael Auli. 2020. wav2vec 2.0: A frameworkfor self-supervised learning of speech representa-tions. arXiv preprint arXiv:2006.11477. https://arxiv.org/abs/2006.11477.

Laurent Besacier, Etienne Barnard, Alexey Karpov,and Tanja Schultz. 2014. Automatic speech recog-nition for under-resourced languages: A survey.Speech Communication, 56:85–100.

Rogier Blokland, Marina Fedina, Ciprian Gersten-berger, Niko Partanen, Michael Rießler, andJoshua Wilbur. 2015. Language documen-tation meets language technology. In Pro-ceedings of the First International Workshop

on Computational Linguistics for Uralic Lan-guages - Septentrio Conference Series, pages 8–18. http://septentrio.uit.no/index.php/SCS/article/view/3457/3386.

Małgorzata Cavar, Damir Cavar, and Hilaria Cruz.2016. Endangered language documentation: boot-strapping a Chatino speech corpus, forced aligner,ASR. In Proceedings of LREC 2016, Tenth Interna-tional Conference on Language Resources and Eval-uation, pages 4004–4011, Portorož, Slovenia.

Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Ro-hit Prabhavalkar, Patrick Nguyen, Zhifeng Chen,Anjuli Kannan, Ron J Weiss, Kanishka Rao, KatyaGonina, et al. 2018. State-of-the-art speech recogni-tion with sequence-to-sequence models. In ICASSP,pages 4774–4778. https://arxiv.org/abs/1712.01769.

Christopher Cox. 2019. Persephone-ELAN (software).https://github.com/coxchristopher/persephone-elan.https://github.com/coxchristopher/persephone-elan.

Emiliana Cruz. 2011. Phonology, tone andthe functions of tone in San Juan QuiahijeChatino. Ph.D., University of Texas at Austin,Austin. http://hdl.handle.net/2152/ETD-UT-2011-08-4280.

Emiliana Cruz and Tony Woodbury. 2014. Finding away into a family of tone languages: The story andmethods of the Chatino Language DocumentationProject. Language Documentation and Conserva-tion, 8:490–524. http://hdl.handle.net/10125/24615.

Siddharth Dalmia, Ramon Sanabria, Florian Metze,and Alan W Black. 2018. Sequence-based multi-lingual low resource speech recognition. In ICASSP.https://arxiv.org/abs/1802.07420.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Asso-ciation for Computational Linguistics. https://www.aclweb.org/anthology/N19-1423.

Van Hai Do, Xiong Xiao, Eng Siong Chng, andHaizhou Li. 2014. Cross-lingual phone mappingfor large vocabulary speech recognition of under-resourced languages. IEICE Transactions on Infor-mation and Systems, E97-D(2):285–295.

Daan van Esch, Ben Foley, and Nay San. 2019.Future directions in technological support forlanguage documentation. In Proceedings of theWorkshop on Computational Methods for Endan-gered Languages, volume 1, Honolulu, Hawai‘i.

Page 11: User-friendly automatic transcription of low-resource

https://computel-workshop.org/wp-content/uploads/2019/02/CEL3_book_papers_draft.pdf#page=26.

Ben Foley, Josh Arnold, Rolando Coto-Solano, Gau-tier Durantin, and T. Mark Ellison. 2018. Buildingspeech recognition systems for language documen-tation: the CoEDL Endangered Language Pipelineand Inference System (ELPIS). In Proceedings ofthe 6th Intl. Workshop on Spoken Language Tech-nologies for Under-Resourced Languages (SLTU),29-31 August 2018, pages 200–204, Gurugram,India. ISCA. https://www.isca-speech.org/archive/SLTU_2018/pdfs/Ben.pdf.

Ben Foley, Alina Rakhi, Nicholas Lambourne,Nicholas Buckeridge, and Janet Wiles. 2019. Elpis,an accessible speech-to-text tool. In Proceedings ofInterspeech 2019, pages 306–310, Graz. https://www.isca-speech.org/archive/Interspeech_2019/pdfs/8006.pdf.

Marc Garellek, Matthew Gordon, James Kirby, Wai-Sum Lee, Alexis Michaud, Christine Mooshammer,Oliver Niebuhr, Daniel Recasens, Timo Roettger,Adrian Simpson, and Kristine M. Yu. 2020. Towardopen data policies in phonetics: What we can gainand how we can avoid pitfalls. Journal of SpeechScience, 9(1). https://halshs.archives-ouvertes.fr/halshs-02894375.

Alex Graves, Santiago Fernandez, Faustino Gomez,and Jurgen Schmidhuber. 2006. Connection-ist Temporal Classification : Labelling Unseg-mented Sequence Data with Recurrent NeuralNetworks. Proceedings of the 23rd interna-tional conference on Machine Learning, pages369–376. http://www.cs.utoronto.ca/~graves/icml_2006.pdf.

Suchin Gururangan, Ana Marasovic, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. 2020. Don’t stop pretraining:Adapt language models to domains and tasks. InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 8342–8360. Association for Computational Linguistics.https://arxiv.org/abs/2004.10964.

Hossein Hadian, Hossein Sameti, Daniel Povey, andSanjeev Khudanpur. 2018. End-to-end speechrecognition using lattice-free MMI. In Interspeech,pages 12–16. https://danielpovey.com/files/2018_interspeech_end2end.pdf.

Awni Hannun, Carl Case, Jared Casper, BryanCatanzaro, Greg Diamos, Erich Elsen, RyanPrenger, Sanjeev Satheesh, Shubho Sengupta, AdamCoates, et al. 2014. Deep speech: Scaling upend-to-end speech recognition. arXiv preprintarXiv:1412.5567. https://arxiv.org/abs/1412.5567.

Georg Heigold, Vincent Vanhoucke, Alan Senior,Patrick Nguyen, Marc’Aurelio Ranzato, Matthieu

Devin, and Jeffrey Dean. 2013. Multilingual acous-tic models using distributed deep neural networks.In Proceedings of ICASSP, pages 8619–8623.

Nikolaus Himmelmann. 2006. Language documenta-tion: what is it and what is it good for? In Josh Gip-pert, Nikolaus Himmelmann, and Ulrike Mosel, ed-itors, Essentials of language documentation, pages1–30. De Gruyter, Berlin/New York.

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl,Abdel-rahman Mohamed, Navdeep Jaitly, AndrewSenior, Vincent Vanhoucke, Patrick Nguyen, Tara NSainath, and Others. 2012. Deep neural networks foracoustic modeling in speech recognition: The sharedviews of four research groups. Signal ProcessingMagazine, IEEE, 29(6):82–97.

Nils Hjortnaes, Niko Partanen, Michael Rießler,and Francis M. Tyers. 2020. Towards aspeech recognizer for Komi, an endangered andlow-resource Uralic language. In Proceed-ings of the Sixth International Workshop onComputational Linguistics of Uralic Languages,pages 31–37, Wien. Association for Computa-tional Linguistics. https://www.aclweb.org/anthology/2020.iwclul-1.5/.

Mathieu Hu, Laurent Pierron, Emmanuel Vincent, andDenis Jouvet. 2020. Kaldi-web: An installation-free, on-device speech recognition system. In Pro-ceedings of Interspeech 2020 Show & Tell, Shang-hai. https://hal.archives-ouvertes.fr/hal-02910876.

David Imseng, Petr Motlicek, Hervé Bourlard, andPhilip N Garner. 2014. Using out-of-language datato improve an under-resourced speech recognizer.Speech Communication, 56:142–151.

Hirofumi Inaguma, Jaejin Cho, Murali KarthickBaskar, Tatsuya Kawahara, and Shinji Watan-abe. 2018. Transfer learning of language-independent end-to-end ASR with language modelfusion. arXiv:1811.02134. https://arxiv.org/abs/1811.02134.

Guillaume Jacques. 2019. Japhug. Journal of the In-ternational Phonetic Association, 49(3):427–450.

Thomas Kisler, Uwe Reichel, and Florian Schiel. 2017.Multilingual processing of speech via web services.Computer Speech & Language, 45:326–347. ISBN:0885-2308 Publisher: Elsevier.

Viet Bac Le and Laurent Besacier. 2005. First stepsin fast acoustic modeling for a new target language:application to Vietnamese. In ICASSP.

Xinjian Li, Siddharth Dalmia, Juncheng Li, MatthewLee, Patrick Littell, Jiali Yao, Antonios Anas-tasopoulos, David R. Mortensen, Graham Neu-big, and Alan W. Black. 2020. Universal phonerecognition with a multilingual allophone system.In ICASSP 2020-2020 IEEE International Confer-ence on Acoustics, Speech and Signal Processing

Page 12: User-friendly automatic transcription of low-resource

(ICASSP), pages 8249–8253. IEEE. https://arxiv.org/abs/2002.11800.

Qi Liu, Matt J Kusner, and Phil Blunsom. 2020.A survey on contextual embeddings. arXivpreprint arXiv:2003.07278. https://arxiv.org/abs/2003.07278.

Cécile Macaire. 2020. Alignement temporel entre tran-scriptions et audio de données de langue japhug.In Actes des Journées scientifiques du Groupementde Recherche "Linguistique informatique, formelleet de terrain" (LIFT), Paris. https://hal.archives-ouvertes.fr/hal-03047146.

Boyd Michailovsky, Martine Mazaudon, AlexisMichaud, Séverine Guillaume, AlexandreFrançois, and Evangelia Adamou. 2014. Doc-umenting and researching endangered lan-guages: the Pangloss Collection. LanguageDocumentation and Conservation, 8:119–135. https://halshs.archives-ouvertes.fr/halshs-01003734.

Alexis Michaud. 2017. Tone in Yongning Na: lex-ical tones and morphotonology. Number 13 inStudies in Diversity Linguistics. Language SciencePress, Berlin. http://langsci-press.org/catalog/book/109.

Alexis Michaud, Oliver Adams, Trevor Cohn, GrahamNeubig, and Séverine Guillaume. 2018. Integrat-ing automatic transcription into the language docu-mentation workflow: experiments with Na data andthe Persephone toolkit. Language Documentationand Conservation, 12:393–429. http://hdl.handle.net/10125/24793.

Tomáš Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Markus Müller, Sebastian Stüker, and Alex Waibel.2017. Phonemic and graphemic multilingualCTC based speech recognition. arXiv:1711.04564.https://arxiv.org/abs/1711.04564.

Sebastian Nordhoff. 2018. Language Science Pressbusiness model: Evaluated version of the 2015model. Language Science Press, Berlin. https://doi.org/10.5281/zenodo.1286972.

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, andLuca Antiga. 2019. Pytorch: An imperativestyle, high-performance deep learning library.In Advances in neural information processingsystems. Proceedings of the 33rd Conference onNeural Information Processing Systems (NeurIPS2019), pages 8026–8037, Vancouver, Canada.https://papers.nips.cc/paper/9015-pytorch-an-imperative-style-

high-performance-deep-learning-library.pdf.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: global vectors for word rep-resentation. Proceedings of the Empiricial Methodsin Natural Language Processing (EMNLP 2014),12:1532–1543.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. arXiv preprint arXiv:1802.05365.https://arxiv.org/abs/1802.05365.

Christian Plahl, Ralf Schlüter, and Hermann Ney.2011. Cross-lingual portability of Chinese and En-glish neural network features for French and Ger-man LVCSR. In IEEE Workshop on AutomaticSpeech Recognition and Understanding (ASRU),pages 371–376.

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, MirkoHannemann, Petr Motlicek, Yanmin Qian, PetrSchwarz, Jan Silovsky, Georg Stemmer, and KarelVesely. 2011. The Kaldi speech recognition toolkit.In IEEE 2011 Workshop on Automatic SpeechRecognition and Understanding. IEEE SignalProcessing Society. https://infoscience.epfl.ch/record/192584/files/Povey_ASRU2011_2011.pdf.

Mirco Ravanelli, Titouan Parcollet, and Yoshua Ben-gio. 2019. The pytorch-kaldi speech recogni-tion toolkit. In ICASSP 2019-2019 IEEE Interna-tional Conference on Acoustics, Speech and Sig-nal Processing (ICASSP), pages 6465–6469. IEEE.https://arxiv.org/abs/1811.07453.

Morgane Rivière, Armand Joulin, Pierre-EmmanuelMazaré, and Emmanuel Dupoux. 2020. Unsuper-vised pretraining transfers well across languages. InICASSP 2020 - 2020 IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP), pages 7414–7418. https://arxiv.org/abs/2002.02848.

Johan Rooryck. 2016. Introducing Glossa. Glossa,1(1):1–3. http://doi.org/10.5334/gjgl.91.

Odette Scharenborg, Francesco Ciannella, ShrutiPalaskar, Alan Black, Florian Metze, Lucas Ondel,and Mark Hasegawa-Johnson. 2017. Building anASR system for a low-research language through theadaptation of a high-resource language ASR system:preliminary results. In International Conference onNatural Language, Signal and Speech Processing(ICNLSSP).

Tanja Schultz and Alex Waibel. 2001. Experi-ments on cross-language acoustic modeling. EU-ROSPEECH’01, pages 2721–2724.

Page 13: User-friendly automatic transcription of low-resource

Andreas Stolcke, Frantisek Grezl, Mei-Yuh Hwang,Xin Lei, Nelson Morgan, and Dimitra Vergyri.2006. Cross-domain and cross-language portabil-ity of acoustic features estimated by multilayer per-ceptrons. In ICASSP. http://ieeexplore.ieee.org/document/1660022/.

Nick Thieberger and Rachel Nordlinger. 2006. Doinggreat things with small languages (AustralianResearch Council grant DP0984419). https://arts.unimelb.edu.au/school-of-languages-and-linguistics/our-research/past-research-projects/great-things-small-languages.

Samuel Thomas, Sriram Ganapathy, Hynek Herman-sky, and Speech Processing. 2012. MultilingualMLP features for low-resource LVCSR systems. InICASSP, pages 4269–4272.

Shubham Toshniwal, Tara N. Sainath, Ron J. Weiss,Bo Li, Pedro Moreno, Eugene Weinstein, and Kan-ishka Rao. 2017. Multilingual speech recognitionwith a single end-to-end model. In ICASSP. http://arxiv.org/abs/1711.01694.

László Tóth, Joe Frankel, Gábor Gosztolya, and SimonKing. 2008. Cross-lingual portability of MLP-basedtandem features - a case study for English and Hun-garian. INTERSPEECH.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Shinji Watanabe, Takaaki Hori, and John R. Her-shey. 2017a. Language independent end-to-endarchitecture for joint language identificationand speech recognition. In IEEE Workshopon Automatic Speech Recognition and Under-standing Workshop (ASRU), pages 265–271.https://www.merl.com/publications/docs/TR2017-182.pdf.

Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, NelsonEnrique Yalta Soplin, Jahn Heymann, MatthewWiesner, and Nanxin Chen. 2018. ESPnet:End-to-end speech processing toolkit. arXivpreprint arXiv:1804.00015. https://arxiv.org/abs/1804.00015.

Shinji Watanabe, Takaaki Hori, Suyoun Kim, John RHershey, and Tomoki Hayashi. 2017b. Hy-brid CTC/attention architecture for end-to-endspeech recognition. IEEE Journal of SelectedTopics in Signal Processing, 11(8):1240–1253.https://www.merl.com/publications/docs/TR2017-190.pdf.

Guillaume Wisniewski, Séverine Guillaume, andAlexis Michaud. 2020. Phonemic transcription oflow-resource languages: To what extent can pre-processing be automated? In Proceedings of

the 1st Joint SLTU (Spoken Language Technologiesfor Under-resourced languages) and CCURL (Col-laboration and Computing for Under-ResourcedLanguages) Workshop, pages 306–315, Marseille,France. European Language Resources Associa-tion (ELRA). https://halshs.archives-ouvertes.fr/hal-02513914.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2019.Huggingface’s transformers: State-of-the-art natu-ral language processing. ArXiv, abs/1910.03771.https://arxiv.org/abs/1910.03771.

Jiangyan Yi, Jianhua Tao, Zhengqi Wen, and Ye Bai.2018. Adversarial multilingual training for low-resource speech recognition. ICASSP, pages 4899–4903.

Matthew D. Zeiler. 2012. Adadelta: an adaptive learn-ing rate method. arXiv preprint arXiv:1212.5701.https://arxiv.org/abs/1212.5701.

Albert Zeyer, Kazuki Irie, Ralf Schlüter, and HermannNey. 2018. Improved training of end-to-end at-tention models for speech recognition. https://arxiv.org/abs/1805.03294.

Wei Zhou, Wilfried Michel, Kazuki Irie, Markus Kitza,Ralf Schlüter, and Hermann Ney. 2020. The RWTHASR system for TED-LIUM Release 2: Improv-ing Hybrid HMM with SpecAugment. https://arxiv.org/abs/2004.00960.