Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

HELSINGIN YLIOPISTOHELSINGFORS UNIVERSITETUNIVERSITY OF HELSINKI

Building Real-WorldFinite-StateSpell-Checkers WithHFSTin FSMNLP 2012 at Donostia

Tommi A Pirinen‹[email protected]›

August 2, 2012University of HelsinkiDepartment of Modern Languages

Outline

Introduction

Language Models

Error Models

Misc. practicalities

Outline

Introduction

Language Models

Error Models


Finite-State Spell-Checking

A simple task: go through all words in text tosee if they belong to language LL, if not, modifythem with relation EE to fit into language LLIn Finite-State world language model LL is any(weighted) single tape finite-state automatonrecognising the words of the languageThe error model EE is any (weighted) two-tapeautomaton, that describes spelling errors, i.e.mapping from misspelt word into the correctoneIn this tutorial we build simple WFST languagemodel and piece together error model fromdifferent error types stored in FSTs

(Weighted) finite-statespell-checking graphicallyFrom this misspelled word as automaton:

30 1c

2t a

Applying this very specific error (correction) model:

30 1c

2t:a a:t

We can compose or intersect with a word in thisdictionary:

0 1c

2a

3t

5 6h

7 8s

c

4s

e

With HFST Tools (but mostlygeneric FST algebra)

The tools we use to build the automata arefrom HFST project http://hfst.sf.netbut most what I show here is finite-statealgebra, which your favorite fst tools shouldsupport as wellThe chain of libraries to get HFST automataspellers to any desktop application is HFSTospell→voikko→enchant (with few exceptionalapplication not supporting this).https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstUseAsSpellChecker

http://hfst.sf.net

https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstUseAsSpellChecker


Outline

Introduction

Language Models

Error Models


Easy baseline for languagemodel—a word-list

The most basic language model for checking ifword is correctly written is a word-form list:abacus, ..., zygotes, right?Fast way to build such word-list now: grab anonline text collection and separate it to a wordsDoing it like this, we also get: probabilities!P(w) = c(w)

CS

In the end we’d probably mangle theprobabilities into tropical logprobs that ourwfst’s use by default, say: − log P(w)

Compiling corpus into aspell-checker with HFST

A simple single command-line (no line breaks):tr -d ’[:punct:]’ <demo-corpus.txt | tr -s’[:space:]’ ’\n’ | sort | uniq -c| awk -f tropicalize-uniq-c.awk-assign=CS=20 | hfst-txt2fst -fopenfst -o demo-lm.hfst

In order: clean punctuation, tokenize, sort,count, log probs and compileLet’s try; you may grab corpus from , orWikipedia dump, write one on the fly if youwant to test it now

A language model as a FST

Resulting language model should look like this:

0

1

f

11m

17

o 18

p 21

q

24

r

37z

40

H 44

a

52b

55

A

2i

10o

3n 4i

5t

6e

7s

8t

9a

16

t

o

12a

13c

14h

15i n e

f

19i

20n t

22u 23u n x

25u 26m

27p

28l 29e 30

s 31

t 32

i 33

l 34

t 35

s 36

k 51

i

38z 39i p

41w

42e 43

t h

45u

46t

47o 48m 49a 50

t o

n

53e

54

a

e r

Other language models

A word-form list is only good for very basicdemoing for mostly isolating languagesSince any given finite-state dictionary isusable, we can use—and have used—e.g.:

Xerox style morphologies (lexc, twolc, xfst)compiled with foma or hfst toolsapertium dictionarieshunspell dictionaries—with a very kludgeycollection of conversion scripts

from a morphological analyser you may wantto project off the analysis tape, we currentlywon’t use itweighting language models afterwards fromcorpora is simple, cf. for example my pastpublications for HFST scripts of that

Outline

Introduction

Language Models

Error Models


Building error models fromscratch

Error models are two-tape automata that mapmisspelled word-forms into a correctly spelledoneBasic error types / corrections covered here:typing errors (Levenshtein–Damerau),phonemic/orthographic competence errors,common confusables word-formsOther relatively easy corrections not coveredhere: phonemic folding, errors in context. . .

Building error models fromscratch

Since types of errors that can appear are verydifferent, we build now separate automatacomponents for each, and then connect themtogetherHere is a good time to look back at thedictionary to see what the weights are andscale error weights accordingly using theStetson-Harrison algorithm

Basic component 1: Run ofcorrectly spelled charactersWe assume most misspelled words will have mostof the letters correct, to account these parts weneed the identity automaton:echo ’?*’ | hfst-regexp2fst -oanystar.hfst

0

?

Basic component 2: Typing error

A single typing error, as in popular Levenshteinmeasure, is a) removal, b) addition or c) change ofa character, using fixed alphabet for example with aand b, and let’s say this error is less unlikely thanany other for all combinations ( = 10):echo ’a:0::10 | 0:a::10 | a:b::10 |b:0::10 | 0:b::10 | b:a::10’ |hfst-regexp2fst -o edit1.hfst

0 1

0:a/10

a:b/10

b:?/10

a:?/10

?:?/10

b:0/10

0:b/10

b:a/10

a:0/10

?:0/10

0:?/10

?:b/10

?:a/10

Excursion 1: Swap inLevenshtein-Damerau measureThe fourth type of error in edit distance algorithm,transposition or swapping adjacent characters, isuseful, but makes the automaton a bit messy andtedious to write (which is why we have few scriptsto generate them). This is the additionalcomponent, that must be added per each pair ofcharacters in alphabet:echo ’a:b::10 (b:a) | b:a::10 (a:b) |hfst-regexp2fst

Edit distance automata

Combining the correctly written parts and asingle edit gets us a nice baseline error model:hfst-concatenate anystar.hfstedit1.hfst | hfst-concatenate -anystar.hfst -o errm1.hfst

Since it’s just an automaton, all regular tricks ofFST algebra just work, particularly:hfst-repeat -f 1 -t 7 -ierrm1.hfst -o errm7.hfst

(A script to automate writing of thecombinations of swaps: pythoneditdist.py -d 1 -S -a demo.hfst |hfst-txt2fst -o edit1.hfst)

Edit distance automatongraphically

0

2

a:0

a:b

??:b

b:??

b:0

b:a

??

??:0

0:??

0:a

0:b

??:a

a:??

1

??

a

b

?? a b

??

??:0

0:??

0:a

0:b

??:a

a:??

a:0

a:b

??:b

b:??

b:0

b:a

?? a b

Common orthography-basedcompetence errorsEnglish orthography sometimes leads to errors thatare not easily solvable by basic edit distance, so wecan extend the edit distance errors by set ofcommonly confused substrings (N.B. input tostrings2fst expects TAB between strings and weight5 i.e. bit more likely than before):hfst-strings2fst -j -o en-orth.hfstof:ough 5ier:eir 5

0

1o

5

i:e

4

7

2f:u 3

0:g 0:h

6e:i r

Edit Distance Plus OrthographicErrors

Combine orthographic errors with edit1 errorbefore making the full-word coveringerror-model: hfst-disjunct edit1.hfsten-orth.hfst -o edit1+en-orth.hfst

And rest as before: hfst-concatenateanystar.hfst edit1+en-orth.hfst |hfst-concatenate - anystar.hfst -oerrm.edit1+en-orth.hfst

Confusion set of words

Very simple, yet very important piece of errormodel, common word-specific mistakes, of coursecompiled from a list of mistakes:hfst-strings2fst -j -oen-confusables.hfstteh:the 2.5mispelled:misspelt 2.5conscienchous:conscientious 2.5

12

25

3

0

1t

13c

4

m

2e:h

10 11e:t d:0

14o

15n

16s

17c

18i

19e

20n

h:e

21c:t 22h:i

23o

24u s

5i

6s

7p:s

8e:p

9l:e l

Finally: hacking all thesetogether

adding one more component to the mixture;the full-word errors can be just disjuncted toany existing full error model: hfst-disjuncten-confusableserrm.edit1+en-orth.hfst -oerrm.everything.hfst

should your language happen to useproductive compounding you may wish tohfst-repeat -f 1 the full-word errorsbefore disjuncting

(The resulting automaton is too much for graphvizalready)

Outline

Introduction

Language Models

Error Models


Getting it used in real-worldapplications

With help of few libraries it’s relatively easy toget these automata to any typical desktopapplicationlibvoikko was originally for Finnishspell-checking based on malaga (lag)grammars, and it still requires that format ofmetadata present. The installation location isalso inherited from hereHFST ospell file format was specced openly onmailing list with few FST people—result:XML-based metadata in zip archive with veryspecific filenaming conventions.

Metadata

Two parts of metadata: voikko-fi_FI.profile saying language code and few other detailsin MIME-headerish format (line-based, key:value)XML metadata about automatons, how to layout dictionaries and error models, also lot of UImetadata like names and descriptionsThe most meaningful setting in these files isthe language code, it must match the localesettings and the name of the locale in programsettings where applicable

voikko-fi_FI.pro

info: Voikko-Dictionary-Format: 2info: Language-Code: fi_FIinfo: Language-Variant: hfstinfo: Description: Finnish HFST dictionaryinfo: Morphology-Backend: hfstinfo: Speller-Backend: hfstinfo: Suggestion-Backend: hfstinfo: Hyphenator-Backend: hfst

<hfstspeller dtdversion="1.0" hfstversion="3"><info><locale>fi</locale>...

</info><acceptor type="general" id="acceptor.default.hfst">...

</acceptor><errmodel>...

</errmodel></hfstspeller>

Zipping and Installation Location

The dictionary or language model needs to bestored in acceptor.default.hfst(configurable) in special format:hfst-fst2fst -f olw -idemo-lm.hfst -oacceptor.default.hfstThe error models areerrmodel.default.hfst and likewise inspecial format: hfst-fst2fst -f olw -ierrm.everythin.hfst -oerrmodel.default.hfstThe XML metadata goes into index.xml, thenzip it to speller.zhfst like so:zip -9 speller.zhfstacceptor.default.hfsterrmodel.default.hfst index.xmlNow copy this and the voikko-fi_FI.prointo $HOME/.voikko/2/mor-hfst-ll or$prefix/share/voikko/2/mor-hfst-lland the language should just work invoikko-enabled software:mkdir -p$HOME/.voikko/2/mor-hfst-llcp speller.zhfst voikko-fi_FI.pro$HOME/.voikko/2/mor-hfst-ll

URLs and references

http://hfst.sf.net


http://hfst.svn.sourceforge.net/viewvc/hfst/trunk/articles/fsmnlp-2012-spellers/

http://www.helsinki.fi/~tapirine/publications/

http://divvun.no/future/proofing/lexfile-spec.html

[email protected]

http://hfst.sf.net










[email protected]

Documents

Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET