Extraction of synonyms and semantically related words from ... · A synonym is a word having the same or nearly the same meaning as another word. For example “pal” and “friend”

Extraction of synonymsand semantically relatedwords from chat logs

Fredrik Norlindh

Uppsala UniversityDepartment of Linguistics and PhilologyMaster’s Programme in Language TechnologyMaster’s Thesis in Language Technology

November 20, 2012

Supervisors:Mats Dahllöf, Uppsala UniversitySonja Petrovic Lundberg, Artificial Solutions

Abstract

This study explores synonym extraction from domain specific chat logcollections by means of a tool-kit called JavaSDM. JavaSDM uses randomindexing and measures distributional similarities. The focus of this studyis to evaluate the effect of different preprocessing operations of trainingdata and different extraction criteria. Four chat log collections containingapproximately 1,000,000 tokens were compared: one English and oneSwedish from a retail company and one English and one Swedish froma travel company. One gold standard was based on synonym dictionariesand one was a manually extended version of that gold standard. Theextended gold standard included antonyms, misspellings and near-relatedhyponyms/hypernyms/siblings and was about 20 % bigger. On averagearound two of the extracted synonym candidates per test word were falselyclassified as incorrect because they were not included in the dictionary-based gold standard.

Precision, recall and f-score were computed. Test words were eithernouns, verbs or adjectives. The f-scores were three to five times higherwhen using the extended gold standard.

The best f-scores were achieved when training data had been lemma-tized. POS-tagging improved precision but decreased recall and decreasedthe number of extractions of misspellings as synonyms. A cosine similarityscore threshold 0.5 could be used to increase precision and f-score withoutsubstantially decreasing recall.

Contents

Acknowledgments 5

1 Introduction 6

2 Background 72.1 Synonymy and other lexical sense relations . . . . . . . . . . . . 72.2 Usage of extracted synonyms . . . . . . . . . . . . . . . . . . . . 82.3 Random Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Synonym Extraction Methods . . . . . . . . . . . . . . . . . . . 102.5 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Data and Method 133.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Preprocessings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Extraction of User Input from Chat Logs . . . . . . . . . 153.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 Part-Of-Speech Tagging . . . . . . . . . . . . . . . . . . . 153.2.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . 163.2.5 Stop Word Lists . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Synonym Extraction by Random Indexing . . . . . . . . . . . . 163.3.1 Random Indexing Tool-Kit . . . . . . . . . . . . . . . . . 163.3.2 Extraction Criteria . . . . . . . . . . . . . . . . . . . . . 18

3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4.1 Test Words . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.2 Gold Standard . . . . . . . . . . . . . . . . . . . . . . . . 203.4.3 Adapted Gold Standard . . . . . . . . . . . . . . . . . . 21

4 Results 244.1 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3 Part-of-speech tagging . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.5 The different gold standards . . . . . . . . . . . . . . . . . . . . 264.6 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.7 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.8 F-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.9 Comparison Overview . . . . . . . . . . . . . . . . . . . . . . . 30

5 Conclusions 34

3

5.1 Overview of the results . . . . . . . . . . . . . . . . . . . . . . . 355.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2.1 Future Study suggestions . . . . . . . . . . . . . . . . . . 39

References 40

Bibliography 40

4

Acknowledgments

I would like to thank Mats Dahllöf for continuous support and feedbackthroughout this project.

I’m greatly appreciative to Artificial Solutions for the opportunity to do thisproject with a leading language technology company and the use of their data.Especially I want to thank my supervisor Sonja Petrovic Lundberg.

I would also like to thank Per Starbäck for the time he took to read my thesisand the advice he gave me regarding formatting estestics and Jörg Tiedemannfor helpful comments and feedback.

5

1 Introduction

This study explores synonym extraction from chat log collections by meansof an open source tool-kit called JavaSDM. JavaSDM measures distributionalsimilarities and uses a method called random indexing. The focus of this studyis to evaluate effects of different preprocessing operations of training data anddifferent extraction criteria. Preprocessing operations are for example lemma-tization and POS-tagging. An extraction criterion is for example a similaritythreshold t which says that words with lower similarity score than t are not to beconsidered as synonym suggestions. The study is intended to help the automaticpart of a semi-automatic synonym extraction process (a process where humansextract synonyms from automatically extracted synonym candidates).

The effects these operations and criteria have on two languages and two do-mains will be compared. Artificial Solutions provided four different collectionsof chat logs which contain approximately 1,000,000 tokens each. One Swedishand one English collection from a travel company and one Swedish and oneEnglish collection from a retail company.

Four gold standards were created for each chat log corpus. One dictionary-based and one adapted gold standard for graph words and one dictionary-basedand one adapted gold standard for lemmas. The dictionary-based gold standardswere based on existing synonym dictionaries but did not include words that didnot exist in the input data. Inflections were added to graph word based goldstandards. Adapted gold standards were created by manually reading synonymproposals that were not included in the dictionary-based gold standards. Thewords that actually were correct were added to adapted gold standard where allwords from the dictionary-based gold standard also are included. Graph worddata and lemmatized data are different things, but, unless a combination ofboth is created and alterations in the JavaSDM code is done, one and only oneof them must be chosen for synonym extraction with JavaSDM.

6

2 Background

Section 2.1 discusses when words can be seen as synonyms. Section 2.2 ex-emplifies how extracted synonyms can be used. Section 2.3 is about randomindexing, which is implemented by JavaSDM. Section 2.4 gives an overview ofdifferent synonym extraction approaches that have been used in prior studies.Section 2.5 is about ways people have evaluated synonym extractors.

2.1 Synonymy and other lexical sense relationsA synonym is a word having the same or nearly the same meaning as anotherword. For example “pal” and “friend” or “jump” and “leap”. Creating an absolutedefinition of how similar a pair of words need to be to be synonyms is almostimpossible or at least rather subjective (Edmonds and Hirst (2002)). Are forexample “buxom”, “chubby”, “plump”, “obese” and “fat” all synonyms? The lex-ical database WordNet have chosen to classify their relation as “similar” insteadof “synonym”. The difficulty of synonym extraction evaluation is illustrated bythe fact that not even professional lexicographers always agreed on whethertwo words are synonyms or not when evaluating a synonym extractor (Plas andBouma (2004))(see Section 2.5 about evaluation methods.)

Polysemous and homonymous graph words have different senses. This meansthat if two graph words are synonyms in one situation they are not necessarilysynonyms in other situations. For example “I don’t buy her story” could bereplaced with “I don’t believe her story” but “I will buy a new house” can’t bereplaced by “I will believe a new house”.

Hyponyms and hypernyms can be synonymous, especially within specificdomains. For example a lot of companies have a “XXX membership card” anddue to the obvious reference the company and their customers often onlysay “membership card” although there are lots of other membership cards inthe world. Whether a hypernym and a hyponym can be used as synonymsdepends on how precise they are and the context. For example “thing”, whichis a hypernym of everything, is not to consider as a synonym for everything.

“Marabou” (the biggest chocolate company in Sweden today) is a hypernymoften used as a synonym/referent to “choklad” (“Chocolate”) in Sweden. Propernouns, such as “Marabou” are generally not included in synonym dictionaries,but there are exceptions. For example “MAC” and “computer” are listed assynonyms in English thesauri.

Conceptual siblings are words that are hyponyms “on the same level” to thesame hypernym. For example “father” and “mother” are siblings that are usedto tell that a person is the parent of a child. Another example of conceptual

7

siblings are products that have different names and/or are created in a slightlydifferent way. This is very common at pharmacies in for example Sweden,where the pharmacists occasionally say they are out of the medecine you askfor or that they have a similar cheaper medicine that you should pick accordingto the laws regarding high-cost protection. Siblings can also be antonymous, forexampel man and woman. Antonymous word pairs often have paradigmaticrelations, which means that it is possible to replace one word, for example “bad”with its antonym “good”, and still have a syntactically and semantically coherentsentence and due to irony they can mean the same thing. Combinations ofnegations and antonyms can be used as a synonym phrase, for example onecan say “Not bad!” instead of “Good!”. Actually “bad” can be positive in slangsense, for example “he’s a baaad dude” (Cruse (2002)). Since antonyms oftenrepresent opposite polarities on the same “scale”, they are often used in the samecontexts, for example I hate/love you”. This is why they might be suggested assynonym candidats when measuring distributional similarity.

In many situations the positive/negative energy of a word is used to decidewhether two words can be used to send the same message. For example thedefinitions of “stupid” and “ugly” are not synonyms but in sentences such as“that dress looks ugly/stupid” they are both used to say that something does notlook good because they carry the same negative energy. The affective meaningof words are for example studied by Strapparava and Mihalcea (2007) .

Another relation that has not been mentioned in earlier synonym extractionprojects is misspellings. Due to the fact that the training data in this projectis unedited chat data, misspellings occur. Actually some words were morefrequently misspelled than correctly spelled.

Sometimes users of a dialogue system mix words from different languages.During the Swedish tests in this study for example “infant” was extracted as asynonym for “bebis” (“infant”) because they were used in similar contexts. Insuch cases the translation should be considered as a synonym.

2.2 Usage of extracted synonymsSemi-automatic synonym extractions is probably the best way to find manysynonyms with good quality. Muller et al. (2006) states that synonym extractorsare supposed to complement existing thesauri or to help targeting a specificdomain. Specific domains might use words in new senses or new words thatare not used outside the domain. Existing thesauri don’t include all possiblesynonyms in a language and new words and new word senses might comeevery day. Automatic synonym extraction is a much more effective way thanletting lexicographers think or analyze corpora by themselves to find synonymscandidates. Then lexicographers can weed out bad synonym candidates in a setof proposals. A synonym is relevant for dialogue systems to understand usersand to widen their own vocabulary. Dialogue systems, that for example arecreated by Artificial Solutions, are often domain specific.

Plas and Bouma (2004) exemplify why it might be a good thing if hyper-nyms, hyponyms and conceptual siblings are found. If a question answeringsystem is asked “What actor is used as Jar Jar Binks voice?” The system looksfor the name of a person if the system know sthat an actor “IS-A-person”. If a

8

question is “What is the profession of Renzo Piano?” and the system has accessto a document saying “Renzo Piano is an architect and an Italian”, then theknowledge that architect is a profession would help answering the question.

The knowledge of product Y as a similar alternative to product X, such asthe pharmacy-example in Section 2.1, is useful when a customer is looking forsomething that for example is not at hand or more expensive than a similarproduct.

2.3 Random IndexingJavaSDM, which is the tool-kit used for synonym extraction in this study, usesrandom indexing. Word space models use distributional statistics to generatehigh-dimensional vector spaces, in which words are represented by contextvectors whose relative directions are assumed to indicate semantic similarity.Words with similar meanings tend to occur in similar contexts. Contexts can beeither whole documents or context windows. A context window can include xwords directly before the focus word and y words directly after.

Random indexing is a method based on an incremental word space modelwhich accumulates context vectors based on the occurence of words in con-texts. Each word in the text is assigned a sparse random index vector withdimensionality usually chosen to be in the range of a couple of hundred upto several thousands, depending on the size and redundancy of the data. Asmall number of the elements of the vector are randomly distributed +1’s andan equal number of elements are distributed -1’s and the rest is set to 0. Theindex vectors are used to construct context vectors for each word in the text(Sahlgren (2005) and Hassel (2007)). This is exemplified by Figure 2.1.

Figure 2.1: A context window focused on the token “merchandise”, taking note of theco-occurring tokens. The row cv represents the continuously updated context vectors, andthe row rl represents the static randomly generated index labels. The grey fields are notinvolved in this update example.”

Sahlgren (2005) states that the most well-known word space alternativesto random indexing such as Latent Semantic Analysis (LSA) first construct ahuge co-occurrence matrix and then use a separate time and memory consum-ing dimension reduction phase. He lists four advantages of Random indexingcompared to other word space methodologies:

• The context vectors can be used for similarity computations even afterjust a few examples have been encountered (Ferret (2010) confirmed thisby trying different frequency thresholds deciding which words got their

9

own context vectors, i.e decided which words could be extracted, andfound that such thresholds only lowered results). By contrast, most otherword space models require the entire data to be sampled before similaritycomputations can be performed.

• Random Indexing uses “implicit” dimension reduction, since the fixeddimensionality d is much lower than the number of contexts c in the data.This leads to a significant gain in processing time and memory consump-tion as compared to word space methods that employ computationallyexpensive dimension reduction algorithms.

• The dimensionality d of the vectors is a parameter in Random Indexing.This means that d does not change once it has been set. When encounter-ing new data the values of the elements of the context vectors increases,but never their dimensionality. In HAL (Hyperspace Analogue to Lan-guage) and LSA, matrixes are created of size T × T or D × T , where D

is the number of indexed documents and T is equal to the number ofunique terms found in the data set, so far (Hassel (2007)).

• Random indexing can be used with any type of context. Other wordspace models typically use either documents or words as contexts. Ran-dom indexing is not limited to these naive choices, but can be used withbasically any type of contexts.

2.4 Synonym Extraction MethodsThere have been plenty of papers written about synonym extraction methods.Three important approaches are:

1. Distributional approaches to monolingual corpora, for example Rosell et al.(2009) and Ferret (2010). This method, which is used by JavaSDM,measures distributional similarities within a corpus. The approach is basedon the distributional hypothesis (Harris (1954)) that words that occur insimilar contexts also tend to have similar meanings/functions. Systemsusing this approach provide ranked lists of semantically related wordsaccording to their context similarities. Random indexing is a commonmethod for this (Gorman and Curran (2006)).

2. Alignment of bi- or multilingual corpora, for example van der Plas andTiedemann (2006) and Blondel and Senellart (2002). This is a methodusing two or more parallel corpora based on the assumption that twowords are similar if they have the same translation. If two words alsoshare translational contexts they are even more similar. This is measuredin a similar way to distributional monolingual corpora. Automatic wordalignment can be used to find the translations in parallel corpora. Thisapproach is known as a way to separate synonyms from other semanticrelations; for example “apple” is typically not translated with a word for“fruit” nor “pear”.

3. Dictionary-based approaches, for example Wang and Hirst (2012) andMuller et al. (2006). This method measures distributional similarities

10

among definitions in dictionaries. It is similar to the distributional ap-proaches to monolingual corpora, but specificly a dictionary is usedinstead. Hence, relatedness of words is measured using distributionalsimilarity in the same way as in the monolingual case but with a differ-ent type of context. Automatic word alignment can be used to findingtranslations in parallel data.

The two latter approaches are not relevant for this study since each chat datacollection is a monolingual corpus. A method with distributional approach as-sesses the degree of synonymy between words according to their co-occurrencepatterns within text corpora, under the assumption that similar words tendto appear in similar contexts (Wang (2009)). For example van der Plas andTiedemann (2006) mention suggestions of hyponyms, antonyms, hypernymsas synonyms as a problem encountered with distributional similarities but asexemplified in the Section 2.1 such synonym proposals are not necessarilynegative.

Wu and Zhou (2003) used a bilingual method, monolingual dictionaryand a monolingual corpus and concluded that the corpus based method canget nuanced synonyms or near synonyms which generally cannot be foundin manually built thesauri, and exemplifies with “handspring → handstand” ,“audiology→ otology” , “roisterer→ carouser” and “parmesan→ gouda”.

2.5 Evaluation MethodsThe validation of a method’s ability to propose synonyms can be done with atest using gold standard or an extrinsic evaluation.

Muller et al. (2006) states that ”Comparing to already existing thesaurus isa debatable means when automatic construction is supposed to complement anexisting one, or when a specific domain is targeted”. He also says that, in orderto get an indication of the extraction quality, manual verification of a sampleof synonym proposals is a common practice. Time is the reason only a sampleof the proposals are verified. Such verification is normally done either by theauthors of a study or by independent lexicographers. The larger sample themore reliable conclusions but the more time is required. Automatic verificationscan verify more proposals in much shorter time but are less reliable.

The simplest evaluation measure of synonym extractions is a comparisonwith manually created thesauri (Wu and Zhou (2003)). Precision, recall andf-score are the most common measures. Gold standard words are often takenfrom one or several thesauri such as WordNet, Roget’s, the Macquarie, Moby.If a word was polysemous all the synonyms of all its senses are included in thegold standard in prior studies, for example Wu and Zhou (2003), Curran andMoens (2002) and van der Plas and Tiedemann (2006). Wang and Hirst (2012),who evaluated a dictionary method, tried to reduce this “problem” by using aPOS-tagger. In their study a synonym proposal was considered correct if it bothgot the same POS-tag as the test word and was included in the same synset inWordNet. Test data can be created using different techniques. van der Plas andTiedemann (2006) automatically tagged all words and made a list of all wordstagged as nouns with a frequency of 4 or more. Then they randomly used 1,000

11

of these as test words and the synonyms found in Dutch EuroWordnet for thesewords were used as gold standard. A correct synonym extraction, for example“e-mail” (“e-mail”) for “e-postadress” (“e-mail address”) might be considered asincorrect because it is not included in the thesauri. They had 10 lexicographersevaluate a sample of 100 incorrect suggestions. 10 out of the 100 word pairswere classified as synonyms by all evaluators and 37 word pairs were classifiedas synonyms more than half of the human evaluators.

There have been several other evaluation methods, for example TOEFL(Test of English as a Foreign Language) for example Ferret (2010) and Wang(2009). TOEFL is originally an obligatory test for non-native speakers of Englishwho intend to study at a university with English as the teaching language. Thesynonym extractor is supposed to pick the one out of four words that is mostsimilar to a test word (Ferret (2010)).

Another method, clustering, has been used by Rosell et al. (2009). Firstthey assigned every word in their data to one of several clusters based on thesimilarity measure. A test group had graded the similarities between word pairsfrom 0-5 to build a synonym dictionary. The word-pairs with mean grade of3.0 to 5.0 were put on a list. They used the list of synonym pairs to see if thewords ended up in the same cluster or not. If a word pair graded as synonymsby the users had ended up in the same cluster it was counted as correct.

van der Plas and Tiedemann (2006) acknoledged that their conclusionsabout precision when using monolingual distributional approach might havebeen a bit unfair since they always extracted the same number of synonymcandidates no matter how low the similarity scores were. Wu and Zhou (2003)tried different similarity thresholds and showed that a higher threshold increasesprecision but also decreases recall and vice versa.

Wu and Zhou (2003) describes evaluation alternatives of synonyms ex-tracted from a corpus as the following: “Comparing the results with the human-built thesauri is not the best way to evaluate synonym extraction because thecoverage of the human-built thesaurus is also limited. However, manually evalu-ating the results is time consuming. And it also cannot get the precise evaluationof the extracted synonyms. Although the human-built thesauri cannot help toprecisely evaluate the results, they can still be used to detect the effectivenessof extraction methods.”

12

3 Data and Method

In this chapter Section 3.1 discusses the data used in this study. Section 3.2 isabout the preprocessing operations and the tools used for that. Section 3.3.1describes the random indexing tool-kit and the extraction criteria used in thisstudy. Section 3.4 is about the creation of test data and gold standards.

Figure 3.1 gives an overview of the extraction experiments.

Figure 3.1: A flow chart of the extraction experiments

3.1 DataArtificial Solutions provided four chat log collections with sizes of approxi-mately 1,000,000 tokens:

• Swedish Retail: Swedish chat log collections from the retail company

• English Retail: English chat log collections from the retail company

• Swedish Travel: Swedish chat log collections from the travel company

• English Travel: English chat log collections from the travel company

A chat log file could look like this:

13

<SESSION ID="-H-vkmsKoZ" ADMIN="no" CONTINUED="no"><INFO><BEGIN LOAD="7" TIME="1304203914 2011-05-01 00:51:54"/><END LOAD="7" TIME="1304204091 2011-05-01 00:54:51"/><BOT NAME="XXXX"/><LOG VERSION="2"/><KBLOAD TIME="1302168630 2011-04-07 11:30:30"/><TRANSACTIONS COUNT="4"/>

</INFO><TRANSACTION TIME="1304203914 2011-05-01 00:51:54"DURATION="26"><QUESTION></QUESTION><ANSWER ID="50888" EMOTION="Happy.Happy" CATEGORY=

"Library.Dialogue.Initial">Welcome to XXX. How can I help youtoday? </ANSWER><RECOGNITION AUTO_RANK="1" AUTO_SUBRANK="11" CUSTOM_RANK="1"

CUSTOM_SUBRANK="1"/></TRANSACTION><TRANSACTION TIME="1304203936 2011-05-01 00:52:16"DURATION="457"><QUESTION>hi</QUESTION><ANSWER ID="50890" EMOTION="Nodding.Happy" CATEGORY=

"Library.Dialogue.Positive">Welcome.</ANSWER><RECOGNITION AUTO_RANK="3" AUTO_SUBRANK="11" CUSTOM_RANK="17"

CUSTOM_SUBRANK="11"/></TRANSACTION><TRANSACTION TIME="1304203950 2011-05-01 00:52:30"DURATION="342"><QUESTION>i want pepper mill</QUESTION><ANSWER ID="50486" EMOTION="Action.Pointing_left"

CATEGORY="Library.User.Personal.Feelings">You can find information about it on the webpage I've justopened for you.</ANSWER><RECOGNITION AUTO_RANK="3" AUTO_SUBRANK="11" CUSTOM_RANK="14"

CUSTOM_SUBRANK="6"/></TRANSACTION><TRANSACTION TIME="1304204089 2011-05-01 00:54:49"DURATION="385"><QUESTION>thanks for your help</QUESTION><ANSWER ID="50464" EMOTION="Happy.Smiling" CATEGORY=

"Library.Dialogue.Positive">You are welcome. </ANSWER><RECOGNITION AUTO_RANK="7" AUTO_SUBRANK="11" CUSTOM_RANK="17"

CUSTOM_SUBRANK="11"/></TRANSACTION>

</SESSION>

Text categorized as “QUESTION” has been written by users and is used astraining corpora in this study. The actual user input looks like this:

hi

14

i want pepper millthanks for your help

3.2 Preprocessings

3.2.1 Extraction of User Input from Chat Logs

The actual user input was extracted from the chat logs by using regular expres-sions in Linux terminal.

3.2.2 Tokenization

The JavaSDM tokenizer was not fully appropriate for noisy data, for examplecharacter sequences like “,candy” and “broken,it” were not split up. I wrote anew tokenizer which split up tokens that include one or more punctuations.It did not split up hyphenated words and used abbreviation lists to not splitup abbreviations. The abbreviation lists were created by extracting Swedishabbreviations from Stockholm-Umeå Corpus and Talbanken and English abbre-viations from Penn Treebank.

3.2.3 Part-Of-Speech Tagging

The open source Hidden Markov Model tagger Hunpos was used to POS-tag the data before training. Swedish models trained by Beáta Megyesi on theStockholm-Umeå Corpus and English models trained on Wall Street Journal cor-pus were downloaded from http://stp.lingfil.uu.se/~bea/resources/hunpos/ and http://code.google.com/p/hunpos/.

The tagging was not more fine-grained then “_a”=adjective, “_n”=noun,“_v”=verb and “_other”= any other word class. POS-tagging initialize a distinc-tion between homonyms of different word classes. Morphological tagging canbe useful for separating homonyms, but during pre-tests it gave lower resultsthan the chosen POS-tagging (see Section 5.2).

The test words were either nouns, verbs or adjectives (see Section 3.4.1).After Hunpos tagging, past participle tags were converted to adjective tagsbecause they are very similar and in Svenska Akademins ordlista (SAOL, Thedictionary of the Swedish Academy) several words, for example “intresserad”(“interested”), are classified both as adjective and perfect participle. Accordingto Holmer (2009) a past participle has a function somewhere between adjectiveand verb. If a past participle frequently enough has been used with rather anadjective than a verb function it is included as an individual lexeme classifiedas adjective in SAOL. Proper nouns were seen as nouns since both domainsin this study have named products. As mentioned in Section 2.1, names canbe near-synonyms, for example “MAC” and “computer” and “Marabou” and“chocolate”.

Tests using the dictionary-based gold standards were run to find indicationsof whether to tag the whole corpus or only verbs, nouns and adjectives sincethey are the only classes included in the test data. These tests, that were done

15

http://stp.lingfil.uu.se/~bea/resources/hunpos/

http://stp.lingfil.uu.se/~bea/resources/hunpos/

http://code.google.com/p/hunpos/

before any manual correction, showed results favoring tagging only nouns, verbsand adjectives. This POS-tagging tag set was consequently used in this study.The corpora could look like this:

thefat_alady_nsings_v

POS-tagged data were required by the lemmatization tools.

3.2.4 Lemmatization

The open source lemmatizer Lempas by Silvia Cinkova was used for Swedish(+http://ufal.mff.cuni.cz/~cinkova/+). A lemmatizer from the NLTKtool-kit was used for English (http://nltk.org/). Data needed to be POS-taggedbefore lemmatization could be performed by theses lemmatizers. The pur-pose is to group the inflected forms of a word as a single item (http://www.collinsdictionary.com/dictionary/english/lemmatisation). If forexample the graph words “run” and “runs” occur 5 times each, their lemmaoccurs 10 times if the data is lemmatized.

Cases when it is important to make differences between different inflectionsare disadvantages of lemmatization. For example the plural form “tillgångar”(assets) often means money which is not what its lemma form “tillgång” (as-set/access) is used for.

3.2.5 Stop Word Lists

Both Swedish and English stop word lists were taken from the NLTK tool-kit(http://nltk.org/). A regular expression for tokens that did not contain anyalphabetic character was added to the stop word lists.

A few quick tests were enough to clearly see that the use of stop word listsimproved the precision, therefore all tests used stop word lists. JavaSDM canread a given list of stop words and remove all those words from the training databefore doing any training. Tests using the dictionary-based gold standards wererun and indicated that it is better to keep stop words during training and insteaduse the list of stop words as a way to stop extractions of those words from beingsynonym proposals (see Section 5.2 for discussion why). For that reason andin order to not double number of experiments and manual evaluation, thatmethod was used in the experiments.

3.3 Synonym Extraction by Random Indexing

3.3.1 Random Indexing Tool-Kit

The tool-kit called JavaSDM (http://www.csc.kth.se/~xmartin/java/,which was written by Martin Hassel, is the software used to extract synonyms inthis study. It is an open source tool-kit which uses random indexing (more aboutrandom indexing in Section 2.3). JavaSDM can use one or more documents

16

+http://ufal.mff.cuni.cz/~cinkova/+

http://www.collinsdictionary.com/dictionary/english/lemmatisation

http://www.collinsdictionary.com/dictionary/english/lemmatisation

http://www.csc.kth.se/~xmartin/java/

to train a model for synonym extraction. The model can be given a list of testwords and provide a ranking list of synonym candidates for each test word.

Dimensionality, random degree, weighting scheme and context window sizeare training parameters that can be altered. All these parameters affect synonymextraction but an examination of them is beyond the scope of this study. Forexample 100 models would have to be created for each chat log corpora inorder to find the best context window combining 1–10 word before and 1–10words after the focus word. Each of these models would need to be combinedwith the other parameters, which means a 100 times more models. Each newparameter setting multiply the number of tests. For this reason default settingswere used in the experiments.

• A lower dimensionality without decreasing extraction quality is desireddue to less time and memory consumption. Where that point is dependson a combination of size and redundancy of the data. The best dimension-ality size is usually “between a couple of hundred and several thousandsdepending on the size and redundancy of the data” (Hassel (2007)). Thedefault dimensionality is 1000.

• Random degree is the number of random +1’s and -1’s in each randomvector. The default random degree is 8 which means four +1’s and four-1’s.

• Context window size is the number of words before and after the focusword that will be considered as the context for each word in the text.When scanning through the text, the random index vectors of the neigh-bors in the sliding window are added to the context vector of the currentfocus word (Sahlgren (2005) and Hassel (2007)). A brief context windowtest gave better results for symmetrical windows with 4 or 6 words beforeand after the focus word than for a window with 5 words before and afterthe focus word. This indicates that the effect of the window size is notobvious. A symmetric context window with 4 words on each side of thefocus word was the default settings.

• The three simplest weighting schemes available in JavaSDM are:

– ConstantWS: Using a constant weight makes the weighting wordorder independent.

– MangesWS: Calculates the weighting based upon the distance to thecurrent label in the following manner: weight = 21−distance to focus word.

– MartinsWS: Calculates the weighting based upon the dis-tance to the current label in the following manner: weight =1/distance to focus word.

”Manges Weighting scheme” is the default weighting scheme. It weightswith exponential dampening, 21−d, where d is the distance between thefocus word and the neighbor. This weighting scheme has for examplebeen used in papers by the creator of JavaSDM (Rosell et al. (2009)).When the context window has four words on each side of the focus word,Manges, Martins and Constant weighting scheme, define weights like this:

17

can i buy the merchandise online to be pickManges 1/8 1/4 1/2 1 Focus Word 1 1/2 1/4 1/8Martins 1/4 1/3 1/2 1 Focus Word 1 1/2 1/3 1/4Constant 1 1 1 1 Focus Word 1 1 1 1

An extraction example from JavaSDM is given below. JavaSDM only mea-sures the similarity between two words by the cosine similarity of their cor-responding context vectors (the dot product of the normalized vectors). Themaximal cosine similarity is 1. Ferret (2010) showed that cosine-similarity gavethe best results in a comparison with four other semantic similarity measurescalled ehlert, jaccard, lin and dice.

The test word is “cute” and the other words are synonym proposals followedby their cosine similarity to the test word.

cuteugly 0.92525595pretty 0.9251133sexy 0.915225weird 0.9052225funny 0.89562523thick 0.890025stupid 0.88813686beautiful 0.876878hair 0.8744281

3.3.2 Extraction Criteria

The extraction criteria were:

• Candidate set size limit:

– a limit of at most 15 synonym proposals for models trained on graphword data

– a limit of at most 10 synonym proposals for models trained onlemmatized data

A higher number of proposals is expected to improve recall and lowerprecision. In preliminary test runs 5, 10, 15 and 30 were used as themaximal number of proposals per test word. The f-scores for lemmamodels were highest when using 10, and 15 candidates gave best f-scorefor graph word models. Thus the number of synonym proposals waslimited to 10 for lemma based models and 15 for graph word basedmodels. The reason graph word based models benefitted from morecandidates is probably because they have more words in the gold standard.30 proposals was very memory- and time-consuming.

• Different cosine similarity thresholds:

– 0 (no threshold)

– 0.5

18

– 0.7

– 0.85

The purpose of a similarity score threshold is to weed out bad synonymproposals and thereby get better precision. With a too high threshold,good synonym proposals might not be given and thereby recall wouldbe lower. A cosine similarity threshold was not an optional setting in thedownloaded JavaSDM code and was therefor new code had to be written.Cosine similarity thresholds of 0, 0.5, 0.7 and 0.85 were used. Thesehave different effects on precision, recall and f-score. The thresholds werechosen after running pre-tests of many different thresholds to see aroundwhich cosine similarity scores there is more or less result differences.

The synonym candidates with a cosine similarity score above the thresholdare shown, but never more than the max number of candidates. A maxnumber of candidates is preferable because there are words with reallyhigh numbers of candidates with high similarity and the more suggestionsthe more time and memory consumption (when having 30 as a maxnumber of candidates, my computer randomly crashed even when havinga cosine similarity of 0.85 and it was very time consuming). Having 30candidates as max number did not clearly decreased f-score in preliminarytests. If a user wants to use this tool to get a list of synonym candidates,he/she should set the highest number or candidates he/she wants to read.

Another alternative was to, instead of limiting candidate set size, use athreshold like the one mentioned above, and also have a relative thresholdcompared to the synonym candidate with highest similarity score. A fewattempts were made to examine this idea but as only negative effectswere found this method was not further explored in this study.

• Extracted words whose POS-tag does not match the test word’s tag are notgiven as synonym proposals. This is a criterion intended to separate the verb“book” from the noun “book” and also to exclude extractions of wordsfrom other word classes than the test word since synonym words aresupposed to be from the same word class. For example “hair” would nothave been extracted as a synonym candidate to “cute” as in the extractionexample in Section 3.3.1. This criterion is only used when a model istrained on POS-tagged data.

This criterion would have a negative effect if a POS-tagger has mistaggedan extraction that actually is of the correct word class.

• Extracted stop words are not synonym proposals The stop word lists werecollected as described in Section3.2.5 but those tokens were kept duringtraining. Instead JavaSDM extractions of tokens that are in those listswere filtered out and thereby not counted as synonym proposals.

3.4 EvaluationThe focus of this study is to evaluate preprocessing operations and extractioncriteria. The evaluation is done by computing precision, recall and f-score.

19

One test word set based on lemmas and one based on graph words set werecreated. For each set one dictionary-based and one adapted gold standardwere created. The dictionary-based gold standards were created from existingsynonym dictionaries and inflections of those synonyms and the test wordsthemselves. The adapted gold standard is the dictionary-based gold standardextended by manual correction.

3.4.1 Test Words

The test words in this study were selected among adjectives, nouns and verbs.The POS-tagging described in Section 3.2.3 was needed for this. The test wordshad to be chosen from the training data if JavaSDM would have a possibilityto suggest any synonym candidates. Keyword extraction based on a methodby Ortuño et al. (2002) was used to extract the top 200 nouns, 100 verbs100 adjectives from the chat log collections. Those words also had to have afrequency of at least 30 in the chat log collection. This was done with POS-tagged data. No more than one inflected form of a word was included in thetest word set based on graph words in order to have the same number of testwords for lemma and graph words. The test word set based on lemmas consistedof the lemmas of the words in the test word set based on graph words. If nosynonyms were found for a word so far qualified to the gold standard, the wordwas not used as a test words.

3.4.2 Gold Standard

The web-pages www.synonymer.se and www.synonymer.org were used to lookup base forms of Swedish synonyms. WordNet and www.thesaurus.com wereused to find English synonyms. For example the adjective able got the synonymsbright, capable, easy, good, strong, worthy included in the gold standard. TheEnglish synonym sources included a lot more words than Swedish sources.For a synonym found in a thesauri to be included in the gold standard itsgraph word had to occur at least once with the correct POS-tag in the trainingdata. Otherwise it is impossible to extract that word and the non-extractionof the word would say nothing about the quality of JavaSDM. For exampleif an English thesauri had “book” as a synonym to “order” it would have beenincluded in the gold standard because of lines like line 2 and not because oflines like line 1:

Line 1: Line 2:I Iread_v want_va tobook_n book_v

aticket_n

Synonym phrases and compound words (separated by space) were notincluded in the gold standard because JavaSDM only extracts single tokens. Forexample “be good for” was not included in the gold standard as a synonym

20

for “benefit”. With a collocation extractor a phrase like that could have beentransformed into the token “be_good_for”.

If a model is trained on a perfectly lemmatized data it can’t be used toextract other inflected forms of a word than the lemma, which models trainedon graph word data often can (during the tests there were occasions when thelemmatizers had failed to lemmatize every form of a word the same way). It isalso possible that a form of a word identical with the lemma is never used ingraph word data. Therefore the lemma and graph word models needed differentgold standards. First the lemma gold standard was created as described above.Then a triple column version of the training data was created which showedthe graph word version of the training data in the first column, the lemmatizedversion in the second column and the POS-tags in the third column like this:

dogs dog NNare be VBcute cute JJ

Every graph word that had the same lemma and POS-tag as a word in thelemma gold standard were included in the graph word gold standard. Table 3.1the effect of this gold standard creation method for the noun “babe”:

Table 3.1:

Lemma Gold Standard Graph Word Gold Standardbaby babieschild babynewborn babys

childrenchildsnewborn

In this example there are twice as many words in the graph word goldstandard. Still one can see that not all inflections are included in the graphword gold standard, for example the plural form of newborn. That is because“newborns” never occurred in the training data. On the other hand the informalor incorrect forms “babys” and “childs” were included because they were used byusers and they had been lemmatized as “baby” and “child” and were POS-taggedas nouns.

3.4.3 Adapted Gold Standard

The dictionary-based gold standards include far from all synonyms. For example”monteringsanvisning” (“assembly instruction”) could not be counted as a correctsynonym proposal to “instruktion” (“instruction”) in tests with the dictionary-based gold standard, since it was not included there. Manual correction wasdone to avoid that and to get an idea how useful addition a synonym extractorcan be to existing thesauri and to get more accurate results.

Manual control was done on synonym proposals, from all models, thatwere not included in the gold standard when not using any threshold. Testsusing thresholds could not possibly find words that were not found without

21

threshold. Manually found synonyms, antonyms and semantically near-relatedwords such as hypernyms, hyponyms, siblings were added to the extendedgold standard (more about synonyms and near-related words in Section 2.1).For example “trasig” (“broken”) was added to the gold standard for “skadad”(“wounded”) and “kontokort” (“credit card”) added to the gold standard for“kort” (“card”). Antonyms/siblings such as “mother” and “father” were also addedto the extended gold standard. For the Swedish test word “bebis” (“baby”) thenon-Swedish word “infant” had been extracted and since “infant” is an Englishword translation of “bebis” was added to the extended gold standard. Generallythe synonym proposals manually found as correct were added to both lemmaand graph word gold standards. When graph word based models had found morethan one inflection of a word applicable for gold standard only one inflection ofthat word was added to the lemma gold standard although all the inflectionswere added to the graph word gold standard. If a word was found by a lemmabased model but not by any of the graph word based models there was nochance that the graph word based models would find that word in tests usingthresholds since these models would not find more words than the ones theyfound using no threshold. Still it was important to add that word to both lemmaand graph word gold standards to get a good recall and f-score comparison.Examples of differences between dictionary-based and adapted gold standardsare shown in Table 3.2– 3.4.

Table 3.2: English Retail Lemma, Test word = crap (noun):

Dictionary-based Gold Standard Adapted Gold Standardnonsense nonsensebunk bunktwaddle twaddle

rubbishshit

Misspellings were also added to the extended gold standard. In order to savetime, Levenshteins algorithm was used to see if an extraction was a misspellingof the test word (Jurafsky and Martin (2009): p. 108). The editing distancehad to be 1 or 2 and the test word had to have a length of at least five letters,otherwise too many words could be seen as misspellings. The longer a word isthe more risk for longer editing distances, but that would have needed furtherresearch for which word lengths to increase the editing distance. If a wordwith editing distance 1 or 2 already was included in the gold standard, forexample “monkeys” - “monkey”, then that misspelling was not added to thegold standard. More misspellings were manually found afterwards among theincorrectly classified extractions.

In average 1–3 new tokens, per test word were added to the adapted goldstandards for the four chat collections respectively. Approximately one fifthwere misspellings.

Table 3.5 shows the number of tokens included in each gold standard. Thetable also shows that there are more keywords with synonyms and semanticallyrelated words in the retail data than in the travel data.

22

Table 3.3: English Retail Graph Word, Test word = noticed (verb):

Dictionary-based Gold Standard Adapted Gold Standardobserve observeadvert advertcatch catchmark markmind mindminded mindednote notenoted notedrecognize recognizerecognized recognizedrefer referreferred referredreferring referringregard regardregarding regardingsee seeseeing seeingspot spotspotted spotted

discoveredfoundrealisedrealizednotice (inflection of the test word)

Table 3.4: Swedish Retail Lemma, Test word = nummer (noun):

Dictionary-based Gold Standard Adapted Gold Standardexemplar exemplarformat formatmått måttnr nrpersonnummer personnummersiffra siffrastorlek storlekstycke stycketal taltelefonnummer telefonnummer

telefonnrtelnrtelefonnumer (misspelling)

Table 3.5: The gold standard is defined with the following abbreviations: DB = dictionarybased gold standard, A = adapted gold standard, GW = graph words, L = lemma. The chatcollection and number of test words are defined like this: R-Swe(244) = Swedish Retail and244 test words.

R-Swe(244) T-Swe(152) R-Eng(171) T-Eng(175)DB-GW 3065 1509 3322 2447A-GW 3675 1852 3642 2696DB-L 1271 574 1874 1359A-L 1849 890 2139 1568

23

4 Results

The first section shows tested combinations of models and criteria that willbe discussed in this chapter. It also tells what kind of metrics that have beencomputed in the study. Sections 4.2– 4.4 summarize the main effects ofthresholds, POS-tagging and lemmatization in the experiments. The dictionary-based and the adapted gold standards are compared in Section 4.5. F-score is themain metric but Sections 4.6 and 4.7 exemplify how precision and recall wereaffected by preprocessing operations and extraction criteria. Section 4.8 showsthe f-score for all tests using the adapted gold standards. Tables in Section 4.9give a comparison overview of the evaluated combinations of preprocessingoperations and extraction criteria.

4.1 Tests16 tests were run on each of the four chat log corpora. Table 4.1 shows thefollowing parameters:

Table 4.1: Name = the name of the test, Text Unit, Set = the candidate set size limit, Threshold= the cosine similarity threshold and POS = part-of-speech tagged or not:

Name Text Unit Set Threshold POS15-Graph-0 Graph word 15 - -15-Graph-0p5 Graph word 15 0.5 -15-Graph-0p7 Graph word 15 0.7 -15-Graph-0p85 Graph word 15 0.85 -15-GraphPOS-0 Graph word 15 - +15-GraphPOS-0p5 Graph word 15 0.5 +15-GraphPOS-0p7 Graph word 15 0.7 +15-GraphPOS-0p85 Graph word 15 0.85 +10-Lemma-0 Lemma 10 - -10-Lemma-0p5 Lemma 10 0.5 -10-Lemma-0p7 Lemma 10 0.7 -10-Lemma-0p85 Lemma 10 0.85 -10-LemPOS-0 Lemma 10 - +10-LemPOS-0p5 Lemma 10 0.5 +10-LemPOS-0p7 Lemma 10 0.7 +10-LemPOS-0p85 Lemma 10 0.85 +

The metrics that were computed for all the test words are discussed in thischapter:

• Precision, recall and f-score separately

– Using the dictionary-based and the adapted gold standards separately

24

– Averages and standard deviations

4.2 ThresholdAs expected a higher precision is reached the higher threshold you use. Strictextraction criteria in general benefit precision but not recall. A threshold ofat least 0.5 is recommended since practically no correct synonym proposalswere extracted below that level. Thus precision and f-score are lower withlower threshold. Models trained on both POS-tagged and lemmatized datafrom English Retail and Swedish Travel got better precision and f-score whenusing a threshold of 0.7 but in all the other cases 0.5 got the best f-score andrecall.Figure 4.1s an example of how precision, recall and f-score were affected bythresholds in this study.

Figure 4.1: A diagram showing the effect of cosine similarity threshold on precision, recalland f-score when using a model trained on the lemmatized version of Swedish Retail.

4.3 Part-of-speech taggingPOS-tagging improves precision but lowers the number of found misspellingsamong synonyms proposals. Misspellings are harder to give a correct POS-tag and due to the word-class criterion a correct tag is necessary. A fact thatPOS-tagged models got the best f-score for all collections when using dictionary-based gold standard indicates that POS-tagging is useful handling “well-known”words (see appendix). When using travel data regardless of language, POS-tagging improved f-score for both lemma and graph word based models. Modelstrained on POS-tagged data got more words with high similarity score.

25

4.4 LemmatizationModels trained on lemmatized data gave better precision, recall as well asf-score than graph word based models for all chat collections. Models trained onlemmatized data got more words with high cosine similarity score than graphword based models. Worth noting is that models trained on lemmatized datacan propose misspellings which in theory could be lemmatized misspellings. Forexample “essen” could be a lemmatization of the token “essens” which probablyis a misspelling of “essence”.

4.5 The different gold standardsPOS-tagging combined with lemmatization improved f-score when using thedictionary-based gold-standards. When using adapted gold standards POS-tagging lowered f-score for all chat collections but Swedish travel. The bestcosine similarity thresholds where either equal or lower when using adaptedgold standards. The results for precision, recall and f-score were much higherwhen using the adapted gold standards.

If excluding misspellings from the adap ted gold standard there were notmany differences between the ranking of the models and thresholds exceptfor English Travel. The differences were the cosine similarity thresholds, asdescribed above regarding adapted gold standards including misspellings, andthe fact that lemmatized English retail data gave better f-score if it wasn’tPOS-tagged when the adapted gold standard without misspellings was used.

These observations are well exemplified by Tables 4.2–4.4 showing theranking lists according to average f-score results using the dictionary-based goldstandard, the adapted gold standard and the adapted gold standard withoutmisspellings for Swedish Travel.

Table 4.2: Swedish Retail using the dictionary-basedgold standard

Rank Model F-score1. 10-LemPOS-0p7 0.059+-0.0852. 10-LemPOS-0p5 0.059+-0.0843. 10-LemPOS-0 0.059+-0.0844. 10-Lemma-0p7 0.055+-0.0855. 10-LemPOS-0p85 0.055+-0.1036. 10-Lemma-0p5 0.053+-0.0757. 10-Lemma-0 0.053+-0.0758. 15-Graph-0p5 0.05+-0.0599. 15-Graph-0 0.049+-0.05910. 10-Lemma-0p85 0.048+-0.10411. 15-GraphPOS-0p5 0.047+-0.05912. 15-GraphPOS-0 0.047+-0.05913. 15-Graph-0p7 0.043+-0.06314. 15-GraphPOS-0p7 0.043+-0.05715. 15-GraphPOS-0p85 0.03+-0.06216. 15-Graph-0p85 0.027+-0.069

Table 4.3: Swedish Retail using the adapted gold stan-dard:

Rank Model F-score1. 10-Lemma-0p5 0.247+-0.172. 10-Lemma-0 0.246+-0.1713. 10-Lemma-0p7 0.236+-0.1774. 10-LemPOS-0p5 0.231+-0.1795. 10-LemPOS-0 0.231+-0.1796. 10-LemPOS-0p7 0.231+-0.1797. 15-Graph-0p5 0.176+-0.1298. 15-Graph-0 0.175+-0.1299. 15-Graph-0p7 0.156+-0.13010. 10-LemPOS-0p85 0.155+-0.17711. 15-GraphPOS-0p5 0.149+-0.12612. 15-GraphPOS-0 0.149+-0.12613. 15-GraphPOS-0p7 0.142+-0.12714. 10-Lemma-0p85 0.118+-0.15415. 15-GraphPOS-0p85 0.086+-0.12316. 15-Graph-0p85 0.07+-0.113

26

Table 4.4: Swedish Retail using the adapted gold standard without misspellings:

Rank Model F-score1. 10-LemPOS-0p7 0.20861813+-0.172. 10-LemPOS-0p5 0.20810248+-0.173. 10-LemPOS-0 0.20809248+-0.174. 10-Lemma-005 0.20524344+-0.1575. 10-Lemma-0 0.2045285+-0.1576. 10-Lemma-0p7 0.20224719+-0.1687. 10-LemPOS-0p85 0.14682704+-0.1828. 15-Graph-0p5 0.14472191+-0.1149. 15-Graph-0 0.14366198+-0.11310. 15-GraphPOS-0p5 0.13288288+-0.11911. 15-GraphPOS-0 0.13287288+-0.11912. 15-Graph-0p7 0.12905452+-0.1213. 15-GraphPOS-0p7 0.12848777+-0.12214. 10-Lemma-0p85 0.10565685+-0.16315. 15-GraphPOS-0p85 0.08052743+-0.12316. 15-Graph-0p85 0.06271605+-0.112

4.6 PrecisionFor all chat log collections the tests with a cosine similarity threshold of 0.85were the four models with the highest precision as in Table 4.5 and Table 4.6.POS-tagging and lemmatization had positive effect on precision. When modelsare trained on lemmatized data more words got higher similarity scores.Although models trained on graph word data have a limit of 15 synonymproposals instead of 10, the models trained on lemma extracted almost asmany or more words when the similarity threshold was set to 0.85. Table 4.5exemplifies how prominently POS-tagging improved precision for all chat logcollections but Swedish Retail.

Table 4.5: Precision ranking for English Retail using the adapted gold standard:

Rank Model Precision Correct Proposals / Proposals1 10-LemPOS-0p85 0.418+-0.327 89/2132 15-GraphPOS-0p85 0.357+-0.367 91/2553 10-Lemma-0p85 0.275+-0.311 131/4764 15-Graph-0p85 0.249+-0.313 125/5025 10-LemPOS-0p7 0.247+-0.256 166/6716 10-LemPOS-0p5 0.217+-0.246 180/8317 10-LemPOS-0 0.215+-0.243 180/8388 15-GraphPOS-0p7 0.213+-0.238 191/8959 15-GraphPOS-0p5 0.196+-0.221 231/118010 15-GraphPOS-0 0.192+-0.216 231/120111 10-Lemma-0p7 0.185+-0.174 245/132412 15-Graph-0p7 0.162+-0.154 283/175013 10-Lemma-0p5 0.159+-0.154 262/164414 10-Lemma-0 0.159+-0.15 262/165015 15-Graph-0p5 0.139+-0.117 338/242616 15-Graph-0 0.136+-0.116 338/2494

When threshold is 0.85, models trained on POS-tagged Swedish data didmore synonym proposals than untagged models, but they did much less propos-

27

als when a threshold was low, see Table 4.6.

Table 4.6: Precision ranking for Swedish Retail using the adapted gold standard:

Rank Model Precision Correct Proposals / Proposals1 10-Lemma-0p85 0.333+-0.384 133/3992 15-Graph-0p85 0.294+-0.380 147/5003 10-LemPOS-0p85 0.291+-0.351 195/6714 15-GraphPOS-0p85 0.268+-0.391 187/6975 10-LemPOS-0p7 0.247+-0.206 402/16256 10-Lemma-0p7 0.244+-0.218 423/17317 10-LemPOS-0p5 0.24+-0.195 413/17208 10-LemPOS-0 0.24+-0.195 413/17209 10-Lemma-0p5 0.224+-0.171 508/226510 10-Lemma-0 0.222+-0.171 508/227911 15-Graph-0p7 0.206+-0.201 462/224612 15-GraphPOS-0p7 0.187+-0.185 422/225613 15-GraphPOS-0p5 0.182+-0.165 461/253114 15-GraphPOS-0 0.182+-0.165 461/253115 15-Graph-0p5 0.18+-0.133 631/349816 15-Graph-0 0.178+-0.133 632/3550

A summary of the precision effects of preprocessings and thresholds for thefour chat log collections:

1. A high threshold. 0.85 was best no matter what model or domain.

2. POS-tagging improved precision for both graph words andlemma(Swedish Retail was the only exception)

3. The lemma based models got higher precision than graph word basedmodels

4.7 RecallFor Swedish and English Retail and Swedish Travel no more correct candidateswere found among the synonym proposals below cosine similarities of 0.5. OnlyEnglish Travel made a few correct extractions below 0.5 which is shown inTable 4.7. Graph word based models generally extract more correct candidatesdue to all inflections of the test words and other synonyms. If for examplesubtracting proposals with the same lemma as the test words, the number ofcorrect synonym proposals are pretty close for graph word and lemma models.

A summary of the recall effects of preprocessings and thresholds for thefour chat log collections:

1. Low thresholds benefit recall but practically no more correct synonymswere found with thresholds below 0.5 with the selected candidate set sizelimits

2. Lemma based models got higher recall than graph word based models

3. POS-tagged models got lower recall

28

Table 4.7: Recall ranking for English Travel using the adapted gold standard:

Rank Model Recall Correct Proposals / Words in Gold Standard1 10-Lemma-0.txt 0.136+-0.157 213/15682 10-Lemma-0p5.txt 0.135+-0.155 212/15683 10-LemPOS-0.txt 0.106+-0.147 166/15684 10-LemPOS-0p5.txt 0.105+-0.144 164/15685 15-Graph-0.txt 0.104+-0.135 280/26966 15-Graph-0p5.txt 0.102+-0.133 274/26967 10-Lemma-0p7.txt 0.098+-0.125 153/15688 15-GraphPOS-0.txt 0.083+-0.116 223/26969 15-GraphPOS-0p5.txt 0.082+-0.112 220/269610 10-LemPOS-0p7.txt 0.075+-0.123 118/1568

4.8 F-scoreThe f-score ranking in Tables 4.8–4.11 was computed with the adapted goldstandards including misspellings.

When using the dictionary-based gold standard, POS-tagging gave the bestf-scores for all collections. When excluding misspellings from the adapted goldstandard, models that were both lemmatized and POS-tagged gave the highestf-score for all collections but English Retail. When including misspellings inthe adapted gold standard, models trained on data without POS-tags got betterthan models trained on POS-tagged data.

Table 4.8: F-score ranking Swedish Retail using theadapted gold standard:

Rank Model F-score1 10-Lemma-0p5 0.247+-0.172 10-Lemma-0 0.246+-0.1713 10-Lemma-0p7 0.236+-0.1774 10-LemPOS-0p7 0.231+-0.1795 10-LemPOS-0p5 0.231+-0.1796 10-LemPOS-0 0.231+-0.1797 15-Graph-0p5 0.176+-0.1298 15-Graph-0 0.175+-0.1299 15-Graph-0p7 0.156+-0.13,10 10-LemPOS-0p85 0.155+-0.17711 15-GraphPOS-0p5 0.149+-0.12612 15-GraphPOS-0 0.149+-0.12613 15-GraphPOS-0p7 0.142+-0.12714 10-Lemma-0p85 0.118+-0.15415 15-GraphPOS-0p85 0.086+-0.12316 15-Graph-0p85 0.07+-0.113

Table 4.9: F-score ranking English Retail using theadapted gold standard:

Rank Model F-score1 10-Lemma-0p7 0.141+-0.1392 10-Lemma-0p5 0.139+-0.1343 10-Lemma-0 0.138+-0.1344 10-LemPOS-0p5 0.121+-0.1435 10-LemPOS-0 0.121+-0.1436 10-LemPOS-0p7 0.118+-0.1437 15-Graph-0p5 0.111+-0.1078 15-Graph-0 0.11+-0.1069 15-Graph-0p7 0.105+-0.11110 10-Lemma-0p85 0.1+-0.1411 15-GraphPOS-0p5 0.096+-0.11412 15-GraphPOS-0 0.095+-0.11313 15-GraphPOS-0p7 0.084+-0.11414 10-LemPOS-0p85 0.076+-0.16515 15-Graph-0p85 0.060+-0.127,16 15-GraphPOS-0p85 0.047+-0.129

29

Table 4.10: F-score ranking Swedish Travel using theadapted gold standard:

Rank Model F-score1 10-LemPOS-0p7 0.235+-0.1992 10-LemPOS-0p5 0.23+-0.1953 10-LemPOS-0 0.229+-0.1954 10-Lemma-0p5 0.218+-0.1775 10-Lemma-0p7 0.215+-0.1846 10-Lemma-0 0.215+-0.1747 10-LemPOS-0p85 0.19+-0.2198 15-GraphPOS-0p5 0.179+-0.1489 15-GraphPOS-0 0.179+-0.14810 15-GraphPOS-0p7 0.17+-0.14911 15-Graph-0p5 0.163+-0.14612 15-Graph-0 0.160+-0.13913 15-Graph-0p7 0.142+-0.14914 10-Lemma-0p85 0.138+-0.215 15-GraphPOS-0p85 0.091+-0.15916 15-Graph-0p85 0.071+-0.139

Table 4.11: F-score ranking English Travel using theadapted gold standard:

Rank Model F-score1 10-Lemma-0p5 0.136+-0.1262 10-LemPOS-0p5 0.135+-0.1463 10-LemPOS-0 0.135+-0.1464 10-Lemma-0 0.135+-0.1245 10-Lemma-0p7 0.119+-0.1216 15-GraphPOS-0p5 0.112+-0.1157 15-GraphPOS-0 0.112+-0.1148 15-Graph-0p5 0.11+-0.1029 10-LemPOS-0p7 0.109+-0.14110 15-Graph-0 0.109+-0.09711 15-GraphPOS-0p7 0.089+-0.11512 15-Graph-0p7 0.087+-0.10413 10-Lemma-0p85 0.065+-0.12514 10-LemPOS-0p85 0.038+-0.12415 15-Graph-0p85 0.03+-0.09316 15-GraphPOS-0p85 0.028+-0.108

4.9 Comparison OverviewTables 4.12– 4.16 compare each test combination of preprocessing operationsand extraction criteria with every other combination. The comparison is doneby counting how many percent of the times the combination associated with therow gave better results than the combination associated with the column. Bothdictionary-based and adapted gold standard tests are counted in this chapter.Misspellings are included in all gold standard tests presented in this section.The five tables represent Swedish corpora, English corpora, corpora from theretail company, corpora from the travel company and all corpora.

• LU0 = Lemmatized and untagged, threshold = 0.0 (no threshold)

• LU5 = Lemmatized and untagged, threshold = 0.5

• LU7 = Lemmatized and untagged, threshold = 0.7

• LU85 = Lemmatized, threshold = 0.85

• LP0 = Lemmatized and POS-tagged, threshold = 0.0 (no threshold)d

• LP5 = Lemmatized and POS-tagged, threshold = 0.5

• LP7 = Lemmatized and POS-tagged, threshold = 0.7

• LP85 = POS-tagged, threshold = 0.85

• GP0 = POS-tagged, threshold = 0.0 (no threshold)

• GP5 = POS-tagged, threshold = 0.5



30

• GU0 = untagged, threshold = 0.0 (no threshold)

• GU5 = untagged, threshold = 0.5



These tables show that models trained on both lemmatized and POS-taggeddata gave the best f-scores. A threshold of 0.5 is better than no thresholdregardless of the preprocessing operations. For models trained on lemmatizedSwedish data a threshold of 0.7 gives better f-score than 0.5. For models trainedon graph words 0.5 was the threshold giving the best f-score. POS-tagging hada clearly a positive impact on travel data. On graph word data from the retailcompany POS-tagging lowered the f-score. 0.85 was the threshold giving thelowest f-scores.

Table 4.12: All tests (including both dictionary-based and adapted gold standard tests onRetail and Travel data in Swedish and English):

L0 L5 L7 L85 LP0 LP5 LP7 LP85 P0 P5 P7 P85 R0 R5 R7 R85

LU0 12.5 25 87.5 25 25 50 87.5 75 75 87.5 100 100 100 100 100LU5 87.5 50 100 37.5 37.5 50 75 87.5 75 87.5 100 100 100 100 100LU7 75 50 87.5 25 25 50 87.5 75 75 100 100 87.5 87.5 87.5 100L85 12.5 0 12.5 0 0 0 37.5 25 25 37.5 100 12.5 12.5 25 100LP0 75 62.5 75 100 0 37.5 87.5 100 100 100 100 100 100 100 100LP5 75 62.5 75 100 100 37.5 87.5 100 100 100 100 100 100 100 100LP7 50 50 50 100 62.5 62.5 87.5 75 75 100 100 87.5 75 100 100

LP85 12.5 25 12.5 62.5 12.5 12.5 12.5 62.5 50 162.5 100 37.5 37.5 37.5 100GP0 25 12.5 25 75 0 0 25 37.5 0 100 100 100 100 100 100GP5 25 25 25 75 0 0 25 50 100 100 100 50 50 75 100GP7 12.5 12.5 0 62.5 0 0 0 37.5 0 0 100 25 25 37.5 100

GP85 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50GU0 0 0 12.5 87.5 0 0 12.5 62.5 0 50 75 100 0 87.5 100GU5 0 0 12.5 87.5 0 0 25 62.5 0 50 75 100 100 87.5 100GU7 0 0 12.5 75 0 0 0 62.5 0 25 62.5 100 12.5 12.5 100GU85 0 0 0 0 0 0 0 0 0 0 0 50 0 0 0

31

Table 4.13: Swedish (including both dictionary-based and adapted gold standard tests onSwedish Retail and Travel):


LU0 0 25 75 25 25 25 50 75 75 75 100 100 100 100 100LU5 100 50 100 25 25 25 75 75 75 75 100 100 100 100 100LU7 75 50 75 25 25 50 75 75 75 100 100 100 100 100 100

LU85 25 0 25 0 0 0 0 25 25 25 100 25 25 50 100LP0 75 75 75 100 0 0 75 100 100 100 100 100 100 100 100LP5 75 75 75 100 100 0 75 100 100 100 100 100 100 100 100LP7 75 75 50 100 100 100 75 100 100 100 100 100 100 100 100

LP85 50 25 25 100 25 25 25 100 100 100 100 75 75 75 100GP0 25 25 25 75 0 0 0 0 0 100 100 50 50 25 100GP5 25 25 25 75 0 0 0 0 100 100 100 50 50 25 100GP7 25 25 0 75 0 0 0 0 0 0 100 50 50 50 100

GP85 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100GU0 0 0 0 75 0 0 0 25 50 50 50 100 0 75 100GU5 0 0 0 75 0 0 0 25 50 50 50 100 100 75 100GU7 0 0 0 50 0 0 0 25 25 25 50 100 25 25 100GU85 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Table 4.14: English (including both dictionary-based and adapted gold standard tests onEnglish Retail and Travel):


LU0 0 50 100 25 25 75 100 75 75 100 100 100 100 100 100LU5 100 50 100 50 50 75 75 100 75 100 100 100 100 100 100LU7 50 50 100 25 25 50 100 75 75 100 100 75 75 75 100

LU85 0 0 0 0 0 0 75 25 25 50 100 0 0 0 100LP0 75 50 75 100 0 75 100 100 100 100 100 100 100 100 100LP5 75 50 75 100 100 75 100 100 100 100 100 100 100 100 100LP7 25 25 50 100 25 25 100 50 50 100 100 75 50 100 100

LP85 0 25 0 25 0 0 0 25 0 25 100 0 0 0 100GP0 25 0 25 75 0 0 50 75 0 100 100 50 50 75 100GP5 25 25 25 75 0 0 50 100 100 100 100 50 50 75 100GP7 0 0 0 50 0 0 0 75 0 0 100 0 0 25 100

GP85 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0GU0 0 0 25 100 0 0 25 100 50 50 100 100 0 100 100GU5 0 0 25 100 0 0 50 100 50 50 100 100 100 100 100GU7 0 0 25 100 0 0 0 100 25 25 75 100 0 0 100GU85 0 0 0 0 0 0 0 0 0 0 0 100 0 0 0

32

Table 4.15: Retail (including both dictionary-based and adapted gold standard tests onSwedish and English Retail):


LU0 25 0 100 50 50 50 100 100 100 100 100 100 100 100 100LU5 75 25 100 50 50 50 75 100 100 100 100 100 100 100 100LU7 100 75 75 50 50 75 100 100 100 100 100 100 100 100 100

LU85 0 0 25 0 0 0 25 50 50 75 100 0 0 25 100LP0 50 50 50 100 0 25 100 100 100 100 100 100 100 100 100LP5 50 50 50 100 100 25 100 100 100 100 100 100 100 100 100LP7 50 50 25 100 75 75 100 100 100 100 100 100 100 100 100

LP85 0 25 0 75 0 0 0 75 50 50 100 25 25 25 100GP0 0 0 0 50 0 0 0 25 0 100 100 0 0 50 100GP5 0 0 0 50 0 0 0 50 100 100 100 0 0 50 100GP7 0 0 0 25 0 0 0 50 0 0 100 0 0 0 100

GP85 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100GU0 0 0 0 100 0 0 0 75 100 100 100 100 0 100 100GU5 0 0 0 100 0 0 0 75 100 100 100 100 100 100 100GU7 0 0 0 75 0 0 0 75 50 50 100 100 0 0 100GU85 0 0 0 0 0 0 0 0 0 0 0 50 0 0 0

Table 4.16: Travel (including both dictionary-based and adapted gold standard tests onSwedish and English Travel):


LU0 0 50 75 0 0 50 75 50 50 75 100 100 100 100 100LU5 100 75 100 25 25 50 75 75 50 75 100 100 100 100 100LU7 50 25 100 0 0 25 75 50 50 100 100 100 100 100 100

LU85 25 0 0 0 0 0 50 0 0 0 100 25 25 25 100LP0 100 75 100 100 0 50 75 100 100 100 100 100 100 100 100LP5 100 75 100 100 100 50 75 100 100 100 100 100 100 100 100LP7 50 50 75 100 50 50 75 50 50 100 100 75 50 100 100

LP85 25 25 25 50 25 25 25 50 50 75 100 50 50 50 100GP0 50 25 50 100 0 0 50 50 0 100 100 100 100 100 100GP5 50 50 50 100 0 0 50 50 100 100 100 100 100 100 100GP7 25 25 0 100 0 0 0 25 0 0 100 100 100 100 100

GP85 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50GU0 0 0 0 75 0 0 25 50 0 0 0 100 0 75 100GU5 0 0 0 75 0 0 50 50 0 0 0 100 100 75 100GU7 0 0 0 75 0 0 0 50 0 0 0 100 25 25 100GU85 0 0 0 0 0 0 0 0 0 0 0 50 0 0 0

33

5 Conclusions

The purpose of this study was to evaluate effects of different preprocessing op-erations of chat log collections and different extraction criteria when extractingsynonyms with JavSDM. In the experiments and this chapter, lemmatizationand POS-tagging are the main preprocessing operations and cosine similaritythe main criterion that are discussed.

Conclusions were drawn if there was similar effect on at least two ofthe chat collections with either the language or domain in common. Whatstrengthens the conclusions drawn in this chapter is that the rankings of modelsand thresholds looked very similar no matter which gold standard or chatcollection was used. The only obvious difference was that POS-tagging hadmore positive effect on travel data.

The standard deviations made all results overlap since precision and recalloften varied from 0.0–1.0 for single test words. Consequently these deviationsare not very informative. Figure 5.1 is an example why the standard deviationshad wide ranges. Section 5.1 is a summary of the conclusions drawn from the

Figure 5.1: The example shows how many percent of the test words that got synonymproposals with a specific f-score. The results in the diagram are from a model trained on thelemmatized chat collection of Swedish Retail tested with 250 test words. 25 per cent of thetest words got f-score of 0.

results in this study. Section 5.2 discusses interesting observations from thestudy that were outside the scope of the purpose of thesis, alternative waysthings could have been done, things that would have been tested/studied if

34

there had been more time and future study suggestions.

5.1 Overview of the resultsThe purpose of this study was to compare preprocessing operations and extrac-tion criteria and thereby help users of JavaSDM for synonym extraction. Thefollowing is a list of main conclusions that can be drawn from this study.

• POS-tagging clearly improves precision for all data collections except forSwedish Retail.

• Lemmatized data give better precision, recall and f-score for all collections(this is discussed in Section 5.2).

• As expected, best precision is achieved with higher thresholds. 0.85 wasthe highest in this study.

• Best recall is achieved with a threshold of 0.5 or lower. 0.5 is preferablesince lower thresholds decrease precision and f-score

• Best f-score for retail data is achieved without POS-tagging when mis-spellings are included in the gold standard.

• Best f-score for travel data is achieved with lemmatized and POS-taggeddata.

• Best threshold regarding f-score was generally 0.5 for unlemmatizeddata. With lemmatized data the synonym candidates tend to gain highersimilarity scores and therefore 0.7 is quite even with 0.5. For modelstrained on both lemmatized and POS-tagged Swedish corpora 0.7 gavebetter f-score every time. With a larger corpora words will probablyget higher similarity scores since words in lemmatized data have higherfrequency and lemmatized models got higher cosine similarity scores thangraph word models.

• If ignoring misspellings, POS-tagging improves f-score for all collectionsbut English Retail.

• The fact the adapted gold standards gives three to four times better f-scorethan dictionary-based indicates how useful synonym extraction can bewhen handling a domain specific vocabulary or complementing existingthesauri.

5.2 DiscussionAs van der Plas and Tiedemann (2006) stated, a higher lemmatization and POS-tagging quality would improve the results, e.g. when a Swedish gold standardincluded “smart” the program extracted “smar”, since that was how “smart”had been lemmatized. In such cases the random indexing is successful and theproblem is the lemmatization quality or in this example rather the lemmatizer’schoice of lemma form for inflections of smart. Unedited chat logs often lack

35

punctuation and capitalization of proper nouns and often contain misspellingsand words from foreign languages. In specific domains there might be namedentities that are rarely mentioned outside the domain. These are factors thatmight make things more difficult for lemmatizers and part-of-speech-taggers.

Gold standard exclusion of synonyms from synonym dictionaries that donot appear in training data makes recall scores more reflective of JavaSDM’sability to extract synonyms. Still it is not certain that the ones that occurred inthe training data appeared in a sense that is synonymous to the test word. ThePOS-tagging solves parts of this problem, but a good word-sense or morphologytagging would do it more accurately. Morphology is in theory useful for sepa-rating noun homonyms in Swedish. For example gender separates the Swedishgraph word “lag” that means “law” if it is utrum and “team” if it is neutrum andspecies can decide whether the Swedish graph word “banan” means “banana”or “the track”. Morphological use did not improve results in this study, butthe training data was rather small and it could be worth trying morphologicaltagging on a larger corpus.

Lemmatized Gold Standard Graph Word Gold Standardersätta ersättaskifta skiftaändra ändra

ersattersatteersattsersätterskiftarändradeändrat

When using the graph word gold standard all inflections of a synonym, thathas occurred in the training data, are seen as correct synonyms e.g. “ersatt”,“ersatte”, “ersätta” and “ersatts” count as four correct synonyms for “byta” (seeSection 3.4.2). Is it better if for example a graph word based model’s proposalsinclude those four correct candidates than if a lemmatized model’s proposalsonly include two correct candidates: “ersätta” and “skifta”? Precision-wise theanswer is yes according to the evaluation method if one compares 4/15 to 2/10.Recall-wise the answer is no in this example since the graph word gold standardincludes more inflections of the other words as well and thereby get a recall of4/10 while the lemma model would get 2/3. The number of found words iscounted and compared to get indications of the effects of this risk. If a lemmaversion does more correct extractions this issue is no problem, but if a graphword version does more correct extractions but has lower recall the issue couldbe there.

For a clearer recall comparison between lemma and graph word basedmodels, one could have counted extractions of words with the same lemmaform as one suggestion, lemmatize that extraction and use the lemmatized goldstandard for evaluation. That would for example have meant a recall of 1/3instead of 4/10 for a graph word based model that extracted “ersatt”, “ersatte”,

36

“ersätta” and “ersatts” for “byta” using the gold standard below. The extractionsand transformed extractions for “byta” are shown below:

Extractions -> Transformed Extractionsersatt ersättaersätta ringaersatte hittaersatts väljaringa användahitta monteravälja ställaanvända skrivamontera seställa hämtarskriva paxase läsahämtarpaxaläsa

How to count precision and thereby f-score when using such a method is notas self-evident. One could choose between the following:

• The precision would be 1/15 if you only count a lemma form as correctone time

• The precision would it be 1/11 if you count all extractions with identicallemma form as only one extraction

• The precision would be 4/15 if you count each inflection individually.

If “ersätta” had been an incorrect extraction, would the four inflections becounted as one or four incorrect extractions? If the user of JavaSDM don’t needmore in more than on form and want less extractions to read, it would be betterto lemmatize extractions and only show a reappearing lemmas once.

In this study models were trained either on graph words or lemma butone could try to combine graph words and lemma when training models. Forexample each focus word could be the graph word while the surrounding wordscould be lemmatized or the other way around. For example “the dogs wavedtheir tales” could be “the dog waved their tail” or “the dogs wave their tails”.

Preliminary tests indicated that it is better to keep stop words during trainingand instead use the list of stop words as a way to stop extractions of those wordsfrom being synonym proposals. An explanation why could be that stop wordsoften give syntactic information and are for example used by POS-taggers. Thesyntactic information from stop words is probably not as relevant if data isPOS-tagged.

Antonyms, which prior studies have mentioned as problem, might be usedironically. With smileys irony is sometimes more clearly revealed in chat logs.Smileys can carry lots of information and are therefore something that shouldbe explored further for use in dialogue systems. Sometimes smileys are used

37

instead of words to communicate positive or negative energy. Lika all othertokens without alphabetic characters, smileys could not be extracted in thisstudy.

When training models with JavsSDM, one can set different parameters suchas type of weighting scheme and size of context windows. Default parameterswere used in this study since no report was found stating when to use whichcombination of parameters. Different settings might be better for graph worddata but not for lemmatized data and vice versa. Also the corpora sizes mighthave affected the conclusions drawn in this study.

There are more things left to try regarding model training of POS-taggeddata. There is a weighting scheme that during training can give extra weight tocontent words, for example words tagged as nouns, adjectives and verbs. It isnot sure that it would improve results since the pre-tests indicated that it canbe a good choice to keep stop words during training, but it should definitely betested.

Misspelled tokens are intended to have the same meaning as test wordsand should have distributions suitable for the test word. Graph word data isbetter for this since lemmatizers and part-of-speech taggers are not trainedto handle unknown spellings. Even if misspellings should be seen as correctsynonym suggestions and actually might be very relevant, there are other waysto find misspellings. Though a distributional similarity can be used to enhancethe probability that a detected/suspected misspelling is a misspelling. JavaSDMfound several misspellings with frequencies of 1 which in a way is not asrelevant as frequent misspellings. Frequency counting was done and showedthat some misspellings were more frequent than the “correct” spelling and thosemisspellings are very interesting. It is likely that very frequent misspellings arenot misspellings in the users’ heads.

Compound words are not handled automatically by JavaSDM. Spaces arethe same as the start and end of a word which means that “answering machine”,for example is read as “answering” and “machine” and “washing machine” as“washing” and “machine”. Such an example increases risk of having “washing”suggested as a synonym for “answering”. That problem could be avoided ifsuch phrases were somehow recognized as one word when training JavaSDMmodels. This is a problem because compound words do not get as much dataas they should and there may be distributional relations calculated, e.g. if“asbra” (“cadaver”+”good” meaning “extremely good”) is written “as bra” then“asbra” has one less occurrence calculated and also “as” (“cadaver”) becomesassociated with a positive word such as “good” and vice versa. This is extradifficult regarding verb compounds that can have other words in between forexample “Switch off the light!” and “Switch the light off!”. I wrote a simpleprogram that extracted nouns followed by nouns and then checked if thesecoordinate nouns occurred compounded as one word with or without hyphen.Then it compared how many times each different compounding method wasused and actually quite often the coordinate nouns had occurred in all threeversions and even more often the compounding method was one that wouldbe considered as a misspelling in formal Swedish. This should be examined infurther studies, because it is an obstruction for synonym extraction and it is awell-known problem in linguistics. Incorrect disjunctions are the most commonproblem in Swedish (Melin (2006)).

38

5.2.1 Future Study suggestions

The following is a summary list of the most relevant future study suggestionsmentioned in the discussion:

• Lemmatize graph word extractions for a better comparison.

• Join compund words as one token. Try to handle incorrect spelling ofSwedish compound words.

• Examine effects of JavaSDM parameters, for example different contextwindow sizes.

• Try weighting words in the context windows differently depending ontheir POS-tag.

39

Bibliography

V.D. Blondel and P. Senellart. Automatic extraction of synonyms in a dictionary.Technical report, University of Louvain, 2002.

D. Alan Cruse. Notes on meaning in language. Master’s thesis, Mitthögskolan,2002.

James R. Curran and Marc Moens. Improvements in automatic thesaurusextraction. In Proceedings of the ACL-02 workshop on Unsupervised lexicalacquisition - Volume 9, ULA ’02, pages 59–66, Stroudsburg, PA, USA, 2002.Association for Computational Linguistics. doi: 10.3115/1118627.1118635.URL http://dx.doi.org/10.3115/1118627.1118635.

Philip Edmonds and Graeme Hirst. Near-synonymy and lexical choice. Compu-tational Linguistics, 28(2), 2002.

Olivier Ferret. Testing semantic similarity measures for extracting synonymsfrom a corpus. Proceedings if LREC, 2010.

James Gorman and James R. Curran. Random indexing using statistical weightfunctions. In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing, EMNLP ’06, pages 457–464, Stroudsburg, PA,USA, 2006. Association for Computational Linguistics. ISBN 1-932432-73-6.URL http://dl.acm.org/citation.cfm?id=1610075.1610139.

Zelig S. Harris. Distributional structure. Word, 10:146–162, 1954.

Martin Hassel. Resource Lean and Portable Automatic Text Summarization. PhDthesis, KTH, 2007.

Louise Holmer. Passiv och perfekt particip i saol plus. en dokumentation av denlexikografiska arbetsprocessen. Master’s thesis, University of Gothenburg,2009.

Daniel Jurafsky and James H. Martin. Speech and Language Processing. Anintroduction to Natural Language Processing, Computational Linguistics, andSpeech Recognition. Pearson Education International, 2009.

Lars Melin. Experimentell språkvård. Språkvård 4, 2006.

Philippe Muller, Nabil Hathout, and Bruno Gaume. Synonym extraction usinga semantic distance on a dictionary. In Proceedings of the First Workshop onGraph Based Methods for Natural Language Processing, TextGraphs-1, pages65–72, Stroudsburg, PA, USA, 2006. Association for Computational Linguis-tics. URL http://dl.acm.org/citation.cfm?id=1654758.1654773.

40

http://dx.doi.org/10.3115/1118627.1118635

http://dl.acm.org/citation.cfm?id=1610075.1610139


M Ortuño, P Carpena, P Bernaola-Galván, E Muñoz, and A.M Somoza. Key-word detection in natural languages and dna. EPL (Europhysics Letters), 57(5):759–764, 2002.

Lonneke Van Der Plas and Gosse Bouma. Syntactic contexts for finding seman-tically related words. In In CLIN, pages 173–186, 2004.

M. Rosell, M. Hassel, and V. Kann. Global evaluation of random indexingthrough swedish word clustering compared to the people’s dictionary ofsynonyms. 2009.

Magnus Sahlgren. An introduction to random indexing. SICS, Swedish Instituteof Computer Science, 2005.

Carlo Strapparava and Rada Mihalcea. Semeval-2007 task 14: Affective text.In Proceedings of the 4th International Workshop on Semantic Evaluations(SemEval- 2007), pages 70–74, 2007.

Lonneke van der Plas and Jörg Tiedemann. Finding synonyms using automaticword alignment and measures of distributional similarity. In Proceedings of theCOLING/ACL on Main conference poster sessions, COLING-ACL ’06, pages866–873, Stroudsburg, PA, USA, 2006. Association for Computational Lin-guistics. URL http://dl.acm.org/citation.cfm?id=1273073.1273184.

Tong Wang. Extracting synonyms from dictionary definitions by. Master’s thesis,University of Toronto, 2009.

Tong Wang and Graeme Hirst. Exploring patterns in dictionary definitions forsynonym extraction. Natural Language Engineering, 18:313–342, 2012.

Hua Wu and Ming Zhou. Optimizing synonym extraction using monolingualand bilingual resources. In Proceedings of the second international workshopon Paraphrasing - Volume 16, PARAPHRASE ’03, pages 72–79, Stroudsburg,PA, USA, 2003. Association for Computational Linguistics. doi: 10.3115/1118984.1118994. URL http://dx.doi.org/10.3115/1118984.1118994.

41


http://dx.doi.org/10.3115/1118984.1118994

Documents

Extraction of synonyms and semantically related words from ... · A synonym is a word having the same or nearly the same meaning as another word. For example “pal” and “friend”