Upload
vuhanh
View
214
Download
0
Embed Size (px)
Citation preview
Extending Wordnet Bahasa with External Resources
Lim Lian Tze1 and Tang Enya Kong2
1KDU College Penang Malaysia (liantzegmailcom)2Linton University College Malaysia (enyagkong1gmailcom)
WordNet Bahasa HackathonWorkshop
Lim and Tang | WordNet Bahasa HackathonWorkshop 1 25
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 2 25
| How it all started
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 3 25
| How it all started
Aligning Bilingual Dictionary to Princeton WordNetUTMKUSM linguists Lim and Hussein (2006)
Kamus Inggeris-Melayu Dewan (KIMD)dot n small round spot titik (appearing in large numbers on dress leaf etc)bintik
KIMD senses (manually) aligned to WordNet 16 senseskimd (dot n 1 [small round spot (appearing in large numbers on dressleaf etc)] ⟨titik bintik⟩)wordnet (110025218 lsquodotrsquo n 1 [a very small circular shape] )
Malay WordNet synset(titik bintik [a very small circular shape] )
Lim and Tang | WordNet Bahasa HackathonWorkshop 4 25
| How it all started
Malay WordNet Prototype
Nouns12429 synsets
hypernymyhyponymyholonymymeronymy
part-ofmember-ofsubstance-of
Verbs5805 synsets
hypernymytroponymy
cause
entailment
Lim and Tang | WordNet Bahasa HackathonWorkshop 5 25
| How it all started
Screenshots
Lim and Tang | WordNet Bahasa HackathonWorkshop 6 25
| Utilising Interlingual Links in Wikipedia Articles
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25
| Utilising Interlingual Links in Wikipedia Articles
Wikipedia Article Dumps
ltpagegtlttitlegtMarikhlttitlegtlttextgt
Infobox Planet[[enMars]][[esMarte (planeta)]]
lttextgtltpagegtltpagegt
lttitlegtLaut Kaspialttitlegtlttextgt
[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]
lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25
| Utilising Interlingual Links in Wikipedia Articles
Categories and Multilingual Translations
[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)
[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)
Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)
Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa
1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet
2 If only one synset is found map the IndonesianMalaysian title to it
3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to
Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa (contrsquod)
8480 new mappings
3725 new synsets
732 new Malay entries (ie used in both Malaysian and Indonesian)
2109 new Malaysian entries
5473 new Indonesian entries
Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25
| Utilising Wikidata API
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 2 25
| How it all started
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 3 25
| How it all started
Aligning Bilingual Dictionary to Princeton WordNetUTMKUSM linguists Lim and Hussein (2006)
Kamus Inggeris-Melayu Dewan (KIMD)dot n small round spot titik (appearing in large numbers on dress leaf etc)bintik
KIMD senses (manually) aligned to WordNet 16 senseskimd (dot n 1 [small round spot (appearing in large numbers on dressleaf etc)] ⟨titik bintik⟩)wordnet (110025218 lsquodotrsquo n 1 [a very small circular shape] )
Malay WordNet synset(titik bintik [a very small circular shape] )
Lim and Tang | WordNet Bahasa HackathonWorkshop 4 25
| How it all started
Malay WordNet Prototype
Nouns12429 synsets
hypernymyhyponymyholonymymeronymy
part-ofmember-ofsubstance-of
Verbs5805 synsets
hypernymytroponymy
cause
entailment
Lim and Tang | WordNet Bahasa HackathonWorkshop 5 25
| How it all started
Screenshots
Lim and Tang | WordNet Bahasa HackathonWorkshop 6 25
| Utilising Interlingual Links in Wikipedia Articles
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25
| Utilising Interlingual Links in Wikipedia Articles
Wikipedia Article Dumps
ltpagegtlttitlegtMarikhlttitlegtlttextgt
Infobox Planet[[enMars]][[esMarte (planeta)]]
lttextgtltpagegtltpagegt
lttitlegtLaut Kaspialttitlegtlttextgt
[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]
lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25
| Utilising Interlingual Links in Wikipedia Articles
Categories and Multilingual Translations
[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)
[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)
Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)
Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa
1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet
2 If only one synset is found map the IndonesianMalaysian title to it
3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to
Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa (contrsquod)
8480 new mappings
3725 new synsets
732 new Malay entries (ie used in both Malaysian and Indonesian)
2109 new Malaysian entries
5473 new Indonesian entries
Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25
| Utilising Wikidata API
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| How it all started
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 3 25
| How it all started
Aligning Bilingual Dictionary to Princeton WordNetUTMKUSM linguists Lim and Hussein (2006)
Kamus Inggeris-Melayu Dewan (KIMD)dot n small round spot titik (appearing in large numbers on dress leaf etc)bintik
KIMD senses (manually) aligned to WordNet 16 senseskimd (dot n 1 [small round spot (appearing in large numbers on dressleaf etc)] ⟨titik bintik⟩)wordnet (110025218 lsquodotrsquo n 1 [a very small circular shape] )
Malay WordNet synset(titik bintik [a very small circular shape] )
Lim and Tang | WordNet Bahasa HackathonWorkshop 4 25
| How it all started
Malay WordNet Prototype
Nouns12429 synsets
hypernymyhyponymyholonymymeronymy
part-ofmember-ofsubstance-of
Verbs5805 synsets
hypernymytroponymy
cause
entailment
Lim and Tang | WordNet Bahasa HackathonWorkshop 5 25
| How it all started
Screenshots
Lim and Tang | WordNet Bahasa HackathonWorkshop 6 25
| Utilising Interlingual Links in Wikipedia Articles
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25
| Utilising Interlingual Links in Wikipedia Articles
Wikipedia Article Dumps
ltpagegtlttitlegtMarikhlttitlegtlttextgt
Infobox Planet[[enMars]][[esMarte (planeta)]]
lttextgtltpagegtltpagegt
lttitlegtLaut Kaspialttitlegtlttextgt
[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]
lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25
| Utilising Interlingual Links in Wikipedia Articles
Categories and Multilingual Translations
[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)
[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)
Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)
Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa
1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet
2 If only one synset is found map the IndonesianMalaysian title to it
3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to
Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa (contrsquod)
8480 new mappings
3725 new synsets
732 new Malay entries (ie used in both Malaysian and Indonesian)
2109 new Malaysian entries
5473 new Indonesian entries
Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25
| Utilising Wikidata API
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| How it all started
Aligning Bilingual Dictionary to Princeton WordNetUTMKUSM linguists Lim and Hussein (2006)
Kamus Inggeris-Melayu Dewan (KIMD)dot n small round spot titik (appearing in large numbers on dress leaf etc)bintik
KIMD senses (manually) aligned to WordNet 16 senseskimd (dot n 1 [small round spot (appearing in large numbers on dressleaf etc)] ⟨titik bintik⟩)wordnet (110025218 lsquodotrsquo n 1 [a very small circular shape] )
Malay WordNet synset(titik bintik [a very small circular shape] )
Lim and Tang | WordNet Bahasa HackathonWorkshop 4 25
| How it all started
Malay WordNet Prototype
Nouns12429 synsets
hypernymyhyponymyholonymymeronymy
part-ofmember-ofsubstance-of
Verbs5805 synsets
hypernymytroponymy
cause
entailment
Lim and Tang | WordNet Bahasa HackathonWorkshop 5 25
| How it all started
Screenshots
Lim and Tang | WordNet Bahasa HackathonWorkshop 6 25
| Utilising Interlingual Links in Wikipedia Articles
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25
| Utilising Interlingual Links in Wikipedia Articles
Wikipedia Article Dumps
ltpagegtlttitlegtMarikhlttitlegtlttextgt
Infobox Planet[[enMars]][[esMarte (planeta)]]
lttextgtltpagegtltpagegt
lttitlegtLaut Kaspialttitlegtlttextgt
[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]
lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25
| Utilising Interlingual Links in Wikipedia Articles
Categories and Multilingual Translations
[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)
[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)
Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)
Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa
1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet
2 If only one synset is found map the IndonesianMalaysian title to it
3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to
Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa (contrsquod)
8480 new mappings
3725 new synsets
732 new Malay entries (ie used in both Malaysian and Indonesian)
2109 new Malaysian entries
5473 new Indonesian entries
Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25
| Utilising Wikidata API
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| How it all started
Malay WordNet Prototype
Nouns12429 synsets
hypernymyhyponymyholonymymeronymy
part-ofmember-ofsubstance-of
Verbs5805 synsets
hypernymytroponymy
cause
entailment
Lim and Tang | WordNet Bahasa HackathonWorkshop 5 25
| How it all started
Screenshots
Lim and Tang | WordNet Bahasa HackathonWorkshop 6 25
| Utilising Interlingual Links in Wikipedia Articles
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25
| Utilising Interlingual Links in Wikipedia Articles
Wikipedia Article Dumps
ltpagegtlttitlegtMarikhlttitlegtlttextgt
Infobox Planet[[enMars]][[esMarte (planeta)]]
lttextgtltpagegtltpagegt
lttitlegtLaut Kaspialttitlegtlttextgt
[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]
lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25
| Utilising Interlingual Links in Wikipedia Articles
Categories and Multilingual Translations
[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)
[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)
Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)
Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa
1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet
2 If only one synset is found map the IndonesianMalaysian title to it
3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to
Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa (contrsquod)
8480 new mappings
3725 new synsets
732 new Malay entries (ie used in both Malaysian and Indonesian)
2109 new Malaysian entries
5473 new Indonesian entries
Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25
| Utilising Wikidata API
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| How it all started
Screenshots
Lim and Tang | WordNet Bahasa HackathonWorkshop 6 25
| Utilising Interlingual Links in Wikipedia Articles
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25
| Utilising Interlingual Links in Wikipedia Articles
Wikipedia Article Dumps
ltpagegtlttitlegtMarikhlttitlegtlttextgt
Infobox Planet[[enMars]][[esMarte (planeta)]]
lttextgtltpagegtltpagegt
lttitlegtLaut Kaspialttitlegtlttextgt
[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]
lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25
| Utilising Interlingual Links in Wikipedia Articles
Categories and Multilingual Translations
[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)
[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)
Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)
Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa
1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet
2 If only one synset is found map the IndonesianMalaysian title to it
3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to
Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa (contrsquod)
8480 new mappings
3725 new synsets
732 new Malay entries (ie used in both Malaysian and Indonesian)
2109 new Malaysian entries
5473 new Indonesian entries
Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25
| Utilising Wikidata API
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Utilising Interlingual Links in Wikipedia Articles
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 7 25
| Utilising Interlingual Links in Wikipedia Articles
Wikipedia Article Dumps
ltpagegtlttitlegtMarikhlttitlegtlttextgt
Infobox Planet[[enMars]][[esMarte (planeta)]]
lttextgtltpagegtltpagegt
lttitlegtLaut Kaspialttitlegtlttextgt
[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]
lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25
| Utilising Interlingual Links in Wikipedia Articles
Categories and Multilingual Translations
[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)
[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)
Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)
Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa
1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet
2 If only one synset is found map the IndonesianMalaysian title to it
3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to
Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa (contrsquod)
8480 new mappings
3725 new synsets
732 new Malay entries (ie used in both Malaysian and Indonesian)
2109 new Malaysian entries
5473 new Indonesian entries
Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25
| Utilising Wikidata API
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Utilising Interlingual Links in Wikipedia Articles
Wikipedia Article Dumps
ltpagegtlttitlegtMarikhlttitlegtlttextgt
Infobox Planet[[enMars]][[esMarte (planeta)]]
lttextgtltpagegtltpagegt
lttitlegtLaut Kaspialttitlegtlttextgt
[[KategoriTasik di Eropah|Kaspia]][[KategoriTasik di Rusia|Kaspia]][[KategoriTasik di Asia|Kaspia]][[enCaspian Sea]][[esMar Caspio]]
lttextgtltpagegtLim and Tang | WordNet Bahasa HackathonWorkshop 8 25
| Utilising Interlingual Links in Wikipedia Articles
Categories and Multilingual Translations
[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)
[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)
Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)
Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa
1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet
2 If only one synset is found map the IndonesianMalaysian title to it
3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to
Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa (contrsquod)
8480 new mappings
3725 new synsets
732 new Malay entries (ie used in both Malaysian and Indonesian)
2109 new Malaysian entries
5473 new Indonesian entries
Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25
| Utilising Wikidata API
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Utilising Interlingual Links in Wikipedia Articles
Categories and Multilingual Translations
[[KategoriTasik di Eropah|Kaspia]](Category European Lakes rarr Caspia)
[[esMar Caspio]]Spanish (es) translation = Mar Caspio(Multilingual dictionary)
Spanish Wikipedia article about the Caspian Sea can be accessed athttpeswikipediaorgwikiMarCaspio(Bilingualmultilingual comparable corpora)
Lim and Tang | WordNet Bahasa HackathonWorkshop 9 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa
1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet
2 If only one synset is found map the IndonesianMalaysian title to it
3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to
Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa (contrsquod)
8480 new mappings
3725 new synsets
732 new Malay entries (ie used in both Malaysian and Indonesian)
2109 new Malaysian entries
5473 new Indonesian entries
Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25
| Utilising Wikidata API
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa
1 For title for each Indonesian and Malaysian Wikipedia article look upits corresponding English title in Princeton WordNet
2 If only one synset is found map the IndonesianMalaysian title to it
3 If multiple synsets are found compare the hypernyms chain of eachsynset to the semantic type and categories of the Wikipedia articleThe first synset whose hypernym chain contains the semantic type orone of the categories is chosen as the synset to be mapped to
Lim and Tang | WordNet Bahasa HackathonWorkshop 10 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa (contrsquod)
8480 new mappings
3725 new synsets
732 new Malay entries (ie used in both Malaysian and Indonesian)
2109 new Malaysian entries
5473 new Indonesian entries
Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25
| Utilising Wikidata API
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Utilising Interlingual Links in Wikipedia Articles
Adding Entries to WordNet Bahasa (contrsquod)
8480 new mappings
3725 new synsets
732 new Malay entries (ie used in both Malaysian and Indonesian)
2109 new Malaysian entries
5473 new Indonesian entries
Lim and Tang | WordNet Bahasa HackathonWorkshop 11 25
| Utilising Wikidata API
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Utilising Wikidata API
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 12 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Utilising Wikidata API
The Wikidata Project
Central storage for the structured data of Wikimedia projectsincluding Wikipedia Wikivoyage Wikisource and others
All interlingual links will be moved to Wikidata eventually
Hence some articles Wikipedia dumps are lsquomissingrsquo these links
Wikidata HTTP API httpwwwwikidataorgwapiphp
Lim and Tang | WordNet Bahasa HackathonWorkshop 13 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations
httpwwwwikidataorgwapiphpaction=wbgetentitiesampsites=mswikiamptitles=
serangan20jantungampnormalizeampformat=jsonampprops=datatype|labels|
descriptions|aliasesamplanguages=ms|id|en
XML response (Other formats are possible)
ltxml version=10gtltapi success=1gt
ltnormalizedgtltn from=serangan jantung to=Penginfarkan miokardium gt
ltnormalizedgtltentitiesgt
ltentity id=Q12152 type=itemgtltlabelsgt
ltlabel language=ms value=Penginfarkan miokardium gtltlabel language=id value=Serangan jantung gtltlabel language=en value=heart attack gt
ltlabelsgtltdescriptionsgt
Lim and Tang | WordNet Bahasa HackathonWorkshop 14 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Utilising Wikidata API
Example Retrieving Multilingual Translations (contrsquod)
ltdescription language=en value=interruption of bloodsupply to a part of the heart gt
ltdescriptionsgtltaliasesgt
ltalias language=en value=acute myocardial infarction gtltalias language=en value=AMI gtltalias language=en value=MI gtltalias language=en value=myocardial infarction gtltalias language=ms value=Serangan jantung gt
ltaliasesgtltentitygt
ltentitiesgtltapigt
Mappings to WordNet synsets done but not yet checked andoroicially added
Cursory glance lots of identical lexicalisation to English
Lim and Tang | WordNet Bahasa HackathonWorkshop 15 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Possible Next Steps
Topics
1 How it all started
2 Utilising Interlingual Links in Wikipedia Articles
3 Utilising Wikidata API
4 Possible Next Steps
Lim and Tang | WordNet Bahasa HackathonWorkshop 16 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Possible Next Steps
Extending Specific Hierarchies
Cultural-specific concepts eg clothing items food and dishes etc Eg Malay Wikipedia category lsquoMasakan Malaysiarsquo (Malaysian dishes)
ayam golek buah keras acar timun ikan bakar mi rebus apam
Some items (eg lsquoAyam penyetrsquo) arenrsquot in Indonesian nor MalayWikipedia but are listed in English Wikipedia
How can we tell if the title of an Wikipedia article is a foreign languageword
Lim and Tang | WordNet Bahasa HackathonWorkshop 17 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Possible Next Steps
Kamus Dewan
The main Malay monolingual dictionary in Malaysia
Published by Dewan Bahasa dan Pustaka
Currently annotating contents with TEI to give structure (as part ofanother project)
Allows for easier more targeted searches
(Still in progress ndash expected completion Nov 2014)
Not open-source but can be used for research by arrangementeg can be used for some quick additions to WordNet Bahasa
Lim and Tang | WordNet Bahasa HackathonWorkshop 18 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Possible Next Steps
Lexical Items
Derived words as subentries of root word ndash very rich
Lots of MWEs including peribahasa (idioms) usually fixed word orderand lile morphosyntactic processNo POS
Do something based on definition text
Lim and Tang | WordNet Bahasa HackathonWorkshop 19 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Possible Next Steps
Extending Specific HierarchiesSearch by definitions (Start with the simple ones)
Definition = lsquosj (masakan | makanan) rsquo (A dish )bamiyah besengek caca dalca gudeg Other posibilities dances articles of clothing musical instruments
(KD definition texts themselves cannot be used in WordNet Bahasa ndashcopyright issues)
(Mine from Wikipedia CC-BY-SAGFPL)
lsquoFlatrsquo hierarchy for starters
Lim and Tang | WordNet Bahasa HackathonWorkshop 20 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Possible Next Steps
Names of Flora amp Fauna
KD contains a huge number of binomial nomenclature (Latin names)for flora amp fauna
Match up Malay names with English translations via Latin names
Some might not yet be in Princeton WordNet
Many may not even have English equivalent names
Lim and Tang | WordNet Bahasa HackathonWorkshop 21 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Possible Next Steps
Penjodoh Bilangan (Classifiers)
KD indicates classifiers
batang penjodoh bilangan bagi benda yang panjang-panjang(classifiers for longish thigs)
But how to createproject relations between lsquobatangrsquo synset and lsquolongthingsrsquoAnd exceptions
biji penjodoh bilangan (bagi benda kecil dll) (for small things) but lsquosebiji meriamrsquo (a cannon)
Lim and Tang | WordNet Bahasa HackathonWorkshop 22 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Possible Next Steps
Named Entities
Collectextract gazeeer lists of named entities in MalayExisting gazeeer lists location names Wikipedia categories Rule-based list of prefixsuixCorpus-based news articles
Lim and Tang | WordNet Bahasa HackathonWorkshop 23 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Possible Next Steps
Further Processing
More advanced processing (including discovering relations) may(should) be possible in future
Study the definition text paerns in Kamus DewanGreater availability of of Malay corpus
Lim and Tang | WordNet Bahasa HackathonWorkshop 24 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25
| Possible Next Steps
Thank You
qatlhorsquo
Danke谢谢
Grazie
Спасибо
ขอบคณ
9 E4E5IacuteRQordm
Merccedili
Gracias
ntilde
Obrigado
Ευχαριστώ
감사합니다DyvAd
Terima kasih
Thank you
ありがとう
Tapadh leibhiumlgsAumlee
Go raibh maith agaibh
Xin cảm ơn
Lim and Tang | WordNet Bahasa HackathonWorkshop 25 25