20
thai-language.com Glenn Slayden October 14, 2009

Thai-language.com Glenn Slayden October 14, 2009

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Thai-language.com Glenn Slayden October 14, 2009

thai-language.com

Glenn SlaydenOctober 14, 2009

Page 2: Thai-language.com Glenn Slayden October 14, 2009

Agenda

• Background and history• Site surface demonstration• Database ontology• Database technology• Data Entry demonstration• Future directions• Q&A : throughout please

Page 3: Thai-language.com Glenn Slayden October 14, 2009

Overarching Motivation

• Long-term objectives:

–Increase linguistic rigor–Publish any new work–Maintain popular accessibility–Build community

Page 4: Thai-language.com Glenn Slayden October 14, 2009

Historical Parchment - 1997

Page 5: Thai-language.com Glenn Slayden October 14, 2009

More Parchment - 2001

Page 6: Thai-language.com Glenn Slayden October 14, 2009

Site Demonstration

Page 7: Thai-language.com Glenn Slayden October 14, 2009

Database? What Database

• How big is a monolingual dictionary?• 100,000 words x 30 b/entry = 30 MB• How much memory in a modern server?

32GB.• That’s about 1/10th of 1% (.00094)• SQL? MySql? PostGres? Not indicated.

Page 8: Thai-language.com Glenn Slayden October 14, 2009

Case Study

October 13, 2009 – 64-bit web server – 32 GB RAM

Page 9: Thai-language.com Glenn Slayden October 14, 2009

Server Memory Utilization

n.b. this entire pie chart represents 10% of total memory

Page 10: Thai-language.com Glenn Slayden October 14, 2009

In-memory is the way to go

• For performance• For ease and speed of development• Easy refactoring• LINQ – C# “language-integrated query”• Have a flexible and powerful object-model

without worrying about relational mapping• Completely avoid OR/M (object-relational

mapping) “impedance mismatch” issues

Page 11: Thai-language.com Glenn Slayden October 14, 2009

thai-language.com Ontology

• Disclaimer and warning– Internal names of programming objects are not

(any longer) intended to have any relationship to corresponding Linguistic terms. On the following slides please consider these names to be opaque monikers.

Page 12: Thai-language.com Glenn Slayden October 14, 2009

thai-language.com Ontology

These colors correspond (roughly) to data-entry screen colors in DBEdit

Page 13: Thai-language.com Glenn Slayden October 14, 2009

The most basic

Page 14: Thai-language.com Glenn Slayden October 14, 2009

Lucky Decision

• ..that turned out to be incredibly valuable:– Heterogeneous objects are assigned ID numbers

within mutually exclusive ranges

Page 15: Thai-language.com Glenn Slayden October 14, 2009

Scary Picture with Clouds In It

Page 16: Thai-language.com Glenn Slayden October 14, 2009

Data Entry Demonstration

Page 17: Thai-language.com Glenn Slayden October 14, 2009

Future directions

• Track provenance of entries and changes• Separate-out meta-information in English

senses• Move towards community curatorship while

maintaining asset value– Requires reputation-granting authority

• Refine and formalize dictionary statement of purpose (i.e. to prevent hijacking)

Page 18: Thai-language.com Glenn Slayden October 14, 2009

Technology Changes

• In 2009, optimizing a language dictionary database for size is not necessary

• Detailed fields should be generously deployed• Exception to the in-memory model:– Comprehensive change version tracking may

warrant database storage– This is necessary for community curatorship

Page 19: Thai-language.com Glenn Slayden October 14, 2009

An integrated DELPH-IN style computational-analytical grammar

• Associate a rigorous HPSG feature structure with each sense

• Display MRS and tree on dictionary page for compounds and sentences.

• Ability to designate gold standard parse trees and attestation provenance

• Live interface for LKB/PET-style parser to provide arbitrary parsing

Page 20: Thai-language.com Glenn Slayden October 14, 2009

Thanks for Coming!