Thai-language.com Glenn Slayden October 14, 2009

Preview:

Citation preview

thai-language.com

Glenn SlaydenOctober 14, 2009

Agenda

• Background and history• Site surface demonstration• Database ontology• Database technology• Data Entry demonstration• Future directions• Q&A : throughout please

Overarching Motivation

• Long-term objectives:

–Increase linguistic rigor–Publish any new work–Maintain popular accessibility–Build community

Historical Parchment - 1997

More Parchment - 2001

Site Demonstration

Database? What Database

• How big is a monolingual dictionary?• 100,000 words x 30 b/entry = 30 MB• How much memory in a modern server?

32GB.• That’s about 1/10th of 1% (.00094)• SQL? MySql? PostGres? Not indicated.

Case Study

October 13, 2009 – 64-bit web server – 32 GB RAM

Server Memory Utilization

n.b. this entire pie chart represents 10% of total memory

In-memory is the way to go

• For performance• For ease and speed of development• Easy refactoring• LINQ – C# “language-integrated query”• Have a flexible and powerful object-model

without worrying about relational mapping• Completely avoid OR/M (object-relational

mapping) “impedance mismatch” issues

thai-language.com Ontology

• Disclaimer and warning– Internal names of programming objects are not

(any longer) intended to have any relationship to corresponding Linguistic terms. On the following slides please consider these names to be opaque monikers.

thai-language.com Ontology

These colors correspond (roughly) to data-entry screen colors in DBEdit

The most basic

Lucky Decision

• ..that turned out to be incredibly valuable:– Heterogeneous objects are assigned ID numbers

within mutually exclusive ranges

Scary Picture with Clouds In It

Data Entry Demonstration

Future directions

• Track provenance of entries and changes• Separate-out meta-information in English

senses• Move towards community curatorship while

maintaining asset value– Requires reputation-granting authority

• Refine and formalize dictionary statement of purpose (i.e. to prevent hijacking)

Technology Changes

• In 2009, optimizing a language dictionary database for size is not necessary

• Detailed fields should be generously deployed• Exception to the in-memory model:– Comprehensive change version tracking may

warrant database storage– This is necessary for community curatorship

An integrated DELPH-IN style computational-analytical grammar

• Associate a rigorous HPSG feature structure with each sense

• Display MRS and tree on dictionary page for compounds and sentences.

• Ability to designate gold standard parse trees and attestation provenance

• Live interface for LKB/PET-style parser to provide arbitrary parsing

Thanks for Coming!

Recommended