Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
December 2006 Slide 1COMAD 2006 Tutorial
COMAD Tutorial on Multi-lingual Database Systems
Jayant Haritsa
Indian Institute of Science Bangalore, India
Database Systems Laboratory
December 2006 Slide 2COMAD 2006 Tutorial
Tutorial Contents
• Motivation• Multilingual Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research
December 2006 Slide 3COMAD 2006 Tutorial
Organization
• Motivation• Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research
December 2006 Slide 4COMAD 2006 Tutorial
Tracks History(User History)
Generates Recommendations
(User Pref. & Mined Data)
Generates Incentives (User Pref. & Mined Data)
Product Categories (Meta-Data)
In Database-speak…Data: 1TB Live; 25TB WarehouseDBMS: 26Systems: 1500 Servers
Deployment: A set of Monolingual DBMS Sub-second Response time
What is needed in DBMS to make such a Portal
Multilingual?
Multilingual Portal
December 2006 Slide 5COMAD 2006 Tutorial
Example Multilingual Database• Books.com
INR 250ÝCò «ü£F
€ 19.95L'Histoire De La France SAR 95
€ 75.00Il Coronation del Virgin
PriceTitle
üõý˜ô£™«ï¼
François Lebrun
BicciNero
Author_FNAuthor Category
êKˆFó‹
Arti Fini
Histoire
Language
$ 49.95History of CivilizationWill/ArielDurant History English
TamilItalian
FrenchArabic
INR 175ªddT£d HI¶ šddy¡d¡d®ddUµT¬dd¬d¦dyUµè Be£d²d±d Hindi
£ 35.00History and HistoriansMark T.Gilderhus Historiography English
€ 12.00ΚατεριναΣαρρη Μουσικη′ Greek
£ 15.00Letters to My DaughterJawaharlalNehru Autobiography English
€ 99.95Les Méditations MetaphysiquesRenéDescartes Philosophie French
¥ 7500無門關慧開無門 禅 Japanese
€ 99.95êˆFò «ê£î¬ù«ñ£è¡î£v裉F ²òêKî‹ Tamil
Παιχνι′δια στο Πια′νο
December 2006 Slide 6COMAD 2006 Tutorial
Granularity of Multi-lingualism
• Uni-lingual rows, multi-lingual columns• Uni-lingual columns, multi-lingual rows• Multi-lingual rows, multi-lingual columns• Multi-lingual attribute values
December 2006 Slide 7COMAD 2006 Tutorial
Why Worry about Multilingual Data?
• Growing Multilingual On-line Users and Data – By 2010, most of the web-pages will be multilingual
• Today English down to 35% from 90% in 1995– Non-native English speaking population has grown rapidly
from about one-third in mid-90’s to about two-thirds in mid-00’s
• E-commerce Implications – Opens up enormous new markets– Users are four-times more likely to buy a product or a service,
if the information is presented in their native language [Aberdeen]
• E-governance Implications– Opens up communication to native communities
December 2006 Slide 8COMAD 2006 Tutorial
Sample Applications
• Vidyanidhi– Portal for all Indian research theses, hosted at
Univ. of Mysore, Karnataka, India– Contains close to 100000 records in English,
Hindi, Kannada
• Bhoomi– Computerized Land Record System in State of
Karnataka, India• storing 20 million records with composite information in
Kannada and English
– to be followed in all other states as well
December 2006 Slide 9COMAD 2006 Tutorial
MLDB Research Questions (1)
• Are today’s database systems equally (a) functional(b) efficientacross all human languages?– i.e. is the DBMS “natural-language-neutral” ?– Specifically, is there a preference for Latin-script
based languages (English, French, German, …) as compared to those based on other scripts (Arabic, Cyrillic, CJK, Indic, …) ?
December 2006 Slide 10COMAD 2006 Tutorial
MLDB Research Questions (2)
• Are new functionalities desired from a multilingual database system?– i.e. multi-lingual SQL operators ?
December 2006 Slide 11COMAD 2006 Tutorial
Organization
• Motivation• Multilingual Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research
December 2006 Slide 12COMAD 2006 Tutorial
Multilingual Functionality
To assess the support offered by current database systems and standards for multilingual data
December 2006 Slide 13COMAD 2006 Tutorial
Background – Character Encoding
• Character is smallest component of a written language that has a semantic value. The set of all the characters in a language is called a repertoire.
• Character Encoding assigns unique numerical value to each of the characters in a repertoire.– Several encodings available
December 2006 Slide 14COMAD 2006 Tutorial
Multilingual Character Encoding
• ASCII [&ISO:8859] Encoding – 7-bit [8-bit] for English [Western European]
• ISCII Encoding– 8-bit (proprietary) encoding for Indic Languages
• ISO:10646 – Universal Character Set (UCS-2)– Uniform 2-Byte encoding for all languages
• Unicode Encoding – 2-byte encoding along the lines of UCS-2 (UTF-16)
• The default standard for Multilingual Data Storage in DBMS– Has a variable-byte encoding (UTF-8) that favors
ASCII (Western European Languages)
December 2006 Slide 15COMAD 2006 Tutorial
Unicode
• Unicode is a uniform 2-byte encoding standard that allows storage of characters from any known alphabet or ideographic system irrespective of platform or programming environments.
• Unicode codes are arranged in Character Blocks, which encode contiguously the characters of a given Script (usually single language).
• Unicode has 3 different byte encodings – UTF-8, UTF-16 and UTF-32 to store same character in a byte, half-word or double-word formats.
December 2006 Slide 16COMAD 2006 Tutorial
Sample Encodings
E4.16.27.16.97.16.E600.E4.00.16.00.27.00.16.00.97.00.16.00.E6E4.16.27.16.97.16.E6
NarayanNarayanNarayan
ASCIIUnicode (UTF-16)Unicode (UTF-8)
EnglishEnglishEnglish
Representation(Hexadecimal)
Multilingual String
EncodingLang.
A8.BE.B0.BE.AF.A9.CD0B.A8.0B.BE.0B.B0.0B.BE.0B.AF.0B.A9.0B.CDE0.AE.A8.E0.AE.BE.E0.AE.B0.E0.AE.BE.E0.AE.AF.E0.AE.A9.E0.AF.CD
ï£ó£ò¡ï£ó£ò¡ï£ó£ò¡
ISCIIUnicode (UTF-16)Unicode (UTF-8)
TamilTamilTamil
A8.BE.B0.BE.AF.A3.CD0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CD
£ÁgÁAiÀÄ£ï
£ÁgÁAiÀÄ£ïISCIIUnicode (UTF-16)
KannadaKannada
5B.FA.4E.95.6B.63.53.5ABF.BA.AF.BA.E4.E6.95.A3.AD.8D.E5.9A
Unicode (UTF-16)Unicode (UTF-8)
KanjiKanji
December 2006 Slide 17COMAD 2006 Tutorial
Database Systems
Commercial SystemsOracle, Microsoft SQL Server, IBM DB2
Public-domain:MySQL, PostgreSQL
For legal reasons, will randomly refer to themas Systems A, B, C, D, E
December 2006 Slide 18COMAD 2006 Tutorial
Standards and SystemsSystem
ESystem
D
NoneNoneNoneNo SpecCross-Language Query Support
No Spec
~25
Similar to Character
Any Collations
System Defined &
User Definable
National Char
SQL:1999 Standard
Data + Meta-data
OS Defined
Similar to Character
System Collations
OS Specified (Pre-defined)
UCS-2
SystemB
DataDataSupport Level
~40~50Locales
Similar to Character
Similar to Character
Query Processing
System Collations
System Collations
Indexing
Pre-definedPre-definedCollations
UnicodeUTF-8/16
Unicode UTF-8/16
Storage Format
SystemC
SystemA
Database
None
Data
~30
Similar to Character
Any Collations
User Definable
(source changes)
UnicodeUTF-8
None
Data
~25
Binary
Any Collations
User Definable
(source changes)
Binary
December 2006 Slide 19COMAD 2006 Tutorial
Remarks
Database systems are generally equivalent in their storage and querying of multilingual data, and offer similar SQL querying power ...
However,- Uniformly, no cross-language support- Unknown differential performance
December 2006 Slide 20COMAD 2006 Tutorial
10,000’ View …Proper Names
DocumentsString Values
Other Attributes Category Attributes
Visual
Grapheme Image
Encoding(ASCII, Unicode …)
Scripts(English, Hindi …)
Nehru «ï¼ ¦dyUµè AmazonText Strings
Semantics
RepresentationTEXT DATA
PhonemicTransformation
Aural
Encoding(ITrans, Unicode …)
Normalized Representation(IPA, Arpabet …)
Q m ´ z A n /N/ /ae/ /R/ /oo/
Phoneme Strings
unicode ⇔ cuniform
Concepts
Abstracted Synsets(WordNet Taxonomies)
Multilingual Synsets(WordNet)
SemanticTransformation
December 2006 Slide 21COMAD 2006 Tutorial
Organization
• Motivation• Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research
December 2006 Slide 22COMAD 2006 Tutorial
Multilingual Performance
Are the DBMS’s natural-language neutral, wrt performance?
December 2006 Slide 23COMAD 2006 Tutorial
Database Setup
• Generated a 1 GB TPC-H Database • Tables modified to hold both original CHAR [ASCII] Data and
equivalent NCHAR [Unicode] Data• Experiments conducted with
– separate CHAR and NCHAR tables– common tables (to eliminate the impact of disk I/O)
• Example PartSuppCommon table with equivalent Tamil data
ð£è‹ ªðò˜#000018
Part Name #00001818îò£KŠðõ˜ #2503
Supplier #2503
2503
PartName_NChar(Unicode)
PartName(ASCII)
Part ID
SuppName(ASCII)
SupplierID
SuppName_NChar(Unicode)
December 2006 Slide 24COMAD 2006 Tutorial
System and DBMS Setup
• Stand-alone Pentium-IV running Windows 2000– Cold-start ensured before each experiment
• DBMS – Oracle, DB2, SQL Server, Postgres– Installed with Default Configurations
• Display time nullified through aggregate functions in the select clause
December 2006 Slide 25COMAD 2006 Tutorial
Database Operators Measured
• Table-Scan– Time for scanning for a specific key
• Index-Create and Index-Scan– Time for creating index– Time for retrieving 20% of search keys in index
• Sort– Time for sorting the attribute
• Join [Nested-Loop, Hash, Sort-Merge]– Join types forced by Optimizer Hints, Setting Optimization Levels, etc.– Plan pictures verified the use of appropriate join type in a query
December 2006 Slide 26COMAD 2006 Tutorial
Sample Queries
Join Operator select count(*)
from PartSuppCommon PS1, PartSuppCommon PS2
where PS1.SuppName = PS2.SuppName{ PS1.SuppName_NChar = PS2.SuppName_NChar }
and PS1.PartName <> PS2.PartName{ PS1.PartName_NChar <> PS2.PartName_NChar }
Table-Scan Operator select count(*)
from PartSuppCommon PS1
where PS1.SuppName = ‘Supplier #2503’
{ PS1.SuppName_NChar = ‘îò£KŠðõ˜ #2503’}
December 2006 Slide 27COMAD 2006 Tutorial
Performance Metrics
• Operator Performance– Measure Operator Differential
Performance between Char and NChar
• Database Relative Efficiency– Measures Database Differential
Performance between Char and NChar
TNChar / Tchar(Ideal Value: 1)
• Optimizer Prediction Equity– Optimizer Prediction [In]Equity
between Char and NChar
GMNChar / GMChar
(Ideal Value: 1)
(Ideal Value: 1)
(ONChar/OChar)(TNChar/TChar)
December 2006 Slide 28COMAD 2006 Tutorial
Operator Performance(Common Table)
• MROOper Metric (Ideal Value = 1)
• In a Nutshell:– Wide variation in System Performance– Slowdown can be as much as 200%– Generally, 30-100% Slower for NChar
All operators are slow on Multilingual Data
2.70 1.791.01 1.24 1.361.12 1.32Database System D1.341.551.92
Join (Sort-
Merge)
1.291.352.60
Join (Hash)
1.971.352.75
Index-Scan
1.231.481.81
Sort
1.591.031.03
Join (Nested-Loop)
1.061.332.72
Table-Scan
1.391.251.21
Index-Create
Database System CDatabase System BDatabase System A
Database System
December 2006 Slide 29COMAD 2006 Tutorial
Overall DBMS Performance
• MEDBMS Metric (Ideal Value = 1)
• Databases are about 50% to 100% inefficient in multilingual world– Note, conservative estimate since only considering in-memory
differentials because of common table
– With separate tables, the inefficiency jumps to several hundred percent (e.g. slowdown was upto 475% for Scans and upto 275% for Joins)
0.69Database System D0.70Database System C0.80Database System B0.57Database System A
EfficiencyDatabase
December 2006 Slide 30COMAD 2006 Tutorial
Query Optimizer Performance
• MPEOper Metric (Ideal Value = 1)
• In a Nutshell:– Generally, 5-100% mis-prediction– Could be due to erroneous cost-models between Char and NChar.
0.740.550.370.990.89Database System D0.951.200.89
Join (Sort-
Merge)
1.220.751.26
Join (Hash)
0.311.550.38
Index-Scan
1.160.970.97
Join (Nested-Loop)
0.940.750.37
Table-Scan
Database System CDatabase System BDatabase System A
Database System
Could lead to grossly inefficient plans.
December 2006 Slide 31COMAD 2006 Tutorial
Analysis of Performance
• Experiments on DBMS system A– System A exhibited worst differential
performance
• Our Objective:– What are the components of the slowdown?– How can these be addressed in improving
performance?
December 2006 Slide 32COMAD 2006 Tutorial
Slowdown vs String Size
• How does the slowdown vary with (logical) String Length?
• High Differential in Scan at small string sizes indicates:• high fixed-cost overheads (such as call overhead)
• Increasing differential cost indicates:• higher variable-cost overheads (such as string handling
for function calls)
December 2006 Slide 33COMAD 2006 Tutorial
Slowdown w.r.t. Data Type
• We created Common Table with:– Char Attributes of size 110; NChar Attributes of size 55– Attributes have same physical size, but are of different
types
• We ran the same queries– Call overheads are the same– Only difference is in Datatype specific code-segments in
common operator implementation
• The observed differential performance is ~10-15% of corresponding Operator Slowdowns– Small, but not insignificant.
December 2006 Slide 34COMAD 2006 Tutorial
Slowdown wrt. String Processing
• Created Common Table (as before)– All NChar attributes replaced with Char attributes of twice the size
• Ran the same queries– Disk I/O & Call Overheads are the same, except the data being
passed as parameter to the operator functions is different in size, same in type
– Measures any differences in in-memory handling of different sized strings in a common operator implementation
• The observed differential performance is ~80-90%of the corresponding Operator Slowdowns– This component contributes primarily to the operator slowdown
December 2006 Slide 35COMAD 2006 Tutorial
Overall Performance Analysis
• The slowdown of NChar over Char (including Disk I/O) is very large (several hundred percent)
• The slowdown of NChar over Char (considering only in-memory processing) is still large (50 to 100%):– Primarily, due to size of the NChar Strings (~85%)– Secondarily, due to the type-specific implementation
(~10%)
• Hence, to improve performance the size of the NChar storage must be tackled …
December 2006 Slide 36COMAD 2006 Tutorial
Cuniform Storage Format
December 2006 Slide 37COMAD 2006 Tutorial
Two Observations…
#2: An Attribute Value is likely to be in ONE script.
Rs 175®d¦Qî «dd£dTa (Vol 1)590¦dyUµèRs 975ªddT£d HI¶ šddy¡d127¦dyUµèPriceTitleISBNAuthor
$12.95Discovery of India992Nehru
ISCIIUnicode
TamilTamil
A8.BE.B0.BE.AF.A9.CD0B.A8.0B.BE.0B.B0.0B.BE.0B.AF.0B.A9.0B.CD
Representation(Hexadecimal)
StringEncodingLang
ï£ó£ò¡ï£ó£ò¡
#1: Unicode = Character Block + Offset
about half the bits for character block
December 2006 Slide 38COMAD 2006 Tutorial
Cuniform – Compressed UNIcode FORMat
• … After skinning into Cuniform pair000B
0C
Narayanï£ó£ò¡£ÁgÁAiÀÄ£ï
NULL
E4.16.27.16.97.16.E6
A8.BE.B0.BE.AF.A9.CD
A8.BE.B0.BE.AF.A3.CD
00.27.00.EF.0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CDRK£ÁgÁAiÀÄ£ï
00.E4.00.16.00.27.00.16.00.97.00.16.00.E6Narayan0B.A8.0B.BE.0B.B0.0B.BE.0B.AF.0B.A9.0B.CDï£ó£ò¡0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CD£ÁgÁAiÀÄ£ï00.27.00.EF.0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CDRK£ÁgÁAiÀÄ£ï
• Original Unicode Strings …
• Store each string data item as an ordered pair • Common Character Block• Offset of each character, in the common Char Block
December 2006 Slide 39COMAD 2006 Tutorial
Implementation of Cuniform
• Transparently remap all the SQL queries to work on the Cuniform Pairs
• For presentation of results, conversion from Cuniform to Unicode is trivial
December 2006 Slide 40COMAD 2006 Tutorial
Operator Performance on Cuniform
• Outside-the-engine Implementation• Space Occupancy
– Only 2% larger than Char; compare this with NChar’s 100 % overhead
1.031.04Join (Nested-Loop)1.222.74Join (Hash)1.151.99Join (Sort-Merge)1.991.88IndexScan1.052.56TableScan
Cuniform Slow-down
UnicodeSlow-down
Operator
Largely the Differential Performance is eliminated (Except, Index Scan)
December 2006 Slide 41COMAD 2006 Tutorial
Cuniform Performance Summary
• Overall,– Generally, Better Performance
• Index Tree is built on a pair of attributes, resulting in worse performance
– Multilingual Efficiency up to 0.81 from 0.57• With inside-engine implementation, can be made
even better– The performance is made almost natural
language-neutral …
December 2006 Slide 42COMAD 2006 Tutorial
Remarks
There is a performance barrier separating languages in Latin script (e.g., English) from those in other scripts (e.g., Indic languages), but this barrier can be largely broken down with the Cuniform format …
December 2006 Slide 43COMAD 2006 Tutorial
Organization
• Motivation• Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research
December 2006 Slide 44COMAD 2006 Tutorial
New Multilingual Operators
Objective:To assess the new operators desired from a
multilingual database system
December 2006 Slide 45COMAD 2006 Tutorial
MLNameJoin Operator
† Referred to as LexEQUAL in published work
December 2006 Slide 46COMAD 2006 Tutorial
Multilingual Books.com(with Language Information)
INR 250ÝCò «ü£F
€ 19.95L'Histoire De La France SAR 95
€ 75.00Il Coronation del Virgin
PriceTitle
üõý˜ô£™«ï¼
François Lebrun
BicciNero
Author_FNAuthor Category
êKˆFó‹
Arti Fini
Histoire
Language
$ 49.95History of CivilizationWill/ArielDurant History English
TamilItalian
FrenchArabic
INR 175ªddT£d HI¶ šddy¡d¡d®ddUµT¬dd¬d¦dyUµè Be£d²d±d Hindi
£ 35.00History and HistoriansMark T.Gilderhus Historiography English
€ 12.00ΚατεριναΣαρρη Μουσικη′ Greek
£ 15.00Letters to My DaughterJawaharlalNehru Autobiography English
€ 99.95Les Méditations MetaphysiquesRenéDecates Philosophie French
¥ 7500無門關慧開無門 禅 Japanese
€ 99.95êˆFò «ê£î¬ù«ñ£è¡î£v裉F ²òêKî‹ Tamil
Παιχνι′δια στο Πια′νο
December 2006 Slide 47COMAD 2006 Tutorial
Multilingual Selection
INR 250ÝCò «ü£F
£ 15.00Letters to My Daughter
INR 175ªddT£d HI¶ šddy¡d
PriceTitle
Tamil
English
Hindi
üõý˜ô£™«ï¼
JawaharlalNehru
¡d®dUµT¬dd¬d¦dyUµè
Author_FNAuthor Language
Suppose a User wants the books of “Nehru” in English, Tamil and Hindi …
December 2006 Slide 48COMAD 2006 Tutorial
Current SQL Approach
Select Author, Title, ... From Books
Where Author = “Nehru“
or Author = “«ï¼“or Author = “¦dyUµè“ ...
• Problems with this approach– User needs to be fluent in all the target languages – Need specialized lexical resources (fonts, keyboard mappings, etc.)
for input– Input prone to Error, due to the lack of Directory support
• Further, CANNOT BE USED TO EXPRESS JOIN ACROSS MULTILINGUAL COLUMNS
December 2006 Slide 49COMAD 2006 Tutorial
Proposed MLNameJoin Query
where Author MLNameJoin “Nehru”
InLanguages { English, Tamil, Hindi }
Select Author, Title, ... From Books
• Input in a convenient language, with Multilingual output • Equivalence based on intuitive Phonetic correspondence
– restricted to proper names (form about 20 percent of text corpora)• Customizable Fuzzy Matching
– Threshold Parameter• Most Importantly, extensible…
– “Retrieve in All Languages” ( InLanguages { * } )– Join ( Author MLNameJoin Faculty )
Threshold 0.2
December 2006 Slide 50COMAD 2006 Tutorial
Matching Strategy
• Store Multilexical Strings in Database– In Unicode (or Cuniform)
• To match, transform to equivalent phonemic strings in IPA alphabet using standard Text-to-Phoneme(TTP) converters …
• … and compare transformed strings using Approximate Matching Techniques– Incompatibilities in Phonemes of different languages
MLNameJoin transforms matching from textual space to phonemic space
December 2006 Slide 51COMAD 2006 Tutorial
Example MLNameJoin Operation
Books Table (the last column is generated):
The Query :where Author MLNameJoin “Nehru” Threshold 0.2
InLanguages { English, Tamil, Hindi }
Select Author ... From Books
Will be executed as:Transform “Nehru” to Phonemic string (in English TTP) as “næhru”
Retrieve all records whose Language is one of (English, Tamil or Hindi) andwhose phoneme strings are within edit distance of 1 from “næhru”
1 = 0.2 * 5 = Threshold * phoneme length of “næhru”
ÝCò «ü£F
Discovery of India
ªddT£d HI¶ šddy¡d
Author (Phonemes) Title
TamilEnglish
Hindi«ï¼
Nehru
¦dyUµè
Authornæhru
nærunæhru
The Coronation of the VirginEin Amerikanischer Autobiographie
EnglishGerman
NeroFranklin
nerou
frAŋklın
Language
December 2006 Slide 52COMAD 2006 Tutorial
MLNameJoin Implementation Goals
• Accurate & Efficient Matching across languages
• Minimum changes to the DB Server
December 2006 Slide 53COMAD 2006 Tutorial
State of the Art
• No Support for Multilexical Matching in Commercial DBMS– Soundex approximations for Latin-based scripts
• Approx. Matching Algos– No Approximate Matching supported in DBMS
• However, UDF’s can solve this problem, partially
• Phonetic Matching in IR & Speech Processing Community– [Zobel-SIGIR96] In English using Soundex type algorithms– Speech Processing research in online-speech processing
• Proprietary Solutions– LASA (look-alike-Sound-alike) system for FDA
December 2006 Slide 54COMAD 2006 Tutorial
MLNameJoin Function
Steps:Convert input strings to phonemes and find edit-distance between the phonemic equivalentsIf (edit-distance < threshold * Size of Query String) return TRUE
December 2006 Slide 55COMAD 2006 Tutorial
MLNameJoin Parameters
• Match Threshold– Specifies the level of tolerance for mismatch between
the phonemic strings– Tunable for Matching
• User-Settable (per Query), or Global (for an application)
– Threshold varied in [0, 1]• 0 => only perfect matches are accepted, 1 => anything can be
matched
• Set of Output Languages– Those languages of interest to the multilingual user
December 2006 Slide 56COMAD 2006 Tutorial
EditDistance Function
Steps:Basic Dynamic Programming Algorithm to find Edit-DistanceInsCost, DelCost and SubCost are Parameterized
Cost for OperationClusters of Phonemes and Intra-Cluster Substitution Cost
December 2006 Slide 57COMAD 2006 Tutorial
EditDistance Parameters
• Cost Functions– InsCost, DelCost and SubCost may be set– Simulates different types of edit-distances
• Intra-Cluster Substitution Cost– Phonemes may be clustered based on their like-ness
• Clusters to be formed with linguist’s input– Matching phonemes within a cluster may be more
acceptable, than from outside a cluster– Cost varied in [0, 1]
• 0 implies all phonemes are equivalent within a cluster
December 2006 Slide 58COMAD 2006 Tutorial
Implementation Architecture
• Query String • Match Threshold
• Matched Strings
Server Manager
Database
TTSq
QueryProc.
Engine
ApproximateMatching
TTSn
Unicode
Cost Fn
Clusters
December 2006 Slide 59COMAD 2006 Tutorial
Performance Experiments
December 2006 Slide 60COMAD 2006 Tutorial
Data Sets
798
198
300
300
# of Strings
Combined SetAll
Generic Names (Places/Chemicals/Objects)3
Occidental Names (San Francisco Physicians Directory)2
Indian Names (Bangalore Telephone Directory)1
DescriptionSet
Three Data Sets:
Equivalent Phonemic Strings were stored in IPA Alphabet, Unicode FormatHand-tagged each set of Matchable strings with a Group ID
December 2006 Slide 61COMAD 2006 Tutorial
Data Sets (Continued…)
Generated Data Sets:Concatenated each string with every other string in the same language
Each set of matchable Strings have Same Group ID
Generated about 200,000 Strings
December 2006 Slide 62COMAD 2006 Tutorial
Phonetic Transformation
• Used standard Text to Phoneme Conversion to IPA alphabet• For English:
– A web-based TTP Convertor (http://www.foreignword.com)• Dictionary: Oxford English Dictionary
• For Indic:– Dhvani TTP (http://dhvani.sourceforge.com) after source modifications
• All Indic Languages are Phonetic; Hence almost a 1-to-1 mapping exists.
nArAIQn
haIdr´dZ´n
Qm´zAn
jun´vŒrsIti
krIst´p´rk´mpjut´r
Phoneme Name (IPA Alphabets)
Tamilï£ó£ò¡
EnglishHydrogen
EnglishAmazon
EnglishUniversity
Tamil‚Kvìð˜
EnglishComputer
LanguageCharacter Name
December 2006 Slide 63COMAD 2006 Tutorial
Metrics Measured
• Parameters varied– User Match Threshold– Intra-Cluster Substitution Cost
• Metrics, Measured• M1 : # of Correctly Reported Matches• M2 : # of Reported Matches• n : number of groups of equivalent names, and
ni : #of elements in ith group
Run Time – As Wall-clock Time
Precision - Fraction of the returned results that are correct
Precision = M1 / M2
Recall = M1 / ( niC2 )Σi=1
nRecall - Fraction of the correct results that are returned
December 2006 Slide 64COMAD 2006 Tutorial
Correctness ExperimentsRecall & Precision Metrics
• Desired Quality of Match: (Recall 1, Precision 1)
Analysis of observed results:Recall Rate is reasonable (≥ 0.90 ) only with e ≥ 0.20Precision Rate is reasonable (≥ 0.90 ) only with e ≤ 0.30
December 2006 Slide 65COMAD 2006 Tutorial
Correctness ExperimentsTuning the Matching
• Best Matching Point (Empirically, with Precision-Recall curves)
The best possible point of matching is reached with:Match Threshold ∈ [0.25, 0.3] and Intra-Cluster Subs Cost ∈ [0.25, 0.5]
For this dataset Recall is 95% and Precision is 85%
December 2006 Slide 66COMAD 2006 Tutorial
Performance ExperimentsBaseline
• Edit-Distance function was implemented as a UDF
ReasonsUDF incurs large costs and runs in an Interpreted Environment
Edit-Distance using O(mn) algorithm
All the strings in the Database were compared – O(N)
How to Improve the Performance?Q-Gram Technique
Approximate Phonetic Indexing
4004Approximate (using UDF)Join
Approximate (using UDF)
Matching Methodology
1418Scan
Time (Sec)
Query
December 2006 Slide 67COMAD 2006 Tutorial
Performance Experiments Q-Grams
• Key Idea:– Generate and store all q-grams
– Use q-grams filter properties to generate a candidate set (cheap) and prune false positives using UDF (expensive)
• Three different filters are used– Length Filter: Matching strings cannot differ by more than k– Count Filter: Matching q-grams ≥ max (|a|, |b|) –1 –(k-1) q– Position Filter: Matching q-grams cannot be more than k positions apart
December 2006 Slide 68COMAD 2006 Tutorial
Performance Experiments Q-Grams (continued…)
856Approximate (using q-grams)Join
Approximate (using q-grams)
Matching Methodology
13.5Scan
Time (Sec)
Query
• From baseline, improved two orders of magnitude in Scan and five times in Join– # of Calls to UDF reduced tremendously
• Caveats:– No false-dismissals, only false-positives– At the cost of 15X storage space, for storing the Q-Grams
December 2006 Slide 69COMAD 2006 Tutorial
Performance Experiments Phonetic Index
• Key Idea:– Use the “Phonemes Clusters” to generate a string of integers (like Soundex)– Convert to a single number– Index the phonemic strings in standard B+ Tree index, on Number
• For searching:– Transform search string to its Index String, and search in index
• Returned set is a Candidate Set for approximate matching strings• Use UDF to weed out false positives
December 2006 Slide 70COMAD 2006 Tutorial
Performance Experiments Phonetic Index (Continued…)
15.2Approximate (using Phon.Index)Join
Approximate (using Phon.Index)
Matching Methodology
0.71Scan
Time (Sec)
Query
From Baseline, three orders of magnitude improvementIndex speeds up look-up of “like-strings” tremendously
Caveats:False-dismissals possible now.Indexing cost is added
December 2006 Slide 71COMAD 2006 Tutorial
Performance Improvements(Three Alternative Techniques)
• Technique #1: Metric Distance Index– Pre-compute edit-distance of Phonetic Strings from a given Key String.– Use Properties of edit-distances to reduce calls to UDF
• Technique #2: Q-Grams– Generate and Store all Q-Grams of Phonetic Strings– Use Properties of Q-Grams to reduce Calls to UDF
• Technique #3: Phonemic Indexes– Convert Phoneme Strings to Numbers (corresponding to Phoneme
Clusters)– Index resulting numbers, using B+ Trees
~5-8% False-Dismissals15.20.71Matching using Phonemic Indexes
13.5
Scan-Time
856
Join-Time
15X Storage Overheads for Q-Grams Matching using Q-Grams
Matching Methodology
356 1728Matching using Metric Distance Index Not much improvement on baseline
Remarks
December 2006 Slide 72COMAD 2006 Tutorial
Remarks
The MLNameJoin operator employing phonetic matching can complement the standard lexicographic operators, for cross-language name searches…
…Further, it may be implemented efficiently on existing systems.
December 2006 Slide 73COMAD 2006 Tutorial
MLSemJoin Operator
† Referred to as MLSemJoin in published work
December 2006 Slide 74COMAD 2006 Tutorial
INR 250ÝCò «ü£F
€ 19.95L'Histoire De La France SAR 95
€ 75.00Il Coronation del Virgin
PriceTitle
üõý˜ô£™«ï¼
François Lebrun
BicciNero
Author_FNAuthor Category
êKˆFó‹
Arti Fini
Histoire
Language
$ 49.95History of CivilizationWill/ArielDurant History English
TamilItalian
FrenchArabic
INR 175ªddT£d HI¶ šddy¡d¡d®ddUµT¬dd¬d¦dyUµè Be£d²d±d Hindi
£ 35.00History and HistoriansMark T.Gilderhus Historiography English
€ 12.00ΚατεριναΣαρρη Μουσικη′ Greek
£ 15.00Letters to My DaughterJawaharlalNehru Autobiography English
€ 99.95Les Méditations MetaphysiquesRenéDecates Philosophie French
¥ 7500無門關慧開無門 禅 Japanese
€ 99.95êˆFò «ê£î¬ù«ñ£è¡î£v裉F ²òêKî‹ Tamil
Παιχνι′δια στο Πια′νο
Multilingual Books.com(with Category Information)
December 2006 Slide 75COMAD 2006 Tutorial
Multilingual Semantic Selection
INR 250ÝCò «ü£F
$ 49.95History of Civilization
€ 19.95L'Histoire De La France
PriceTitle
êKˆFó‹
History
Histoire
üõý˜ô£™«ï¼
Will/ArielDurant
FrançoisLebrun
Author_FNAuthor Category
Suppose a user wants to retrieve all “History” Books in English, Tamil and French ...
Currently, an equivalent SQL expression is as follows…
where Category =“History” or Category =“êKˆFó‹” or Category =“Histoire” ...Select Author, Title, Category ... From Books
where Category MLSemJoin “History” InLanguages {English,Tamil,French}Select Author, Title, Category... From Books
We propose a simpler syntax as follows…
December 2006 Slide 76COMAD 2006 Tutorial
Multilingual Semantic Selection(Adding Expressive Power)
INR 250ÝCò «ü£F
$ 49.95History of Civilization
€ 19.95L'Histoire De La France
PriceTitle
êKˆFó‹
History
Histoire
üõý˜ô£™«ï¼
Will/ArielDurant
FrançoisLebrun
Author_FNAuthor Category
Suppose a User wants to retrieve all “History-type” Books in English, Tamil and French
where Category MLSemJoin All “History” InLanguages {English,Tamil,Hindi}Select Author, Title, ... From Books
Currently, no equivalent SQL expression is available for this query…
£ 15.00Letters to my DaughterAutobiographyJawaharlalNehru£ 35.00History and Historians HistoriographyMark T.Gilderhus
December 2006 Slide 77COMAD 2006 Tutorial
MLSemJoin Features
• Simpler Query Specification– Input in any convenient language, with Multilingual output
• When Appropriate Linguistic Resources are not available• Specially Suited for PDA, Cell Phone Interfaces
• Robustness of Query Processing– Query String is more robust with respect to meaning, spelling etc, as
the matching relies not just on Lexicographic Matching– Equivalence based on intuitive Semantic correspondence
• Semantic ⇒ A Specified Ontology Based
• Restrictions– Restricted to Specific types of Attributes (Categorical)
December 2006 Slide 78COMAD 2006 Tutorial
Is MLSemJoin just Syntactic Sugar?
• Extends SQL Expressive Power– “Retrieve in All Languages”
• To Retrieve books irrespective of language of Publication( InLanguages { * } )
– Join Functionality based on Semantics• To Retrieve “Books published by Publishers in their Specialty”
( Book.Category MLSemJoin Publisher.Specialty )
• Query Processing with Domain Specific Ontologies– The same Mechanism may be extended to any Ontological Query
Processing• Use domain-specific ontologies in specific domains
December 2006 Slide 79COMAD 2006 Tutorial
BackgroundWordNet Linguistic Resource
December 2006 Slide 80COMAD 2006 Tutorial
WordNet Basics: Words vs. Meaning
• WordNet is a Psycholinguistic Dictionary– It organizes concepts, similar to human mind– CS-speak: Semantic Network
• Word is an association with a concept & string– Given by a Lexical Matrix, as follows:
Synonymy
Polysemy
English has ~110K Noun Words and ~75K Noun Synsets~150K Associations between them
Need about 5MBto Store on-line
Aero
-pla
n e
Auto -
mob il
e
Car
Fligh
t
Leo
Sy:1
Sy:2
Sy:3
Sy:4
Gloss 1Gloss 2Gloss 3Gloss 4
December 2006 Slide 81COMAD 2006 Tutorial
WordNet Basics: Noun Hierarchy
• Nouns are Grouped into 25 “Semantic Primes”• Under each, concepts are arranged in a Taxonomic Hierarchy
– Can Specialize / Generalize a “Synset”
Mouse1
Land
Whale
Fauna
Bird Mammal
Water-Based
Dolphin
Artifact
Computer Peripherals
Pointing Devices
Mouse2
Biography
Knowledge
Philosophy History
Personal History
Autobiography
Subject History
HistoriographySynsets Corresponding to
Synonyms
December 2006 Slide 82COMAD 2006 Tutorial
WordNet Basics: Interlinked Synsets• With Multilingual WordNets, the English WN Hierarchy is
taken as the base, and modified for Target Languages – Interlinking provided between the Synsets
Biography
Knowledge
Philosophy History
Personal History
Autobiography
Subject History
Historiography Biographie
Wissen
Philosophie Geschichte
Persönliche Geschichte
Autobiography
Inter-language Semantic LinksIntra-language Is-A Links
December 2006 Slide 83COMAD 2006 Tutorial
Multilingual WordNet Initiatives
• There are several WordNet initiatives around the globe, coordinated by Global WordNet Organization– Euro WordNet: Covers all major European Languages– Indo WordNet: Covers ~15 Official Indic Languages– CJK WordNet: Between CJK Languages– …
• Most of them take English WordNet as the Base– Maintain a structural similarity with English WordNet and
specialize for their specific languages– Provide Inter-lingual Index between “Equivalent” Synsets
December 2006 Slide 84COMAD 2006 Tutorial
Implementation Derived Operator Approach
December 2006 Slide 85COMAD 2006 Tutorial
Semantic Matching Strategy
• Integrate WordNet Linguistic Ontological Resources to the DBMS
• Map [Multilingual] Words to Canonical Semantic Primitives– WordNet provides rich Ontological Hierarchies for nouns– Inter-linked WordNets between languages provide cross-
lingual mappings
• Match on Semantic Primitives – Directly as a not-null intersection of Primitives– Or as not-null intersection on Transitive Closures for Matching
on Specializations
December 2006 Slide 86COMAD 2006 Tutorial
MLSemJoin Example• Database:
– All books are tagged with Category – The following Hierarchy is given as a “Resource”
Biography
Knowledge
Philosophy History
Personal History
Autobiography
Subject History
Historiography Biographie
Wissen
Philosophie Geschichte
Persönliche Geschichte
Autobiographyõ£›¬è êKî‹
ÜP¾Þò™
õ‹ êKˆFó‹
üùêKî‹
²òêKî‹
Query: Retrieve all History-Type Books in English, Tamil & German
English GermanTamil
December 2006 Slide 87COMAD 2006 Tutorial
MLSemJoin Algorithm
Steps:Convert Query String to a SynsetFind Transitive Closure (TCQ) of Query Synset in Interlinked WordNet Hierarchies
If ( Synset of DataString ∈ TCQ ) then return TRUE else FALSE
December 2006 Slide 88COMAD 2006 Tutorial
Implementation Details
• In MLSemJoin Algorithm– Computing Recursive Closure (Line#3) in Relational Systems
is expensive• Takes about 98% of the time of Query
– Can implement a UDF, but is very expensive
• We took a “Derived-Operator” approach, in an unmodified RDBMS using Recursive SQL feature– Transparently re-write the MLSemJoin query into one that
uses WITH and IN clauses of SQL:1999• Caveat: System should support SQL:1999
– The query may be optimized using standard Relational Query Optimizer
December 2006 Slide 89COMAD 2006 Tutorial
Derived Operator Implementation of MLSemJoin
• The query is transformed from MLSemJoin query to a standard SQL:1999 query as follows:
where Category MLSemJoin ALL “History”
InLanguages { English, Tamil, German }
Select Author, Title, ... From Books
where Category in { ‘History’,‘Biography’,‘Autobiography’,
Select Author, Title, ... From Books
‘êKˆFó‹’,‘²òêKî‹’,…,‘Geschichte’... }
The data for IN Clause is the Recursive Closure of “History” across target Languages
December 2006 Slide 90COMAD 2006 Tutorial
MLSemJoin Performance
December 2006 Slide 91COMAD 2006 Tutorial
Data
• WordNet (Ver 1.5) Stored in DBMS– ~110,000 Word-Forms and ~80,000 Word-Senses and
~140,000 Relationships between them
• Stored in DBMS in Plain Taxonomy Tables– Plain vanilla <Parent,Child> Relationships– Occupies 4 MB for ASCII and ~8 MB for Unicode
• Multilingual WordNets were simulated by copies of English WordNet in Unicode– Inter-Language-Links created between all pairs (p:0.95)
• For Performance experiments, this approach gives a good approximation
December 2006 Slide 92COMAD 2006 Tutorial
WordNet Profiles
Structural Characteristics of WordNets
Different WordNets are at different stages of developmentThey are highly correlated
More so in Euro-WordNets, than in Indo- WordNetsConfirms their design goal to ...
“keep the basic taxonomies as much as possible”
NA0.9081.0800.9991.000Equivalence Links to English2.2862.1621.3521.4421.985Avg. (Word Form / Synset)3.8892.3602.3012.1762.236Avg. (Synset / Word Form)7,86823,37815,13222,74580,000Word Sense (Synsets)22,52250,52620,45332,809114,648Word Form (Words)
HindiSpanishGermanFrenchEnglishCharacteristics
December 2006 Slide 93COMAD 2006 Tutorial
Queries Run
• Three Commercial Database Systems were studied – Identified only as Systems A, B and C to protect identities– WordNet stored in ASCII (English) and in Unicode (others)
• MLSemJoin queries that compute closures of various sizes– Measured the Wall-Clock time for queries– In MLSemJoin queries, the TC Computation takes ~98% of time
• A Typical Query requires a Closure size of ~2,000– Average of Top-100 Query Nouns from popular Web-Search Engine, on
English WordNet– Assuming user is interested in 3 languages
December 2006 Slide 94COMAD 2006 Tutorial
Performance: Baseline MLSemJoin
Highlights:Runtime proportional to the closure cardinality
Runtime for a typical query (TC of size ~2,000) is in tens of seconds (no index) and close to a second (with index)!
December 2006 Slide 95COMAD 2006 Tutorial
Optimization #1: Pre-Computed Closures• Pre-Compute the Closures for all nodes in WordNet W,
and store in a WTC Table– Closure of x in W can be computed by a scan of WTC
– Index WTC for performance
• Positives:– Expected to have much better performance– More importantly, linear scale-up wrt Closure Size
• Negatives:– Space overheads are substantial– For WordNet, the size goes from 4 MB/language to ~120
MB/Language
December 2006 Slide 96COMAD 2006 Tutorial
Performance: Pre-Computed Closures
Highlights:Runtime near-Constant irrespective of the magnitude of Closure Cardinality
Runtime for all queries is sub-second (~700 mSec) with Index
December 2006 Slide 97COMAD 2006 Tutorial
Optimization #2: Reversed Traversals
• Traverse the Taxonomic Hierarchy in Reverse– Use the Same Taxonomic Table W– Instead of Checking if (Data ∈ Descendents of Query), check
if (Query ∈ Ancestors of Data).
• Positives:– Expected Closure Cardinalities are smaller
• Though WordNet is a DAG, itsAverage In-degree << Average Out-degree
• Negatives:– Computation of Closure for every Data String
December 2006 Slide 98COMAD 2006 Tutorial
Performance: Reversed Traversals
Highlights:Clearly much better performance for a single TC computation
The query is very expensive since Closure needs to be computed for every record
December 2006 Slide 99COMAD 2006 Tutorial
Optimization #3:Reorganized Schema
• Leveraging the Structural Characteristics of the WordNet– A large number of nodes have a few children, obeying Power Law
Inline up to 16 ChildrenCovers ~90% of them
TC Computation is modified to look into both tables
December 2006 Slide 100COMAD 2006 Tutorial
Performance: Re-Organized Schema
Highlights:Runtime is ~3 orders of magnitude better than Baseline Performance and~1 order of magnitude better than Pre-computed Closures
No Space Overhead !
Runtime for typical query (TC of size ~2,000) is ~25 mSec
December 2006 Slide 101COMAD 2006 Tutorial
Scaling up wrt Languages
Highlights:Increased number of [simulated] languages up to 8
The performance of Optimized Versions remain efficient
December 2006 Slide 102COMAD 2006 Tutorial
Implementation Architecture
• Query String • Match Parameters
• Result Set
Server Manager
Database
QueryProc.
Engine
SemanticEqualityFunction
UnicodeOntology
RecSQL
December 2006 Slide 103COMAD 2006 Tutorial
MLSemJoin: Take Away
The MLSemJoin operator employing WordNet-based matching can complement the standard lexicographic operators, for multilingual semantic searches …
… Further, the performance may be tuned to a level acceptable for online user interaction
December 2006 Slide 104COMAD 2006 Tutorial
Organization
• Motivation• Multilingual Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research
December 2006 Slide 105COMAD 2006 Tutorial
Multilingual Relational Algebra(Mural)
December 2006 Slide 106COMAD 2006 Tutorial
Why a Query Algebra?
• For expressing complex queries declaratively• For evaluating alternative query execution plans
– Critical for leveraging the Query Optimizer – For a core implementation of the multilingual operators
• What is Needed?– Functionality defined as Operators– Composition Rules, Cost Models and Selectivity Estimates
December 2006 Slide 107COMAD 2006 Tutorial
Multilingual Datatype and Operators
• Uniform (Unicode Format) Datatype– A Representation that is tagged with language– E.g.: <“Sample String”, English>, <“àõñ£ù êóñ¢”, Tamil> or <“Corde Témoin”, French>
• Operators on Uniform datatype– Composing: : <Text, ID> → Uniform– Decomposing: : <Uniform> → <Text, ID>– Uniform Equality: Ξ: <Uniform, Uniform> → <Boolean>
– MLNameJoin: Ψ : <Uniform,Uniform>→< Uniform,Uniform,Integer>• Edit-Distance between the phonemic-equivalents of input Uniform strings
– MLSemJoin: Φ : <Uniform,Uniform>→< Uniform,Uniform,Boolean>• Boolean indicates if LHS is a sub-class of RHS
December 2006 Slide 108COMAD 2006 Tutorial
Composition Rules
December 2006 Slide 109COMAD 2006 Tutorial
MLNameJoin Operator
• Simplified version of Earlier Definition:
Ψ Commutative and Associative with all relational operators
Cost of Ψ Scan of a Table: O(RL lL k / √Σ) and PL Disk I/O (Without Index) O(RL lL k2 / √Σ) and AL Disk I/O (With Index)
Cost of Ψ Join of a pair of Tables: O(RL RR lL k / √Σ) and (PL + PR)Disk I/O (Without Index) O(RL RR lL k2 / √Σ) and (AL+ AR ) Disk I/O (With Index)
Selectivity estimates based on End-Serial histograms and relaxation for approximate matching
December 2006 Slide 110COMAD 2006 Tutorial
MLSemJoin Operator• Simplified version of Earlier Definition:
Φ is not a commutative operator; but, associates with others
Cost of Φ Scan of a Table: O(RL + RH (h+1) ) and PH (h+1) Disk I/O (Without Index) O(RL lL k2 / √Σ) and AL Disk I/O (With Index)
Cost of Φ Join of a pair of Tables: O(RL +RR + UR RH (h+1) ) and (PL + PR)Disk I/O (Without Index) O(RL +RR + UR logEH (h+1)) and 3(PL + PR) EH Disk I/O (With Index)
Selectivity estimates based on structural characteristics of thehierarchy
December 2006 Slide 111COMAD 2006 Tutorial
Relational Completeness of Mural
• Lemma: There exists a mapping Scheme ΩSch between a DB in Mural Schema and Standard Relational Schema– Sketch of the Proof:
• Using Composing and Decomposing Operators of Uniform, ΩSch can be defined.
• Theorem: There exists a mapping scheme Ω that maps a relational algebra database D to a Mural database Ω(D) such that, for every query Q on D, there is a corresponding expression Q’, such that Q’(Ω(D)) = Ω(Q(D))– Sketch of the Proof:
• ΩSch is known from Lemma.• Since only a mapping of queries from Normal Schema to Mural Schema
needs to be derived, we can map queries in Normal Schema to appropriate component of Uniform, in Mural Schema.
December 2006 Slide 112COMAD 2006 Tutorial
Organization
• Motivation• Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research
December 2006 Slide 113COMAD 2006 Tutorial
Multilingual Architecture &
Implementation Experience
December 2006 Slide 114COMAD 2006 Tutorial
Design Goals
• Relational Systems Oriented
• Attribute Data Oriented– Primarily for OLTP Environments
• DB Transparent to Language– Linguistic Resources are only “plugged-in”
• Modular, Dynamic & Configurable …
December 2006 Slide 115COMAD 2006 Tutorial
Implementation Architecture
…
Server Manager
• Query String • Parameters
• Result Set
QueryProc.
Engine
Database
Unicode
Cuniform
ApproximateMatching
TTPn
CostCluster
MLNameJoin
TransitiveClosure
CostOntology
MLSemJoin
MURAL
December 2006 Slide 116COMAD 2006 Tutorial
Outside-the-Server Implementation
• On Commercial Systems– Using UDF (in PLSQL) for MLNameJoin– Using Recursive SQL and IN Clause for MLSemJoin
• Can be packaged as a PL/SQL procedure
• Advantages:– Implementation with existing features, though slow– Optimization using techniques outlined here
• Disadvantages:– Slow Performance– No leveraging on Optimizer for Better Plan Selection
December 2006 Slide 117COMAD 2006 Tutorial
Native Implementation†
• Implemented Natively…– Uniform Datatype as Derived Datatype– MLNameJoin and MLSemJoin as First-Class Operators
• Added TTP Converters in specific languages• Added WordNet (English Only) for Semantic Matching
– Added Metric M-Tree Index Structure using Gist API– Added All Components of Mural Algebra
• Cost Models, Composition Rules and Selectivity Estimations– Optimizer made “aware” of Mural components
† Implemented on PostgreSQL Open-Source Database System
December 2006 Slide 118COMAD 2006 Tutorial
Performance of Native Implementation [MLNameJoin]
• Outside-the-Server UDF Performance• Native Implementation with MTree
169498MLNameJoin (w/ MetricDist or M Tree Idx)
3618
Scan-Time (Outside)
453
Join-Time (Outside)
MLNameJoin
Matching Methodology
1.924.24
5.20
Scan-Time (Native)
1.96
Join-Time (Native)
December 2006 Slide 119COMAD 2006 Tutorial
Performance of Native Implementation[MLSemJoin]
• Outside-the-Server Performance using SQL:1999• Native pinning WordNet in-memory
December 2006 Slide 120COMAD 2006 Tutorial
Optimizer Evaluation Possible…
Optimizer Predictions of Query Executions are accurate
Optimizer Cost:2,439,370
Runtime:82.15s
Optimizer Cost:7,513,852
Runtime:2338s
Optimizer Evaluation and Runtimes for two alternatives:Plan 1: π AuthorID,BookID,PubID(σ (Threshold≤0.25)(Ψ Aname,Pname(A X B)))
Plan 2: π AuthorID,BookID,PubID(B X (σ(Threshold≤0.25) (ΨAname,Pname(P,A)))
Query: Find the books whose Author name sounds like Publisher’s name
December 2006 Slide 121COMAD 2006 Tutorial
Tutorial - Take Away
This tutorial explored research to make Information Systems,Natural Language Neutral in functionality and performance
Cuniform storage format nearly nullifies performance differential
The MLNameJoin and MLSemJoin operators enhance thestandard lexicographic operators for cross-lingual queryingThe Mural Algebra is critical for a Native implementation of the Functionality in Relational Systems
December 2006 Slide 122COMAD 2006 Tutorial
More Information[http://dsl.serc.iisc.ernet.in/~publications]• Phd Thesis of A. Kumaran [Microsoft Research India]
• On Database Support for Multilingual EnvironmentsRIDE/MLIM Workshop, part of ICDE ’03, March 2003.
• On the Costs of Multilingualism in Database Systems VLDB ’03, September 2003.
• Supporting Multilexical Queries in Database Systems ICDE ’04, March 2004.
• Supporting Multiscript Query Processing in Database Systems EDBT ’04, March 2004.
• LexEQUAL: Multilexical Operator in SQL SIGMOD ’04, June 2004.
• MIRA – Multilingual Information-processing on Relational Architecture Springer LNCS 3268, November 2004.
• On Semantic Matching of Multilingual Attributes in Relational Systems CIKM ’04, November 2004.
• MLSemJoin: Multilingual Semantic Matching in Relational SystemsDASFAA ’05, April 2005.
• On Pushing Multilingual Query Operators into Database EnginesICDE ’06, April 2006.
December 2006 Slide 123COMAD 2006 Tutorial
Future Research Avenues
• HomoGlyphic Operator• Extensions to MLNameJoin Operator
– Better Index Structures in the Phonetic Domains– Automatic Tuning of Optimal Match Parameters Based on a
Training Set provided by the User• Extensions to MLSemJoin Operator
– Domain-Specific Ontological Matching• Multilingual Performance Suites based on a
Standard Application– Multilingual Benchmarks
December 2006 Slide 124COMAD 2006 Tutorial
Thank you!http://dsl.serc.iisc.ernet.in/~projects/MIRA
Database Systems LaboratoryIndian Institute of Science