Upload
ashley-shaw
View
215
Download
0
Embed Size (px)
Citation preview
Understanding System_T
By Mao Xianling
2009.02.28
Outline
Introduction to System_TPrimary testsProblem
Outline
Introduction to System_TIntroduction to System_TPrimary testsProblem
Installing the Development Environment
downloaded from IBM's AlphaWorks site; just search for "System Text" at http://www.alphaworks.ibm.com/
uncompress the .zip/.tar file onto your computer's hard drive
run the startup script Sh SystemText-[version]/bin/startserver.shstart the Development Environment by pointing y
our web browser at the address http://localhost:8083/aql
Development Environment
create view PhoneNum as
extract
regex /[0-9]{3}-[0-9]{4}/
on D.text as number
from Document D;
output view PhoneNum;
One Example for AQL Code
create view PhoneNum as
extract
regex /[0-9]{3}-[0-9]{4}/
on D.text as number
from Document D;
output view PhoneNum;
Introduction to AQL
• AQL:a language for building annotators that extract structured information from unstructured or semistructured text.
• AQL is the primary method of creating new annotators in System Text for Information Extraction.
Introduction to AQL
The syntax of AQL is similar to that of SQL, but with several important differences:
• AQL is case sensitive. • AQL allows regular expressions to be expressed
in Perl syntax, e.g. /regex/ instead of 'regex'. • AQL currently does not support advanced SQL
features like correlated subqueries and recursive queries.
• AQL has a new statement type, extract, which is not present in SQL.
Data Model
• AQL's data model is similar to the standard relational model used by SQL databases like DB2. All data in AQL is stored in tuples, data records of one or more columns, or fields. A collection of tuples forms a relation. All tuples in a relation must have the same schema — the names and types of their fields.
Data Model
The fields of an AQL tuple must belong to one of the language's built-in scalar types
• Integer: A 32-bit signed integer.
• Text: A Unicode string, with additional metadata to indicate which tuple the string belongs to.
• Span: A contiguous region of characters in a Text object.
Execution Model
AQL Statement
The create view Statement The extract Statement
– Extraction Specifications• Regular Expressions• Dictionaries• Splits
The select Statement The create table Statement Built-In Functions
– Predicate Functions– Scalar Functions– Table Functions
• create view PersonFirstOrLastName as• extract• dictionary 'names.dict' on D.text as name• from Document D• having MatchesRegex(/[A-Z].+/, name);• • create view PhoneNumber as• extract • regexes /(\d{3})-(\d{3}-\d{4})/ and /\(\d{3}\)\s*(\d{3}-\d{4})/ • on D.text as num • from Document D;• • create view ExtensionNumbers as• extract • regex /[Ee]xt\s*[\.\-\:]?\s*(\d{3,5})/• on D.text • return group 1 as num and group 0 as completenum• from Document D;
• create view PhoneNumberWithExtension as• select CombineSpans(P.num,E.completenum) as num• from PhoneNumber P, ExtensionNumbers E• where FollowsTok(P.num, E.completenum,0,1);• • create view PhoneNumberAll as• (select P.num as num from PhoneNumber P)• union all• (select E.completenum as num from ExtensionNumbers E)• union all• (select P.num as num from PhoneNumberWithExtension P);• • create view PhoneNumberAllConsolidated as• select P.num as num• from PhoneNumberAll P• consolidate on P.num• using 'ContainedWithin';
• • create view PersonsPhone as• select person.name as person, phone.num as phone,• CombineSpans(person.name, phone.num) as personphone• from PersonFirstOrLastName person, PhoneNumberAllConsolidated phone• where Follows(person.name, phone.num, 0, 30);• • output view PersonsPhone;
Outline
Introduction to System_TPrimary testsPrimary testsProblem
Primary Tests
• DataSet
From TianWang Clawer; Chinese; Firstname.dict/Lastname.dict (for Chinese)
• Method
Using AQL to build Annotators
Annotator for extract phone num
Annotator for extract name
Time && Space
Outline
Introduction to System_TPrimary testsProblemProblem
Problem
• English VS Chinese [extract regex /[0-9]{3}/ on 1 token in D.text] • Time && Space && Network?• MultiSet?• The express ability of Regex ?• No source code && MapReduce?• Zip?