Upload
shanna-nelson
View
215
Download
1
Embed Size (px)
Citation preview
WP3: FE ArchitectureProgress Report
CROSSMARC Seventh MeetingEdinburgh 6-7 March 2003
University of Rome “Tor Vergata”
WHISK: a brief summary
A wrapper induction algorithm handle from highly structured to free text
Learns rules in form of regular expressions patterns can extract either single-slot or multi-slot
Every rule is a sequence of the following elements: a text token the “*” symbol as a wildcard a Semantic Class one of the above enclosed in parentheses (extraction delimiters)
WHISK: a brief summary (2)
Input: a set of hand-tagged instances every instance is in turn considered as a “seed instance”, while the
rest of the input is taken as a training set.
WHISK induces rules top down First find the most general rule covering the seed Then extend the rule by adding terms one at a time Test the results on the training set.
The metric used to select a new term is the Laplacian expected error of the rule:
Laplacian = e+1 n+1
where n is the number of extractions made on the training set and e is the number of errors among those extractions.
WHISK implementation
WHISK_Main
WHISK_Evaluator
Training Set
XHTML Pages
FE Surrogates
Testing Set
XHTML Pages
FE Surrogates
WHISK Train Set
Tokenized XML Text
Product Descr
WHISK Test Set
Tokenized XML Text
Product Descr
WHISK Output
Product Descr
RuleSet WHISK_Main
Evaluation Results
WHISK Training
WHISK Testing
WHISK_Trainer
WHISK_Main
A WHISK_Evaluator
WHISK_Corpus_Feeder
WHISK implementationWHISK_Corpus_Feeder module: prepares the training set for learning and evaluation modules Merge web pages and the FE surrogate files into XML files, containing NE + Pdemarcator
information in the form of additional XML tags Tokenize the above XML files, then produces prolog lists of these tokens Extract from FE surrogate files target structured product descriptions, used to calculate
Laplacian during training process and to evaluate WHISK’s output during evaluation processWHISK_Main module: core implementation of WHISK algorithm Performs both training and testing processes.
WHISK_Evaluator module: performs the evaluation of product extractions on the testing set Once obtained extractions from WHISK_Main methods, it reports statistics for precision and
recall metrics
WHISK improvements
Heavy Semanticisation of WHISK rules construction Base 1 Contruction:
Original Base_1 construction: * (Semantic Class|Token)
Modified Base_1 construction: * (series of [Tokens|Semantic Classes])
Example:WHISK STANDARD: * (128 Mb)
[‘128 Mb’ is not a semantic
class]
RTV WHISK : * (Number, Capacity_mb)
[‘128’ and ‘Mb’ are both specific semantic classes]
WHISK improvements
Heavy Semanticisation of WHISK rules construction Base 2 Contruction:
Original Base_2 construction: left_token_delimiter (*) right_token_delimiter
Modified Base_2 construction: Left_semantic_class_delimiter (*) Right_semclass_delimiter
Example:SENTENCE: <term type=‘TERM’ product_no=product_1> TFT </term>
WHISK STANDARD Base_2: > ( * ) <
RTV WHISK : Base_2 Term_Tag ( * ) Close_Term_Tag
WHISK improvements
Two different windows while adding terms Adding terms one at a time and then testing rules versus
every time a term is added is a very cumbersome task We used a window of a specific number of tokens near the
element to be extracted, where to take terms from During recognition of semantic classes, effective
dimension (related to semantic elements and not simple tokens) of this window may vary dramatically
Dilemma: small window size versus large window size Many elements of semantic classes contain many tokens
(example: html tags) <term type=“TERM” product_no=“product_1,product_2”> = 21 tokens
Small window size = may get too small if window contains large semantic classes
Large window size = may require too much processing time
WHISK improvements
Two different windows while adding terms Solution:
Use of two windows: Token Window Semantic Window
At first a Token Window of token_window_size is created near the element to be extracted
The elements (tokens) in the Token Window are converted to Semantic Elements
A number of semantic_window_size elements are considered when adding terms to the WHISK Expression
This way a more stable window is created for purposes of rule improvement
WHISK improvements
WHISK_Corpus_Feeder Multiple Product Reference
If a single named entity refers to multiple product descriptions, we used the following syntax of PDemarcator to represent this phenomenon:
<NETAG type=“…” product_no=“product_1, product_2…”>
WHISK will handle this form of the attribute and test against every single product that is referred by the TAG
We propose agreement over this syntax and use it for new PDemarcator releases
WHISK evaluation
Some Experiments
Experiments have been conducted using a very simple set of base rules (one generic rule for every fact type)
There were no mutual exclusion in appliance of the different rules: every rule extracts all the elements matched, no matter if they have been extracted by other rules
WHISK evaluation
Application of base rules, results:RuleType: Precision Recall manufacturerName: 1 0.875433processorName: 1 1 preinstalledOS: 0.779923 0.966507processorSpeed: 0.532957 1 price: 1 1 dvdSpeed: 0.0772128 0.854167hdCapacity: 0.499006 0.992095ram: 0.500994 0.992126screenSize: 0.842809 0.936803warranty: 0.824324 0.709302batteryLife: 0.175676 0.866667preinstalledSoftware: 0.220077 1 width: 0.0501672 0.681818modelName: 1 0.972561modemSpeed: 0.301318 0.958084batteryType: 0.270358 0.873684screenType: 0.67101 0.944954cdromSpeed: 0.0885122 1 screenResolution: 1 0.666667weight: 1 0.86 height: 0.0568562 0.586207depth: 0.0501672 0.681818
WHISK: conclusionsWe’ve recently concluded WHISK modifications (at least we hope so!) and started experiments
ISSUE: multi-slot extraction requires too much training instances, as different product descriptions present different sequence of slots (number and/or order) At least a new rule for every different combination of slots is extracted, and considering
the large amount of varatio characterizing every single slot, providing multi-slot extraction would be unfeasible.
We decided to work only with single-slot extraction, charging to rule priorities and mutual exclusion between rule extractions the task of obtaining coherent product descriptions
FUTURE WORKS: Heavy semanticization of elements of the domain Presemanticization of the entire Corpora before training on them