WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”

WP3: FE ArchitectureProgress Report

CROSSMARC Seventh MeetingEdinburgh 6-7 March 2003

University of Rome “Tor Vergata”

WHISK: a brief summary

A wrapper induction algorithm handle from highly structured to free text

Learns rules in form of regular expressions patterns can extract either single-slot or multi-slot

Every rule is a sequence of the following elements: a text token the “*” symbol as a wildcard a Semantic Class one of the above enclosed in parentheses (extraction delimiters)

WHISK: a brief summary (2)

Input: a set of hand-tagged instances every instance is in turn considered as a “seed instance”, while the

rest of the input is taken as a training set.

WHISK induces rules top down First find the most general rule covering the seed Then extend the rule by adding terms one at a time Test the results on the training set.

The metric used to select a new term is the Laplacian expected error of the rule:

Laplacian = e+1 n+1

where n is the number of extractions made on the training set and e is the number of errors among those extractions.

WHISK implementation

WHISK_Main

WHISK_Evaluator

Training Set

XHTML Pages

FE Surrogates

Testing Set

XHTML Pages

FE Surrogates

WHISK Train Set

Tokenized XML Text

Product Descr

WHISK Test Set

Tokenized XML Text

Product Descr

WHISK Output

Product Descr

RuleSet WHISK_Main

Evaluation Results

WHISK Training

WHISK Testing

WHISK_Trainer

WHISK_Main

A WHISK_Evaluator

WHISK_Corpus_Feeder

WHISK implementationWHISK_Corpus_Feeder module: prepares the training set for learning and evaluation modules Merge web pages and the FE surrogate files into XML files, containing NE + Pdemarcator

information in the form of additional XML tags Tokenize the above XML files, then produces prolog lists of these tokens Extract from FE surrogate files target structured product descriptions, used to calculate

Laplacian during training process and to evaluate WHISK’s output during evaluation processWHISK_Main module: core implementation of WHISK algorithm Performs both training and testing processes.

WHISK_Evaluator module: performs the evaluation of product extractions on the testing set Once obtained extractions from WHISK_Main methods, it reports statistics for precision and

recall metrics

WHISK improvements

Heavy Semanticisation of WHISK rules construction Base 1 Contruction:

Original Base_1 construction: * (Semantic Class|Token)

Modified Base_1 construction: * (series of [Tokens|Semantic Classes])

Example:WHISK STANDARD: * (128 Mb)

[‘128 Mb’ is not a semantic

class]

RTV WHISK : * (Number, Capacity_mb)

[‘128’ and ‘Mb’ are both specific semantic classes]

WHISK improvements

Heavy Semanticisation of WHISK rules construction Base 2 Contruction:

Original Base_2 construction: left_token_delimiter (*) right_token_delimiter

Modified Base_2 construction: Left_semantic_class_delimiter (*) Right_semclass_delimiter

Example:SENTENCE: <term type=‘TERM’ product_no=product_1> TFT </term>

WHISK STANDARD Base_2: > ( * ) <

RTV WHISK : Base_2 Term_Tag ( * ) Close_Term_Tag

WHISK improvements

Two different windows while adding terms Adding terms one at a time and then testing rules versus

every time a term is added is a very cumbersome task We used a window of a specific number of tokens near the

element to be extracted, where to take terms from During recognition of semantic classes, effective

dimension (related to semantic elements and not simple tokens) of this window may vary dramatically

Dilemma: small window size versus large window size Many elements of semantic classes contain many tokens

(example: html tags) <term type=“TERM” product_no=“product_1,product_2”> = 21 tokens

Small window size = may get too small if window contains large semantic classes

Large window size = may require too much processing time

WHISK improvements

Two different windows while adding terms Solution:

Use of two windows: Token Window Semantic Window

At first a Token Window of token_window_size is created near the element to be extracted

The elements (tokens) in the Token Window are converted to Semantic Elements

A number of semantic_window_size elements are considered when adding terms to the WHISK Expression

This way a more stable window is created for purposes of rule improvement

WHISK improvements

WHISK_Corpus_Feeder Multiple Product Reference

If a single named entity refers to multiple product descriptions, we used the following syntax of PDemarcator to represent this phenomenon:

<NETAG type=“…” product_no=“product_1, product_2…”>

WHISK will handle this form of the attribute and test against every single product that is referred by the TAG

We propose agreement over this syntax and use it for new PDemarcator releases

WHISK evaluation

Some Experiments

Experiments have been conducted using a very simple set of base rules (one generic rule for every fact type)

There were no mutual exclusion in appliance of the different rules: every rule extracts all the elements matched, no matter if they have been extracted by other rules

WHISK evaluation

Application of base rules, results:RuleType: Precision Recall manufacturerName: 1 0.875433processorName: 1 1 preinstalledOS: 0.779923 0.966507processorSpeed: 0.532957 1 price: 1 1 dvdSpeed: 0.0772128 0.854167hdCapacity: 0.499006 0.992095ram: 0.500994 0.992126screenSize: 0.842809 0.936803warranty: 0.824324 0.709302batteryLife: 0.175676 0.866667preinstalledSoftware: 0.220077 1 width: 0.0501672 0.681818modelName: 1 0.972561modemSpeed: 0.301318 0.958084batteryType: 0.270358 0.873684screenType: 0.67101 0.944954cdromSpeed: 0.0885122 1 screenResolution: 1 0.666667weight: 1 0.86 height: 0.0568562 0.586207depth: 0.0501672 0.681818

WHISK: conclusionsWe’ve recently concluded WHISK modifications (at least we hope so!) and started experiments

ISSUE: multi-slot extraction requires too much training instances, as different product descriptions present different sequence of slots (number and/or order) At least a new rule for every different combination of slots is extracted, and considering

the large amount of varatio characterizing every single slot, providing multi-slot extraction would be unfeasible.

We decided to work only with single-slot extraction, charging to rule priorities and mutual exclusion between rule extractions the task of obtaining coherent product descriptions

FUTURE WORKS: Heavy semanticization of elements of the domain Presemanticization of the entire Corpora before training on them

Documents

WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”