Annotation Free Information Extraction

Annotation Free Information Extraction

Chia-Hui Chang Department of Computer Science & Information Engineering

National Central [email protected]

10/4/2002

IEPAD: Information Extraction based on Pattern Discovery

C.H. Chang. National Central UniversityWWW10

Semi-structured Information Extraction Information Extraction (IE)

Input: Html pages Output: A set of records

Pattern Discovery based IE

Motivation Display of multiple records often forms a repeated

pattern The occurrences of the pattern are spaced regularly and

adjacently

Now the problem becomes ... Find regular and adjacent repeats in a string

IEPAD Architecture

Pattern Generator

ExtractorExtraction Results

Html Page

Patterns

Pattern Viewer

Extraction Rule

Users

Html Pages

The Pattern Generator

Translator PAT tree construction Pattern validator Rule Composer

HTML Page

Token Translator

PAT TreeConstructor

Validator

Rule Composer

PAT trees andMaximal Repeats

Advenced Patterns

Extraction Rules

A Token String

1. Web Page Translation

Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a

special token called TEXT (denoted by a underscore) HTML Example:

Congo242 

Egypt20 

Encoded token stringT()T(_)T()T()T(_)T()T( )

T()T(_)T()T()T(_)T()T( )

Various Encoding Schemes

B lo ck -lev e l ta g s T ex t-lev e l ta g sHeadings

Text containers

Lists

Others

H1~H6

P, PRE, BLOCKQUOTE,ADDRESS

UL, OL, LI, DL, DIR,MENU

DIV, CENTER, FORM,HR, TABLE, BR

Logical markup

Physical markup

Special markup

EM, STRONG, DFN, CODE,SAMP, KBD, VAR, CITE

TT, I, B, U, STRIKE, BIG,SMALL, SUB, SUP, FONT

A, BASEFONT, IMG, APPLET,PARAM, MAP, AREA

Figure. 2 Tag classification

2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible

suffix strings of a text Example

T() 000T() 001T() 010T() 011T( ) 100 T(_) 110

000110001010110011100000110001010110011100

T()T(_)T()T()T(_)T()T( )T()T(_)T()T()T(_)T()T( )

Indexing position:suffix 1 000110001010110011100000110001010110011100$suffix 2 110001010110011100000110001010110011100$suffix 3 001010110011100000110001010110011100$suffix 4 010110011100000110001010110011100$suffix 5 110011100000110001010110011100$suffix 6 011100000110001010110011100$suffix 7 100000110001010110011100$suffix 8 000110001010110011100$suffix 9 110001010110011100$suffix10 001010110011100$suffix11 010110011100$suffix12 110011100$suffix13 011100$suffix14 100$

The Constructed PAT Tree

$

12

1

2 2

3 4 5

10

1 8 10

0

1

10000

1

$

0

147

0

5

3

22

$0

16

$0

3 13

7

$0

6

11

13

$

4

19

$0

92

a

b

c

d e

f

g

h

i

j k

l m

Figure 3. The PAT tree for the Congo Code

=0110001010110011100=1010110011100=01010110011100=0110011100=11100

Definition of Maximal Repeats

Let occurs in S in position p1, p2, p3, …, pk is left maximal if there exists at least one (i, j) pai

r such that S[pi-1]S[pj-1] is right maximal if there exists at least one (i, j) p

air such that S[pi+||]S[pj+||] is a maximal repeat if it it both left maximal and ri

ght maximal

Finding Maximal Repeats

Definition: Let’s call character S[pi-1] the left character of s

uffix pi

A node is left diverse if at least two leaves in the ’s subtree have different left characters

Lemma: The path labels of an internal node in a PAT tr

ee is a maximal repeat if and only if is left diverse

3. Pattern Validator

Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence.

Characteristics of a Pattern Regularity: Variance coefficient

Adjacency: Density}1|{

}1|{)(

1

1

kippMean

kippStdDevV

ii

ii

||

||*)(

1

pp

kD

k

Pattern Validator (Cont.) Basic ScreeningFor each maximal repeat , compute V() and D()

a) check if the pattern’s variance: V() < 0.5

b) check if the pattern’s density: 0.25 < D() < 1.5

V()<0.5

0.25<D()<1.5

Yes

NoDiscard

Yes

Pattern

NoDiscard

Pattern

4. Rule Composer Occurrence partition

Flexible variance threshold control Multiple string alignment

Increase density of a pattern

Occurrence Partition

Problem Some patterns are divided into several blocks Ex: Lycos, Excite with large regularity

Solution Clustering of the occurrences of such a pattern

Clustering V()<0.1No

Discard

Check densityYes

Multiple String Alignment

Problem Patterns with density less than 1 can extract only part

of the information

Solution Align k-1 substrings among the k occurrences

A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

Multiple String Alignment (Cont.) Suppose “adc” is the discovered pattern for token

string “adcwbdadcxbadcxbdadcb”

If we have the following multiple alignment for strings `àdcwbd'', `àdcxb'' and `àdcxbd'':

a d c w b d

a d c x b -

a d c x b d

The extraction pattern can be generalized as “adc[w|x]b[d|-]”

Pattern Viewer Java-application based GUI Web based GUI

http://www.csie.ncu.edu.tw/~chia/WebIEPAD/

The Extractor

Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm

Alternatives in a rule matching the longest pattern

What are extracted? The whole record

Experiment Setup

Fourteen sources: search engines Performance measures

Number of patterns Retrieval rate and Accuracy rate

Parameters Encoding scheme Thresholds control

Translation

Table 2. Size of translated sequences and number of patterns

Encoding Scheme Length of Sequence No. of Patterns

All Tag 1128 7.9

No Physical 873 6.5

No Special 796 5.7

Block-Level 514 4.4

Average page length is 22.7KB

Accuracy and Retrieval RateTable 5. The performance of multiple string alignment

Search Engine Retrieval Rate Accuracy Rate Matching PercentageAltaVistaCoraExciteGalaxyHotbotInfoseekLycosMagellanMetacrawlerNorthernLightOpenfindSavvysearchStpt.comWebcrawler

1.001.001.001.000.970.980.941.000.900.950.831.000.990.98

1.001.000.970.950.860.940.631.000.960.960.900.951.000.98

0.910.971.000.990.880.870.940.760.780.900.660.970.950.98

Average 0.97 0.94 0.90

Problems

Guarantee high retrieval rate instead of accuracy rate Generalized rule can extract more than the

desired data Only applicable when there are several

records in a Web page, currently

ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites

Valter Crescenzi , Giansalvatore , Paolo Merialdo

VLDB2001

Observations

1. Wrapper generator works by using additional information. (labeled samples)

2. Wrapper induction system has some a priori knowledge about the page organization.

3. Finally, systems generate wrapper by examining one HTML page at a time.

ROADRUNNER new perspective1. Don’t rely on any interaction with the user.

(Completely automatic)

2. No a priori knowledge HTML schema will be inferred along with

wrapper.

Can handle any nested structures.

3. Works with two HTML pages at a time. (based on the study of similarities and dissimilarities between the pages)

Theoretical Background Site generation = Encoding of database co

ntent Data extraction = Decoding The problem is based on a close correspon

dence between nested type and union-free regular expressios.

Delimiter #PCDATA : map to string + : map to lists (nested) , being iterator ? : map to nullable fields, optional patterns.

Find schema and data extraction = Find minimal UFRE.

Matching Technique It is based on a matching technique called

ACME. (Align, Collapse under Mismatch, and Extract)

HTML XHTML tokens Matching algorithm works on two objects:

A list of tokens, call the sample A wrapper (one UFRE)

This is done by solving mismatches between the wrapper and the sample.

Mismatches1. String mismatches:

May be due only to different values of a database field.

These mismatches are use to discover fields. (#PCDATA)

Ex : ‘John Smith’ and ‘Paul Jones’ at token 4

2. Tag mismatches: Optional patterns Iterative patterns

Discovering Optionals Strategy: Looking for repeated patterns a

s a first step, and then, if this attempt fails, in trying to identify optional pattern.

Two steps: 1. Optional Pattern Location by Cross-Search

Mismatch at token 6 - <UL> and <IMG…/> Assume optional pattern is located on wrapper or s

ample. 2. Wrapper Generalization

( <IMG src=…/> ) ?

Discovering Iterators1. Square Location by Terminal – Tag Searc

h : Both the wrapper and sample contain at least o

ne occurrence of the square. Terminal Tag = position before the mismatch

In this example is </LI> Test which is the square initial tag ?

</UI> ~ </LI> v.s. <LI> ~ </LI>

Finally, we can infer that the sample contains one candidate occurrence of the square at token 20-25.

Discovering Iterators (con’t)2. Square Matching :

Try to match the candidate square occurrence (tokens 20-25).

Backwards : matching token 25 and 19, then moves to 24 and 18 and so on.

3. Wrapper Generalization : If we denote the newly found square by s, we

replace the repeated pattern by (s)+

More Complex Example First mismatch at token 15 (external misma

tch) Find iterators :

Terminal tag = </LI> Candidate square is found : <LI> ~ </LI> at token 1

5-28 Backward match : second mismatch at token 23 and

9 (internal mismatch) solve the mismatch by recursive

Recursively solve mismatch Internal mismatch at token 23 and 9

Solve it by the same way at external mismatch. But don’t work by comparing one wrapper and

one sample, rather two different portions of the same objects.

Terminal tag = Candidate square is ~ token 23-18 Backward match : mismatch at token 20 and 26 Find token 20-22 is optional pattern.

Matching as an AND-OR tree Finding one solution to match(w,s) corresponds to find

ing one visit for the AND-OR tree. (i) match(w,s) = all external mismatches encountered d

uring the parsing (AND node) (ii) solve mismatch by either introducing one field, or on

e iterator, or one optional (OR) (iii) The search may either on wrapper or sample (OR) (iv) iterators and optionals are various candidates (OR) (v) Discover iterators may be need to recursively solve

several internal mismatches. (AND)

AND-OR tree

Experimental Results

Experimental Results (con’t)

Extracting Structured Data from Web Page

Arvind Arasu, Hector Garcia-MolinaACM SIGMOD 2003

Cue Keywords: schema, template Web pages belonging to the same site are gene

rated by encoding data of the same schema with a common template

= ＞ a common template by plugging-in value

Figuration

Goal and Challenge Previous IE Techniques rely on heuristic by

human. ex. wrapper Goal: to deduce the template without human

Time consuming and error-prone Optional attributes are ignored

Challenge: No obvious way of differentiating what text is template or data The schema of data in pages isn’t flat but more complex and semi-structured of attributes

Model, Problem Formulation

Structured Data Model of Page Creation Optionals and Disjunctions Problem Statement Miscellaneous Terminology, Definition

Structured Data Token: A token is some basic unit of text Structured Data: any set of data values confor

ming to a common schema or type Define “Type”:

1. Basic Type (β): string of tokens e.g. ＜ html ＞ , text2. Ordered List Type: tuple constructor order “n”

e.g. ＜ T1, T2, …, Tn ＞ , T1, T2, …, Tn : type3. Define Type: set constructor e.g. {T} , T: type

Define term value and example Define “instance”:

1. an instance of basic type, β, token

2. an instance of type ＜ T1, T2, …, Tn＞ is

　　 tuple of the form ＜ i1, i2, …, in ＞ , attributes

i1, i2, …, in are instances of typesT1, T2, …, Tn

3. an instance of type {T}, is any set of elements

{e1, e2, …, em}, such ei is an instance of type T

Instance → Value; String → token Example:

Schema S1= Value =

3

21

, , ,B B B B

1 1 1 2 2, , , , ,x t f l f l c 2 0 0, , ,x t f l c

Model of Page Creation Definition: A template T for a sc

hema S (as shown TS), is defined as a function that maps each type constructor τ of S into an ordered set of strings T(τ ), such that,

τis the tuple constructor of order n, T(τ) is an order set of n+1 string

τis the set constructor of order n, T(τ) is string Sτ

1 ( 1),..., nC C

λ(T, x) :values x that are instances of sub-schema of S

Encoding of a value x S

1. if x β, then λ (T,x)→x

2. if x <x1, x2, …, xn ＞ τt

λ (T,x) → C1 λ (T, x1) C2 …λ (T, xn) Cn+1

3. if x {e1, e2, …, em}τs , τs S

λ (T,x) → λ (T, e1) S λ (T, e2) ….S λ (T, em)

Example of Schema S1

3

21

1 , , ,S B B B B

1 1( ) , , ,T A B C D

1 3( ) , ,T E F G 1 2( )T H

1 1 1 2 2, , , , ,x t f l f l c

1 1 1 2 2

1 1 2 2

2 1 1 2 2

1 1 2 2

3 1 1

1 1

1 1 2 2

( ) ( , , , , , ) , , ,

, , ,

( ) ( , , , )

, ,

( ) ( , ) , ,

T T t f l f l c A B C D

String AtB f l f l CcD

T T f l f l H

Substring f l H f l

T T f l E F G

Substring Ef Fl G

String AtBEf Fl GHEf Fl GCcD

Re

H

gularExpression

A B E F G C D

Optionals and Disjunctions

Optional: If T is type, optional type (T)?≡{T}τ

|τ| = 0 or 1

Disjunction: If T1 and T2 is type, disjunction type

(T1| T2) ≡ ＜ {T1}τ1, {T2}τ2 ＞ τ

|τ1|+|τ2| = 1

Problem Statement

Extract Problem: n pages, pi = λ(T, xi)

(1 ≤ i ≤ n), created from some unknown deduction template T and values {x1,. . .,x1} from the set of pages alone

Example of correct solution of EXTRACT (cont.)

1 2 3 4, , ,e e e e eP p p p p

Example of correct solution of EXTRACT (cont.)

1

1 1

2

2 2

, ,7,...

( , )

,{ , 2,... , ,6,... }

( , )

e

e

e

e

S

Se

S

Se

x Database John T

P T x

x DataMining Jeff Jane T

P T x

( , )eSei iP T x

1 2 3, , ,e e e e

S B B B B

Miscellaneous Terminology, Definition

An occurrence of a token in template is called a template-token

An occurrence of a token in value is called a value-token

An occurrence of a token in page is called a page-token

2 page-token in Pe have the same role iff they have been generated by the same template-token

Overview Approach - EXALG

(ECGM)

EXALG - ECGM – FINDEQ (step2) The module used to compute “equivalence classes:ε”, set of tokens having the same frequency of occurrence in every pages Pe

Ex. εe1:{ <html>, <body>, Book, Reviews, <ol>,

</ol>, </body>, </html> } Ex. εe3:{ <li>, Reviewer, Rating, Text, </li> }

EXALG retain only EQ Classes that are Large and Frequently occurring EQ Classes (LFEQ)

EXALG - ECGM – HANDINV (step3) The module used to detect and remove invalid LFEQs – those that are not formed by tokens associated with a type constructor

DIFFFORM (step1) and DIFFEQ (step4) The module used to add more tokens to LFEQ by “diff

erentiating” roles Ex. Name has multiple “role”, one occurs in Book Name and

the other occurs in Reviewer Name Differentiate the multiple roles :

The multiple tokens occur in different path from root in the HTML parse tree (DIFFFORM)

The multiple tokens occur in different “Position” with respect to LFEQ εe1(DIFFEQ)

dtoken: ex. Name5 and Name14

regard NameA and NameB as different tokens

Review ECGM

Find dtoken from pathin html parse tree

Find LFEQ

Detect and removeinvalid LFEQ

Find dtoken from position in valid LFEQ

Example After ECGM Process εe1: { <html>, <body>, , Book, Name, , , Reviews, , <ol>, </ol>, </body>, </html> }

8 →13 εe3: { <li>, , Reviewer, Name, , ,

Rating, , , Text, , </li>}5 →12

Position: empty and non-empty

Construct Schema from ECGM

Construct Schema S’ fromεe1

The 1st of non-empty position is Basic Type β The 2nd of non-empty position is εe3 , are generated b

y set type constructorτe3

→ T(τe1) = <C11, C12,C13>, S’ = <β,{ S” }τe2 >τe1

→ T(τe2) = S” = < C31, C32,C33,C34 > → T(τe3) = < C31, C32,C33,C34 >, <β,β,β,>τe3

S’ = < β,{ <β,β,β,>τe3 }τe2 >τe1

Equivalence Classes (Cont.)Pages P = { p1, … , pn } , pi = λ(TS, xi)

TS = {τ1, … , τk }: type constructor Definition: All tokens of equivalence class have the s

ame occurrence vector

ex. εe1: <1,1,1,1>; εe3: <1,2,1,0> Observation1 : Tokens associated with the sam

e type constructor τj in T that have unique-roles occur in the same equivalence class. (used to decide EQ valid or not)

Support of token: #(page contain) Size of EQ class: #(token of EQ)

Equivalence Classes (Cont.) Observation2: for real pages, an equivalence clas

s of large size and support is usually valid Properties of EQ class: <t1, … , tm>

Ordered Nested: the span of all occurrences of εi is within for s

ome fixed Position_p or doesn’t overlap Observation3: A valid equivalence class is ordere

d and a pair of two valid equivalence classes is nested

Handling Invalid Equivalence classes Detect the existence of invalid LFEQs using vi

olation of ordered and nesting Yes, discard some of LFEQs and break other into

smaller LFEQs

Differentiating roles of tokens By Path – different roles of tokens are in

different path of HTML parse tree By Position – different roles of tokens locates at

different Position (non-empty)

Equivalence Class Generation Module

OUTPUT: set of LFEQs of dtokens and page represented as string of dtokens

FINDEQ: 2 parameters used to consider

LFEQs (SIZETHRES, SUPTHRES) On running example:

SIZETHRES = SUPTHRES = 3

the iteration = 2, find out εe1 and εe3

Building Template and Extracting Values

Input to this module is {ε1 ,ε2 , … ,εm } The ANALYSIS consist of 2 modules – CONSTTEM

P and EXVAL CONSTTEMP ,εi = { d1, d2, … , dl }

Start the basic ε1= { <html>, <body>, … ,</body>, </html> }

recursively constructs a template Tεi , corresponding toεi , and template Tεi, p, corresponding to each non-empty position p ofεi

Checks if the set of strings, PosString(εi ,p), corresponding has some recognizable pattern

Example

In running example, PosString(εe1+ ,6) is a string dto

kens for every occurrence of εe1+, which matches Pat

tern 5 of table; PosString(εe1+ ,10) is always a string

of 0 or more occurrences of εe3+, which matches Patt

ern 1 εe1: { <html>, <body>, , Book, Name, , ,

Reviews, , <ol>, </ol>, </body>, </html> }

Assumption The 4 assumptions:

(A1) A large number of tokens occurring in

template have unique roles

(A2) The EQ class derived from a type constructor

is recognized as an LFEQ

(A3) Irregularity in encoded data that leads to

invalid EQ class

(A4) The separators are around data values. In

this model, strings associated with type

construction are non-empty position

EvaluationLeaf attribute Am in schema Sm

Correct: the set of Am in the page is equal to the set of extracted value Ae in the page

Partially Correct: the set of Am in the page is not equal to the set of extracted value Ae in the page, but as part of value of Ae

Incorrect: not correct and Partially correct

Result 18 or 40% of input collections

our System correctly extracted all the attribute

Around 80% of the attributes were extracted correctly

Normalized average Input size <=10 Parameter = 3

Conclusion EXALG: use 2 novel concept equivalence classes

and differentiate roles, to discovery the template Impact of the failed assumption is limit to a few

attributes Future work:

Develop techniques for crawling, indexing, and providing querying support for the structured pages in the web

Develop techniques for automatically annotating the extracted data, possibly using the words that appear in the template

References

C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW2001, pp. 681-688.

Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB2001, 109-118

Arvind Arasu, Hector Garcia-Molina. Extracting Structured Data from Web Pages. SIGMOD2003, 337-348.

Documents

Annotation Free Information Extraction