Upload
asa
View
51
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Annotation Free Information Extraction. Chia-Hui Chang Department of Computer Science & Information Engineering National Central University [email protected] 10/4/2002. IEPAD: Information Extraction based on Pattern Discovery. C.H. Chang. National Central University WWW10. - PowerPoint PPT Presentation
Citation preview
Annotation Free Information Extraction
Chia-Hui Chang Department of Computer Science & Information Engineering
National Central [email protected]
10/4/2002
IEPAD: Information Extraction based on Pattern Discovery
C.H. Chang. National Central UniversityWWW10
Semi-structured Information Extraction Information Extraction (IE)
Input: Html pages Output: A set of records
Pattern Discovery based IE
Motivation Display of multiple records often forms a repeated
pattern The occurrences of the pattern are spaced regularly and
adjacently
Now the problem becomes ... Find regular and adjacent repeats in a string
IEPAD Architecture
Pattern Generator
ExtractorExtraction Results
Html Page
Patterns
Pattern Viewer
Extraction Rule
Users
Html Pages
The Pattern Generator
Translator PAT tree construction Pattern validator Rule Composer
HTML Page
Token Translator
PAT TreeConstructor
Validator
Rule Composer
PAT trees andMaximal Repeats
Advenced Patterns
Extraction Rules
A Token String
1. Web Page Translation
Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a
special token called TEXT (denoted by a underscore) HTML Example:
<B>Congo</B><I>242</I><BR>
<B>Egypt</B><I>20</I><BR>
Encoded token stringT(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
Various Encoding Schemes
B lo ck -lev e l ta g s T ex t-lev e l ta g sHeadings
Text containers
Lists
Others
H1~H6
P, PRE, BLOCKQUOTE,ADDRESS
UL, OL, LI, DL, DIR,MENU
DIV, CENTER, FORM,HR, TABLE, BR
Logical markup
Physical markup
Special markup
EM, STRONG, DFN, CODE,SAMP, KBD, VAR, CITE
TT, I, B, U, STRIKE, BIG,SMALL, SUB, SUP, FONT
A, BASEFONT, IMG, APPLET,PARAM, MAP, AREA
Figure. 2 Tag classification
2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible
suffix strings of a text Example
T(<B>) 000T(</B>) 001T(<I>) 010T(</I>) 011T(<BR>) 100 T(_) 110
000110001010110011100000110001010110011100
T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
Indexing position:suffix 1 000110001010110011100000110001010110011100$suffix 2 110001010110011100000110001010110011100$suffix 3 001010110011100000110001010110011100$suffix 4 010110011100000110001010110011100$suffix 5 110011100000110001010110011100$suffix 6 011100000110001010110011100$suffix 7 100000110001010110011100$suffix 8 000110001010110011100$suffix 9 110001010110011100$suffix10 001010110011100$suffix11 010110011100$suffix12 110011100$suffix13 011100$suffix14 100$
The Constructed PAT Tree
$
12
1
2 2
3 4 5
10
1 8 10
0
1
10000
1
$
0
147
0
5
3
22
$0
16
$0
3 13
7
$0
6
11
13
$
4
19
$0
92
a
b
c
d e
f
g
h
i
j k
l m
Figure 3. The PAT tree for the Congo Code
=0110001010110011100=1010110011100=01010110011100=0110011100=11100
Definition of Maximal Repeats
Let occurs in S in position p1, p2, p3, …, pk is left maximal if there exists at least one (i, j) pai
r such that S[pi-1]S[pj-1] is right maximal if there exists at least one (i, j) p
air such that S[pi+||]S[pj+||] is a maximal repeat if it it both left maximal and ri
ght maximal
Finding Maximal Repeats
Definition: Let’s call character S[pi-1] the left character of s
uffix pi
A node is left diverse if at least two leaves in the ’s subtree have different left characters
Lemma: The path labels of an internal node in a PAT tr
ee is a maximal repeat if and only if is left diverse
3. Pattern Validator
Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence.
Characteristics of a Pattern Regularity: Variance coefficient
Adjacency: Density}1|{
}1|{)(
1
1
kippMean
kippStdDevV
ii
ii
||
||*)(
1
pp
kD
k
Pattern Validator (Cont.) Basic ScreeningFor each maximal repeat , compute V() and D()
a) check if the pattern’s variance: V() < 0.5
b) check if the pattern’s density: 0.25 < D() < 1.5
V()<0.5
0.25<D()<1.5
Yes
NoDiscard
Yes
Pattern
NoDiscard
Pattern
4. Rule Composer Occurrence partition
Flexible variance threshold control Multiple string alignment
Increase density of a pattern
Occurrence Partition
Problem Some patterns are divided into several blocks Ex: Lycos, Excite with large regularity
Solution Clustering of the occurrences of such a pattern
Clustering V()<0.1No
Discard
Check densityYes
Multiple String Alignment
Problem Patterns with density less than 1 can extract only part
of the information
Solution Align k-1 substrings among the k occurrences
A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.
Multiple String Alignment (Cont.) Suppose “adc” is the discovered pattern for token
string “adcwbdadcxbadcxbdadcb”
If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'':
a d c w b d
a d c x b -
a d c x b d
The extraction pattern can be generalized as “adc[w|x]b[d|-]”
Pattern Viewer Java-application based GUI Web based GUI
http://www.csie.ncu.edu.tw/~chia/WebIEPAD/
The Extractor
Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm
Alternatives in a rule matching the longest pattern
What are extracted? The whole record
Experiment Setup
Fourteen sources: search engines Performance measures
Number of patterns Retrieval rate and Accuracy rate
Parameters Encoding scheme Thresholds control
Translation
Table 2. Size of translated sequences and number of patterns
Encoding Scheme Length of Sequence No. of Patterns
All Tag 1128 7.9
No Physical 873 6.5
No Special 796 5.7
Block-Level 514 4.4
Average page length is 22.7KB
Accuracy and Retrieval RateTable 5. The performance of multiple string alignment
Search Engine Retrieval Rate Accuracy Rate Matching PercentageAltaVistaCoraExciteGalaxyHotbotInfoseekLycosMagellanMetacrawlerNorthernLightOpenfindSavvysearchStpt.comWebcrawler
1.001.001.001.000.970.980.941.000.900.950.831.000.990.98
1.001.000.970.950.860.940.631.000.960.960.900.951.000.98
0.910.971.000.990.880.870.940.760.780.900.660.970.950.98
Average 0.97 0.94 0.90
Problems
Guarantee high retrieval rate instead of accuracy rate Generalized rule can extract more than the
desired data Only applicable when there are several
records in a Web page, currently
ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites
Valter Crescenzi , Giansalvatore , Paolo Merialdo
VLDB2001
Observations
1. Wrapper generator works by using additional information. (labeled samples)
2. Wrapper induction system has some a priori knowledge about the page organization.
3. Finally, systems generate wrapper by examining one HTML page at a time.
ROADRUNNER new perspective1. Don’t rely on any interaction with the user.
(Completely automatic)
2. No a priori knowledge HTML schema will be inferred along with
wrapper.
Can handle any nested structures.
3. Works with two HTML pages at a time. (based on the study of similarities and dissimilarities between the pages)
Theoretical Background Site generation = Encoding of database co
ntent Data extraction = Decoding The problem is based on a close correspon
dence between nested type and union-free regular expressios.
Delimiter #PCDATA : map to string + : map to lists (nested) , being iterator ? : map to nullable fields, optional patterns.
Find schema and data extraction = Find minimal UFRE.
Matching Technique It is based on a matching technique called
ACME. (Align, Collapse under Mismatch, and Extract)
HTML XHTML tokens Matching algorithm works on two objects:
A list of tokens, call the sample A wrapper (one UFRE)
This is done by solving mismatches between the wrapper and the sample.
Mismatches1. String mismatches:
May be due only to different values of a database field.
These mismatches are use to discover fields. (#PCDATA)
Ex : ‘John Smith’ and ‘Paul Jones’ at token 4
2. Tag mismatches: Optional patterns Iterative patterns
Discovering Optionals Strategy: Looking for repeated patterns a
s a first step, and then, if this attempt fails, in trying to identify optional pattern.
Two steps: 1. Optional Pattern Location by Cross-Search
Mismatch at token 6 - <UL> and <IMG…/> Assume optional pattern is located on wrapper or s
ample. 2. Wrapper Generalization
( <IMG src=…/> ) ?
Discovering Iterators1. Square Location by Terminal – Tag Searc
h : Both the wrapper and sample contain at least o
ne occurrence of the square. Terminal Tag = position before the mismatch
In this example is </LI> Test which is the square initial tag ?
</UI> ~ </LI> v.s. <LI> ~ </LI>
Finally, we can infer that the sample contains one candidate occurrence of the square at token 20-25.
Discovering Iterators (con’t)2. Square Matching :
Try to match the candidate square occurrence (tokens 20-25).
Backwards : matching token 25 and 19, then moves to 24 and 18 and so on.
3. Wrapper Generalization : If we denote the newly found square by s, we
replace the repeated pattern by (s)+
More Complex Example First mismatch at token 15 (external misma
tch) Find iterators :
Terminal tag = </LI> Candidate square is found : <LI> ~ </LI> at token 1
5-28 Backward match : second mismatch at token 23 and
9 (internal mismatch) solve the mismatch by recursive
Recursively solve mismatch Internal mismatch at token 23 and 9
Solve it by the same way at external mismatch. But don’t work by comparing one wrapper and
one sample, rather two different portions of the same objects.
Terminal tag = <B> Candidate square is </B>~<B> token 23-18 Backward match : mismatch at token 20 and 26 Find token 20-22 is optional pattern.
Matching as an AND-OR tree Finding one solution to match(w,s) corresponds to find
ing one visit for the AND-OR tree. (i) match(w,s) = all external mismatches encountered d
uring the parsing (AND node) (ii) solve mismatch by either introducing one field, or on
e iterator, or one optional (OR) (iii) The search may either on wrapper or sample (OR) (iv) iterators and optionals are various candidates (OR) (v) Discover iterators may be need to recursively solve
several internal mismatches. (AND)
AND-OR tree
Experimental Results
Experimental Results (con’t)
Extracting Structured Data from Web Page
Arvind Arasu, Hector Garcia-MolinaACM SIGMOD 2003
Cue Keywords: schema, template Web pages belonging to the same site are gene
rated by encoding data of the same schema with a common template
= > a common template by plugging-in value
Figuration
Goal and Challenge Previous IE Techniques rely on heuristic by
human. ex. wrapper Goal: to deduce the template without human
Time consuming and error-prone Optional attributes are ignored
Challenge: No obvious way of differentiating what text is template or data The schema of data in pages isn’t flat but more complex and semi-structured of attributes
Model, Problem Formulation
Structured Data Model of Page Creation Optionals and Disjunctions Problem Statement Miscellaneous Terminology, Definition
Structured Data Token: A token is some basic unit of text Structured Data: any set of data values confor
ming to a common schema or type Define “Type”:
1. Basic Type (β): string of tokens e.g. < html > , text2. Ordered List Type: tuple constructor order “n”
e.g. < T1, T2, …, Tn > , T1, T2, …, Tn : type3. Define Type: set constructor e.g. {T} , T: type
Define term value and example Define “instance”:
1. an instance of basic type, β, token
2. an instance of type < T1, T2, …, Tn> is
tuple of the form < i1, i2, …, in > , attributes
i1, i2, …, in are instances of typesT1, T2, …, Tn
3. an instance of type {T}, is any set of elements
{e1, e2, …, em}, such ei is an instance of type T
Instance → Value; String → token Example:
Schema S1= Value =
3
21
, , ,B B B B
1 1 1 2 2, , , , ,x t f l f l c 2 0 0, , ,x t f l c
Model of Page Creation Definition: A template T for a sc
hema S (as shown TS), is defined as a function that maps each type constructor τ of S into an ordered set of strings T(τ ), such that,
τis the tuple constructor of order n, T(τ) is an order set of n+1 string
τis the set constructor of order n, T(τ) is string Sτ
1 ( 1),..., nC C
λ(T, x) :values x that are instances of sub-schema of S
Encoding of a value x S
1. if x β, then λ (T,x)→x
2. if x <x1, x2, …, xn > τt
λ (T,x) → C1 λ (T, x1) C2 …λ (T, xn) Cn+1
3. if x {e1, e2, …, em}τs , τs S
λ (T,x) → λ (T, e1) S λ (T, e2) ….S λ (T, em)
Example of Schema S1
3
21
1 , , ,S B B B B
1 1( ) , , ,T A B C D
1 3( ) , ,T E F G 1 2( )T H
1 1 1 2 2, , , , ,x t f l f l c
1 1 1 2 2
1 1 2 2
2 1 1 2 2
1 1 2 2
3 1 1
1 1
1 1 2 2
( ) ( , , , , , ) , , ,
, , ,
( ) ( , , , )
, ,
( ) ( , ) , ,
T T t f l f l c A B C D
String AtB f l f l CcD
T T f l f l H
Substring f l H f l
T T f l E F G
Substring Ef Fl G
String AtBEf Fl GHEf Fl GCcD
Re
H
gularExpression
A B E F G C D
Optionals and Disjunctions
Optional: If T is type, optional type (T)?≡{T}τ
|τ| = 0 or 1
Disjunction: If T1 and T2 is type, disjunction type
(T1| T2) ≡ < {T1}τ1, {T2}τ2 > τ
|τ1|+|τ2| = 1
Problem Statement
Extract Problem: n pages, pi = λ(T, xi)
(1 ≤ i ≤ n), created from some unknown deduction template T and values {x1,. . .,x1} from the set of pages alone
Example of correct solution of EXTRACT (cont.)
1 2 3 4, , ,e e e e eP p p p p
Example of correct solution of EXTRACT (cont.)
1
1 1
2
2 2
, ,7,...
( , )
,{ , 2,... , ,6,... }
( , )
e
e
e
e
S
Se
S
Se
x Database John T
P T x
x DataMining Jeff Jane T
P T x
( , )eSei iP T x
1 2 3, , ,e e e e
S B B B B
Miscellaneous Terminology, Definition
An occurrence of a token in template is called a template-token
An occurrence of a token in value is called a value-token
An occurrence of a token in page is called a page-token
2 page-token in Pe have the same role iff they have been generated by the same template-token
Overview Approach - EXALG
(ECGM)
EXALG - ECGM – FINDEQ (step2) The module used to compute “equivalence classes:ε”, set of tokens having the same frequency of occurrence in every pages Pe
Ex. εe1:{ <html>, <body>, Book, Reviews, <ol>,
</ol>, </body>, </html> } Ex. εe3:{ <li>, Reviewer, Rating, Text, </li> }
EXALG retain only EQ Classes that are Large and Frequently occurring EQ Classes (LFEQ)
EXALG - ECGM – HANDINV (step3) The module used to detect and remove invalid LFEQs – those that are not formed by tokens associated with a type constructor
DIFFFORM (step1) and DIFFEQ (step4) The module used to add more tokens to LFEQ by “diff
erentiating” roles Ex. Name has multiple “role”, one occurs in Book Name and
the other occurs in Reviewer Name Differentiate the multiple roles :
The multiple tokens occur in different path from root in the HTML parse tree (DIFFFORM)
The multiple tokens occur in different “Position” with respect to LFEQ εe1(DIFFEQ)
dtoken: ex. Name5 and Name14
regard NameA and NameB as different tokens
Review ECGM
Find dtoken from pathin html parse tree
Find LFEQ
Detect and removeinvalid LFEQ
Find dtoken from position in valid LFEQ
Example After ECGM Process εe1: { <html>, <body>, <b>, Book, Name, </b>, <
b>, Reviews, </b>, <ol>, </ol>, </body>, </html> }
8 →13 εe3: { <li>, <b>, Reviewer, Name, </b>, <b>,
Rating, </b>, <b>, Text, </b>, </li>}5 →12
Position: empty and non-empty
Construct Schema from ECGM
Construct Schema S’ fromεe1
The 1st of non-empty position is Basic Type β The 2nd of non-empty position is εe3 , are generated b
y set type constructorτe3
→ T(τe1) = <C11, C12,C13>, S’ = <β,{ S” }τe2 >τe1
→ T(τe2) = S” = < C31, C32,C33,C34 > → T(τe3) = < C31, C32,C33,C34 >, <β,β,β,>τe3
S’ = < β,{ <β,β,β,>τe3 }τe2 >τe1
Equivalence Classes (Cont.)Pages P = { p1, … , pn } , pi = λ(TS, xi)
TS = {τ1, … , τk }: type constructor Definition: All tokens of equivalence class have the s
ame occurrence vector
ex. εe1: <1,1,1,1>; εe3: <1,2,1,0> Observation1 : Tokens associated with the sam
e type constructor τj in T that have unique-roles occur in the same equivalence class. (used to decide EQ valid or not)
Support of token: #(page contain) Size of EQ class: #(token of EQ)
Equivalence Classes (Cont.) Observation2: for real pages, an equivalence clas
s of large size and support is usually valid Properties of EQ class: <t1, … , tm>
Ordered Nested: the span of all occurrences of εi is within for s
ome fixed Position_p or doesn’t overlap Observation3: A valid equivalence class is ordere
d and a pair of two valid equivalence classes is nested
Handling Invalid Equivalence classes Detect the existence of invalid LFEQs using vi
olation of ordered and nesting Yes, discard some of LFEQs and break other into
smaller LFEQs
Differentiating roles of tokens By Path – different roles of tokens are in
different path of HTML parse tree By Position – different roles of tokens locates at
different Position (non-empty)
Equivalence Class Generation Module
OUTPUT: set of LFEQs of dtokens and page represented as string of dtokens
FINDEQ: 2 parameters used to consider
LFEQs (SIZETHRES, SUPTHRES) On running example:
SIZETHRES = SUPTHRES = 3
the iteration = 2, find out εe1 and εe3
Building Template and Extracting Values
Input to this module is {ε1 ,ε2 , … ,εm } The ANALYSIS consist of 2 modules – CONSTTEM
P and EXVAL CONSTTEMP ,εi = { d1, d2, … , dl }
Start the basic ε1= { <html>, <body>, … ,</body>, </html> }
recursively constructs a template Tεi , corresponding toεi , and template Tεi, p, corresponding to each non-empty position p ofεi
Checks if the set of strings, PosString(εi ,p), corresponding has some recognizable pattern
Example
In running example, PosString(εe1+ ,6) is a string dto
kens for every occurrence of εe1+, which matches Pat
tern 5 of table; PosString(εe1+ ,10) is always a string
of 0 or more occurrences of εe3+, which matches Patt
ern 1 εe1: { <html>, <body>, <b>, Book, Name, </b>, <b>,
Reviews, </b>, <ol>, </ol>, </body>, </html> }
Assumption The 4 assumptions:
(A1) A large number of tokens occurring in
template have unique roles
(A2) The EQ class derived from a type constructor
is recognized as an LFEQ
(A3) Irregularity in encoded data that leads to
invalid EQ class
(A4) The separators are around data values. In
this model, strings associated with type
construction are non-empty position
EvaluationLeaf attribute Am in schema Sm
Correct: the set of Am in the page is equal to the set of extracted value Ae in the page
Partially Correct: the set of Am in the page is not equal to the set of extracted value Ae in the page, but as part of value of Ae
Incorrect: not correct and Partially correct
Result 18 or 40% of input collections
our System correctly extracted all the attribute
Around 80% of the attributes were extracted correctly
Normalized average Input size <=10 Parameter = 3
Conclusion EXALG: use 2 novel concept equivalence classes
and differentiate roles, to discovery the template Impact of the failed assumption is limit to a few
attributes Future work:
Develop techniques for crawling, indexing, and providing querying support for the structured pages in the web
Develop techniques for automatically annotating the extracted data, possibly using the words that appear in the template
References
C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW2001, pp. 681-688.
Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB2001, 109-118
Arvind Arasu, Hector Garcia-Molina. Extracting Structured Data from Web Pages. SIGMOD2003, 337-348.