Upload
mai
View
43
Download
1
Embed Size (px)
DESCRIPTION
Robust Web Extraction: An Approach Based on a Probabilistic Tree –Edit Model. Nilesh Dalvi, Philip Bohannon, Fei Sha Presented by Vinay Rambhia. Introduction. Script generated websites have html tree structure Wrappers are used to extract information - PowerPoint PPT Presentation
Citation preview
Nilesh Dalvi, Philip Bohannon, Fei Sha
Presented byPresented byVinay RambhiaVinay Rambhia
Script generated websites have html tree structure
Wrappers are used to extract information Xpath expression to extract director
information
w1=/html/body/div[2]/table/td[2]/text() Works for similar pages
Evolution cause wrappers to break so high maintenance
Other wrappers w2=//div[@class=‘content’]/*/td[2]/text()
w3=//table[@width=‘80%’/td[2]/text()
w4=//text()[psib::*[1][text()=‘director’]]
This paper discuss use temporal snapshot of WebPages to
develop probabilistic tree edit model use this model to improve wrapper
construction Method estimates efficiently in quadratic
time in the size of the tree When applied to IMDB it was 86% robust
whereas traditional wrappers were 40% robust
Change model is defined in terms of conditional transducer ‘п’ process
When a forest T is given to П process it converts into forest S
П process is defined into 2 sub process пins ,пds
To summarize, the generative process π is characterized by following parameters
θ = (pstop, {pdel(l)}, {pins(l)}, {psub(l1, l2))}
for l, l1, l2 ∈∑ along with the following conditions:
• 0 < pstop < 1
• 0 ≤ pdel(l) ≤ 1
• pins(l) ≥ 0, ∑L pins(l) = 1
• psub(l1, l2) ≥ 0,∑L2 psub(l1, l2) = 1
……..eq(A)
Archival data contains {S,T} pairs were S is old versions and T is new versions
Model is specified in terms of set of parameters θ
We want to find θ* θ∗ = arg max Π (T,S)∈ArchivalData Pθ(T | S) Pθ(T | S) is a Computing Transformation
Probability
The transducer π performs a sequence of edit operations consisting of insertions, deletions and substitutions to transform a tree S into another tree T.
Use dynamic programming to compute probabilities as there various ways
Let DP1(Fs, Ft) denote the probability that π(Fs) = Ft due πins ,πsub
two cases: i. The node v was the result of an insertion
by πins operator. Let p be the probability that πins inserts the node v in Ft−v to form Ft.Then, the probability of this case is DP1(Fs, Ft −v) ∗ p.
ii. The node v was the result of a substitution. The probability of this case is DP2(Fs, Ft). Hence, we have
DP1(Fs, Ft) = DP2(Fs, Ft) + p ∗ DP1(Fs, Ft − v) ……..Eq(1)
Let DP2(Fs, Ft) denote the probability that π(Fs) = Ft πsub
two cases: i. v was substituted for u. In this case, we
must have Fs − [u] transform to Ft − [v] and ⌊u⌋ transform to ⌊v⌋. Denoting psub (label(u), label(v)) with p1, the total probability of this case is p1 ∗ DP1(Fs −[u], Ft −[v]) ∗ DP1(⌊u⌋, ⌊v⌋)
ii. v was substituted for some node other than u. we have
DP2(Fs, Ft) = p1DP1(Fs − [u], Ft − [v])DP1(⌊u⌋, ⌊v⌋)+ p2DP2(Fs − u, Ft) ……..Eq(2)
Let T1 be the tree with the nodes a and b, let T2 be the tree with single node c. Let us compute the probability that π(T1) = T2,
which is denoted by DP1(T1, T2). Applying Eq (1) we get
DP1(T1, T2) = DP2(T1, T2) + pins(c) ∗ DP1(T1, ∅)
Let T3 denote the tree with single node b. Then, DP2(T1, T2) = psub(a, c) ∗ DP1(∅, ∅) ∗
DP1(T3, ∅)+ pdel(a) ∗ DP2(T3, T2) To compute DP2(T3, T2), we get DP2(T3, T2) = psub(b, c) ∗ DP1(∅, ∅) ∗
DP1(∅, ∅)+ pdel(b) ∗ DP2(∅, T2) Total probability DP1(T1, T2) = psub(a, c) ∗ pdel(b) ∗ p2 stop +
psub(b, c) ∗ pdel(a) ∗ p2 stop+ pdel(a) ∗ pdel(b) ∗ pins(c) ∗ pstop
θ∗ = arg max θ N∑n=1logPθ(Tn | Sn)
It is difficult to calculate θ∗ so we calculate by Gradient ascent
θt+1 = θt + ηg(θt)…..eq(3) g(θ) =∂ log ℓ(θ)/∂θ = N∑n=1∂ logP(Tn | Sn)/ ∂θ Θ has to satisfy eq(A) So we use variable reparameterization θij = e αij /N∑j=1 e αij
Eq(3) becomes αt+1 = αt + ηg(αt)
We use bottom up algorithm starting from general Xpath and specializing it till it matches only the target node
w0 = //table/ ∗ /td/text()
//table/tr/td/text() //table[bgcolor =′ red′]/ ∗ /td/text() //table/ ∗ /td[2]/text() Algorithm maintains a set P of partial
wrappers which has recall=1 and precision<1
Algorithm applies specialization steps to Xpaths in P to convert into new Xpath such that precision becomes 1
Rob X,θ(ϕ) =∑XY | ϕ |=Pθ(Y | X) Algorithm for calculating robustness
Change model
Generating Robust Wrappers
Evaluation of Model Learner