Upload
reynard-copeland
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Content Extraction from News Pages Using Particle
Swarm Optimization on Linguistic and Structural
FeaturesCai-Nicolas Ziegler, Michal
SkubaczSiemens AG, Corporate Research & Technologies
2007 IEEE/WIC/ACM International Conference on Web Intelligence
Introduction (1/2)
Actual textual information only represents a minor fraction
Enrichment of pure content with page clutter
Problem online reputation
monitoring platforms
Search engine
Non-related images
Advertisement
copyright
content
Introduction (2/2)
Fully-automated extraction of content from HTML pages
Extract coherent blocks of text from HTML pages
DOM parsing Compute linguistic and structural features
for each block Learn feature thresholds using PSO Decide whether the block at hand is signal
or noise
Related Work Wrapper induction system
Require human labeling of relevant passages for each site template
Wrapper generator Fully-automated Only for web pages containing numerous
recurring entries of like style and format Domain-oriented extraction
Human effort is required for adaptation to novel domains
Structural approach Make use of the organization of HTML pages
Take into account visual cues of Web page rendering
Approach Outline Input
Plain HTML page Output
Text doc. containing the extracted content
Applicable to any type of news pages
No human intervention once feature thresholds have been learned
Merging and Selecting Text Blocks
Build a DOM model of the current document
Aware of the semantics of HTML elements
Apply two operations removal
Only the element occurrence itself is discarded
With or without whitespace insertion
Pruning Respective element and its
subtree are discards
DOM Model of an HTML Fragment
<p>Mr. Bush said in a <a href=“press release.html”>press release</a> yesterday</p>
Web page display
p
aMr. Bush said in a yesterday
press release press release.html
Upon removing and pruning elements, a DOM tree is eventually constructed from the processed document. Tree traversal → only select the merged text nodes Obtain text blocks
Computing Features Linguistic feature
Average sentence length ↑
Number of sentences ↑ Character distribution
↑ Stop-word ratio ↑
Structural feature Anchor ratio ↓ Format tag ratio ↑ List element ratio ↓ Text structuring
ratio ↑
Thresholds for each unique feature Conjunctive
Only when all feature thresholds are satisfied, the text block will become classified as signal.
Benchmark Composition
610 human-labeled documents 5 different languages : English(265),
German(195), Spanish(50), French(50), Italian(50)
From the same source, about 2-4 articles were taken
Hand-labeled by 3 labeler Human labeling would not achieve 100%
agreement Example
the date the article was written → include or not ?
Computing Similarity
simσ (i,j) : compare 2 documents diσ
and djσ
diσ and dj
σ : representing the textual content extracted from an HTML news article page a
σ
Using a variant of cosine similarity
document vector, term length > 1
identicalMax. dissent
On Particle Swarm Optimization (1/2)
Benchmark set 610 labeled documents was subdivided
into a training set and test set Enable the determination of thresholds
that allows good discrimination between true article content and page clutter
Determine optimal threshold values Eight pivot parameters → combinatory
explosion! Particle Swarm Optimization (PSO)
On Particle Swarm Optimization (2/2)
Particle 8 features 的 thresholds set pi in 8 dimensional space
Every pi has Velocity vector wi Memory of its own best state
ρi In each step
Particles change their position in hyperspace
Their own best state ρi Current population Pk’s best
particle pb τ : cognitive component
Degree to which particles move towards their own best state
φ: social component Extent to which they follow the
population’s best particle Both assume values around 2
constant → 1
Fitness functionParticle’s classification accuracy
Training and Empirical Evaluation
Before testing, training set was used to learn near-optimal thresholds via PSO
70% training set vs. 30% test set split on 610 documents
Training the classifier Population size n = 100 Stop the optimization procedure upon
generation of 100 populations
Training the Classifier When cognition is
favored over social behavior, bumps become more pronounced
Due to the increased degree of randomness
Best particle generation 32 Cog. > Soc. 0.827
Try tuning population size n
No significant differences
Converges quickly (after 10 iterations)
Performance Benchmarking Upper bound HL
A sample of 70 documents from the full test set was labeled by two persons in parallel
Measure their mutual agreement → 0.909 Automated content extraction system FE Pruning and removing PR
Without feature matching Accept all blocks
Border-line TB Simply extracts all text nodes within the original HTML
document operating on the DOM tree model
Feature Entropy Evaluation Understand the
utility and impact of the various features
Evaluations on the complete test set
Testing each feature in isolation
Linguistic features exhibit lower, i.e., better, entropy than structural one
Combination 0.77 → 0.857
0.77 → best
2nd best
3rd best
和 list tag ratio, text structuring ratio 差不多→ poor performance
Outlook and Conclusion Content extraction from news articles on the Web
is a non-trivial, indispensable task for many applications
Presented a novel paradigm Fully-automated Classifying text blocks within HTML pages as signal or
noise By means of linguistic and structural features
Accuracy close to human judgment Future work
Include more features, e.g., part-of–speech information Threshold defining techniques from machine learning,
e.g., C4.5 decision trees The system will be incorporated as processing
component into practical reputation monitoring platform
Used by Siemens Corporate Communications