Upload
kelley-norton
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Semi-automatic Product Attribute Extraction from Store
Website
Yan LiuCarnegie Mellon University
Sep 2, [email protected]
Example from Dick’s Sporting Goods
Product name
Description
Features
webpageFree text
Structured data
Applications Direct application
Product recommendation systems for customers Price estimates for auction Sales amount prediction
More general applications Document organization Email prioritization Question answering And many more text mining tasks
Relationship with Previous Work Information extraction
Extract from the documents salient facts about prespecified types of events, entities or relationships
Different from information retrieval Previous work
Finite state machines Sliding windows Sequential models, such as HMMs or CRFs Association and clustering
Major challenges Few training data Unclear attribute definition
Outline Introduction General framework Detailed algorithms Experiment results Conclusion and discussion
General Framework
Attribute Identification(Semi-supervised learning)
Name-value Assignment(Statistical and grammatical association)
Feedback(Active learning)
9.68-lb total weight (4.4-kg)
9.68-lb total weight (4.4-kg)
9.68-lb total weight (4.4-kg)
weight: 9.68-lb, 4.4-kg
weight: CD-lbweight: CD-kg….
Example:Input: free text
Output: structured data
Attribute Identification
Initial label acquisition Template matching Knowledge database
Semi-supervised learning Yarowsky’s algorithm Co-training Co-EM Co-boosting Graph-based methods
Phrase identification Statistical associations between
adjacent words Heuristic grammatical rules
Attribute Identification (1)Initial Label Acquisition
Positive labels Template matching
Extracted templates from data with special format
Noisy data Knowledge database
Measure units: length, weight, volume and etc
Material Country Color
Negative labels Partial stop word list
Attribute Identification (1)Semi-supervised learning
Co-training [Blum & Mitchell, 1998; Collins & Singer, 1999]
Separation of two views
Contextual features Spelling features
Two kinds of features Stemmer words
(Porter Stemmer) POS tagging (Brill’s
tagger)
Algorithm Psedocode
Attribute Identification (1)Phrase Identification
Phrase identification Difference between chunking Label propagation Category dependent
Statistical association Information gain
Mutual information
Yule’s statistic
)()(),( YXHXHYXIG
display team colorsup to 12 inches
display team colorsup to 12 inches
)()(
),();(
YPXP
YXPYXMI
),(),(),(),(
),(),(),(),(),(
XYCYXCYXCYXC
XYCYXCYXCYXCYXYule
Name-value Assignment Combination of three information
sources Semantic association
Knowledge database Attribute name generation and pair
assignment Grammatical association
Parsing tree (Minipar) Attribute name/value generation
Statistical association scores Yule’s statistic (category
dependent) Pair assignment
Other association sources Wordnet
User Feedback Clustering─based active learning
Novelty attribute identification Merge and splitting attributes Better use of labeled examples
Clustering algorithm Sparse data problem Multiple clustering algorithms
Cluster selection Within-cluster coherence Novelty based measurement
User FeedbackClustering algorithm Latent semantic indexing (LSI) [Deerwester et al, 1990]
Singular value decomposition on term─document matrix Mapping the words into hidden semantic concepts Similarity measure: cosine similarity
Clustering algorithm using CLUTO K─means Bisected K─means Agglomerative algorithm
Single linkage Complete linkage Average linkage
User FeedbackCluster Selection Novelty concepts
Major difference from previous task Supervised novelty detection is difficult
Tradeoff between novelty and relevancy Recently studied by the IR community [Carbonell and Goldstein,
1995; Zhang et al, 2003; Zhai et al, 2004]
Cluster selection criterion using maximal marginal relevance (MMR)
Similarity measure Cosine similarity KL-divergence
),(max)1(),(max jj
ii
MMR DCsimCCsimV
Outline Introduction General framework Detailed algorithms Experiment results Conclusion and discussion
Experiment Setup Dataset
Free text extracted from product descriptions on http://www.dickssportinggoods.com
Subsets from two categories Football (largest category)
52339 entries, 194273 words, 2926 predicted feature-value pairs Tennis (medium category)
3840 entries, 12533 words, 419 predicted feature-value pairs
Evaluation measures Direct evaluation
Precision on feature value pairs Indirect evaluation in other applications
Recommender systems
Experiment Results Initial label acquisition
Semi-supervised learning
Phrase identification
Semantic association
Grammatical association
Statistical association scores
Examples by steps
Experiment Results Human feedback
Sample files (link to file) Total labeling time of 5 mins Identified concepts
color, graphics, logo, design, fit, size, pocket, pad, set, adjustment, attachment, construction, strap
Examples by active learning
Experiment Results Precision on most frequent feature-value pairs
Most frequent 600 pairs Assignment of 5 labels
Fully correct, incorrect names, incorrect values, incorrect associations, nonsense:
Human labeling of approximately 6 hours Thanks to Katharin and Marko
Results
Conclusion Product attribute identification is a difficult task
Few training data Making use of labeled and unlabeled data by semi-supervised learning
Unclear attribute definition Novelty attribute identification by active learning
A framework of active learning combined with semi-supervised learning
Text Learning Techniques Text processing
Stemming (Porter stemmer) POS tagging (Brill’s parser) Text chunking and parsing (Minipar) Word semantics (Wordnet, dependency-based thesaurus) Latent semantic indexing (SVDPack)
Machine learning Semi-supervised learning (Co-training) Active learning (MMR) Classification (C4.5 decision tree, FOIL) Clustering (K-means, CLUTO) Information theory and statistical associations (Information gain, Yule’s
statistic)
Future Work Associations of product attributes across categories or websites
More effective active learning algorithms
Graphical models with application to information extraction
Questions?