Semi-Supervised learning
Mining the Web Chakrabarti & Ramakrishnan 2
Need for an intermediate approach
Unsupervised and Supervised learning•Two extreme learning paradigms
•Unsupervised learning collection of documents without any labels easy to collect
•Supervised learning each object tagged with a class. laborious job
Semi-supervised learning•Real life applications are somewhere in
between.
Mining the Web Chakrabarti & Ramakrishnan 3
Motivation Document collection D A subset (with ) has
known labels Goal: to label the rest of the collection. Approach
• Train a supervised learner using , the labeled subset.
• Apply the trained learner on the remaining documents.
Idea• Harness information in to enable better
learning.
DDK |||||||| DDK
KD
KDD \
Mining the Web Chakrabarti & Ramakrishnan 4
The Challenge Unsupervised portion of the corpus,
, adds to•Vocabulary•Knowledge about the joint distribution of
terms•Unsupervised measures of inter-document
similarity. E.g.: site name, directory path, hyperlinks
Put together multiple sources of evidence of similarity and class membership into a label-learning system.•combine different features with partial
supervision
KDD \
Mining the Web Chakrabarti & Ramakrishnan 5
Hard Classification Train a supervised learner on
available labeled data Label all documents in Retrain the classifier using the new
labels for documents where the classier was most confident,
Continue until labels do not change any more.
KD
KDD \
Mining the Web Chakrabarti & Ramakrishnan 6
Expectation maximization Softer variant of previous algorithm Steps
•Set up some fixed number of clusters with some arbitrary initial distributions,
•Alternate following steps based on the current parameters of the
distribution that characterizes c.– Re-estimate, Pr(c|d), for each cluster c and each
document d, Re-estimate parameters of the distribution
for each cluster.
Mining the Web Chakrabarti & Ramakrishnan 7
Experiment: EM Set up one cluster for each class
label Estimate a class-conditional
distribution which includes information from D
Simultaneously estimate the cluster memberships of the unlabeled documents.
Mining the Web Chakrabarti & Ramakrishnan 8
Experiment: EM (contd..) Example:
• EM procedure + multinomial naive Bayes text classifier
• Laplace’s law for parameter smoothing
• For EM, unlabeled documents belong to clusters probabilistically Term counts weighted by the probabilities
• Likewise, modify class priors
dDd
Ddtc
Kc
Kc
dnW
tdn
,
, ),(||
),(1
Dd
Ddtc dndcW
tdndc
),()|Pr(||
),()|Pr(1
,
Mining the Web Chakrabarti & Ramakrishnan 9
EM: Issues For , we know the class label cd
•Question: how to use this information ?
•Will be dealt with later Using Laplace estimate instead of ML
estimate•Not strictly EM
•Convergence takes place in practice
kDd
Mining the Web Chakrabarti & Ramakrishnan 10
EM: Experiments Take a completely labeled corpus D, and
randomly select a subset as DK. also use the set of unlabeled
documents in the EM procedure. Correct classification of a document
=> concealed class label = class with largest probability
Accuracy with unlabeled documents > accuracy without unlabeled documents• Keeping labeled set of same size
EM beats naïve Bayes with same size of labeled document set• Largest boost for small size of labeled set• Comparable or poorer performance of EM for large labeled sets
DDU
Mining the Web Chakrabarti & Ramakrishnan 11
Belief in labeled documents Depending on one’s faith in the initial
labeling•Set before 1st iteration:
•With each iteration Let the class probabilities of the labeled
documents `smear'
d0d cc’ allfor 0 d)|Pr(c and 1 d)|Pr(c
Mining the Web Chakrabarti & Ramakrishnan 12
EM: Reducing belief in unlabeled documents
Problems due to•Noise in term distribution of documents
in
•Mistakes in E-step Solution
•attenuate the contribution from documents in
•Add a damping factor in E Step for contribution from
UDUD
UD
UD
,
, ),()|Pr(),(][||
),()|Pr(),(][1
K U
K U
Dd Ddd
Dd Ddd
tc dndcdnccW
tdndctdncc
Mining the Web Chakrabarti & Ramakrishnan 13
Increasing DU while holding DK fixed also shows the advantage of using large unlabeled sets in the EM-like algorithm.
Mining the Web Chakrabarti & Ramakrishnan 14
EM: Reducing belief in unlabeled documents (contd..) No theoretical justification
•accuracy is indeed influenced by the choice of
What value of to choose ? An intuitive recipe (to be tried)
small choose ,D largeFor
1) ( large choose ,D smallFor
k
k
Mining the Web Chakrabarti & Ramakrishnan 15
EM: Modeling labels using many mixture components
Need not be a one to one correspondence between EM clusters and class labels.
Mixture modeling of •Term distributions of some classes
•Especially “the negative class” E.g.: For two class case “football” vs.
“not football”•Documents not about “football” are
actually about a variety of other things;
Mining the Web Chakrabarti & Ramakrishnan 16
EM: Modeling labels using many mixture components
Experiments: comparison with Naïve Bayes•Lower accuracy with one mixture
component per label
•Higher accuracy with more mixture components per label
•Over fitting and degradation with too large a number of clusters
Mining the Web Chakrabarti & Ramakrishnan 17
Allowing more clusters in the EM-like algorithm than there are class labels often help
to capture term distributions for composite or complex topics, and boosts the accuracy of the semi-supervised learner beyond that of a
naive Bayes classier.
Mining the Web Chakrabarti & Ramakrishnan 18
Labeling hypertext graphs More complex features than exploited by
EM•Test document is cited directly by a training
document, or vice verca
•Short path between the test document and one or more training documents.
•Test document is cited by a named category in a Web directory Target category system could be somewhat
different
•Some category of a Web directory co-cites one or more training document along with the test document.
Mining the Web Chakrabarti & Ramakrishnan 19
Labeling hypertext graphs: Scenario
Snapshot of the Web graph, Graph G=(V,E)
Set of topics, Small subset of nodes VK labeled Use the supervision to label some or
all nodes in V - Vk
Mining the Web Chakrabarti & Ramakrishnan 20
Hypertext models for classification
c=class, t=text, N=neighbors
Text-only model: Pr[t|c]
Using neighbors’ textto judge my topic:Pr[t, t(N) | c]
Better model:Pr[t, c(N) | c]
Non-linear relaxation
?
Mining the Web Chakrabarti & Ramakrishnan 21
Absorbing features from neighboring pages
Page u may have little text on it to train or apply a text classier
u cites some second level pages Often second-level pages have
usable quantities of text Question: How to use these
features ?
Mining the Web Chakrabarti & Ramakrishnan 22
Absorbing features indiscriminate absorption of
neighborhood text Does not help. At times deteriorates accuracy Reason: Implicit assumption-
• Topic of a page u is likely to be the same as the topic of a page cited by u.
• Not always true
• Topic may be “related” but not “same”
Distribution of topics of the pages cited could be quite distorted compared to the totality of contents available from the page itself
E.g.: university page with little textual content • Points to “how to get to our campus” or “recent
sports prowess"
Mining the Web Chakrabarti & Ramakrishnan 23
Absorbing link-derived features Key insight 1
• The classes of hyper-linked neighbors is a better representation of hyperlinks.
• E.g.: use the fact that u points to a page about athletics to
raise our belief that u is a university homepage, learn to systematically reduce the attention we pay to
the fact that a page links to the Netscape download site.
Key insight 2• class labels are from a is-a hierarchy.
evidence at the detailed topic level may be too noisy coarsening the topic helps collect more reliable data
on the dependence between the class of the homepage and the link-derived feature.
Mining the Web Chakrabarti & Ramakrishnan 24
Absorbing link-derived features Add all prefixes of the class path to the
feature pool: Do feature selection to get rid of noise
features Experiment
• Corpus of US patents• Two level topic hierarchy
three first-level classes, each has four children.
• Each leaf topic has 800 documents,• Experiment with
Text Link Prefix Text+Prefix
Mining the Web Chakrabarti & Ramakrishnan 25
The prefix trick
A two-level topic hierarchy of US patents.Using prefix-encoded link features in conjunction with text can significantly reduce classification error.
Mining the Web Chakrabarti & Ramakrishnan 26
Absorbing link-derived features: Observations
Absorbing text from neighboring pages in an indiscriminate manner does not help classify
hyper-linked patent documents any better than a purely text-based naive Bayes classier.
Mining the Web Chakrabarti & Ramakrishnan 27
Absorbing link-derived features: Limitation
|Vk| << |V|
Hardly any neighbors of a node to be classified linked to any pre-labeled node
Proposal•Start with a labeling of reasonable quality
Maybe using a text classifier
•Do Refine the labeling using a coupled
distribution of text and labels of neighbors,
•Until the labeling stabilizes.
Mining the Web Chakrabarti & Ramakrishnan 28
A relaxation labeling algorithm Given
•Hypertext graph G(V, E)
•Each vertex u is associated with text uT
Desired•A labeling f of all (unlabeled) vertices so
as to maximize
) |V}u ; {u Pr(E, )Pr( V})u,{u Pr(E, where
}),{uPr(E,
))(|},{u(E,Pr(f(V))PrV})u,{uE,|Pr(f(V)
TT
T
TT Vu
VfVu
Mining the Web Chakrabarti & Ramakrishnan 29
Preferential attachment Simplifying assumption: undirected graph Web graph starts with m0 nodes Time proceeds in discrete steps Every step, one new node v is added v is attached with m edges to old nodes
•Suppose old node w has current degree d(w)•Multinomial distribution, probability of
attachment to w is proportional to d(w) Old nodes keep accumulating degree “Rich gets richer”, or “winner takes all”
Mining the Web Chakrabarti & Ramakrishnan 30
Heuristic Assumption E :
• The event that edges were generated as per the edge list E
Difficult to obtain a known function for Approximate it using heuristic assumptions. Assume that
• Term occurrences are independent
• link-derived features are independent
• no dependence between a term and a link-derived feature.
• Assumption: decision boundaries will remain relatively immune to errors in the probability estimates.
))(|},{,Pr( VfVuuE T
Mining the Web Chakrabarti & Ramakrishnan 31
Heuristic Assumption (contd.) Approximate the joint probability of
neighboring classes by the product of the marginals
Couple class probabilities of neighboring node Optimization concerns
• Kleinberg and Tardos: global optimization A unique f for all nodes in VU
• Greedy labeling followed by iterative correction of neighborhoods
Greedy labeling using a text classier Reevaluate class probability of each page using latest estimates
of class probabilities of neighbors. EM-like soft classification of nodes
))(,,)),((|)(Pr())(,,|)(Pr(
))(,,)),((|)(Pr())(,,|))((Pr())(,,|)(Pr(
))(( )(
))((
KTU
vNf vNw
KT
vNf
KTUKTUKT
VfVEvNfvfVfVEwf
VfVEvNfvfVfVEvNfVfVEvf
vU U
vU
Mining the Web Chakrabarti & Ramakrishnan 32
Inducing a Markov Random field
Induction on time-step • to break the circular definition.
Converges if seed values are reasonably accurate
Further assumptions• limited range of influence
• text of nodes other than v contain no information about f(v)
Already accounted for in the graph structure
vU U
vU U
vNf
KTU
vNw
KTr
vNf
KTU
vNw
KTr
KTr
vNfVEvNfvfVfVEwf
VfVEvNfvfVfVEwfVfVEvf
))(( )()(
))(( )()()1(
)))((,,)),((|)(Pr())(,,|)((Pr
))(,,)),((|)(Pr())(,,|)((Pr))(,,|)((Pr
Mining the Web Chakrabarti & Ramakrishnan 33
Overview of the algorithm Desired: the class (probabilities) of v given
• The text vT on that page• Classes of the neighbors N(v) of v.
Use Bayes rule to invert that goal • Build distributions :
The algorithm HyperClass• Input: Test node v• construct a suitably large vicinity graph around and containing v• for each w in the vicinity graph do• assign using a text classier• end for• while label probabilities do not stabilize (r = 1,2…..) do• for each node w in the vicinity graph do• update to using equation • end for• end while
f(v)) | vP(f(N(v)), T
))f(V,VE, | (f(v)P KT(0)
))f(V,VE, | (f(v)P KT1)(r
Mining the Web Chakrabarti & Ramakrishnan 34
Exploiting link features 9600 patents from
12 classes marked by USPTO
Patents have text and cite other patents
Expand test patent to include neighborhood
‘Forget’ fraction of neighbors’ classes
05
10152025303540
0 50 100
%Neighborhood known%
Err
or
Text Link Text+Link
Mining the Web Chakrabarti & Ramakrishnan 35
Relaxation labeling: Observations
When the test neighborhood is completely unlabeled.•`Link‘ performs better than the text-based
classier
•Reason: Model bias Pages tend to link to pages with a related
class label."
Relaxation labeling •An approximate procedure to optimize a
global objective function on the hypertext graph being labeled.
•A metric graph labeling problem
Mining the Web Chakrabarti & Ramakrishnan 36
A metric graph-labeling problem Inference about the topic of page u
depends possibly on the entire Web.• Computationally infeasible• Unclear if capturing such dependencies is
useful. Phenomenon of “losing one's way“ with clicks significant clues about a page expected to be in a
neighborhood of limited radius
Example : • A hypertext graph• Nodes can belong to exactly one of two topics
(red and blue)• Given a graph with a small subset of nodes
with known colors
Mining the Web Chakrabarti & Ramakrishnan 37
A metric graph-labeling problem (contd..) Goal: find a labeling f(u) (u unlabeled) to
minimize
2 terms•affinity A(c1,c2) : cost between all pairs of
colors.•L(u,f(u)) = -Pr(f(u)|u): cost of assigning label
f(u) to node u Parameters
•Marginal distribution of topics,•2 x 2 topic citation matrix: probability of
differently colored nodes linking to each other
Ev)(u,u
f(v))A(f(u), f(u))L(u, Q(f)
Mining the Web Chakrabarti & Ramakrishnan 38
Semi-supervised hypertext classification represented as a problem of completing a partially colored graph subject to a given set of cost
constraints.
Mining the Web Chakrabarti & Ramakrishnan 39
A metric graph-labeling problem:
NP-Completeness NP-complete [Kleinberg and Tardos] approximation algorithms
•Within a O(log k log log k) multiplicative factor of the minimal cost,
•k = number of distinct class labels.
Mining the Web Chakrabarti & Ramakrishnan 40
Problems with approaches so far
Metric or relaxation labeling• Representing accurate joint distributions over
thousands of terms High space and time complexity
Naïve Models• Fast: assume class-conditional attribute
independence,
• Dimensionality of textual sub-problem >> dimensionality of link sub-problem,
• Pr(vT|f(v)) tends to be lower in magnitude than Pr(f(N(v))|f(v)).
• Hacky workaround: aggressive pruning of textual features
Mining the Web Chakrabarti & Ramakrishnan 41
Co-Training [Blum and Mitchell] Classifiers with disjoint features spaces. Co-training of classifiers
•Scores used by each classifier to train the other
•Semi-supervised EM-like training with two classifiers
Assumptions•Two sets of features (LA and LB) per
document dA and dB.•Must be no instance d for which •Given the label , dA is conditionally
independent of dB (and vice versa)
)()( BBAA dfdf
Mining the Web Chakrabarti & Ramakrishnan 42
Co-training Divide features into two class-
conditionally independent sets Use labeled data to induce two
separate classifiers Repeat:
• Each classifier is “most confident” about some unlabeled instances
• These are labeled and added to the training set of the other classifier
Improvements for text + hyperlinks)|Pr().|Pr()|,Pr( cdcdcdd BABA
Mining the Web Chakrabarti & Ramakrishnan 43
Co-Training: Performance dA=bag of words
dB=bag of anchor texts from HREF tags
Reduces the error below the levels of both LA and LB individually
Pick a class c by maximizing •Pr(c|dA) Pr(c|dB).
Mining the Web Chakrabarti & Ramakrishnan 44
Co-training reduces classification error Reduction in error against the number of mutual training rounds.