Machine Learning for NLP

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Seminar: Statistical NLPSeminar: Statistical NLP

Girona, June 2003Girona, June 2003

Machine Learning for Natural Language

Processing

Processing Lluís Màrquez

TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

Lluís MàrquezTALP Research Center

Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

OutlineOutline

• Machine Learning for NLP• Machine Learning for NLP

• The Classification Problem

• Three ML Algorithms

• Applications to NLP

OutlineOutline

ML4NLPML4NLP

• There are many general-purpose definitions of Machine Learning (or artificial learning):

Making a computer automatically acquire some kind of knowledge from a concrete

data domain

Making a computer automatically acquire some kind of knowledge from a concrete

data domain

• Learners are computers: we study learning algorithms

• Resources are scarce: time, memory, data, etc.

• It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc.

• Biological plausibility is welcome but not the main goal

• Learners are computers: we study learning algorithms

• Resources are scarce: time, memory, data, etc.

• It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc.

• Biological plausibility is welcome but not the main goal

Machine LearningMachine Learning

• Learning... but what for?– To perform some particular task

– To react to environmental inputs

– Concept learning from data:• modelling concepts underlying data

• predicting unseen observations

• compacting the knowledge representation

• knowledge discovery for expert systems

• Learning... but what for?– To perform some particular task

– To react to environmental inputs

– Concept learning from data:• modelling concepts underlying data

• predicting unseen observations

• compacting the knowledge representation

• knowledge discovery for expert systems

• We will concentrate on:– Supervised inductive learning for classification

= discriminative learning

• We will concentrate on:– Supervised inductive learning for classification

= discriminative learning

ML4NLPML4NLP

Obtaining a description of the concept in

some representation language that

explains observations and helps

predicting new instances of the same

distribution

Obtaining a description of the concept in

some representation language that

explains observations and helps

predicting new instances of the same

distribution

A more precise definition:A more precise definition:

Machine LearningMachine LearningML4NLPML4NLP

• What to read?– Machine Learning (Mitchell, 1997)

• Lexical and structural ambiguity problems

– Word selection (SR, MT)– Part-of-speech tagging– Semantic ambiguity (polysemy)– Prepositional phrase attachment– Reference ambiguity (anaphora)– etc.

• Lexical and structural ambiguity problems

– Word selection (SR, MT)– Part-of-speech tagging– Semantic ambiguity (polysemy)– Prepositional phrase attachment– Reference ambiguity (anaphora)– etc.

Clasification problems

9090’s’s: Application of Machine Learning techniques (ML) to NLP problems

Empirical NLP Empirical NLP ML4NLPML4NLP

• What to read? Foundations of Statistical Language Processing (Manning & Schütze,

He was shot in the hand as he chased

the robbers in the back street

• Ambiguity is a crucial problem for natural language understanding/processing. Ambiguity Resolution = Classification

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

NLP “classification” problems

ML4NLPML4NLP

• Morpho-syntactic ambiguity• Morpho-syntactic ambiguity

NNVBNNVB

JJVBJJ

VBNNVBNNVB

ML4NLPML4NLP

the robbers in the back streetNNVBNNVB

JJVBJJ

VBNNVBNNVB

• Morpho-syntactic ambiguity: Part of Speech Tagging

ML4NLPML4NLP

the robbers in the back streetbody-partclock-partbody-partclock-part

• Semantic (lexical) ambiguity• Semantic (lexical) ambiguity

ML4NLPML4NLP

the robbers in the back streetbody-partclock-partbody-partclock-part

• Semantic (lexical) ambiguity: Word Sense Disambiguation

ML4NLPML4NLP

• Structural (syntactic) ambiguity• Structural (syntactic) ambiguity

ML4NLPML4NLP

• Structural (syntactic) ambiguity• Structural (syntactic) ambiguity

ML4NLPML4NLP

He was shot in the hand as he (chased

(the robbers)NP (in the back street)PP)

He was shot in the hand as he (chased

(the robbers)NP (in the back street)PP)

• Structural (syntactic) ambiguity: PP-attachment disambiguation

ML4NLPML4NLP

• Three ML Algorithms in detail

OutlineOutline

IA perspective

Feature Vector ClassificationFeature Vector ClassificationClassificationClassification

• An instance is a vector: x = <x1,…, xn> whose components,

called features (or attributes), are discrete or real-valued.

• Let X be the space of all possible instances.

• Let Y={y1,…, ym} be the set of categories (or classes).

• The goal is to learn an unknown target function, f : X Y

• A training example is an instance x belonging to X,

labelled with the correct value for f(x), i.e., a pair <x, f(x)>

• Let D be the set of all training examples.

• An instance is a vector: x = <x1,…, xn> whose components,

called features (or attributes), are discrete or real-valued.

• Let X be the space of all possible instances.

• Let Y={y1,…, ym} be the set of categories (or classes).

• The goal is to learn an unknown target function, f : X Y

• A training example is an instance x belonging to X,

labelled with the correct value for f(x), i.e., a pair <x, f(x)>

• Let D be the set of all training examples.

Feature Vector ClassificationFeature Vector Classification

The goal is to find a function h belonging to

H such that for all pair <x, f (x)> belonging

to D, h(x) = f (x)

The goal is to find a function h belonging to

H such that for all pair <x, f (x)> belonging

to D, h(x) = f (x)

• The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions

ClassificationClassification

An ExampleAn Example

Example SIZE COLOR SHAPE CLASS1 small red circle positive

2 big red circle positive

3 small red triangle negative

4 big blue circle negative

(COLOR=red) (SHAPE=circle) positive

RulesRules

red blue

SHAPE negative

positive

circle triangle

negative

Decision TreeDecision Tree

otherwise negative

RulesRules

(SIZE=small) (SHAPE=circle) positive

otherwise negative

(SIZE=big) (COLOR=red) positivesmall big

circle red

triang blue

neg pos neg

Some important conceptsSome important concepts

• Inductive Bias

“Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99)

– Language / Search bias

• Inductive Bias

“Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99)

– Language / Search bias

red blue

SHAPE negative

positive

circle triangle

negative

• Generalization ability and overfitting

• Batch Learning vs. on-line Leaning

• Symbolic vs. statistical Learning

• Propositional vs. first-order learning

• Generalization ability and overfitting

• Batch Learning vs. on-line Leaning

• Symbolic vs. statistical Learning

• Propositional vs. first-order learning

• Inductive Bias

• Training error and generalization error

• Inductive Bias

• Training error and generalization error

Some important conceptsSome important conceptsClassificationClassification

Propositional vs. Relational Learning

color(red) shape(circle) classA

course(X) person(Y) link_to(Y,X) instructor_of(X,Y)

research_project(X) person(Z) link_to(L1,X,Y) link_to(L2,Y,Z) neighbour_word_people(L1)

member_proj(X,Z)

• Relational learning = ILP (induction of logic programs)• Relational learning = ILP (induction of logic programs)

• Propositional learning• Propositional learning

The Classification SettingClass, Point, Example, Data Set, ...

• Input Space: X Rn

• (binary) Output Space: Y = {+1,-1}

• A point, pattern or instance: x X, x = (x1, x2, …, xn)

• Example: (x, y) with x X, y Y

• Training Set: a set of m examples generated i.i.d. according to an unknown distribution P(x,y) S = {(x1, y1), …, (xm, ym)} (X Y)m

• Input Space: X Rn

• (binary) Output Space: Y = {+1,-1}

• A point, pattern or instance: x X, x = (x1, x2, …, xn)

• Example: (x, y) with x X, y Y

• Training Set: a set of m examples generated i.i.d. according to an unknown distribution P(x,y) S = {(x1, y1), …, (xm, ym)} (X Y)m

CoLT/SLT perspectiveCoLT/SLT

perspective

The Classification SettingLearning, Error, ...

• The hypotheses space, H, is the set of functions h: XY that the learner can consider as possible definitions. In SVM are of the form:

• The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P(x,y), is minimal (Risk Minimization, RM)

• The hypotheses space, H, is the set of functions h: XY that the learner can consider as possible definitions. In SVM are of the form:

• The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P(x,y), is minimal (Risk Minimization, RM)

iii bwh

)()( xx

The Classification SettingLearning, Error, ...

• Expected error (risk)

• Problem: P itself is unknown. Known are training examples an induction principle is needed

• Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal

• Expected error (risk)

• Problem: P itself is unknown. Known are training examples an induction principle is needed

• Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal

),(),(loss ydPyhhR xx

i iiemp yhmhR1

),(loss1 x

The Classification SettingError, Over(under)fitting,...

• Low training error low true error?

• The overfitting dilemma:

• Low training error low true error?

• The overfitting dilemma:

• Trade-off between training error and complexity

• Different learning biases can be used

• Trade-off between training error and complexity

• Different learning biases can be used

(Mül le

Underfitting Overfitting

OutlineOutline

• Three ML Algorithms−Decision Trees−AdaBoost−Support Vector Machines

OutlineOutline

• Statistical learning:– HMM, Bayesian Networks, ME, CRF, etc.

• Traditional methods from Artificial Intelligence (ML, AI)

– Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc.

• Methods from Computational Learning Theory (CoLT/SLT)

– Winnow, AdaBoost, SVM’s, etc.

• Statistical learning:– HMM, Bayesian Networks, ME, CRF, etc.

• Traditional methods from Artificial Intelligence (ML, AI)

– Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc.

• Methods from Computational Learning Theory (CoLT/SLT)

– Winnow, AdaBoost, SVM’s, etc.

Learning ParadigmsLearning ParadigmsAlgorithmsAlgorithms

• Classifier combination:– Bagging, Boosting, Randomization,

ECOC, Stacking, etc.

• Semi-supervised learning: learning from labelled and unlabelled examples– Bootstrapping, EM, Transductive learning

(SVM’s, AdaBoost), Co-Training, etc.

• etc.

• Classifier combination:– Bagging, Boosting, Randomization,

ECOC, Stacking, etc.

• Semi-supervised learning: learning from labelled and unlabelled examples– Bootstrapping, EM, Transductive learning

(SVM’s, AdaBoost), Co-Training, etc.

• etc.

AlgorithmsAlgorithms

Learning ParadigmsLearning Paradigms

Decision TreesDecision Trees

• Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data.

• They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization.

• From a machine-learning perspective: Decision Trees are n -ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes

• Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data.

• They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization.

• From a machine-learning perspective: Decision Trees are n -ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes

Decision TreesDecision Trees

• Acquisition: Top-Down Induction of Decision Trees (TDIDT)

• Systems: CART (Breiman et al. 84),ID3, C4.5, C5.0 (Quinlan 86,93,98),

ASSISTANT, ASSISTANT-R (Cestnik et al. 87) (Kononenko et al. 95)

• Acquisition: Top-Down Induction of Decision Trees (TDIDT)

• Systems: CART (Breiman et al. 84),ID3, C4.5, C5.0 (Quinlan 86,93,98),

ASSISTANT, ASSISTANT-R (Cestnik et al. 87) (Kononenko et al. 95)

small big

circle red

triang blue

neg pos neg

Learning Decision TreesLearning Decision TreesTrainingTraining

Training Set

TDIDTTDIDT+DT

TestTest

Example + ClassClass

function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree;

X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X)

else amax := feature_selection (X,A);

tree1 := create_tree (X, amax);

for-all val in values (amax) do

X’ := select_examples (X,amax,val);

A’ := A - {amax};

tree2 := TDIDT (X’,A’);

tree1 := add_branch (tree1,tree2,val)

end-for end-if return (tree1)

end-function

function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree;

X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X)

else amax := feature_selection (X,A);

tree1 := create_tree (X, amax);

for-all val in values (amax) do

X’ := select_examples (X,amax,val);

A’ := A - {amax};

tree2 := TDIDT (X’,A’);

tree1 := add_branch (tree1,tree2,val)

end-for end-if return (tree1)

end-function

Feature Selection CriteriaFeature Selection Criteria

• Functions derived from Information Theory:– Information Gain, Gain Ratio (Quinlan 86)

• Functions derived from Distance Measures– Gini Diversity Index (Breiman et al. 84)

– RLM (López de Mántaras 91)

• Statistically-based– Chi-square test (Sestito & Dillon 94)

– Symmetrical Tau (Zhou & Dillon 91)

• RELIEFF-IG: variant of RELIEFF (Kononenko 94)

• Functions derived from Information Theory:– Information Gain, Gain Ratio (Quinlan 86)

• Functions derived from Distance Measures– Gini Diversity Index (Breiman et al. 84)

– RLM (López de Mántaras 91)

• Statistically-based– Chi-square test (Sestito & Dillon 94)

– Symmetrical Tau (Zhou & Dillon 91)

• RELIEFF-IG: variant of RELIEFF (Kononenko 94)

Extensions of DTsExtensions of DTs

• Pruning (pre/post)

• Minimize the effect of the greedy approach: lookahead

• Non-lineal splits

• Combination of multiple models

• Incremental learning (on-line)

• etc.

• Pruning (pre/post)

• Minimize the effect of the greedy approach: lookahead

• Non-lineal splits

• Combination of multiple models

• Incremental learning (on-line)

• etc.

(Murthy 95)(Murthy 95)

Decision Trees and NLPDecision Trees and NLP

• Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)

• POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00)

• Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96)

• Parsing (Magerman 95,96; Haruno et al. 98,99)

• Text categorization (Lewis & Ringuette 94; Weiss et al. 99)

• Text summarization (Mani & Bloedorn 98)

• Dialogue act tagging (Samuel et al. 98)

• Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)

• POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00)

• Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96)

• Parsing (Magerman 95,96; Haruno et al. 98,99)

• Text categorization (Lewis & Ringuette 94; Weiss et al. 99)

• Text summarization (Mani & Bloedorn 98)

• Dialogue act tagging (Samuel et al. 98)

• Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95)

• Discourse analysis in information extraction (Soderland & Lehnert 94)

• Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94)

• Verb classification in Machine Translation (Tanaka 96; Siegel 97)

• Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95)

• Discourse analysis in information extraction (Soderland & Lehnert 94)

• Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94)

• Verb classification in Machine Translation (Tanaka 96; Siegel 97)

Decision Trees and NLPDecision Trees and NLPAlgorithmsAlgorithms

Decision Trees: pros&consDecision Trees: pros&cons

• Advantages

– Acquires symbolic knowledge in a understandable way

– Very well studied ML algorithms and variants

– Can be easily translated into rules

– Existence of available software: C4.5, C5.0, etc.

– Can be easily integrated into an ensemble

• Advantages

– Acquires symbolic knowledge in a understandable way

– Very well studied ML algorithms and variants

– Can be easily translated into rules

– Existence of available software: C4.5, C5.0, etc.

– Can be easily integrated into an ensemble

• Drawbacks

– Computationally expensive when scaling to large natural language domains: training examples, features, etc.

– Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation

– DTs is a model with high variance (unstable)

– Tendency to overfit training data: pruning is necessary

– Requires quite a big effort in tuning the model

• Drawbacks

– Computationally expensive when scaling to large natural language domains: training examples, features, etc.

– Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation

– DTs is a model with high variance (unstable)

– Tendency to overfit training data: pruning is necessary

– Requires quite a big effort in tuning the model

Decision Trees: pros&consDecision Trees: pros&cons

Boosting algorithmsBoosting algorithms

• Idea “to combine many simple and moderately accurate

hypotheses (weak classifiers) into a single and highly accurate classifier”

• AdaBoost (Freund & Schapire 95) has been theoretically and empirically studied extensively

• Many other variants extensions (1997-2003)

http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html

• Idea “to combine many simple and moderately accurate

hypotheses (weak classifiers) into a single and highly accurate classifier”

• AdaBoost (Freund & Schapire 95) has been theoretically and empirically studied extensively

• Many other variants extensions (1997-2003)

http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html

AdaBoost: general schemeAdaBoost: general scheme

Learner

TST. . .

Probability distribution

updating

Learner

hT. . .

Linear combination

F(h1,h2,...,hT)TES

AdaBoost: algorithmAdaBoost: algorithmAlgorithmsAlgorithms

(Freund & Schapire 97)(Freund & Schapire 97)

AdaBoost: exampleAdaBoost: example

Weak hypotheses = vertical/horizontal hyperplanesWeak hypotheses = vertical/horizontal hyperplanes

AdaBoost: round 1AdaBoost: round 1AlgorithmsAlgorithms

Combined HypothesisCombined Hypothesis

www.research.att.com/~yoav/adaboostwww.research.att.com/~yoav/adaboost

AdaBoost and NLPAdaBoost and NLP• POS Tagging (Abney et al. 99; Màrquez 99)

• Text and Speech Categorization (Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99)

• PP-attachment Disambiguation (Abney et al. 99)

• Parsing (Haruno et al. 99)

• Word Sense Disambiguation (Escudero et al. 00, 01)

• Shallow parsing (Carreras & Màrquez, 01a; 02)

• Email spam filtering (Carreras & Màrquez, 01b)

• Term Extraction (Vivaldi, et al. 01)

• POS Tagging (Abney et al. 99; Màrquez 99)

• Text and Speech Categorization (Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99)

• PP-attachment Disambiguation (Abney et al. 99)

• Parsing (Haruno et al. 99)

• Word Sense Disambiguation (Escudero et al. 00, 01)

• Shallow parsing (Carreras & Màrquez, 01a; 02)

• Email spam filtering (Carreras & Màrquez, 01b)

• Term Extraction (Vivaldi, et al. 01)

AdaBoost: pros&consAdaBoost: pros&cons

+Easy to implement and few parameters to set

+Time and space grow linearly with number of examples. Ability to manage very large learning problems

+Does not constrain explicitly the complexity of the learner

+Naturally combines feature selection with learning

+Has been succesfully applied to many practical problems

+Easy to implement and few parameters to set

+Time and space grow linearly with number of examples. Ability to manage very large learning problems

+Does not constrain explicitly the complexity of the learner

+Naturally combines feature selection with learning

+Has been succesfully applied to many practical problems

±Seems to be rather robust to overfitting (number of rounds) but sensitive to noise

±Performance is very good when there are relatively few relevant terms (features)

– Can perform poorly when there is insufficient training data relative to the complexity of the base classifiers, the training errors of the base classifiers become too large too quickly

±Seems to be rather robust to overfitting (number of rounds) but sensitive to noise

±Performance is very good when there are relatively few relevant terms (features)

– Can perform poorly when there is insufficient training data relative to the complexity of the base classifiers, the training errors of the base classifiers become too large too quickly

AdaBoost: pros&consAdaBoost: pros&cons

• “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor,

SVM: A General DefinitionSVM: A General Definition

• “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000)

Key ConceptsKey Concepts

SVM: A General DefinitionSVM: A General Definition

Linear ClassifiersLinear Classifiers

otherwise

0b xw1b xw sign )h(

• Hyperplanes in RN.

• Defined by a weight vector (w) and a threshold (b).

• They induce a classification rule:

• Hyperplanes in RN.

• Defined by a weight vector (w) and a threshold (b).

• They induce a classification rule:

Optimal Hyperplane: Geometric Intuition

Maximal Margin

Hyperplane

Maximal Margin

Hyperplane

These are theSupport Vectors

Linearly separable dataLinearly separable data

Seminari SVMs 22/05/2001Seminari SVMs 22/05/2001

liby ii ,,1 allfor 1)( xw

2/2 margin geometric w

:sconstraint subject to minimize toequivalent ismargin themaximizing2

Quadratic Programming

Non-separable case (soft margin)

lii ,,1 allfor 0

:sconstraint subject to Minimize1

liby iii ,,1 allfor 1)( xw

costs gintroducinfor ablesslack vari positive ,,1 l

iii bwf

)()( xx

FX : Non-linear mapping

Set of hypotheses

Non-linear SVMsNon-linear SVMs

• Implicit mapping into feature space via kernel functions• Implicit mapping into feature space via kernel functions

)()()( xxx Dual formulation

)()(),( zxzx K Kernel function

),()( xxx Evaluation

Non-linear SVMsNon-linear SVMs

• Kernel functions

– Must be efficiently computable

– Characterization via Mercer’s theorem

– One of the curious facts about using a kernel is that we do not need to know the underlying feature map in order to be able to learn in the feature space! (Cristianini & Shawe-Taylor, 2000)

– Examples: polynomials, Gaussian radial basis functions, two-layer sigmoidal neural networks, etc.

• Kernel functions

– Must be efficiently computable

– Characterization via Mercer’s theorem

– One of the curious facts about using a kernel is that we do not need to know the underlying feature map in order to be able to learn in the feature space! (Cristianini & Shawe-Taylor, 2000)

– Examples: polynomials, Gaussian radial basis functions, two-layer sigmoidal neural networks, etc.

Non linear SVMsNon linear SVMs

Degree 3 polynomial kernel

lin. separable lin. non-separable

Toy ExamplesToy Examples• All examples have been run with the 2D

graphic interface of SVMLIB (Chang and Lin, National University of Taiwan)

“LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. The basic algorithm is a simplification of both SMO by Platt and SVMLight by Joachims. It is also a simplification of the modification 2 of SMO by Keerthy et al. Our goal is to help users from other fields to easily use SVM as a tool. LIBSVM provides a simple interface where users can easily link it with their own programs…”

• Available from: www.csie.ntu.edu.tw/~cjlin/libsvm (it icludes a Web integrated demo tool)

• All examples have been run with the 2D graphic interface of SVMLIB (Chang and Lin, National University of Taiwan)

“LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. The basic algorithm is a simplification of both SMO by Platt and SVMLight by Joachims. It is also a simplification of the modification 2 of SMO by Keerthy et al. Our goal is to help users from other fields to easily use SVM as a tool. LIBSVM provides a simple interface where users can easily link it with their own programs…”

• Available from: www.csie.ntu.edu.tw/~cjlin/libsvm (it icludes a Web integrated demo tool)

Toy Examples (I)Toy Examples (I)

.What happens if we adda blue training example

What happens if we adda blue training example

Linearly separable data setLinear SVMMaximal margin Hyperplane

(still) Linearly separable data setLinear SVM

High value of C parameterMaximal margin Hyperplane

The example is correctly classified

(still) Linearly separable data setLinear SVM

Low value of C parameterTrade-off between: margin and training error

The example is now a bounded SV

Toy Examples (II)Toy Examples (II)AlgorithmsAlgorithms

Toy Examples (III)Toy Examples (III)AlgorithmsAlgorithms

SVM: SummarySVM: Summary

• SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik, 1992). Great developement since then

• Kernel-induced feature spaces: SVMs work efficiently in very high dimensional feature spaces (+)

• Learning bias: maximal margin optimisation. Reduces the danger of overfitting. Generalization bounds for SVMs (+)

• Compact representation of the induced hypothesis. The solution is sparse in terms of SVs (+)

• SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik, 1992). Great developement since then

• Kernel-induced feature spaces: SVMs work efficiently in very high dimensional feature spaces (+)

• Learning bias: maximal margin optimisation. Reduces the danger of overfitting. Generalization bounds for SVMs (+)

• Compact representation of the induced hypothesis. The solution is sparse in terms of SVs (+)

SVM: SummarySVM: Summary• Due to Mercer’s conditions on the kernels the

optimi-sation problems are convex. No local minima (+)

• Optimisation theory guides the implementation. Efficient learning (+)

• Mainly for classification but also for regression, density estimation, clustering, etc.

• Success in many real-world applications: OCR, vision, bioinformatics, speech recognition, NLP: TextCat, POS tagging, chunking, parsing, etc. (+)

• Parameter tuning (–). Implications in convergence times, sparsity of the solution, etc.

• Due to Mercer’s conditions on the kernels the optimi-sation problems are convex. No local minima (+)

• Optimisation theory guides the implementation. Efficient learning (+)

• Mainly for classification but also for regression, density estimation, clustering, etc.

• Success in many real-world applications: OCR, vision, bioinformatics, speech recognition, NLP: TextCat, POS tagging, chunking, parsing, etc. (+)

• Parameter tuning (–). Implications in convergence times, sparsity of the solution, etc.

OutlineOutline

NLP problemsNLP problemsApplicationsApplications

• Warning! We will not focus on final NLP applications, but on intermediate tasks...

• We will classify the NLP tasks according to their (structural) complexity

• Warning! We will not focus on final NLP applications, but on intermediate tasks...

• We will classify the NLP tasks according to their (structural) complexity

NLP problems: structural complexity

ApplicationsApplications

• Decisional problems−Text Categorization, Document filtering,

Word Sense Disambiguation, etc.

• Sequence tagging and detection of sequential structures−POS tagging, Named Entity extraction,

syntactic chunking, etc.

• Hierarchical structures−Clause detection, full parsing, IE of

complex concepts, composite Named Entities, etc.

• Decisional problems−Text Categorization, Document filtering,

Word Sense Disambiguation, etc.

• Sequence tagging and detection of sequential structures−POS tagging, Named Entity extraction,

syntactic chunking, etc.

• Hierarchical structures−Clause detection, full parsing, IE of

complex concepts, composite Named Entities, etc.

the robbers in the back streetNNVBNNVB

JJVBJJ

VBNNVBNNVB

• Morpho-syntactic ambiguity: Part of Speech Tagging

POS taggingPOS taggingApplicationsApplications

P(IN)=0.81P(RB)=0.19Word Form

P(IN)=0.83P(RB)=0.17tag(+1)

P(IN)=0.13P(RB)=0.87tag(+2)

P(IN)=0.013P(RB)=0.987

“As”,“as”

others

“preposition-adverb” tree“preposition-adverb” tree

Probabilistic interpretation:Probabilistic interpretation:

P( RB | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.987

P( IN | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.013

“as_RB much_RB as_IN”“as_RB much_RB as_IN”

Collocations:Collocations:

“as_RB well_RB as_IN”“as_RB well_RB as_IN”

“as_RB soon_RB as_IN”“as_RB soon_RB as_IN”

P(IN)=0.81P(RB)=0.19Word Form

P(IN)=0.83P(RB)=0.17tag(+1)

P(IN)=0.13P(RB)=0.87tag(+2)

P(IN)=0.013P(RB)=0.987

“As”,“as”

others

“preposition-adverb” tree“preposition-adverb” tree

POS taggingPOS tagging

Rawtext

Morphologicalanalysis

Taggedtext

Classify Update Filter

Language Model

Disambiguation

RTT (Màrquez & Rodríguez 97)RTT (Màrquez & Rodríguez 97)

A Sequential Model for Multi-class Classification: NLP/POS Tagging (Even-Zohar & Roth, 01)

POS taggingPOS tagging

STT (Màrquez & Rodríguez 97)STT (Màrquez & Rodríguez 97)

Taggedtext

Rawtext

Morphologicalanalysis

Viterbialgorithm

Language Model

Disambiguation

Lexicalprobs. +

Contextual probs. The Use of Classifiers in sequential inference:

Chunking (Punyakanok & Roth, 00)

The Use of Classifiers in sequential inference: Chunking (Punyakanok & Roth, 00)

• Named Entity recognition

• Clause detection

• Named Entity recognition

• Clause detection

Detection of sequential and hierarchical

structures

Detection of sequential and hierarchical

structures

• We have briefly outlined:

−The ML setting: “supervised learning for classification”

−Three concrete machine learning algorithms

−How to apply them to solve itermediate NLP tasks

• We have briefly outlined:

−The ML setting: “supervised learning for classification”

−Three concrete machine learning algorithms

−How to apply them to solve itermediate NLP tasks

Summary/conclusionsSummary/conclusionsConclusionsConclusions

• Any ML algorithm for NLP should be:

– Robust to noise and outliers

– Efficient in large feature/example spaces

– Adaptive to new/changing domains: portability, tuning, etc.

– Able to take advantage of unlabelled examples: semi-supervised learning

• Any ML algorithm for NLP should be:

– Robust to noise and outliers

– Efficient in large feature/example spaces

– Adaptive to new/changing domains: portability, tuning, etc.

– Able to take advantage of unlabelled examples: semi-supervised learning

ConclusionsConclusions

Summary/conclusionsSummary/conclusions

• Statistical and ML-based Natural Language Processing is a very active and multidisciplinary area of research

Summary/conclusionsSummary/conclusionsConclusionsConclusions

Some current research lines

• Appropriate learning paradigm for all kind of NLP problems: TiMBL (DBZ99), TBEDL (Brill95), ME (Ratnaparkhi98), SNoW (Roth98), CRF (Pereira & Singer02).

• Definition of an adequate (and task-specific) feature space: mapping from the input space to a high dimensional feature space, kernels, etc.

• Resolution of complex NLP problems: inference with classifiers + constraint satisfaction

• etc.

• Appropriate learning paradigm for all kind of NLP problems: TiMBL (DBZ99), TBEDL (Brill95), ME (Ratnaparkhi98), SNoW (Roth98), CRF (Pereira & Singer02).

• Definition of an adequate (and task-specific) feature space: mapping from the input space to a high dimensional feature space, kernels, etc.

• Resolution of complex NLP problems: inference with classifiers + constraint satisfaction

• etc.

BibliografiaBibliografia

• You may found additional information at:

http://www.lsi.upc.es/~lluism/

tesi.htmlpublicacions/pubs.htmlcursos/talks.htmlcursos/MLandNL.htmlcursos/emnlp1.html

• This talk at:

http://www.lsi.upc.es/~lluism/udg03.ppt.gz

• You may found additional information at:

http://www.lsi.upc.es/~lluism/

tesi.htmlpublicacions/pubs.htmlcursos/talks.htmlcursos/MLandNL.htmlcursos/emnlp1.html

• This talk at:

http://www.lsi.upc.es/~lluism/udg03.ppt.gz

Seminar: Statistical NLPSeminar: Statistical NLP

Girona, June 2003Girona, June 2003

Processing

Processing Lluís Màrquez

TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

Lluís MàrquezTALP Research Center

Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

Machine Learning for NLP

Documents

Sentiment Analysis (Opinion Mining) with Machine Learning ... · University of Sheffield NLP Sentiment Analysis (Opinion Mining) with Machine Learning in GATE

University of Sheffield NLP Module 11: Advanced Machine Learning

NLP & Machine Learning - An Introductory Talk

Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research

Resume Ranking using NLP and Machine Learning · "Resume Ranking using NLP and Machine Learning" ... Gunduka Rakesh Narsayya ... there resume in a manual form which takes large amount

AI, Machine, Deep learning and NLP - MathWorks...3 Artificial Intelligence Machine Learning AI, Machine Learning and Deep Learning Timeline 1950s 1980s Today Breadth Bioinformatics

The Role of Machine Learning in NLP Eduard Hovy USC Information Sciences Institute hovy Confessions of an Addict: Machine Learning as the

Machine Learning in Natural Language Processingcoms4771/lectures/guestlec26x.pdf · Vinodkumar Prabhakaran 2 Outline Motivation NLP Research Areas using ML – NLP Applications –

The use of machine learning with signal- and NLP · The use of machine learning with signal- and NLP processing of source code to ﬁngerprint, detect, and classify vulnerabilities

Overview of Machine Learning for NLP Tasks: part II Named Entity Tagging: A Phrase-Level NLP Task

CS 546 Machine Learning in NLP - l2r.cs.uiuc.edul2r.cs.uiuc.edu/~danr/Teaching/CS546-12/Lectures/Lec1-Intro-2012.pdf · CS 546 Machine Learning in NLP ... printed in a magazine for

AI, Machine, Deep learning and NLP

Reinforcement Learning for NLP - Computer Sciencejbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of 1. DeepQ Learning

Strictly Private and Confidential - mot.korea.ac.kr NLP Machine Learning Big Data Analysis Machine Learning Natural Language Processing Data visualization Data Aggregation: Scrapping

University of Sheffield NLP Module 4: Machine Learning

Machine Learning in NLP - Uppsala Universitynivre/master/ml13-18.pdf · 2018-05-23 · I Why do we use machine learning in NLP? I When should we (not) use machine learning? I Appropriate

Machine Learning for NLP - Unsupervised Learningaurelieherbelot.net/resources/slides/teaching/unsupervised.pdf · fundamental to NLP: dimensionality reduction (e.g. PCA, using SVD

The Role of Machine Learning in NLP

Machine Learning andLarge Scale NLP, Machine Learning, Deep Learning, and Neural Reasoning Technology Structured Data Rich proprietary knowledge Proprietary Knowledge and Inference

Machine Learning for NLP - aurelieherbelot.netaurelieherbelot.net/resources/slides/teaching/ml-intro.pdf · Machine Learning for NLP Introduction session Aurélie Herbelot ... The