T06 Screen

Embed Size (px)

Citation preview

  • 7/27/2019 T06 Screen

    1/75

    Revisiting Dimensionality Reduction Techniques for NLP

    J a g ad eesh J a g ar l amud i1 Ra ghavend r a U dupa2

    (1) University of Maryland, College Park, Maryland, USA(2) Microsoft Research Lab India Private Limited, Bangalore, India

    ABSTRACTMany natural language processing (NLP) applications represent words and documents as

    vectors in a very high dimensional space. The inherently high-dimensional nature of these

    applications leads to sparse vectors resulting in poor performance of downstream applications.

    Dimensionality reduction aims to find a lower dimensional subspace (or simply subspace)

    which captures the essential information required by the downstream applications. Althoughit received a lot of attention in the beginning, its popularity in NLP, unlike other fields such

    as computer vision, has declined over the time. This is partly because, traditionally, it was

    studied in an unsupervised fashion and hence the learnt subspace may not be optimal for the

    task at hand. But recent advances in learning low-dimensional representations in the presence

    of input and output variables enables us to learn task-specific subspaces that are as effective

    as the state-of-the-art approaches. In this tutorial, we aim to demonstrate the simplicity and

    effectiveness of these techniques on a diverse set of NLP tasks. By the end of the tutorial, we

    hope the attendees would be able to decide "whether or not dimensionality reduction can help

    their task and if so, how?".

    The tutorial begins with an introduction to dimensionality reduction and its impor-

    tance to NLP. As many of the dimensionality reduction techniques discussed in the tutorial make

    use of Linear Algebra, we discuss some important concepts including Linear Transformation,

    Positive Definite Matrices, eigenvalues and eigenvectors. Next, we look at some important

    dimensionality reduction techniques for data with single view (PCA, SVD, OPCA). We then

    take up applications of these techniques to some important NLP problems (word-sense

    discrimination, POS tagging and Information Retrieval). As NLP often involves more than one

    language, we look at dimensionality reduction of multiview data using Canonical CorrelationAnalysis. We discuss some interesting applications of multiview dimensionality reduction

    (bilingual document projections and mining word-level translations). We also discuss some

    advanced topics in dimensionality reduction such as Non-Linear and Neural techniques and

    some application inspired techniques such as Discriminative Reranking, Supervised Semantic

    Analysis, and Multilingual Hashing.

    We do not assume attendees know anything about dimensionality reduction (though

    the tutorial should be interesting even to those who know some), but we *do* assume some

    basic knowledge of linear algebra.

  • 7/27/2019 T06 Screen

    2/75

    Road Map

    Introduction

    NLP and Dimensionality Reduction

    Mathematical Background

    Data with Single View

    Techniques

    Applications

    Advanced Topics

    Data with Multiple Views

    Techniques

    Applications

    Advanced Topics

    Summary

  • 7/27/2019 T06 Screen

    3/75

  • 7/27/2019 T06 Screen

    4/75

    Dimensionality Reduction:

    Motivation

    Many applications involve high dimensional

    (and often sparse) data

    High dimensional data poses severalchallenges

    Computational

    Difficulty of Interpretation

    Over Fitting

    However, data often lies (approximately) in a

    low dimensional manifold embedded in high

    dimensional manifold

    Dimensionality Reduction:

    Goal

    Given high dimensional data, discover the

    underlying low dimensional structure

    2D Embedding

    560 dimensional data

    He et al, Face Recognition Using LaplacianFaces

  • 7/27/2019 T06 Screen

    5/75

    Dimensionality Reduction:

    Benefits

    Computational Efficiency

    K-Nearest Neighbor Search

    Data Compression

    Less storage; millions of data points in RAM

    Data Visualization

    2D and 3D Scatter Plots

    Latent Structure and Semantics

    Feature Extraction

    Removing distracting variance from data sets

    Dimensionality Reduction:

    Techniques

    Projective Methods

    find low dimensionalprojections that extract

    useful

    information from the data, by maximizinga suitable objective function

    PCA, ICA, LDA

    Manifold Modeling Methods

    find low dimensional subspace that best

    preserves the manifold structure in the data, bymodelling the manifold structure

    LLE, Isomap, Laplacian Eigenmaps

  • 7/27/2019 T06 Screen

    6/75

    Dimensionality Reduction:

    Relevance to NLP

    High dimensional data in NLP

    Text Documents

    Context Vectors

    How can Dimensionality Reduction help?

    Z^uv][]u]o]}(}uv

    Correlate semantically related terms

    Crosslingual similarity

  • 7/27/2019 T06 Screen

    7/75

  • 7/27/2019 T06 Screen

    8/75

    Data Centering

    Dataset::L T5 T 4H Mean:L 5 T@5 Centering:TL T F Centered dataset::L T5 T Mean after centering:

    L 5 T@5 L 5 T F @5 L F L r Mean after linear transformation:

    5

    #T L 5 # T F L #@5 F#L r@5

    Data Variance

    Dataset::L T5 T 4H Centered:

    L 5

    T@5

    Lr

    Variance:5

    T

    6 L@5 6N 5:: L 6N% where %L 5::(sample covariance)

    Transformed dataset:#:

    Variance after transformation:5

    #T

    6@5 L 5 6N#::# L 6N#%#

    v]vP}v[

    change data variance

  • 7/27/2019 T06 Screen

    9/75

    Positive Definite Matrices

    Real: / 4H

    Square:L L M Symmetric:/L / Positive: T/TP rfor all TM r Examples:

    Identity Matrix

    s ss w

    %,#%#

    Cholesky decomposition: /L ..

    Eigenvalues and Eigenvectors

    / 4H /Q

    LQwhere Qis a vector and is a scalar

    eigenvectorQ, eigenvalue

  • 7/27/2019 T06 Screen

    10/75

    Eigensystem of Positivedefinite

    Matrices

    / 4H Positive eigenvalues:P r Real valued eigenvectors:Q 4

    Orthonormal eigenvectors:M QQLrand Q

    QL s: 77 L +; Full rank:4=JG / L L Eigen decomposition:/ L 7&7

    Data Variance and Eigenvalues

    Centered dataset::L T5 T 4

    Data variance:

    5

    T6

    L

    @5 6N% Eigen decomposition: %L 7&7 Data variance:

    5

    T

    6 L@5 6N%L

  • 7/27/2019 T06 Screen

    11/75

  • 7/27/2019 T06 Screen

    12/75

    Principal Components Analysis (PCA)

    Centered Dataset::L T5 T T 4 Goal:Find orthonormal linear transformation

    6 4

    \4

    that maximizes data variance 6 T L #T ## L + 6N#%#

    Mathematical formulation:

    #

    L H@

    6N#%#

    Linear transformation

    Orthonormal basis

    Data variance

  • 7/27/2019 T06 Screen

    13/75

    PCA: Solution

    Eigen decomposition of %:

    %L 7&7

    7 L Q5 Q6 Q & L @E=C 5 6 5R 6R R

    # L Q5 Q6 Q

    6 T L #TL Q5 Q6 Q

    T MATLAB function: LNEJ?KIL:;

    PCA: Solution (contd.)

    Data variance after transformation:

    #: L Q5 Q6 Q :

    #::

    #

    L Q5 Q6 Q

    ::

    Q5 Q6 Q L Q5 Q6 Q 7&7 Q5 Q6 Q L @E=C 5 6

    6N#::# L @5 Contribution ofth component to data

    variance:

    8-

  • 7/27/2019 T06 Screen

    14/75

    PCA: Properties

    PCA decorrelates the dataset

    #%#

    L@E=C 5 6 PCA gives rank k reconstruction withminimum squared error

    PCA is sensitive to the scaling of the original

    features

  • 7/27/2019 T06 Screen

    15/75

    Singular Value Decomposition (SVD)

    Dataset::L T5 T T 4 :L 7-8 L QR@5

    NL N=JG : 7 4H such that 77 L +(left singular vectors) 8 4Hsuch that 88L + (right singular vectors) - L @E=C 5 4H (singular values)

    :L QR

    @5 Low rank approximation:

    :L 7-8 L QR@5 -L @E=C 5 r r GQ @

    SVD and Data Sphering

    Centered dataset::L T5 T T 4 :L 7-8 L QR@5 :: L 7-67L 6QQ@5 Note that

    5

    Q

    6QQ

    @5

    5

    Q L s

    Let 7L Q5 Q 8L R5 R -L @E=C 5 GQ N

    -?57

    :: 7- L + #::# L +where# L -?57 The linear transformation# L -?57

    decorrelates the data set

  • 7/27/2019 T06 Screen

    16/75

    SVD and Eigen Decomposition

    Dataset::

    L T5 T T 4

    : L 7-8 :: L 7-88-7 L 7-67(eigen decomposition) :: L 8-77-8 L 8-68(eigen decomposition) SVD and PCA:

    SVD on centered:is the same as PCA on :

  • 7/27/2019 T06 Screen

    17/75

    Oriented Principal Components

    Analysis (OPCA)

    Generalization of PCA

    Along with signal covariance %, a noise

    covariance% is available When%L +(white noise) OPCA = PCA

    Seeks projections that maximize the ratio of the

    variance of the signal projected to the variance

    of the noise

    Mathematical formulation:

    # L H

    @

    6N#%#

    OPCA: Solution

    Generalized eigenvalue problem:

    %7 L %7&

    Equivalent eigenvalue problem:%?56%%?568L 8&where 8L %567

    7 L Q5 Q6 Q & L @E=C 5 6 5R 6R R

    # L Q5 Q6 Q

    6 T L #TL Q5 Q6 Q T MATLAB function:AEC:;

  • 7/27/2019 T06 Screen

    18/75

    OPCA: Properties

    Projections remain the same when the noise

    and signal vectors are globally scaled with

    two different scale factors

    Projected data is not necessarily uncorrelated

    Can be extended to multiview data [Platt et

    al, EMNLP 2010]

  • 7/27/2019 T06 Screen

    19/75

    Popular Feature Space Models

    Vector Space Model

    Document is represented as bag-of-words

    Features: words

    Feature weight: TF(S @) or some variant

    Word Space Model

    Word is represented in terms of its context words

    Features: words (with or with out the position)

    Feature weight: Freq(S S)

    Turney and Pantel 2010

  • 7/27/2019 T06 Screen

    20/75

    Curse of dimensionality

    We have observations T 4

    @is usually very huge

    Vector Space Models

    @= vocab size (number of words in a language)

    Word Space Models

    @= vocab size (if position is ignored)

    @= H where L is window length

    Curse of dimensionality

  • 7/27/2019 T06 Screen

    21/75

  • 7/27/2019 T06 Screen

    22/75

    Word Sense Discrimination

    Aim:Cluster contexts based on the meaning

    Steps:

    1. Represent a word as a point in vector space

    Dimensionality Reduction

    2. Represent each context as a point

    3. Cluster the points using a clustering algorithm

    Vector Space Use words as the features

    Feature weight is co-occurrence strength

    1. Word Vectors

    2. Context

    Vectors

    3. Sense Vectors

  • 7/27/2019 T06 Screen

    23/75

  • 7/27/2019 T06 Screen

    24/75

    Reduce the dimensionality of word vectors

    9 L 7-8 9 Z Q5 Q

    Word Sense Discrimination:

    Dimensionality Reduction

    legal Clothes Y

    judge 210 75 Y

    robe 50 250 Y

    law 240 50 Y

    suit 147 157 Y

    dismisses 96 152 Y

    9 L

    Word Sense Discrimination :

    Results & Discussion

    Averaged results on 20 words

    Accuracy

    6, terms 76

    6, SVD 90

    Frequency, terms 81

    Frequency, SVD 88

    Schtze 1998

  • 7/27/2019 T06 Screen

    25/75

    Part-of-Speech (POS) Tagging

    Given a sentence label words with their POS tags

    Unsupervised Approaches

    Attempt to cluster words

    Align each cluster with a POS tag

    Do not assume a dictionary of tags

    I ate an apple .

    NN VB DT NN .

    Schtze 1995, Lamar et al 2010

  • 7/27/2019 T06 Screen

    26/75

    Part-of-Speech Tagging

    Steps

    1. Represent words in appropriate vector space

    Dimensionality Reduction

    2. Cluster using your favorite algorithm

    Vector space should capture syntactic properties

    Use most frequent :@;words as features

    Frequency of a word in the context as feature weight

    Part-of-Speech Tagging :

    Pass 1

    Construct left and right context matrices

    . and 4matrices of size 8

    H@

    Dimensionality Reduction

    Get rank N5approximation

    &L .

    4

    is a 8 H tN5 matrix Run weighted G-means on &with G5clusters

    .L 7-8

    . L 7-

    . ZNormalized .

    4L 7-84 L 7- 4 ZNormalized 4

  • 7/27/2019 T06 Screen

    27/75

    Part-of-Speech Tagging :

    Pass 2

    The clusters are not optimal because of sparsity

    Construct .and 4of size 8 H G5 Dimensionality Reduction

    Get rank N6approximation

    &L . 4 is a 8 H tN6 matrix Run weighted G-means on &

    . L 7-8

    . L 7-

    . ZNormalized .

    4 L 7-84 L 7- 4 ZNormalized 4

    Part-of-Speech Tagging :

    Results

    Penn Treebank (1.1 M tokens, 43K types)

    17 and 45 tags

    M-to-1 accuracies

    PTB17 PTB45

    SVD2 0.730 0.660

    HMM-EM 0.647 0.621

    HMM-VB 0.637 0.605

    HMM-GS 0.674 0.660

    HMM-Sparse(32) 0.702 (2.2) 0.654 (1.0)

    VEM:sr?5 sr?5; 0.682 (0.8) 0.546 (1.7)

    Lamar et al 2010

  • 7/27/2019 T06 Screen

    28/75

    Part-of-Speech Tagging :

    Discussion

    Sensitivity to parameters

    Scaling with singular values

    G-means algorithm

    Weighted G-means

    Clusters are initialized to most frequent word types

    Non-disambiguating tagger

    Very simple algorithm

  • 7/27/2019 T06 Screen

    29/75

    Information Retrieval

    Rank documents @in response to a query M

    Vector Space Model

    Query and doc. are represented as bag-of-words

    Features: Words Feature Weight: TFIDF

    Lexical Gap

    Polysemy and Synonymy

    Information Retrieval :

    Lexical Gap

    Term HDocument matrix %

    ship 1 1

    boat 1

    ocean 1 1

    voyage 1 1 1

    trip 1 1

    TFIDF weightingis better ?

  • 7/27/2019 T06 Screen

    30/75

    Information Retrieval :

    Latent Semantic Analysis

    Term

    HDocument matrix %

    H

    Steps:

    1. Dimensionality Reduction of term Hdoc. matrix2. Folding-in queries

    M ZB:M;

    3. Compute semantic similarity, score M @

  • 7/27/2019 T06 Screen

    31/75

    Information Retrieval :

    Latent Semantic Analysis

    Term HDocument matrix %H Steps:

    1. Dimensionality Reduction

    2. Folding-in queries

    @L 7-@@L -?57@ ML -

    ?57M

    %L 7-8

  • 7/27/2019 T06 Screen

    32/75

    Information Retrieval :

    Latent Semantic Analysis

    Term

    HDocument matrix %

    H

    Steps:

    1. Dimensionality Reduction

    2. Folding-in queries

    3. Semantic similarity

    denotes dot product

    ML -?57M

    5?KNA M@

    Z M

    @ L

    M @

    M@

    %L 7-8

    Deerwester 1988; Dumais 2005

    Information Retrieval :

    Lexical Gap Revisited

    Term HDocument matrix %

    New document representations

    ship 1 1

    boat 1

    ocean 1 1

    voyage 1 1 1

    trip 1 1

    ship -1.62 -0.60 -0.44 -0.97 -0.70 -0.26

    boat -0.46 -0.84 -0.30 1.00 0.35 0.65

  • 7/27/2019 T06 Screen

    33/75

    Information Retrieval :

    Results & Discussion

    Term

    HDocument matrix %

    Fold-in new documents as well Deviates from the optimal as we add more docs.

    Probabilistic Latent Semantic Analysis

    MED CRAN CACM CISI

    Cos+tfidf 49 35.2 21.9 20.2

    LSA 64.6 38.7 23.8 21.9

    PLSI-U 69.5 38.9 25.3 23.3

    PLSI-Q 63.2 38.6 26.6 23.1

    Hofmann 1999

  • 7/27/2019 T06 Screen

    34/75

  • 7/27/2019 T06 Screen

    35/75

    Non-linear Dimensionality Reduction

    Laplacian Eigenmaps

    Weight matrix 9with similarities

    Local neighbourhood

    &L 9 and .L & F 9

    Q.Q s.t. Q&QL +.Q L &Q

    Q.QL 9 Q F Q 6

  • 7/27/2019 T06 Screen

    36/75

  • 7/27/2019 T06 Screen

    37/75

  • 7/27/2019 T06 Screen

    38/75

    Neural Embeddings

    Word is represented as a vector of size I

    Learning

    Optimize such that log-likelihood is maximized

    Gradient ascent

    Learns parameters and word vectors simultaneously

    Learned word-vectors capture semantics

    Learn to perform multiple tasks simultaneously

    Bengio et al 2003; Collobert and Weston 2008

  • 7/27/2019 T06 Screen

    39/75

    Canonical Correlation Analysis (CCA)

    Centered dataset:

    :

    L T5 T 4

    -H ;

    L U5 U 4

    .H Project:and ;along = 4-and > 4.

    O L =T5 =T , PL >U5 >U Data correlation after transformation:

    ?KO O P L

    L 8-

    .

    8-

    .

    8-

    L =:;>

    =::= >;;>

  • 7/27/2019 T06 Screen

    40/75

    CCA (contd.)

    Covariance matrices:

    %L :; %L ::, %L ;; Correlation in terms of covariance matrices:

    ?KO O P L =%>

    =%= >%>

    Directionsthat maximize data correlation:

    = > L

    =%>

    =%= >%>

    CCA: Formulation

    Goal: Find linear transformations # $ that

    maximize data correlation

    Optimization problem:# $ L

    6N#:;$

    O P

    6N#::# L s6N$

    ;;

    $L s

  • 7/27/2019 T06 Screen

    41/75

    CCA: Solution

    Generalized eigenvalue problem:

    %$L %#&%

    # L %$& Can be shown that &L &L &$L %?5% #&?5

    %%?5

    %

    # L %#&6

    MATLAB function:?=JKJ?KNN:;

  • 7/27/2019 T06 Screen

    42/75

  • 7/27/2019 T06 Screen

    43/75

    Bilingual Document Projections

    Applications:

    Comparable and Parallel Document Retrieval

    Cross-language text categorization

    Steps:

    1. Represent each document as a vector

    Two different vector spaces, one per language

    2. Use CCA to find linear transformations :# $;3. Find new aligned documents using #and $

    Bilingual Document Projections

    Steps:

    1. Represent each document as a vector

    Vector Space:

    Features: Most frequent 20K content words

    Feature weight: TFIDF weighting

    Training Data:

    T 4- bag of English words

    U 4. bag of Hindi words

    T U EL s J :L T5T6 T ;L U5U6 U

  • 7/27/2019 T06 Screen

    44/75

    Bilingual Document Projections

    Steps:

    1. Represent each document as a vector

    2. Use CCA to find linear transformations#and $

    3. Find new aligned documents using #and $

    Scoring:

    Score(T U; Z #T $U L

  • 7/27/2019 T06 Screen

    45/75

    Bilingual Document Projections :

    Results & Discussion

    Accuracy MRR

    OPCA 72.55 77.34

    Word-by-word 70.33 74.67

    CCA 68.94 73.78

    Word-by-word (5000) 67.86 72.36

    CL-LSI 53.02 61.30

    Untranslated 46.92 53.83

    CPLSA 45.79 51.30

    JPLSA 33.22 36.19

    Platt et al 2010

  • 7/27/2019 T06 Screen

    46/75

    Mining Word-Level Translations

    Training Data:Word level seed translations

    Task:Mine translations for new wordsdvo]}v}(^]o]_M

    Resources:monolingual comparable corpora

    English Spanish P(s|e)

    state estado 0.5

    state declarar 0.3

    society sociedad 0.4

    society compaa 0.35

    company sociedad 0.8

    Mining Word-Level Translations

    Applications:

    Lexicon induction for resource poor languages

    Mining translations for unknown words in MT

    Steps:

    1. W]v]vP}(^}]_

    2. Represent each word as vector

    Two different feature spaces, one per language

    3. Use CCA to find transformations #and $

    4. Use#and $to mine new word translations

  • 7/27/2019 T06 Screen

    47/75

    Mining Word-Level Translations

    Steps:

    1. W]v]vP}(^}]_

    2. Represent each word as a vector

    Vector Space

    Features: context words (WSM); Orthography

    Feature Weights: TFIDF weights

    Can be computed using ONLY comparable corpora

    T U EL s J; :L T5T6 T ; ;L U5U6 U

  • 7/27/2019 T06 Screen

    48/75

    Mining Word-Level Translations

    Steps:

    1. W]v]vP}(^}]_

    2. Represent each word as a vector

    3. Use CCA to find transformations #and $

    4. Use#and $to mine new word translations

    Scoring

    Score(A O; L #T $U L

  • 7/27/2019 T06 Screen

    49/75

    Mining Word-Level Translations :

    Results & Discussion

    Seed lexicon size 100

    Bootstrapping

    Results are lower for other language pairs

    Best-

    EditDist 58.6 62.6 61.1 47.4

    Ortho 76.0 81.3 80.1 52.3 55.0

    Context 91.1 81.3 80.2 65.3 58.0

    Both 87.2 89.7 89.0 89.7 72.0

    Haghighi et al 2008

    Mining Word-Level Translations :

    Results & Discussion

    Mining translations for unknown words

    OOV words for MT domain adaptation

    Ne

    wsEm

    eaSu

    bsPHP

    Baseline 23.00 26.62 10.26 38.67

    +ve change 0.80 1.44 0.13 0.28German

    Baseline 27.30 40.46 16.91 28.12

    +ve change 0.36 1.51 0.61 0.68French

    MT Accuracies (BLEU)

    Daum and Jagarlamudi 2011

  • 7/27/2019 T06 Screen

    50/75

  • 7/27/2019 T06 Screen

    51/75

    Supervised Semantic Indexing

    Task:Learn to rank ads =for a given doc. @

    Training Data:

    Pairs of :@ =>;webpages and clicked ads

    Randomly chosen pairs @ =?

    Steps :

    1. Represent an ad =and a doc. @as vectors

    2. Learn scoring functionB:= @;

    3. Rank ads for a given document

    Bai et al 2009

    Supervised Semantic Indexing

    Steps :

    1. Represent ads and docs. as vectors

    Vector Space

    Bag-of-word representation

    Features: words

    Feature weights: TFIDF weight

    =and @are vectors of size 8

  • 7/27/2019 T06 Screen

    52/75

    Supervised Semantic Indexing

    Steps :

    1. Represent ads and docs. as vectors

    2. Learn scoring functionB:= @;

    Scoring function

    B = @ L @9=Parameters: 8 H 8

    9L + 9L & 9L 7

    8 E + 9L 7

    7 E +

    Cosine

    Similarity

    Reweighting

    of words

    Dimensionality Reduction

    Different treatment for

    ads and documents

    Dimensionality Reduction

    SAME treatment for ads

    and documents

    Supervised Semantic Indexing :

    Learn Scoring Function

    Max-MarginB @ =>

    FB @ =?

    Ps

    Objective

    rs FB @ => EB:@ =?;67

    Sub Gradient Descent

  • 7/27/2019 T06 Screen

    53/75

    Supervised Semantic Indexing

    Steps:

    1. Represent ads and docs. as vectors

    2. Learn scoring functionB:= @;

    3. Rank ads for a give document

    Ranking Ads

    Compute score usingB = @ and rank

    Supervised Semantic Indexing :

    Results & Discussion

    1.9 M pairs for training

    100K pairs for testing

    Parameters Rank Loss

    TFIDF 45.60

    SSI: 9L 78 54 E + 50H10k 25.83SSI: 9L 78 64 E + 50H20k 26.68SSI: 9L 78 74 E + 50H30k 26.98

    Bai et al 2009

  • 7/27/2019 T06 Screen

    54/75

    Supervised Semantic Indexing :

    Results & Discussion

    Ranking wikipedia pages for queries

    Rank Loss

    Performs better when training data is big

    K=5 K=10 K=20

    TFIDF 21.6 14.0 9.14

    LSI + :s F ;TFIDF 14.2 9.73 6.36SSI: 9L 77 74 E + 4.80 3.10 1.87SSI: 9L 78 74 E + 4.37 2.91 1.80

    Bai et al 2009

  • 7/27/2019 T06 Screen

    55/75

    -

    Discriminative Reranking

    Reranker operates in the outer product space [Szedmak et al., 2006; Wang et al., 2007]

    Weight vector is constrained [Bai et al. 10]

    T5

    T6

    T7

    T8

    Y

    U5

    U6

    U7

    U8

    Y

    T& U&

    T5U5 T5 U6 T U5 T U6

    55 56 h5 h6

    SL =>

    @5 @6 Vector of

    length @5 H @6

  • 7/27/2019 T06 Screen

    56/75

    Low-Dimensional Reranking

    Find#and $s.t.

    :#T $U;U

    T U

    U5U6 U7

    T U

    # $U6

    Low-Dimensional Reranking

    Find#and $s.t.

    :#T $ U; U

    1. Score: =TU>

    2. Add constraints to penalize incorrect ones

    O?KNA T U R O?KNA T U E s F

    IR s F

    Idea

    [Tsochantaridis et al. 04]

  • 7/27/2019 T06 Screen

    57/75

    Discriminative Model

    L . Repeat

    # $ ZSoftened-Disc

    IL=T U> F =T U> L s F I . L r P r= If P r

    @L I F s F Z F @

    End Until convergence

    // Initialization

    // Get the current soln.

    // Compute margins

    // Potential Slack

    // Compute Slack

    // Update theLagrangian variables

    6ODFN GRHVQW FKDQJH

  • 7/27/2019 T06 Screen

    58/75

    POS Tagging

    Combine with Viterbi score

    Interpolation parameter is tuned

    Training

    Input sentence and Reference tag sequences

    Candidates, Score and Loss values

    Testing-7.0514 NNS VBD IN TO DT NNS NN .

    -0.1947 NNS VBD RP TO DT NNS NN .-6.8068 NNS VBD RB TO DT NNS NN .

    -7.1408 NNS VBD RP TO DT NNS VB .

    -13.752 NNS VBD RB TO DT NNS VB .

    Buyers stepped in to the futures pit .Score

    POS Tagging

    Combine with Viterbi score

    Interpolation parameter is tuned

    Data Statistics

    Results

    English Chinese French Swedish

    Baseline 96.15 92.31 97.41 93.23

    Collins 96.06 92.81 97.35 93.44

    Regularized 96.00 92.88 97.38 93.35

    Oracle 98.39 98.19 99.00 96.48

  • 7/27/2019 T06 Screen

    59/75

    POS Tagging

    Combine with Viterbi score

    Interpolation parameter is tuned

    Data Statistics

    Results

    English Chinese French Swedish

    Baseline 96.15 92.31 97.41 93.23

    Collins 96.06 92.81 97.35 93.44

    Regularized 96.00 92.88 97.38 93.35

    Oracle 98.39 98.19 99.00 96.48

    Softened-Disc 96 32 92.87 97 53 93.24

    Discriminative 96.3 92 91 97 53 93.36

    POS Tagging

    Zo}v]vY

    Interpolation with Viterbi score is crucial

    Softened-Disc

    Independent of no. training examples Easy to code and can be solved exactly

    English Chinese French Swedish

    Softened-Disc+0.17 +0.56 +0.12 +0.01

    Discriminative +0.15 +0.6 +0.12 +0.13

    Softened-Disc* + 92 +4.31 +1 12 +0.08

    Discriminative* +0.88 +4 77 +0.9 + 73

    Jagarlamudi and Daum 2012

  • 7/27/2019 T06 Screen

    60/75

  • 7/27/2019 T06 Screen

    61/75

    Similarity Search: Challenges

    Computing nearest neighbors in high

    dimensions using geometric search

    techniques is very difficult

    All methods are as bad as brute force linear

    search which is expensive

    Approximate techniques such as ANN perform

    efficiently in dimensions as high as 20; in higher

    dimensions, the results are rather spotty

    Need to do search on commodity hardware

    Cross-language search

  • 7/27/2019 T06 Screen

    62/75

  • 7/27/2019 T06 Screen

    63/75

    What is the advantage?

    Scales easily to very large databases

    Compact language-independent representation

    32 bits per object

    Search is effective and efficient

    Hamming nearest-neighbor search

    Few milliseconds per query for searching amillion objects (single thread on a singleprocessor)

  • 7/27/2019 T06 Screen

    64/75

    What is the challenge?

    Language/Script Independent HashCodes

    Learning Hash Functions from

    Training Data

  • 7/27/2019 T06 Screen

    65/75

  • 7/27/2019 T06 Screen

    66/75

  • 7/27/2019 T06 Screen

    67/75

  • 7/27/2019 T06 Screen

    68/75

    Learning Hash Functions: Summary

    Given a set of parallel names as training data, findthe top K projection vectors for each language

    using Canonical Correlation Analysis.

    Each projection vector gives a 1-bit hash function.

    Hash code for a name can be computed byprojecting its feature vector on to the projectionvectors followed by binarization.

    Udupa & Kumar, 2010

    Fuzzy Name Search: Experimental Setup

    Test Sets:

    DUMBTIONARY

    1231 misspelled names

    INTRANET200 misspelled names

    Name Directories:

    DUMBTIONARY

    550K names from Wikipedia

    INTRANET

    150K employee names

    Training Data:15K pairs of single token names in English and Hindi

    Baselines:

    Two popular search engines, Double Metaphone, BM25

  • 7/27/2019 T06 Screen

    69/75

  • 7/27/2019 T06 Screen

    70/75

    Multilingual: Experimental Setup

    Test Sets

    1000 multi-word names each in Russian, Hebrew,

    Kannada, Tamil, Hindi

    Name Directory:

    English Wikipedia Titles

    6 Million Titles, 2 Million Unique Words

    Baseline:

    State-of-the-art Machine Transliteration (NEWS2009)

  • 7/27/2019 T06 Screen

    71/75

  • 7/27/2019 T06 Screen

    72/75

    Dimensionality Reduction

    0

    100

    200

    300

    400

    500

    600

    700

    800

    1990 1995 2000 2005 2010

    #ofdim.

    reductionpapers

    Vision NLP

    Dimensionality Reduction

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1990 1995 2000 2005 2010

    Vision NLP

    Popularity compared toBayesian approaches

  • 7/27/2019 T06 Screen

    73/75

    Summary

    Dimensionality reduction has merits for NLP

    Computational and Feature correlations

    Has been explored in unsupervised fashion

    But recent novel developments

    For multi-view data

    If you can formulate your problem as mapping

    Try dimensionality reduction

    Can solve for the global optimum

    Summary

    Spectral Learning

    Provides way to learn global optimum forgenerative models

    Enriching the existing models

    Using word embeddings instead of words

    Scalability of the techniques

    }v[v}vZvu}(uo

    Large scale SVD

  • 7/27/2019 T06 Screen

    74/75

  • 7/27/2019 T06 Screen

    75/75

    References Jagadeesh Jagarlamudi, Hal Daum, III , Low-Dimensional Discriminative Reranking, in HLT-NAACL 2012

    Shaishav Kumar and Raghavendra Udupa, Learning Hash Functions for Cross-View Similarity Search, in

    IJCAI-11, IJCAI, 20 July 2011

    Raghavendra Udupa and Shaishav Kumar, Hashing-based Approaches to Spelling Correction of PersonalNames, in Proceedings of EMNLP 2010, October 2010

    Raghavendra Udupa and Mitesh Khapra, Transliteration Equivalence using Canonical Correlation Analysis,

    in ECIR 2010, 2010

    Jagadeesh Jagarlamudi, Hal Daum, III , Regularized Interlingual Projections: Evaluation on Multilingual

    Transliteration, in Proceedings of EMNLP-CoNLL 2012.