T06 Screen

7/27/2019 T06 Screen

1/75

Revisiting Dimensionality Reduction Techniques for NLP

J a g ad eesh J a g ar l amud i1 Ra ghavend r a U dupa2

(1) University of Maryland, College Park, Maryland, USA(2) Microsoft Research Lab India Private Limited, Bangalore, India

ABSTRACTMany natural language processing (NLP) applications represent words and documents as

vectors in a very high dimensional space. The inherently high-dimensional nature of these

applications leads to sparse vectors resulting in poor performance of downstream applications.

Dimensionality reduction aims to find a lower dimensional subspace (or simply subspace)

which captures the essential information required by the downstream applications. Althoughit received a lot of attention in the beginning, its popularity in NLP, unlike other fields such

as computer vision, has declined over the time. This is partly because, traditionally, it was

studied in an unsupervised fashion and hence the learnt subspace may not be optimal for the

task at hand. But recent advances in learning low-dimensional representations in the presence

of input and output variables enables us to learn task-specific subspaces that are as effective

as the state-of-the-art approaches. In this tutorial, we aim to demonstrate the simplicity and

effectiveness of these techniques on a diverse set of NLP tasks. By the end of the tutorial, we

hope the attendees would be able to decide "whether or not dimensionality reduction can help

their task and if so, how?".

The tutorial begins with an introduction to dimensionality reduction and its impor-

tance to NLP. As many of the dimensionality reduction techniques discussed in the tutorial make

use of Linear Algebra, we discuss some important concepts including Linear Transformation,

Positive Definite Matrices, eigenvalues and eigenvectors. Next, we look at some important

dimensionality reduction techniques for data with single view (PCA, SVD, OPCA). We then

take up applications of these techniques to some important NLP problems (word-sense

discrimination, POS tagging and Information Retrieval). As NLP often involves more than one

language, we look at dimensionality reduction of multiview data using Canonical CorrelationAnalysis. We discuss some interesting applications of multiview dimensionality reduction

(bilingual document projections and mining word-level translations). We also discuss some

advanced topics in dimensionality reduction such as Non-Linear and Neural techniques and

some application inspired techniques such as Discriminative Reranking, Supervised Semantic

Analysis, and Multilingual Hashing.

We do not assume attendees know anything about dimensionality reduction (though

the tutorial should be interesting even to those who know some), but we *do* assume some

basic knowledge of linear algebra.

7/27/2019 T06 Screen

2/75

Road Map

Introduction

NLP and Dimensionality Reduction

Mathematical Background

Data with Single View

Techniques

Applications

Advanced Topics

Data with Multiple Views

Techniques

Applications

Advanced Topics

Summary

7/27/2019 T06 Screen

3/75

7/27/2019 T06 Screen

4/75

Dimensionality Reduction:

Motivation

Many applications involve high dimensional

(and often sparse) data

High dimensional data poses severalchallenges

Computational

Difficulty of Interpretation

Over Fitting

However, data often lies (approximately) in a

low dimensional manifold embedded in high

dimensional manifold


Goal

Given high dimensional data, discover the

underlying low dimensional structure

2D Embedding

560 dimensional data

He et al, Face Recognition Using LaplacianFaces

7/27/2019 T06 Screen

5/75


Benefits

Computational Efficiency

K-Nearest Neighbor Search

Data Compression

Less storage; millions of data points in RAM

Data Visualization

2D and 3D Scatter Plots

Latent Structure and Semantics

Feature Extraction

Removing distracting variance from data sets


Techniques

Projective Methods

find low dimensionalprojections that extract

useful

information from the data, by maximizinga suitable objective function

PCA, ICA, LDA

Manifold Modeling Methods

find low dimensional subspace that best

preserves the manifold structure in the data, bymodelling the manifold structure

LLE, Isomap, Laplacian Eigenmaps

7/27/2019 T06 Screen

6/75


Relevance to NLP

High dimensional data in NLP

Text Documents

Context Vectors

How can Dimensionality Reduction help?

Z^uv][]u]o]}(}uv

Correlate semantically related terms

Crosslingual similarity

7/27/2019 T06 Screen

7/75

7/27/2019 T06 Screen

8/75

Data Centering

Dataset::L T5 T 4H Mean:L 5 T@5 Centering:TL T F Centered dataset::L T5 T Mean after centering:

L 5 T@5 L 5 T F @5 L F L r Mean after linear transformation:

5

#T L 5 # T F L #@5 F#L r@5

Data Variance

Dataset::L T5 T 4H Centered:

L 5

T@5

Lr

Variance:5

T

6 L@5 6N 5:: L 6N% where %L 5::(sample covariance)

Transformed dataset:#:

Variance after transformation:5

#T

6@5 L 5 6N#::# L 6N#%#

v]vP}v[

change data variance

7/27/2019 T06 Screen

9/75

Positive Definite Matrices

Real: / 4H

Square:L L M Symmetric:/L / Positive: T/TP rfor all TM r Examples:

Identity Matrix

s ss w

%,#%#

Cholesky decomposition: /L ..

Eigenvalues and Eigenvectors

/ 4H /Q

LQwhere Qis a vector and is a scalar

eigenvectorQ, eigenvalue

7/27/2019 T06 Screen

10/75

Eigensystem of Positivedefinite

Matrices

/ 4H Positive eigenvalues:P r Real valued eigenvectors:Q 4

Orthonormal eigenvectors:M QQLrand Q

QL s: 77 L +; Full rank:4=JG / L L Eigen decomposition:/ L 7&7

Data Variance and Eigenvalues

Centered dataset::L T5 T 4

Data variance:

5

T6

L

@5 6N% Eigen decomposition: %L 7&7 Data variance:

5

T

6 L@5 6N%L

7/27/2019 T06 Screen

11/75

7/27/2019 T06 Screen

12/75

Principal Components Analysis (PCA)

Centered Dataset::L T5 T T 4 Goal:Find orthonormal linear transformation

6 4

\4

that maximizes data variance 6 T L #T ## L + 6N#%#

Mathematical formulation:

#

L H@

6N#%#

Linear transformation

Orthonormal basis

Data variance

7/27/2019 T06 Screen

13/75

PCA: Solution

Eigen decomposition of %:

%L 7&7

7 L Q5 Q6 Q & L @E=C 5 6 5R 6R R

# L Q5 Q6 Q

6 T L #TL Q5 Q6 Q

T MATLAB function: LNEJ?KIL:;

PCA: Solution (contd.)

Data variance after transformation:

#: L Q5 Q6 Q :

#::

#

L Q5 Q6 Q

::

Q5 Q6 Q L Q5 Q6 Q 7&7 Q5 Q6 Q L @E=C 5 6

6N#::# L @5 Contribution ofth component to data

variance:

8-

7/27/2019 T06 Screen

14/75

PCA: Properties

PCA decorrelates the dataset

#%#

L@E=C 5 6 PCA gives rank k reconstruction withminimum squared error

PCA is sensitive to the scaling of the original

features

7/27/2019 T06 Screen

15/75

Singular Value Decomposition (SVD)

Dataset::L T5 T T 4 :L 7-8 L QR@5

NL N=JG : 7 4H such that 77 L +(left singular vectors) 8 4Hsuch that 88L + (right singular vectors) - L @E=C 5 4H (singular values)

:L QR

@5 Low rank approximation:

:L 7-8 L QR@5 -L @E=C 5 r r GQ @

SVD and Data Sphering

Centered dataset::L T5 T T 4 :L 7-8 L QR@5 :: L 7-67L 6QQ@5 Note that

5

Q

6QQ

@5

5

Q L s

Let 7L Q5 Q 8L R5 R -L @E=C 5 GQ N

-?57

:: 7- L + #::# L +where# L -?57 The linear transformation# L -?57

decorrelates the data set

7/27/2019 T06 Screen

16/75

SVD and Eigen Decomposition

Dataset::

L T5 T T 4

: L 7-8 :: L 7-88-7 L 7-67(eigen decomposition) :: L 8-77-8 L 8-68(eigen decomposition) SVD and PCA:

SVD on centered:is the same as PCA on :

7/27/2019 T06 Screen

17/75

Oriented Principal Components

Analysis (OPCA)

Generalization of PCA

Along with signal covariance %, a noise

covariance% is available When%L +(white noise) OPCA = PCA

Seeks projections that maximize the ratio of the

variance of the signal projected to the variance

of the noise

Mathematical formulation:

# L H

@

6N#%#

OPCA: Solution

Generalized eigenvalue problem:

%7 L %7&

Equivalent eigenvalue problem:%?56%%?568L 8&where 8L %567

7 L Q5 Q6 Q & L @E=C 5 6 5R 6R R

# L Q5 Q6 Q

6 T L #TL Q5 Q6 Q T MATLAB function:AEC:;

7/27/2019 T06 Screen

18/75

OPCA: Properties

Projections remain the same when the noise

and signal vectors are globally scaled with

two different scale factors

Projected data is not necessarily uncorrelated

Can be extended to multiview data [Platt et

al, EMNLP 2010]

7/27/2019 T06 Screen

19/75

Popular Feature Space Models

Vector Space Model

Document is represented as bag-of-words

Features: words

Feature weight: TF(S @) or some variant

Word Space Model

Word is represented in terms of its context words

Features: words (with or with out the position)

Feature weight: Freq(S S)

Turney and Pantel 2010

7/27/2019 T06 Screen

20/75

Curse of dimensionality

We have observations T 4

@is usually very huge

Vector Space Models

@= vocab size (number of words in a language)

Word Space Models

@= vocab size (if position is ignored)

@= H where L is window length

Curse of dimensionality

7/27/2019 T06 Screen

21/75

7/27/2019 T06 Screen

22/75

Word Sense Discrimination

Aim:Cluster contexts based on the meaning

Steps:

1. Represent a word as a point in vector space

Dimensionality Reduction

2. Represent each context as a point

3. Cluster the points using a clustering algorithm

Vector Space Use words as the features

Feature weight is co-occurrence strength

1. Word Vectors

2. Context

Vectors

3. Sense Vectors

7/27/2019 T06 Screen

23/75

7/27/2019 T06 Screen

24/75

Reduce the dimensionality of word vectors

9 L 7-8 9 Z Q5 Q

Word Sense Discrimination:


legal Clothes Y

judge 210 75 Y

robe 50 250 Y

law 240 50 Y

suit 147 157 Y

dismisses 96 152 Y

9 L

Word Sense Discrimination :

Results & Discussion

Averaged results on 20 words

Accuracy

6, terms 76

6, SVD 90

Frequency, terms 81

Frequency, SVD 88

Schtze 1998

7/27/2019 T06 Screen

25/75

Part-of-Speech (POS) Tagging

Given a sentence label words with their POS tags

Unsupervised Approaches

Attempt to cluster words

Align each cluster with a POS tag

Do not assume a dictionary of tags

I ate an apple .

NN VB DT NN .

Schtze 1995, Lamar et al 2010

7/27/2019 T06 Screen

26/75

Part-of-Speech Tagging

Steps

1. Represent words in appropriate vector space


2. Cluster using your favorite algorithm

Vector space should capture syntactic properties

Use most frequent :@;words as features

Frequency of a word in the context as feature weight

Part-of-Speech Tagging :

Pass 1

Construct left and right context matrices

. and 4matrices of size 8

H@


Get rank N5approximation

&L .

4

is a 8 H tN5 matrix Run weighted G-means on &with G5clusters

.L 7-8

. L 7-

. ZNormalized .

4L 7-84 L 7- 4 ZNormalized 4

7/27/2019 T06 Screen

27/75


Pass 2

The clusters are not optimal because of sparsity

Construct .and 4of size 8 H G5 Dimensionality Reduction

Get rank N6approximation

&L . 4 is a 8 H tN6 matrix Run weighted G-means on &

. L 7-8

. L 7-

. ZNormalized .

4 L 7-84 L 7- 4 ZNormalized 4


Results

Penn Treebank (1.1 M tokens, 43K types)

17 and 45 tags

M-to-1 accuracies

PTB17 PTB45

SVD2 0.730 0.660

HMM-EM 0.647 0.621

HMM-VB 0.637 0.605

HMM-GS 0.674 0.660

HMM-Sparse(32) 0.702 (2.2) 0.654 (1.0)

VEM:sr?5 sr?5; 0.682 (0.8) 0.546 (1.7)

Lamar et al 2010

7/27/2019 T06 Screen

28/75


Discussion

Sensitivity to parameters

Scaling with singular values

G-means algorithm

Weighted G-means

Clusters are initialized to most frequent word types

Non-disambiguating tagger

Very simple algorithm

7/27/2019 T06 Screen

29/75

Information Retrieval

Rank documents @in response to a query M

Vector Space Model

Query and doc. are represented as bag-of-words

Features: Words Feature Weight: TFIDF

Lexical Gap

Polysemy and Synonymy

Information Retrieval :

Lexical Gap

Term HDocument matrix %

ship 1 1

boat 1

ocean 1 1

voyage 1 1 1

trip 1 1

TFIDF weightingis better ?

7/27/2019 T06 Screen

30/75


Latent Semantic Analysis

Term

HDocument matrix %

H

Steps:

1. Dimensionality Reduction of term Hdoc. matrix2. Folding-in queries

M ZB:M;

3. Compute semantic similarity, score M @

7/27/2019 T06 Screen

31/75



Term HDocument matrix %H Steps:

1. Dimensionality Reduction

2. Folding-in queries

@L 7-@@L -?57@ ML -

?57M

%L 7-8

7/27/2019 T06 Screen

32/75



Term

HDocument matrix %

H

Steps:

1. Dimensionality Reduction

2. Folding-in queries

3. Semantic similarity

denotes dot product

ML -?57M

5?KNA M@

Z M

@ L

M @

M@

%L 7-8

Deerwester 1988; Dumais 2005


Lexical Gap Revisited

Term HDocument matrix %

New document representations

ship 1 1

boat 1

ocean 1 1

voyage 1 1 1

trip 1 1

ship -1.62 -0.60 -0.44 -0.97 -0.70 -0.26

boat -0.46 -0.84 -0.30 1.00 0.35 0.65

7/27/2019 T06 Screen

33/75



Term

HDocument matrix %

Fold-in new documents as well Deviates from the optimal as we add more docs.

Probabilistic Latent Semantic Analysis

MED CRAN CACM CISI

Cos+tfidf 49 35.2 21.9 20.2

LSA 64.6 38.7 23.8 21.9

PLSI-U 69.5 38.9 25.3 23.3

PLSI-Q 63.2 38.6 26.6 23.1

Hofmann 1999

7/27/2019 T06 Screen

34/75

7/27/2019 T06 Screen

35/75

Non-linear Dimensionality Reduction

Laplacian Eigenmaps

Weight matrix 9with similarities

Local neighbourhood

&L 9 and .L & F 9

Q.Q s.t. Q&QL +.Q L &Q

Q.QL 9 Q F Q 6

7/27/2019 T06 Screen

36/75

7/27/2019 T06 Screen

37/75

7/27/2019 T06 Screen

38/75

Neural Embeddings

Word is represented as a vector of size I

Learning

Optimize such that log-likelihood is maximized

Gradient ascent

Learns parameters and word vectors simultaneously

Learned word-vectors capture semantics

Learn to perform multiple tasks simultaneously

Bengio et al 2003; Collobert and Weston 2008

7/27/2019 T06 Screen

39/75

Canonical Correlation Analysis (CCA)

Centered dataset:

:

L T5 T 4

-H ;

L U5 U 4

.H Project:and ;along = 4-and > 4.

O L =T5 =T , PL >U5 >U Data correlation after transformation:

?KO O P L

L 8-

.

8-

.

8-

L =:;>

=::= >;;>

7/27/2019 T06 Screen

40/75

CCA (contd.)

Covariance matrices:

%L :; %L ::, %L ;; Correlation in terms of covariance matrices:

?KO O P L =%>

=%= >%>

Directionsthat maximize data correlation:

= > L

=%>

=%= >%>

CCA: Formulation

Goal: Find linear transformations # $ that

maximize data correlation

Optimization problem:# $ L

6N#:;$

O P

6N#::# L s6N$

;;

$L s

7/27/2019 T06 Screen

41/75

CCA: Solution

Generalized eigenvalue problem:

%$L %#&%

# L %$& Can be shown that &L &L &$L %?5% #&?5

%%?5

%

# L %#&6

MATLAB function:?=JKJ?KNN:;

7/27/2019 T06 Screen

42/75

7/27/2019 T06 Screen

43/75

Bilingual Document Projections

Applications:

Comparable and Parallel Document Retrieval

Cross-language text categorization

Steps:

1. Represent each document as a vector

Two different vector spaces, one per language

2. Use CCA to find linear transformations :# $;3. Find new aligned documents using #and $


Steps:


Vector Space:

Features: Most frequent 20K content words

Feature weight: TFIDF weighting

Training Data:

T 4- bag of English words

U 4. bag of Hindi words

T U EL s J :L T5T6 T ;L U5U6 U

7/27/2019 T06 Screen

44/75


Steps:


2. Use CCA to find linear transformations#and $

3. Find new aligned documents using #and $

Scoring:

Score(T U; Z #T $U L

7/27/2019 T06 Screen

45/75

Bilingual Document Projections :


Accuracy MRR

OPCA 72.55 77.34

Word-by-word 70.33 74.67

CCA 68.94 73.78

Word-by-word (5000) 67.86 72.36

CL-LSI 53.02 61.30

Untranslated 46.92 53.83

CPLSA 45.79 51.30

JPLSA 33.22 36.19

Platt et al 2010

7/27/2019 T06 Screen

46/75

Mining Word-Level Translations

Training Data:Word level seed translations

Task:Mine translations for new wordsdvo]}v}(^]o]_M

Resources:monolingual comparable corpora

English Spanish P(s|e)

state estado 0.5

state declarar 0.3

society sociedad 0.4

society compaa 0.35

company sociedad 0.8


Applications:

Lexicon induction for resource poor languages

Mining translations for unknown words in MT

Steps:

1. W]v]vP}(^}]_

2. Represent each word as vector

Two different feature spaces, one per language

3. Use CCA to find transformations #and $

4. Use#and $to mine new word translations

7/27/2019 T06 Screen

47/75


Steps:

1. W]v]vP}(^}]_

2. Represent each word as a vector

Vector Space

Features: context words (WSM); Orthography

Feature Weights: TFIDF weights

Can be computed using ONLY comparable corpora

T U EL s J; :L T5T6 T ; ;L U5U6 U

7/27/2019 T06 Screen

48/75


Steps:

1. W]v]vP}(^}]_

2. Represent each word as a vector

3. Use CCA to find transformations #and $

4. Use#and $to mine new word translations

Scoring

Score(A O; L #T $U L

7/27/2019 T06 Screen

49/75

Mining Word-Level Translations :


Seed lexicon size 100

Bootstrapping

Results are lower for other language pairs

Best-

EditDist 58.6 62.6 61.1 47.4

Ortho 76.0 81.3 80.1 52.3 55.0

Context 91.1 81.3 80.2 65.3 58.0

Both 87.2 89.7 89.0 89.7 72.0

Haghighi et al 2008

Mining Word-Level Translations :


Mining translations for unknown words

OOV words for MT domain adaptation

Ne

wsEm

eaSu

bsPHP

Baseline 23.00 26.62 10.26 38.67

+ve change 0.80 1.44 0.13 0.28German

Baseline 27.30 40.46 16.91 28.12

+ve change 0.36 1.51 0.61 0.68French

MT Accuracies (BLEU)

Daum and Jagarlamudi 2011

7/27/2019 T06 Screen

50/75

7/27/2019 T06 Screen

51/75

Supervised Semantic Indexing

Task:Learn to rank ads =for a given doc. @

Training Data:

Pairs of :@ =>;webpages and clicked ads

Randomly chosen pairs @ =?

Steps :

1. Represent an ad =and a doc. @as vectors

2. Learn scoring functionB:= @;

3. Rank ads for a given document

Bai et al 2009


Steps :

1. Represent ads and docs. as vectors

Vector Space

Bag-of-word representation

Features: words

Feature weights: TFIDF weight

=and @are vectors of size 8

7/27/2019 T06 Screen

52/75


Steps :



Scoring function

B = @ L @9=Parameters: 8 H 8

9L + 9L & 9L 7

8 E + 9L 7

7 E +

Cosine

Similarity

Reweighting

of words


Different treatment for

ads and documents


SAME treatment for ads

and documents

Supervised Semantic Indexing :

Learn Scoring Function

Max-MarginB @ =>

FB @ =?

Ps

Objective

rs FB @ => EB:@ =?;67

Sub Gradient Descent

7/27/2019 T06 Screen

53/75


Steps:



3. Rank ads for a give document

Ranking Ads

Compute score usingB = @ and rank



1.9 M pairs for training

100K pairs for testing

Parameters Rank Loss

TFIDF 45.60

SSI: 9L 78 54 E + 50H10k 25.83SSI: 9L 78 64 E + 50H20k 26.68SSI: 9L 78 74 E + 50H30k 26.98

Bai et al 2009

7/27/2019 T06 Screen

54/75



Ranking wikipedia pages for queries

Rank Loss

Performs better when training data is big

K=5 K=10 K=20

TFIDF 21.6 14.0 9.14

LSI + :s F ;TFIDF 14.2 9.73 6.36SSI: 9L 77 74 E + 4.80 3.10 1.87SSI: 9L 78 74 E + 4.37 2.91 1.80

Bai et al 2009

7/27/2019 T06 Screen

55/75

-

Discriminative Reranking

Reranker operates in the outer product space [Szedmak et al., 2006; Wang et al., 2007]

Weight vector is constrained [Bai et al. 10]

T5

T6

T7

T8

Y

U5

U6

U7

U8

Y

T& U&

T5U5 T5 U6 T U5 T U6

55 56 h5 h6

SL =>

@5 @6 Vector of

length @5 H @6

7/27/2019 T06 Screen

56/75

Low-Dimensional Reranking

Find#and $s.t.

:#T $U;U

T U

U5U6 U7

T U

# $U6

Low-Dimensional Reranking

Find#and $s.t.

:#T $ U; U

1. Score: =TU>

2. Add constraints to penalize incorrect ones

O?KNA T U R O?KNA T U E s F

IR s F

Idea

[Tsochantaridis et al. 04]

7/27/2019 T06 Screen

57/75

Discriminative Model

L . Repeat

# $ ZSoftened-Disc

IL=T U> F =T U> L s F I . L r P r= If P r

@L I F s F Z F @

End Until convergence

// Initialization

// Get the current soln.

// Compute margins

// Potential Slack

// Compute Slack

// Update theLagrangian variables

6ODFN GRHVQW FKDQJH

7/27/2019 T06 Screen

58/75

POS Tagging

Combine with Viterbi score

Interpolation parameter is tuned

Training

Input sentence and Reference tag sequences

Candidates, Score and Loss values

Testing-7.0514 NNS VBD IN TO DT NNS NN .

-0.1947 NNS VBD RP TO DT NNS NN .-6.8068 NNS VBD RB TO DT NNS NN .

-7.1408 NNS VBD RP TO DT NNS VB .

-13.752 NNS VBD RB TO DT NNS VB .

Buyers stepped in to the futures pit .Score

POS Tagging



Data Statistics

Results

English Chinese French Swedish

Baseline 96.15 92.31 97.41 93.23

Collins 96.06 92.81 97.35 93.44

Regularized 96.00 92.88 97.38 93.35

Oracle 98.39 98.19 99.00 96.48

7/27/2019 T06 Screen

59/75

POS Tagging



Data Statistics

Results


Baseline 96.15 92.31 97.41 93.23

Collins 96.06 92.81 97.35 93.44

Regularized 96.00 92.88 97.38 93.35

Oracle 98.39 98.19 99.00 96.48

Softened-Disc 96 32 92.87 97 53 93.24

Discriminative 96.3 92 91 97 53 93.36

POS Tagging

Zo}v]vY

Interpolation with Viterbi score is crucial

Softened-Disc

Independent of no. training examples Easy to code and can be solved exactly


Softened-Disc+0.17 +0.56 +0.12 +0.01

Discriminative +0.15 +0.6 +0.12 +0.13

Softened-Disc* + 92 +4.31 +1 12 +0.08

Discriminative* +0.88 +4 77 +0.9 + 73

Jagarlamudi and Daum 2012

7/27/2019 T06 Screen

60/75

7/27/2019 T06 Screen

61/75

Similarity Search: Challenges

Computing nearest neighbors in high

dimensions using geometric search

techniques is very difficult

All methods are as bad as brute force linear

search which is expensive

Approximate techniques such as ANN perform

efficiently in dimensions as high as 20; in higher

dimensions, the results are rather spotty

Need to do search on commodity hardware

Cross-language search

7/27/2019 T06 Screen

62/75

7/27/2019 T06 Screen

63/75

What is the advantage?

Scales easily to very large databases

Compact language-independent representation

32 bits per object

Search is effective and efficient

Hamming nearest-neighbor search

Few milliseconds per query for searching amillion objects (single thread on a singleprocessor)

7/27/2019 T06 Screen

64/75

What is the challenge?

Language/Script Independent HashCodes

Learning Hash Functions from

Training Data

7/27/2019 T06 Screen

65/75

7/27/2019 T06 Screen

66/75

7/27/2019 T06 Screen

67/75

7/27/2019 T06 Screen

68/75

Learning Hash Functions: Summary

Given a set of parallel names as training data, findthe top K projection vectors for each language

using Canonical Correlation Analysis.

Each projection vector gives a 1-bit hash function.

Hash code for a name can be computed byprojecting its feature vector on to the projectionvectors followed by binarization.

Udupa & Kumar, 2010

Fuzzy Name Search: Experimental Setup

Test Sets:

DUMBTIONARY

1231 misspelled names

INTRANET200 misspelled names

Name Directories:

DUMBTIONARY

550K names from Wikipedia

INTRANET

150K employee names

Training Data:15K pairs of single token names in English and Hindi

Baselines:

Two popular search engines, Double Metaphone, BM25

7/27/2019 T06 Screen

69/75

7/27/2019 T06 Screen

70/75

Multilingual: Experimental Setup

Test Sets

1000 multi-word names each in Russian, Hebrew,

Kannada, Tamil, Hindi

Name Directory:

English Wikipedia Titles

6 Million Titles, 2 Million Unique Words

Baseline:

State-of-the-art Machine Transliteration (NEWS2009)

7/27/2019 T06 Screen

71/75

7/27/2019 T06 Screen

72/75


0

100

200

300

400

500

600

700

800

1990 1995 2000 2005 2010

#ofdim.

reductionpapers

Vision NLP


0

0.2

0.4

0.6

0.8

1

1.2

1990 1995 2000 2005 2010

Vision NLP

Popularity compared toBayesian approaches

7/27/2019 T06 Screen

73/75

Summary

Dimensionality reduction has merits for NLP

Computational and Feature correlations

Has been explored in unsupervised fashion

But recent novel developments

For multi-view data

If you can formulate your problem as mapping

Try dimensionality reduction

Can solve for the global optimum

Summary

Spectral Learning

Provides way to learn global optimum forgenerative models

Enriching the existing models

Using word embeddings instead of words

Scalability of the techniques

}v[v}vZvu}(uo

Large scale SVD

7/27/2019 T06 Screen

74/75

7/27/2019 T06 Screen

75/75

References Jagadeesh Jagarlamudi, Hal Daum, III , Low-Dimensional Discriminative Reranking, in HLT-NAACL 2012

Shaishav Kumar and Raghavendra Udupa, Learning Hash Functions for Cross-View Similarity Search, in

IJCAI-11, IJCAI, 20 July 2011

Raghavendra Udupa and Shaishav Kumar, Hashing-based Approaches to Spelling Correction of PersonalNames, in Proceedings of EMNLP 2010, October 2010

Raghavendra Udupa and Mitesh Khapra, Transliteration Equivalence using Canonical Correlation Analysis,

in ECIR 2010, 2010

Jagadeesh Jagarlamudi, Hal Daum, III , Regularized Interlingual Projections: Evaluation on Multilingual

Transliteration, in Proceedings of EMNLP-CoNLL 2012.