LATENT SEMANTIC INDEXING FOR HINDI- ENGLISH CLIR ...shodhganga.inflibnet.ac.in/bitstream/10603/12226/9/09_chapter3.pdf · inverted file system, probabilistic latent semantic indexing,

CHAPTER 3

LATENT SEMANTIC

INDEXING FOR HINDI-

ENGLISH CLIR

IRRESPECTIVE OF

CONTEXT SIMILARITY

67

3.1 INFORMATION RETRIEVAL

Information retrieval is a process that determines similarity between the query

and document. Information Retrieval (IR) deals with representing, storing, organizing,

and accessing information [31]. This representation and organization of information is

useful for user accessing. The main goal of Information Retrieval is to retrieve the

information, which is relevant to the users need.

3.1.1. INDEXING ISSUES IN INFORMATION RETRIEVAL

In Information Retrieval (IR), basic task is to find the subset of a collection of

elements that are relevant to a query. In Text Retrieval, a query is an ordered set of

English words and a collection is a set of natural language English documents. Any

text retrieval system must overcome the fundamental difficulty that the presence or

absence of a word is insufficient to determine relevance. This is due to two intrinsic

problems of natural language named synonymy and polysemy.

Synonymy refers to the fact that a single underlying concept or idea can be

represented by many different terms or combinations of terms e.g. “car” and

“automobile” often refer to the same class of objects. Polysemy refers to the fact that

a single term can refer to more than one underlying concept or idea e.g. “car “may be

an automobile or the head of a LISP cons cell. Because of synonymy, it is difficult to

realize that two documents describe the same topic when they use different

vocabulary, leading to relevant documents being rejected. Because of polysemy, it is

difficult to realize that two documents that use some of the same terms describe

different topics, leading to the retrieval of unwanted documents.

68

A variety of approaches have been developed for IR tasks in the face of these

problems. We will focus on the popular Vector Space Model representation for

documents and queries. We will also focus on variations of latent semantic indexing,

one technique that is designed to address synonymy and polysemy in the VSM

framework and similar in flavor to the approach that we will derive.

The demand for multilingual information is becoming more profound as the

users of the internet throughout the world are increasing. This demand creates a

problem of retrieving documents in one language by specifying the query in another

language.

This increasing necessity for retrieval of multilingual documents came up with

the new branch called Cross Lingual Information Retrieval (CLIR). Cross Lingual

Information Retrieval makes use of user queries in one language (source language)

and utilizes them in retrieval of documents in another language (target language). For

example, if the user enters a query in Hindi language then relevant documents in

English will be retrieved. These retrieved documents are semantically equal. The

main problem in CLIR is scarcity of resources.

3.1.2. INDEXING TASKS IN INFORMATION RETRIEVAL

The field of Information Retrieval is broad. Researchers have focused their

efforts within several sub areas. We focused on the task where a system is given a set

of documents, and whenever a user specifies a query, those documents are ranked by

relevance. This is known as the ad hoc retrieval task. Here, the goal is to find the best

ranking method possible by whatever means. Documents are known beforehand and

69

collections usually remain relatively unchanged. In the routing and filtering tasks, a

user has a set of standing information requests. New documents arrive regularly. The

system receives a new document and decides whether it meets the criteria of those

information requests and, if so, presents the documents to the user. There are

interesting retrieval issues in domains that do not include text at all, such as image

retrieval, or sound classification. Although the tasks are similar, the structure of the

data and the queries are often quite different. Each domain brings different challenges.

In this work, we are mostly concerned with issues that arise from text retrieval.

Further, we are particularly interested in ad hoc retrieval tasks involving short queries.

3.1.3. COMPUTATIONAL ISSUES IN REAL-WORLD IR

The data sets used in text retrieval are large. Although real-world data sets

may contain only 1000 documents consisting of about 10,000 different words, it is

often the case that we are more interested in 100,000 or even 1,000,000 documents

consisting of hundreds of thousands of distinct words. Even the smallest data sets are

beyond the feasible reach of many machine learning algorithms. There are several

engineering challenges that must be addressed. First, simply storing and manipulating

such data efficiently is difficult. Faster algorithms are absolutely essential. Even

algorithms polynomial in the size of the data is infeasible on serial machines. Further,

bringing to bear statistical and machine learning techniques introduces more

complexity. Such algorithms are usually at least polynomial in the size of the data to

be learned, so even the smallest collections are beyond the reach of many machine

learning algorithms. Clearly, developing a fully working retrieval system for

something as large as the World Wide Web requires a system- level engineering

approach.

70

3.1.4. PERFORMANCE MEASURE

Determining how well a system performs is difficult. In this section we

discuss several standard evaluation metrics and provide some examples of how they

interact.

Many measures of retrieval performance have been proposed. The most

commonly used are Precision and recall. Precision is the ratio of relevant documents

retrieved to the total number of documents retrieved. Recall is the ratio of relevant

documents retrieved to the total number of relevant documents contained within the

collection [33]. Because systems provide an ordering on all documents for a given

query, we can calculate precision and recall for the top n documents, with n ranging

over the total number of documents in the collection. We have a collection made up of

ten documents. For a particular query, five are relevant and five are not. If we

examine only the first document returned, we can see that we have perfect precision

(1.0) with recall equal to 0.2. Looking at the first four documents returned, we can see

that three are relevant, resulting in a precision of 0.75 and a recall of 0.60. Precision

and recall can be calculated for a single query as we have in our example, or averaged

over many queries. One usually wishes to measure performance in terms of both

precision and recall. This is commonly done using a precision-recall graph. Precision

is on the y-axis and recall is on the X-axis.

71

Figure.3.1. Precision and Recall

where P is precision, R is recall, and b is the ratio of the importance of recall

to precision. If b=10, recall is ten times more important than precision, but when

b=0.1, recall is only one tenth as important as precision.

3.1.5. VECTOR SPACE MODEL

In the Vector Space Model (VSM), a document is a vector and each dimension

represents a count of occurrences for a different word queries are similarly

represented, making queries no difference from documents. A collection of

documents is a matrix, D, where each column is a document vector di. Thus, Dij is the

weight of word I in document j. Classically, the similarity between a document and a

query, q is defined to be the inner product of their vectors, dTq. This approach may

seem bizarre however, the inner product is just a weighted match between the

overlapping terms of two documents. Although expressed as linear algebra [32], it is

essentially the same approach used by many search engines, from the library systems

72

commonly available in universities to the wildly popular Alta Vista web search

engine.

There are several advantages to this approach beyond its mathematical

simplicity. Above all, it is computationally efficient to compute a histogram and

requires very little space to store it. Notice that although document vectors live in a

very high-dimensional space, the document matrix will be sparsely populated, which

is made up mostly of zeroes. This is true because in general, most documents will not

contain most of the possible words. Thus, algorithms for manipulating the matrix only

require space and time proportional to the average number of different words that

appear in a document, a number likely to be much smaller than the full dimensionality

of the document matrix. Similarly, comparing a query to all the documents in a

collection is efficient. These are key advantages when collections may require

gigabytes to store.

3.2 RELATED WORK

Much work has already been done on CLIR systems and presently research is

going on in many countries like India, Japan, China, and Portugal. Most of the

proposed systems are based on indexing techniques like a dictionary based indexing,

inverted file system, probabilistic latent semantic indexing, ontology indexing and

language modeling, which retrieves the documents based on the index terms. But, by

using index terms we won’t be able to get the documents which are relevant to the

user query.

73

The method of Automatic cross language information retrieval using latent

semantic indexing [35] , they tested the language independent depiction of the

documents, irrespective of the user query, that means it may be a short or long query.

They used French and English parallel corpus for training and testing the system.

They collected the corpus from Hansard: 982 documents for training and 1500

documents for testing their system. Totally, they had used nearly 2482 documents. In

English documents, there are 2482 paragraphs and in French documents also, there

are 2482 paragraphs had taken. The success rate in finding out the mate documents is

98%.

The porter stemmer [36] is used for stemming of the documents in English.

Here they removed suffixes from the words. Stemming is done on the Cranfield200

collection. While stemming, they calculated precision and recall. They tested the

porter stemmer algorithm on 10,000 vocabularies. The reduced words out of 10,000

are 1373 and the words, which are not reduced are nearly 3650. So finally, the

reduced size of the vocabulary size by using the porter stemmer is nearly 1/3 rd of the

original one.

In the method of Turkish-English cross language information retrieval using LSI

[37], in this experiment, they have created a system that can find the cross language

mate of a given document. The system is trained with bilingual documents. In this

phase, they have parsed the documents and stem the words with corresponding

stemmers. Feature-space (term-by-document matrix) has been created and normalized

by using the TF - IDF method. The entire operations take 10-12 minutes with a

computer which has an Intel Quad - core processor and 4GB of memory. Then, the

normalized feature - space matrix has been decomposed to Σ, V, U matrices. In this

step, Matlab was used. After training the system, the documents in the Turkish test set

74

have been queried to the system to find their cross language mates. Cosine similarity

is used to decide the similarity of the documents. They also test the system with

different rank approximation (k values). Depending on the k value, this operation

takes 8 – 12 minutes with a computer mentioned above.

The result of the performance of our system is evaluated by checking if the

query document is obtaining the mate of it in the retrieval result set. After query

submission, retrieval results are ranked according to their similarity to the query

vector. They have submitted 1801 test documents, one by one, as a query and expect

to find its mate in the query results. They consider the query results as successful, if

the mate of the query document appears in the first 10 of ranked retrieval results. The

result shows the number of successful queries according to rank order of the mate

document. For instance, let us take k=500 experiments, they obtain the mate of the

query document at the first rank for 416 queries. The result of CLIR, if a direct match

between documents is made, where no LSI and TFIDF is used. It also shows that,

using TF-IDF and LSI, increases the query performance approximately 3 times, when

the direct matching is considered. The result shows that as the k value increases, the

retrieval result gets better. However, the greater k value causes to have matrices, in

bigger size and this needs more computational time.

In this, they experimented on LSI using Singular Value Decomposition. The

parallel corpus collected from Skylife Magazine’s website. This contains both Turkish

as well as English Articles. Those articles are converted by the interpreters. This

corpus contains 1056 Turkish Documents and 1056 English documents. Here each

paragraph is taken as an individual document. They had some paragraphs in common

with their cross language mates. So, finally there are 3602 document pairs and each of

75

them represented as a single term by document matrix. Out of 3602 documents, 1801

documents are used for training the system and 1801are used for testing the system. A

longest match Stemming algorithm is used from the stemming of the Turkish

Documents and for English, they used Porter stemmer. They had taken MySQL 5.1.11

Database server for storing the documents. By using Latent Semantic Indexing, the

retrieval rate is 3 times more than the direct Matching. The success rate is 69%.

The reference [38] describes Portuguese - English Experiments using LSI.

They used Los Angeles Times for English Documents only. Systran is used for

translating the 20 % of the English collection to Portuguese. The total documents in

the collection are 22000. The success rate of the retrieval is nearly 99%. The

translation is far from perfect, and many times the incorrect sense of a word was used,

e.g. “branch” was translated to “filial”, where the correct sense was “ramo”. When the

system did not have a translation for a term, it remained in the original language.

Nevertheless, they did not perform any corrections or modifications on the resulting

translation they used the Porter stemmer to stem the English documents, and own

stemmer to stem the Portuguese translations. Stop words were also removed.

Next step was to run the SVD on the 22,000 dual-language documents. They

used a binary version of LSI provided by Telcordia Technologies. An important

aspect is the choice of the number of dimensions that will compose the concept space.

They chose 700 dimensions since this is the number, which gave the best

performance, within reasonable indexing time, when using last year’s query topics. It

was also the highest number that our system could support. The entries in the term by

document matrix were the local weight (frequency of a term in a document)

multiplied by the global weight of the term (number of occurrences of a term across

76

the entire collection). The weighting scheme used was “log-entropy” which is given

by the formula below. A term whose appearance tends to be equally likely among the

documents is given a low weight and a term whose appearance is concentrated in a

few documents is given a higher weight. The elements of our matrix will be of the

form: L( i , j ) * G( i )

Local Weighting: L( i , j ) = log( tf ij+1)

The next step was to “fold in” the remaining 91,000 English-only documents

in that semantic space, which means that the vector representations were calculated

for those remaining documents. The resulting index had 70,000 unique terms,

covering both languages. They did not index terms which were less than 3 characters

long. They did not use phrases or multiword recognition, syntactic or semantic

parsing, word sense disambiguation, heuristic association, spellchecking or correction,

proper noun identification, a controlled vocabulary, a thesaurus, or any manual

indexing. All the terms and the documents from both languages are represented in the

same conceptual space. Therefore, a query in one language may retrieve documents

from any other languages.

This situation benefits from cross-linguistic homonyms, i.e. words that have

the same spelling and meaning in both languages; e.g. “singular” is represented by

one vector only, accounting for both languages. On the other hand, it suffers with

“false friends”, i.e. words that have the same spelling across languages but different

meanings; e.g. “data” in Portuguese means “date” instead of “information”. The

problem in this case is that false friends are wrongly represented by only one point in

space, placed on the average of all meanings. The ideal scenario would be taking

77

advantage from cross-linguistic homonyms and at the same time avoiding false

friends. They are still looking for a way to do that automatically. The similarity

between a term and its translation should be very high.

The method Singular Value Decomposition is tested in Indexing by Latent

Semantic Analysis [39]. This method gives the details about how to solve the problem

of multiple terms referring to the same object. In these, the relevant documents are

characterized and identified properly. 12 terms by 9 documents matrix is decomposed

by using SVD in our experiments. These results are modestly encouraging. They

show the latent semantic indexing method to be superior to the simple term matching

in one standard case and equal with another. Further, for these two databases,

performance with LSI is superior to that obtained with the system described by

Voorhees; it performed better than SMART in one case and equal in the other when

term selection differences were eliminated. In order to assess the value of the basic

representational method, they have so far avoided the addition of refinements that one

would consider in a real application, such as discriminative term weighting,

stemming, phrase finding or a method of handling negation or disjunction in the

queries.

So far they have tested the method only with queries formulated to be used

against other retrieval methods; the method almost certainly could do better with

queries in some more appropriate format. They have projects in progress to add

standard enhancements and to incorporate them in a fully automatic indexing and

retrieval system. In addition, they are working on methods to incorporate the very low

frequency, but often highly informative words that were filtered out in the trial

78

analysis procedures. It seems likely that with such improvements LSI will offer a

more effective retrieval method than has previously been available.

The method of Latent Semantic Indexing is described briefly in the Latent

Semantic Indexing Overview [40]. It described some advantage of Liplike less

dimensionality, polysemy, synonymy and Term dependence. In the analysis of LSI,

they used 90,000 terms instead of 70,000 documents are used. So, the term by

document contains only 0.001% - 0.002% non zero, entries. It had taken nearly 18

hours CPU time. In this LSI gave 16% improvement than original keyword method.

Latent Semantic Indexing is a technique that projects queries and documents

into a space with “latent” semantic dimensions. In the latent semantic space, a query

and a document can have high cosine similarity even if they do not share any terms -

as long as their terms are semantically similar in a sense to be described later. They

can look at LSI as a similarity metric that is an alternative to word overlap measures

like tf.idf. The latent semantic space that project into has fewer dimensions than the

original space. LSI is thus a method for dimensionality reduction. A dimensionality

reduction technique takes a set of objects that exist in a high-dimensional space and

represents them in a low dimensional space, often in a two-dimensional or three-

dimensional space for the purpose of visualization. Latent semantic indexing is the

application of a particular mathematical technique, called Singular Value

Decomposition or SVD, to a word-by-document matrix. Hence LSI is a least-squares

method. The projection into the latent semantic space is chosen such that the

representations in the original space are changed as little as possible when measured

by the sum of the squares of the differences.

79

SVD takes a matrix A and represents it as Aˆ in a lower dimensional space

such that the “distance” between the two matrices as measured by the 2-norm is

minimized. The 2-norm for matrices is the equivalent of Euclidean distance for

vectors. SVD project an n-dimensional space onto a k dimensional space where n > >

k. In our application, n is the number of word types in the collection. Values of k that

are frequently chosen are 100 and 150. The projection transforms a document's vector

in n-dimensional word space into a vector in the k-dimensional reduced space. There

are many different mappings from high dimensional to low-dimensional spaces.

Latent Semantic Indexing chooses the mapping that is optimal in the sense that it

minimizes the distance. This setup has the consequence that the dimensions of the

reduced space correspond to the axes of greatest variation. The reference [41]

describes the method for retrieving of English-Greek documents using Latent

Semantic Indexing for Cross Language Information Retrieval. The English and

Turkish documents are clustered along the X-axis and Y-axis a two dimensional

vector. The parsing mechanism is used. Here, the terms should be appearing at least

more than once in the database. This paper mainly focuses on the query matching

within the database. Folding-in is another technique for the LSI generated database

already exists. In this Folding- in technique, each new document is represented as a

weighted sum of component document vector, this is appended to the existing

documents.

The reference [42] describes the method of Latent semantic Indexing a fast

Track Tutorial, how Singular Value Decomposition (SVD) is used in Latent Semantic

Indexing (LSI) to score documents and queries. Do-it-yourself procedures using an

online matrix calculator are described. The tutorials help in learning about basic LSI

80

models, then move forward to advanced models, understand how LSI ranks

documents, replicate all calculations and experiment with your own data, get out of

your head the many SEO myths and fallacies about LSI and stay away from "LSI

based" Snakeoil Search Marketers.

The method of Singular Value Decomposition describes in Singular Value

Decomposition [43] (SVD) is a mathematical technique used for reducing the

dimension of a matrix. This tutorial describes how the documents are decomposed

from a single matrix. This gives the relation between the correlated documents and

uncorrelated documents. In this tutorial, they illustrated the two dimensional data

points.

3.3 LATENT SEMANTIC INDEXING

Indexing is very simple for a single language, but when coming to

multilingual it is quiet difficult. So, for this we are proposing Latent Semantic

Indexing (LSI), by Using Singular Value Decomposition (SVD).This latent semantic

indexing is the best approach for mapping of each document and query vector in to a

reduced dimensional space[47]. This is based on concept matching rather than

matching of index terms. Latent Semantic Indexing is a variant of the vector-retrieval

method in which the De- despondencies between terms are explicitly modeled and

exploited to improve retrieval. One advantage of the LSI representation is that a query

can retrieve a relevant document even if they have no words in common.

Most information-retrieval methods depend on exact matches between words

in users' queries and words in documents. Typically, documents containing one or

81

more query words are returned to the user. Such methods will, however, fail to

retrieve relevant materials that do not share words with users' queries. One reason for

this is that the standard retrieval models treat words as if they are independent,

although it is quite obvious that they are not. A central theme of LSI is that the term -

term interrelationships can be automatically modeled and are used to improve

retrieval; this has been critical in cross-language retrieval since direct term matching

is of little use.

Latent semantic indexing adds an important step to the document indexing

process. In addition to recording which keywords a document contains [46], the

method examines the document collection as a whole, to see which other documents

contain some of those same words. LSI considers documents that have many words in

common to be semantically close, and Ones with few words in common to be

semantically distant. This simple method correlates surprisingly well with how a

human being, looking at content, might classify a document collection. Although the

LSI algorithm doesn't understand anything about what the words mean, the patterns it

notices can make it seem astonishingly intelligent.

When you search an LSI-indexed database, the search engine looks at

similarity values it has calculated for every content word, and returns the documents

that it thinks best fit the query. Because two documents may be semantically very

close even if they do not share a particular keyword, LSI does not require an exact

match to return useful results. When a plain keyword search will fail if there is no

exact match, LSI will often return relevant documents that don't contain the keyword

at all.

82

LSI is used to index our collection of mathematical articles. If the words n-

dimensional [49], manifold and topology appear together in enough articles, the

search algorithm will notice that the three terms are semantically close. A search for

n-dimensional manifolds will therefore return a set of articles containing that phrase

the same result would get with a regular search, but also articles that contain just the

word topology. The search engine understands nothing about mathematics, but

examining a sufficient number of documents teaches it that the three terms are related.

It then uses that information to provide an expanded set of results with better recall

than a plain keyword search.

LSI examines the similarity of the contexts in which words appear, and creates

a reduced-dimension feature-space representation in which words that occur in similar

contexts are near each other. That is, the method rst creates a representation that

captures the similarity of usage of terms and then uses this representation for retrieval.

The derived feature space rejects these inter relationships. LSI uses a method from

linear algebra, singular value decomposition (SVD), to discover the important

associative relationships. It is not necessary to use any external dictionaries, thesauri

[51], or knowledge bases to determine these word associations because they are

derived from a numerical analysis of existing texts. The learned associations are

specific to the domain of interest, and are derived completely automatically.

The singular-value decomposition (SVD) technique is closely related to

eigenvector decomposition and factor analysis. For information retrieval and altering

applications in first step large term-document matrix has been constructed, in much

the same way as vector or Boolean methods do. This term-document matrix is

decomposed into a set of typically k orthogonal factors from which the original matrix

83

can be approximated by linear combination this analysis reveals the latent structure in

the matrix that is obscured by noise or by variability in word usage.

Traditional vector methods represent documents as linear combinations of

orthogonal terms, as shown in the left half of the T so that the angle between two

documents depends on the frequency with which the same terms occur in both,

without regard to any correlations among the terms [50]. Here, Doc 3 contains Term

2, Doc 1 contains Term 1, and Doc 2 contains both terms. In this LSI represents terms

as continuous values on each of the orthogonal indexing dimensions. Since the

number of factors or dimensions is much k Smaller than the number of unique terms,

the terms will not be independent as depicted in the right half of Figure 3.1 When two

terms are used in similar contexts (documents), they will have similar vectors in the

reduced-dimension LSI representation. LSI partially overcomes some of the decencies

of assuming independence of words, and provides a way of dealing with synonymy

automatically without the need for a manually constructed thesaurus. Presented

detailed mathematical descriptions and examples of the underlying LSI/SVD method.

The result of the SVD is a set of vectors representing the location of each term

and document in the reduced -dimension LSI representation. Retrieval proceeds by

using the terms in a query to identify a point in the space technically, the query is

located at the lighted vector sum of its constituent terms. The documents are then

ranked by their similarity to the query, typically using a cosine measure of similarity.

While the most common retrieval scenario involves returning documents in response

to a user query, the LSI representation allows for much more flexible retrieval

scenarios. Since both terms and document vectors are represented in the same space,

similarities between any combination of terms and documents can be easily obtained.

84

For example, user asks for a term's nearest documents, a term's nearest terms, a

document's nearest terms, or a document's nearest documents.

New documents can be added to the LSI representation using a procedure

called folding in. This method assumes that the LSI space is a reasonable

characterization of the important underlying dimensions of similarity, and that new

items can be described in terms of the existing dimensions. Any document not used in

the construction of the semantic space is located at the weighted vector sum of its

constituent terms. This is exactly how queries are handled and has the desirable

mathematical property that a document that is already in the space is folded in at the

same location. A new term is located at the vector sum of the documents in which it

occurs. In single-language document retrieval, the LSI method has equaled or

outperformed standard vector methods in almost every case, and was as much as 30%

better in some cases.

3.3.1. ADVANTAGES

A. TRUE DIMENSIONS

The assumption in LSI and similarly for other forms of dimensionality

reduction like principal component analysis is that the new dimensions are a better

representation of documents and queries. The metaphor underlying the term “latent”

is that these new dimensions are the true representation. This true representation was

then obscured by a generation process that expressed a particular dimension with one

set of words in some documents and a different set of words in another document. LSI

analysis recovers the original semantic structure of the space and its original

85

dimensions [48]. Describe the three major advantages of using the LSI representation

with the following labels are synonymy, polysemy, and the term dependence.

B. SYNONYMY

Synonymy refers to the fact that the same underlying concept can be described

using different terms. Traditional retrieval strategies have trouble discovering

documents on the same topic that use a different vocabulary. In LSI, the concepts in

question as well as all documents that are related to it are all likely to be represented

by a similar weighted combination of indexing variables.

C. POLYSEMY

Polysemy describes words that have more than one meaning, which is the

common property of language. Large numbers of polysemous words in the query can

reduce the precision of a search significantly. By using a reduced representation in

LSI, one hopes to remove some "noise" from the data, which could be described as

rare and less important usages of certain terms. This would work only when the real

meaning is close to the average meaning. Since the LSI term vector is just a weighted

average of the different meanings of the term, when the real meaning differs from the

average meaning, LSI may actually reduce the quality of the search.

D. TERM DEPENDENCE

The traditional vector space model assumes term independence and terms

serve as the orthogonal basis vectors of the vector space. Since there are strong

associations between terms in language, this assumption is never satisfied. While term

independence represents the most reasonable first-order approximation, it should be

86

possible to obtain improved performance by using term associations in the retrieval

process. Adding common phrases as search items is a simple application of this

approach. On the other hand, the LSI factors are orthogonal by definition, and terms

are positioned in the reduced space in a way that reflects the correlations on their use

across documents. It is very difficult to take advantage of term associations without

dramatically increasing the computational requirements of the retrieval problem.

While the LSI solution is difficult to compute for large collections, it need only be

constructed once for the entire collection and performance at retrieval time is not

affected.

3.3.2. DISADVANTAGES

A. STORAGE

One could also argue that the SVD representation is more compact. Many

documents have more than 150 unique terms. So the sparse vector representation will

take up more storage space than the compact SVD representation, if the dimensions

are reduced to 150 dimensions. In reality, the opposite is actually true For example,

the document by term matrix for the Canfield collection used in Hull’s experiments

had 90,441 non-zero entries here after stemming and stop word removal. Retaining

only 100 of the possible 1399 LSI vectors require storing 139,900 values of the

documents alone. The term vectors require the storage of roughly 400,000 additional

values. In addition, the LSI values are real numbers while the original term

frequencies are integers, adding to the storage costs. Using LSI vectors, they no

longer take advantage of the fact that each term occurs in a limited number of

documents, which accounts for the sparse nature of the term by document matrix.

With recent advances in electronic storage media, the storage requirements of the LSI

87

are not a critical problem, but the loss of sparseness has other, more serious

implications.

B. EFFICIENCY

One of the most important speedups in vector space search comes from using

an inverted index. As a consequence, only documents that have some terms in

common with the query must be examined during the search. With LSI, however, the

query must be compared to every document in the collection. However, several

factors can reduce or eliminate this drawback. If the query has more terms than its

representation in the LSI vector space, then inner product similarity scores will take

more time to compute in term space. For example, if relevance feedback is conducted

using the full text of the relevant documents, the number of terms in the query is

likely to grow to be many times the number of LSI vectors, leading to a corresponding

increase in search time. In addition, using a data structure such as the k-d tree in

conjunction with LSI would greatly speed the search for nearest neighbors, provided

only a partial ordering of the documents is required. Most of the additional costs come

in the pre-processing stage when the SVD and the k-d tree are computed, and actual

search time should not be significantly degraded. Other query expansion techniques

suffer even more heavily from the difficulties described above, and LSI performs

relatively well for long documents due to the small number of context vectors used to

describe each document. However, implementation of LSI does require an additional

investment of storage and computing time.

88

C. TOWARD A THEORETICAL FOUNDATION

The empirically improved performance has been observed that there is very

little in the literature in the way of a mathematical theory that predicts this improved

performance. In this session we briefly describe one paper that is an attempt at using

mathematical techniques to rigorously explain the empirically observed improved

performance of LSI, Papadimitriou starts citrating an interesting mathematical fact

due to Eckart and Young, often cited as an explanation of the improved performance

of LSI, that states, informally, that LSI retains as much as possible the relative

position of the document vectors while projecting it to a lower-dimensional space.

This may only provide an explanation of why LSI does not deteriorate too much in

performance over conventional vector-space methods; it fails to justify the observed

improvement in precision and recall.

3.3.3. APPLICATIONS OF LSI

A. INFORMATION RETRIEVAL

The application of Singular Value Decomposition to information retrieval was

originally proposed by a group of researchers at Bell core and called Latent Semantic

Indexing in this context. At this point, it should be clear how to use LSI for IR.

Regarding the Performances reports for several information science test collections,

the average precision using LSI ranged from comparatively 30% better than that

obtained using standard keyword vector methods. The LSI method performs better

relative to standard vector methods when the queries and relevant documents do not

share many words, and at high levels of recall.

89

B. RELEVANCE FEEDBACK

Most of the tests of Relevance Feedback using LSI have involved a method in

which the initial query is replaced with the vector sum of the documents the users

have selected as relevant. The use of negative information has not yet been exploited

in LSI; for example, by moving the queries away from documents that the user has

indicated are irrelevant. Replacing the users’ query with the first relevant document

improves performance by an average of 33% and replacing it with the average of the

first three relevant documents improves performance by an average of 67%.

Relevance feedback provides sizable and consistent retrieval advantages. One way of

thinking about the success of these methods is that many words augment the initial

query that is usually quite impoverished. LSI does some of this kind of query

expansion or enhancement even without the relevant information, but can be

augmented with relevant information.

C. INFORMATION FILTERING

Applying LSI to information filtering applications is straightforward. An

initial sample of documents is analyzed using standard LSI/SVD tools. A users'

interest is represented as one vector in this reduced-dimension LSI space. Each new

document is matched against the vector and if it is similar enough to the interest

vector it is recommended to the user. Learning methods like relevance feedback can

be used to improve the representation of interest vectors over time. Performances

studies are encouraging.

90

D. TREC

Recently, LSI has been used for both information filtering and information

retrieval in TREC. The queries are very long and have detailed descriptions,

averaging more than 50 words in length. The fact that the TREC queries are quite rich

means that the smallest advantages would be expected for LSI or any other methods

that attempt to enhance user queries. The big challenge in this collection was to

extend the LSI tools to handle collections of this size. The results were quite

encouraging. At the time of the TREC conferences it was not reasonable to compute

Aˆ for the complete collection. Instead, a sample of about 70,000 documents and

90,000 terms were used. Such term by document matrices (A) are quite sparse,

containing only .001 to.002 percent of non-zero entries. Computing a 200, i.e. the

200-largest singular values and corresponding singular vectors, required about18

hours of CPU time on a SUN SPARC station 10 workstations.

Documents not in the original LSI analysis, Although it is very difficult to

compare across systems in any detail because of large pre-processing, representation

and matching differences, LSI performance was quite good. For filtering tasks, using

information about known relevant documents to create a vector for each query was

beneficial. The retrieval advantage of 31% was somewhat smaller than that observed

for other filtering tests and is attributable to the good initial queries in TREC. For

retrieval tasks, LSI showed 16% improvement when compared with the keyword

vector methods. Again, the detailed original queries account for the somewhat smaller

advantages than previously observed.

91

E. CROSS-LANGUAGE RETRIEVAL

It is important to note that the LSI analysis makes no use of English syntax or

semantics. This means that LSI is applicable to any language. In addition, it can be

used for cross-language retrieval - documents are in several languages and user

queries can match documents in any language. What is required for cross-language

applications is a common space in which words from many languages are represented

and describe one method for creating such an LSI space. The original term-document

matrix is formed using a collection of abstracts that have versions in more than one

language. Each abstract is treated as the combination of its French English versions.

The truncated SVD is computed for this term by combined-abstract matrix A.

The resulting space consists of combined-language abstracts, English words

and French words. English words and French words that occur in similar combined

abstracts will be near each other in the reduced-dimension LSI space. After this

analysis, monolingual abstracts can be folded-in: a French abstract will simply be

located at the vector sum of its constituent words that are already in the LSI space.

Queries in either French or English can be matched to French or English abstracts.

There is no difficult translation involved in retrieval from the multilingual LSI space.

Experiments showed that the completely automatic multilingual space was more

effective than single-language spaces. The retrieval of French documents in response

to English queries was as effective as first translating the queries into French and

searching a French-only database. The method has shown almost as good results for

retrieving English abstracts and Japanese Kanji ideographs, and for multilingual

translations of the Bible.

92

F. MATCHING PEOPLE INSTEAD OF DOCUMENTS

In a couple of applications, LSI has been used to return the best matching

people instead of documents. In these applications, articles they had written

represented people. In one application, known as the Bellcore Advisor, a system was

developed to find local experts relevant to users' queries. A query was matched to the

nearest documents and project descriptions and the author's organization was returned

as the most relevant internal group. In another application, LSI was used to automate

the assignment of reviewers to the submitted conference papers. Several hundred

reviewers were described by means of texts they had written, and this formed the

basis of the LSI analysis. Hundreds of submitted papers were represented by their

abstracts, and matched to the closest reviewers. These LSI similarities were used to

assign papers to reviewers for a major human-computer interaction conference.

Subsequent analyses suggested that these completely automatic assignments were as

good as those of human experts.

G. NOISY INPUT

LSI does not depend on literal keyword matching, it is especially useful when

the text input is noisy, as in OCR (Optical Character Reader), open input, or spelling

errors. If there are scanning errors and a word Dumais is misspelled as Duniais ,

many of the other words in the document will be spelled correctly. If these correctly

spelled context words also occur in documents that contained a correctly spelled

version of Dumais, then Dumais will probably be near Dunials in the k-dimensional

space determined by Aˆ.

93

H. OTHERS

[119,120] have used SVD and related dimension reduction ideas forward sense

disambiguation and information retrieval work. And have used LSI/SVD as the first

step in conjunction with the statistical classification example is the discriminant

analysis. Using the LSI-derived dimensions effectively reduce the number of

predictor variables for classification. It also used LSI/SVD to reduce the training set

dimension for a neural network protein classification system used in human genome

research.

I. OPEN COMPUTATIONAL OR STATISTICAL ISSUES

There are a number of computational/statistical improvements that would

make LSI even more useful, especially for large collections:

• Computing in an efficient way the truncated SVD of extremely large sparse matrices

• Perform SVD-updating in real-time for databases that change frequently, and

• Efficiently comparing queries to documents for finding near neighbors in high-

dimension spaces

3.4 SINGULAR VALUE DECOMPOSITION

The Singular Value Decomposition (SVD) is a widely used technique to

decompose a matrix into several component matrices, exposing many of the useful

and interesting properties of the original matrix. The decomposition of a matrix is

often called a factorization. Ideally, the matrix is decomposed into a set of factors that

are optimized based on some criterion. For example, a criterion might be the

reconstruction of the decomposed matrix. The decomposition of a matrix is also

94

useful when the matrix is not of full rank. That is, the rows or columns of the matrix

are linearly dependent.

Theoretically, one can use Gaussian elimination to reduce the matrix to row

echelon form and then count the number of nonzero rows to determine the rank.

However, this approach is not practical when working in nite precision arithmetic. A

similar case presents itself when using LU decomposition where L is in lower

triangular form with 1's on the diagonal and U is in upper triangular form. Ideally, a

rank-deficient matrix may be decomposed into a smaller number of factors than the

original matrix and still preserve all of the information in the matrix.

The SVD, in general, represents an expansion of the original data in a

coordinate system where the covariance matrix is diagonal. Using the SVD, one can

determine the dimension of the matrix range or more-often called the rank. The rank

of a matrix is equal to the number of linearly independent rows or columns. This is

often referred to as a minimum spanning set or simply a basis. The SVD can also

quantify the sensitivity of a linear system to numerical error or obtain a matrix

inverse. Additionally, it provides solutions to least-squares problems and handles

situations when matrices are either singular or numerically very close to singular.

The following are the steps involving in constructing the SVD

Step1: Score term weights and construct the term-document matrix A and query

matrix.

Step2: Decompose a matrix A matrix and find the U, S and V matrices, where A =

USVT

95

Step3: Implement a Rank k Approximation by keeping the first columns of U

and V and its first columns and rows of S.

Step4: Find the new document vector coordinates in this reduced k-dimensional

space. A row of V holds eigenvector values. These are the coordinates of the

individual document vectors

Step5: Find the new query vector coordinates in the reduced 2-dimensional space.

q = qTUkSk-1

Step6: Rank documents in decreasing order of query-document cosine similarities.

Singular value decomposition (SVD) can be looked at from three mutually

compatible points of view. On the one hand, it can see as a method for transforming

correlated variables into a set of uncorrelated ones that had better expose the various

relationships among the original data items. At the same time, SVD is a method for

identifying and ordering the dimensions along which data points exhibit the most

variation. This ties in to the third way of viewing SVD, which is that once have

identified where the most variation is, it's possible to the best approximation of the

original data points using fewer dimensions. Hence, the SVD can be seen as a method

for data reduction. As an illustration of these ideas, consider the 2-dimensional data

points.

The regression line running through them shows the best approximation of the

original data with a 1-dimensional object a line. It is the best approximation in the

sense that it is the line that minimizes the distance between each original point and the

line. If a perpendicular line is drawn from each point of the regression line, and took

the intersection of those lines as the approximation of the original data point, a

96

reduced representation of the original data that capture as much of the original

variation as possible.

The second regression line, perpendicular to the restate line captures as much

of the variation as possible along the second dimension of the original data set. It does

a poor job of approximating the original data because it corresponds to a dimension

exhibiting less variation to begin with. It is possible to use these regression lines to

generate a set of uncorrelated data points that will show sub groupings in the original

data not necessarily visible at rst glance. These are the basic ideas behind SVD: taking

a high dimensional, highly variable set of data points and reducing it to a lower

dimensional space that exposes the substructure of the original data more clearly and

orders it from most variation to the least. What makes SVD practical for NLP

applications is that you can simply ignore variation below a particular threshold to

massively reduce your data but be assured that the main relationships of interest have

been preserved. Example of Full Singular Value Decomposition.

SVD is based on a theorem from linear algebra which says that a rectangular

matrix A can be broken down into the product of three matrices - an orthogonal

matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V . The

theorem usually presents something like

Amn = Umm Smn VT

nn

Where UTU = I; V

TV = I; the columns of U are orthonormal eigenvectors of

AAT

, the columns of V are orthonormal eigenvectors of ATA, and S is a diagonal

matrix containing the square roots of the eigenvalues from U or V in descending

97

order. The following example merely applies this dentition to a small matrix in order

to compute its SVD. In the next section, I attempt to interpret the application of SVD

to document classification.

The Singular Value Decomposition (SVD) is a widely used technique to

decompose a matrix into several component matrices, exposing many of the useful

and interesting properties of the original matrix. The decomposition of a matrix is

often called a factorization. Ideally, the matrix is decomposed into a set of factors that

are optimized based on some criterion. For example, a criterion might be the

reconstruction of the decomposed matrix. The decomposition of a matrix is also

useful when the matrix is not of full rank. That is, the rows or columns of the matrix

are linearly dependent. Theoretically, one can use Gaussian elimination to reduce the

matrix to row echelon form and then count the number of nonzero rows to determine

the rank. However, this approach is not practical when working nite precision

arithmetic.

A similar case presents itself when using LU decomposition where L is in

lower triangular form with 1's on the diagonal and U is in upper triangular form.

Ideally, a rank-deceit matrix may be decomposed into a smaller number of factors

than the original matrix and still preserve all of the information in the matrix. The

SVD, in general, represents an expansion of the original data in a coordinate system

where the covariance matrix is diagonal. Using the SVD, one can determine the

dimension of the matrix range or more-often called the rank. The rank of a matrix is

equal to the number of linearly independent rows or columns. This is often referred to

as a minimum spanning set or simply a basis. The SVD can also quantify the

sensitivity of a linear system to numerical error or obtain a matrix inverse.

98

Additionally, it provides solutions to least-squares problems and handles situations

when matrices are either singular or numerically very close to singular.

A. TF-IDF

The tf–idf weight (term frequency–inverse document frequency) is a weight

often used in information retrieval and text mining. This weight is a statistical

measure used to evaluate how important a word is to a document in a collection or

corpus. The importance increases proportionally to the number of times a word

appears in the document but is offset by the frequency of the word in the corpus.

Variations of the tf–idf weighting scheme are often used by search engines as a central

tool in scoring and ranking a document's relevance given a user query.

B. TERM FREQUENCY

The number of times a term occurs in a document is called its term frequency.

By taking into account these two factors term frequencies (TF) and inverse document

frequency (IDF) — it is possible to assign “weights” to search results and therefore

ordering them statistically. Put another way, a search result’s score (“ranking”) is the

product of TF and IDF:

TFIDF = TF * IDF where:

TF = C / T where the C = number of times a given word appears in a document and T

= total number of words in a document

IDF = D / DF where the D = total number of documents in a corpus, and DF = total

number of documents containing a given word

http://en.wikipedia.org/wiki/Information_retrieval

http://en.wikipedia.org/wiki/Text_mining

http://en.wikipedia.org/wiki/Document

http://en.wikipedia.org/wiki/Text_corpus

http://en.wikipedia.org/wiki/Proportionality_%28mathematics%29

http://en.wikipedia.org/wiki/Search_engine

http://en.wikipedia.org/wiki/Relevance_%28information_retrieval%29

http://en.wikipedia.org/wiki/Information_retrieval

99

The number of times a term occurs in a document is called its term frequency.

The term "the" is so common, this will tend to incorrectly emphasize documents

which happen to use the word "the" more frequently, without giving enough weight to

the more meaningful terms "brown" and "cow". Also the term "the" is not a good

keyword to distinguish relevant and non-relevant documents and terms. On the

contrary, the words "brown" and "cow" that occur rarely are good keywords to

distinguish relevant documents from the non-relevant documents. Hence an inverse

document frequency factor is incorporated which diminishes the weight of terms that

occur very frequently in the collection and increase the weight of terms that occur

rarely. A high weight in tf–idf is reached by a high term frequency in the given

document and a low document frequency of the term in the whole collection of

documents; the weights hence tend to filter out common terms. The tf-idf value for a

term will be greater than zero if and only if the ratio inside the idf's log function is

greater than 1. Depending on whether a 1 is added to the denominator, a term in all

documents will have either a zero or negative idf, and if the 1 is added to the

denominator, a term that occurs in all but one document will have an idf equal to zero.

C. EXAMPLE

Consider a document containing 100 words wherein the word cow appears 3

times. Following the previously defined formulas, the term frequency (TF) for cow is

then (3 / 100) = 0.03. Now, assume there are 10 million documents and cow appears

in one thousand of these. Then, the inverse document frequency is calculated as log

(10 000 000 / 1 000) = 4. The TF-IDF score is the product of these quantities: 0.03 × 4

= 0.12.

http://en.wikipedia.org/wiki/Frequency_%28statistics%29

100

D. APPLICATIONS IN VECTOR SPACE MODEL

The tf-idf weighting scheme is often used in the vector space model together

with cosine similarity to determine the similarity between two documents.

E. JAMA

JAMA is a basic linear algebra package for Java. It provides user-level classes

for constructing and manipulating real, dense matrices. It is meant to provide

sufficient functionality for routine problems, packaged in a way that is natural and

understandable to non-experts.

F. CAPABILITIES

JAMA is comprised of six Java classes Matrix, Cholesky Decomposition, LU

Decomposition, QR Decomposition, Singular Value Decomposition and Eigenvalue

Decomposition. The Matrix class provides the fundamental operations of numerical

linear algebra. Various constructors create Matrices from two dimensional arrays of

double precision floating point numbers. Various gets and sets methods provide

access to submatrices and matrix elements. The basic arithmetic operations include

matrix addition and multiplication, matrix norms and selected element-by-element

array operations. A convenient matrix print method is also included. Five

fundamental matrix decompositions, which consist of pairs or triples of matrices,

permutation vectors, and the like, produce results in five decomposition classes.

These decompositions are accessed by the Matrix class to compute solutions of

simultaneous linear equations, determinants, inverses and other matrix functions. The

five decompositions are

http://en.wikipedia.org/wiki/Vector_space_model

http://en.wikipedia.org/wiki/Cosine_similarity

http://en.wikipedia.org/wiki/String_metric

101

Cholesky Decomposition of symmetric, positive definite matrices

LU Decomposition (Gaussian elimination) of rectangular matrices

QR Decomposition of rectangular matrices

Eigenvalue Decomposition of both symmetric and nonsymmetric square

matrices

Singular Value Decomposition of rectangular matrices

The current JAMA deals only with real matrices. We expect that future versions

will also address complex matrices. This has been deferred since crucial design

decisions cannot be made until certain issues regarding the implementation of

complex matrices in the Java language are resolved.

The design of JAMA represents a compromise between the need for pure and elegant

object-oriented design and the need to enable high performance implementations.

Table.3.1: JAMA PACKAGE

Summary of JAMA Capabilities

Object Manipulation constructors

set elements

get elements

copy

clone

Elementary Operations addition

subtraction

multiplication

scalar multiplication

element-wise multiplication

element-wise division

unary minus

transpose

102

norm

Decompositions Cholesky

LU

QR

SVD

symmetric eigenvalue

nonsymmetric eigenvalue

Equation Solution nonsingular systems

least squares

Derived Quantities condition number

determinant

rank

inverse

pseudoinverse

3.5 EXPERIMENTAL SETUP & RESULTS

In this experiment, a parallel corpus has been created. The documents are

retrieved from India.gov web sites that contain both Hindi and English documents

which are semantically equal. Documents in both languages have been divided into

paragraphs. Each paragraph is kept as a separate document. So, these documents are

mapped to the respective translation language paragraphs. The mapping data is stored

in the MySQL database server.

The corpus consists of 180 Hindi and 180 English parallel documents. For the

purpose of testing every paragraph is taken as a single document.

103

Figure.3.2.System overview

In this, a system that can search the cross language mate of a given document

has been created. First, the system has been trained with bilingual documents. In this

stage, the English documents are stemmed using a porter stemmer and also the Hindi

documents are stemmed manually. After stemming the documents using

corresponding stemmers, stop words are removed to increase the retrieval

performance.

By counting the frequency of each word in documents, a term-by-document

matrix (Feature-space) is created. The Feature-space is normalized using Term

Frequency – Inverse Document Frequency (TF-IDF), because longer documents may

affect the retrieval results. Then the normalized term-by-document matrix has been

decomposed to U, S, and V matrices using singular value decomposition (SVD). For

this, JAMA package is used which contains all the classes and interfaces which are

used for decomposing the Feature-space.

104

A. LSI

For processing multilingual languages, indexing is quite difficult. So, for this

Latent Semantic Indexing (LSI) is proposed, by Using Singular Value Decomposition

(SVD).This latent semantic indexing is the best approach for mapping of each

document and query vector in to a reduced dimensional space. This is based on

concept matching rather than matching of index terms. Latent Semantic Indexing is a

variant of the vector-retrieval method in which the De- despondencies between terms

are explicitly modeled and exploited to improve retrieval.

B. SVD

1. Score term weights, construct the term-document matrix A, and query matrix.

2. Decompose a matrix A and find the U, S and V matrices, where A = USVT

3. Implement a Rank k approximation by keeping the first columns of U and V and

the first columns and rows of S.

4. Find the new document vector coordinates in this reduced k-dimensional space. A

row of V holds eigenvector values. These are the coordinates of individual

document vectors

5. Find the new query vector coordinates in the reduced 2-dimensional space. Q =

QTUkSk

-1

6. Rank documents in decreasing order of query-document cosine similarities.

C. JAMA

JAMA is a basic linear algebra package for Java. It provides user-level classes

for constructing and manipulating real, dense matrices. It is meant to provide

sufficient functionality for routine problems, packaged in a way that is natural and

105

understandable to non-experts. JAMA is comprised of six Java classes: Matrix,

CholeskyDecomposition,LUDecomposition,QRDecomposition,

SingularValueDecomposition and EigenValueDecomposition.

The corpus consists of 180 Hindi and 180 English parallel documents. So for

this purpose every paragraph is taken as a single document as shown in the table 3.2

Table 3.2.Example Document

English Document Hindi Document

India & the World India's foreign policy

seeks to safeguard the country's enlightened

self-interest. The primary objectiv7 e of

India's foreign policy is to promote and

maintain a peaceful and stable external

environment in which the domestic tasks of

inclusive economic development and

poverty alleviation can progress rapidly and

without obstacles. Given the high priority

attached by the Government of India to

Dsocio-economic development, India has a

vital stake in a supportive external

environment both in our region and

globally.

भारत और विश ् ि भारत की विदेश नीतत में

देश के वििेकपूर्ण स ् ि-हित की रक्षा करन े

पर बल हदया जाता िै। भारत की विदेश

नीतत का प्राथममक उदे्दश ् य शाांततपूर्ण स्थथर

बािरी पररिेश को बढािा देना और उसे

बनाए रखना िै, स्जसमें समग्र आर्थणक और

गरीबी उन ् मूलन के घरेल ूलक्ष ्यों को तेजी

से और बाधाओां से मुक ्त मािौल में आगे

बढाया जा सकें । सरकार द्िारा सामास्जक-

आर्थणक विकास को उच ् च प्राथममकता हदए

106

जाने को देखते िुए, क्षेत्रीय और िैस्विक

दोनों िी स ् तरों पर सियोगपूर्ण बािरी

िातािरर् कायम करने में भारत की

मित ् िपूर्ण भूममका िै।

The Porter stemmer has been used for stemming of English documents. For

Hindi documents manual stemming is performed. So after stemming, the stop word

list is as shown in below Table 3.3

Table 3.3.Top 20 Stop Word List

English Hindi

Word Count Word Count

The 969 में 550

Of 577 और 445

And 483 की 378

In 389 को 241

To 337 का 215

A 202 मलए 166

107

For 161 से 165

With 111 ने 124

Is 108 एक 110

On 105 ककया 108

As 102 पर 105

By 100 िै 95

Was 73 करने 75

From 63 साथ 72

Has 63 इस 69

Also 57 भी 67

At 56 द्िारा 66

An 43 यि 51

As mentioned earlier, each paragraph is taken as an individual document. We

have mapped the paragraphs to their cross linguistic mates in the MySQL database

server. So totally, there are 360 documented pairs created. In that, document relation

represents each of them as a single document in the term.

108

The paragraphs, which are present in the same document, are semantically

equal.180 documents are used for training the system and 180 documents for testing

the system. The document set is shown in the below table 3.4

Table 3.4.Corpus Overview

Set

Number of Documents

English Hindi

Corpus 360 360

Training set 180 80

Hindi Test Set 0 180

English Test Set 180 0

After training the system, the documents in the Hindi testing set have been

queried to the system to find their cross language mates. Cosine similarity is used to

find the similarity among the documents.

D. SIMILARITY OF THE DOCUMENTS

In CLIR systems, the similarity of two documents is measured with similarity

metrics. Cosine Similarity is the most recently used similarity metric to find similarity

of two documents. When the documents are represented in two dimensional space, the

angle θ between two documents can define the similarity of these documents. When

the angle θ decreases, the similarity of the X and Y increases.

109

The formula for computing cosine similarity values is given as

Sim(q,d) = q.d/|q||d|

where q = query and

d=document

Figure 3.3. Cosine Similarity

After computing the similarity, the documents for which cosine value is near

to “1” are retrieved. Because as cos (0) =1.So, the documents which are similar have

the angle between them as 0.

The system is also tested with different ranks (k values). Based on the k value,

the results are shown in the Table 3.5

110

Table 3.5. Cross Language Mate Retrieval Results

The performance of the system is evaluated, if the mate of the query document

is found in the retrieval result. After submission of a query, retrieval results are ranked

according to their similarity to the query document. 180 test documents are given one

by one as a query and expected to find its mate in the query results. The query results

as successful, if the mate of the query document appears in the first 10 of ranked

retrieval results. The above table shows the number of successful queries according to

rank order of the mate document. For example, consider k=40 experiments, the mate

of the query document obtained at the first rank for 40 documents. The first row in the

table shows the results of CLIR, if a direct match between documents is made, where

no LSI and TFIDF is used. The above table also shows that using TFIDF and LSI

increases the query performance by approximately 3 times when direct matching is

considered. The table 3.5 shows that as the k value increases, the results are good but

compile time is increased.

111

3.6 SUMMARY

An index is a critical data structure which allows fast searching over large

volumes of data. A set of keywords are present in this data structure. These index

terms are also useful for ranking documents using some measure of relevance.

Indexing is very simple for a single language, but when coming to multilingual, it is

quite difficult. There are some traditional methods like dictionary based a query

translation method, which translates the query into the target language and search in

index. But there are some problems with this method like polysemy, synonymy which

can be solved by latent semantic indexing, as it doesn’t need to perform query

translation.

This experiment mainly focused on improving a Hindi-English cross language

information retrieval using latent semantic indexing. For this a parallel corpus has

been created from India.gov.in web site and performed singular value decomposition

to get a CLIR system. The tests performed on the proposed system depicted that the

latent semantic indexing improves the results three times to that of the direct matching

method. It is also observed that if the value of k increases, performance will be

increased but the complexity and compilation time are also increased.