30
Improving VIVO search results through Semantic Ranking. Anup Sawant Deepak Konidena

Improving VIVO search through semantic ranking

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Improving VIVO search through semantic ranking

Improving VIVO search results through Semantic Ranking.

Anup SawantDeepak Konidena

Page 2: Improving VIVO search through semantic ranking

VIVO Search till Release 1.2.1 • VIVO Search till Release 1.2.1.

– Lucene keyword based search.– Score based on Textual relevance.– Importance of a node was not taken into consideration.– Additional data that describes a relationship was not

being searched.

Page 3: Improving VIVO search through semantic ranking

Adding knowledge from semantic relationships

–  VIVO 1.2 Search contained restricted information about an individual in the index. This lead people to ask questions like:

  “Hey I work for "USDA" and when I search for "USDA", my

profile doesn't show up in the search results and vice-versa.”  

“Hey information related to my Educational background, Awards, the Roles I assumed, etc. that appear on my profile

don't show up in the search results when I search for them individually or when I search for my name.”

 

       

 

Page 4: Improving VIVO search through semantic ranking

Lucene field for an Individual.

 And here's why          

 

Page 5: Improving VIVO search through semantic ranking

Intermediate nodes were overlooked.

 – Traditionally semantic relationships of an Individual like

Roles, Educational Training, Awards, Authorship, etc. were not stored in the Index. 

– Individuals were connected to these properties through intermediate nodes called "Context Nodes". And the information hiding beyond these context nodes was not captured.

        

 

Page 6: Improving VIVO search through semantic ranking

How does the semantic graph look like with the presence of context nodes?

Page 7: Improving VIVO search through semantic ranking

VIVO Search in 1.3

• VIVO Search in 1.3– Transition from Lucene to SOLR.– Provides base for distributed search capabilities.– Individuals enriched by description of semantic

relationships.– Enhanced score by Individual connectivity.– Improved precision and recall of search results.

Page 8: Improving VIVO search through semantic ranking

Enriching Individuals with Semantic Relations.

– For 1.3, the individuals are enriched with information from their semantic relations. i.e information hidden behind context nodes.

– Sparql queries are fired during index time to capture this information.

– Result: Improvement in overall placement of search results. Relevant results float up.

Page 9: Improving VIVO search through semantic ranking

Influence of PageRank

– Introduced by Larry Page & Sergey Brin.

– Every node relies on every other node for its ranking.

– Intuitive understanding: Node importance is calculated based on incoming connections and contribution of highly ranked important nodes.

Page 10: Improving VIVO search through semantic ranking

Some parameters based on PageRank• β

– Number of nodes connected to a particular node.– Intuition: Probably, a node deserves high rank because

it is connected to lot of individuals.• Φ

– Average over β values of all the nodes to which a node is connected.

– Intuition: Probably, a node deserves high rank because it is connected to some important individuals.

• Γ– Average strength of uniqueness of properties through

which a node is connected.– Intuition: Probably, a node deserves high rank based on

the strength of connection to other nodes.

Page 11: Improving VIVO search through semantic ranking

Search Index Architecture:

Enriching with Semantic Relations.

Enriching with Semantic Relations.

Overall connectivity of an Individual

(ß)

Overall connectivity of an Individual

(ß)

Apache Solr

Apache Solr

Relevant Documents.Relevant Documents.

Dismax Query Handler.

Indexing Phase

SparqlSparql

Proper BoostsProper Boosts

Searching Phase

Multithreaded.

Page 12: Improving VIVO search through semantic ranking

Real-time Indexing:

Enriching with Semantic Relations.

Enriching with Semantic Relations.

Overall connectivity of an Individual

(ß)

Overall connectivity of an Individual

(ß)

Apache Solr

Apache Solr

Relevant Documents.Relevant Documents.

Dismax Query Handler.

Indexing Phase

SparqlSparql

Proper BoostsProper Boosts

Searching Phase

ADD/EDIT/DELETE of an Individual or its properties.

ADD/EDIT/DELETE of an Individual or its properties.

The changes occur in real time and propagate beyond intermediate nodes.

The changes occur in real time and propagate beyond intermediate nodes.

Multithreaded.

Page 13: Improving VIVO search through semantic ranking

Cluster Analysis of Search Results • Intuition

– Assume search results from Release 1.2.1 and Release 1.3 are two different clusters.

• Expectation– Results from Release 1.3 should have their mean vector

close to query vector.• Results

– Text to vector conversion using ‘Bag of words’ technique.

– Tanimoto distance measure used.– Code at : https://github.com/anupsavvy/Cluster_Analysis

Query Distance from Mean vector of Release

1.2.1

Distance from Mean vector of Release

1.3

Scripps 0.27286328362357193

0.004277746256068157

Paulson James 0.009907336493786136

0.004650133621323327

Genome Sequencing 9.185463752863598E-4

8.154498815206635E-4

Kenny Paul 0.007610235640599918

0.003984303949283425

Page 14: Improving VIVO search through semantic ranking

Understanding how it happens ..

• R1• R2• R3• R4• R5• .• .• .• .

namelocationdescriptionnameresearchnamearticles

namelocation

Bla bla bla ….

Page 15: Improving VIVO search through semantic ranking

Understanding how it happens ..

• scripps• loring• jeanne• institute• cornell• florida• .• .• .• .

R1 R2 R3 .. .. .. ..

61

Q

1

0

01

401

1

01

401

1

01

111

0

00

---

-

---

-

---

-

---

-

Page 16: Improving VIVO search through semantic ranking

Understanding how it happens ..

institute

cornell

loring

V1

V2

θ

Euclidean distance

Euclidean distance

Cosine distanceCosine distance

Page 17: Improving VIVO search through semantic ranking

Understanding how it happens ..

institute

cornell

loringV2

θ

V1

Euclidean distance increases, Cosine distance remains the same

Page 18: Improving VIVO search through semantic ranking

Query vector distance from Cluster Mean vectors

Page 19: Improving VIVO search through semantic ranking

User testing for Relevance

Page 20: Improving VIVO search through semantic ranking

Precision and Recall

TotalRelevant

TotalRelevant

TotalRetrieved

TotalRetrieved

Precision = X / (Total Retrieved)

Recall = X / (Total Relevant)

X

Page 21: Improving VIVO search through semantic ranking

Precision-Recall graphs based on User Analysis.

Page 22: Improving VIVO search through semantic ranking

Cluster Analysis for Relevance

Page 23: Improving VIVO search through semantic ranking

Precision-Recall graphs based on Cluster Analysis

Page 24: Improving VIVO search through semantic ranking

Query vector distance from individual search result vectors

Page 25: Improving VIVO search through semantic ranking

Experiments : SOLR

• Search query expansion can be done using SOLR synonym analyzer. • Princeton Wordnet http://wordnet.princeton.edu/ is frequently used

with SOLR synonym analyzer.• A gist code by Bradford on Github https://gist.github.com/562776

was used to convert wordnet flat file into SOLR compatible synonyms file.

– Pros• High Recall• Documents can be matched to well known acronyms and words not present in

SOLR index. For instance, a query which has ‘fl’ as one of the terms would retrieve documents related to ‘Florida’ as well.

– Cons• Documents matching just the synonym part of the query could be ranked

higher.

Page 26: Improving VIVO search through semantic ranking

Experiments : SOLR ( cont. )

• Certain degree of spelling correction like feature could be achieved through SOLR Phonetic Analyzer.

• Phonetic Analyzer uses Apache Commons Codec for phonetic implementations.

– Pros• High Recall• Helps in detecting spelling mistakes in search query. For instance, if

a query like ‘scrips’ would be accurately match to a similar sounding word ‘scripps’ which is actually present in the index. Misspelled name like ‘Polex Frank’ in the query could be matched to correct name ‘Polleux Franck’.

– Cons• Number of results matched just based on Phonetics could decrease

the precision of the engine.

Page 27: Improving VIVO search through semantic ranking

Experiments : Ontology provides a good base for Factoid Questioning.

• Properties of Individuals give direct reference to the information.

• Natural language techniques and Machine learning algorithms could help us understand the search query better.

• A query like “What is Brian Lowe’s email id ?” should probably return just the email id on top or a query like “Who are the co-authors of Brian Lowe ?” should return just the list of co-authors of Brian Lowe.

• We can train an algorithm to know the type of question or search query that has been fired. Cognitive Computation Group of University of Illinois At Urbana-Champaign provides corpus of tagged questions to be used as training set. http://cogcomp.cs.illinois.edu/page/resources/data

Page 28: Improving VIVO search through semantic ranking

Experiments : Ontology provides a good base for Factoid Questioning. ( cont. )• Once the question type is determined, we could

grammatically parse the question using Stanford Lexparser http://nlp.stanford.edu/software/lex-parser.shtml

• Question type helps us to know whether we should look for a datatype property or an object property. Lexparser will helps us to form a SPARQL query.

Stanford Lexparser

Kmeans/SVM

Search Query

SPARQL Query

CorporaQuestion type

Terms

Page 29: Improving VIVO search through semantic ranking

Summary

• Transition from Lucene to SOLR

• Additional information of semantic relationships and interconnectivity in the index.

• More relevant results and good ranking compared to VIVO 1.2.1

• Improvements in indexing time due to multithreading.

Page 30: Improving VIVO search through semantic ranking

Team Work…