8
SEARCH APPROACH-ES, GRAPHDB Search, Mining Sunita Shrivastava [email protected] Abstract This is a comparison of search technologies for Code Lens, CEG, Tech Debt. The initial focus is on understanding the relationships between inverted index search technologies like ES with Graph DB approaches like Titan.

Search Approach - ES, GraphDB

Embed Size (px)

Citation preview

SearcH Approach-ES, GraPHDB

Search, Mining

Sunita [email protected]

Abstract

This is a comparison of search technologies for Code Lens, CEG, Tech Debt. The initial focus is on understanding the relationships between inverted index search technologies like ES with

Graph DB approaches like Titan.

Table of Contents1. Making the ES deployment available to other VSO Services........................................................................1

2. Beyond ALM Search......................................................................................................................................2

2.1 Scenarios..............................................................................................................................................2

2.1.1 Test Results Search and Reporting...................................................................................................2

2.1.2 Code Lens.........................................................................................................................................3

2.1.3 Code Connect...................................................................................................................................3

2.1.4 Semantic Search...............................................................................................................................3

2.1.5 Code Map.........................................................................................................................................3

2.1.6 Tech Debt.........................................................................................................................................3

2.2 Overview of Indexing Technologies for Search....................................................................................3

2.2.1 Elastic Search...................................................................................................................................4

2.2.2 Graph Databases..............................................................................................................................4

2.2.3 Titan and Elastic Search...................................................................................................................4

3. Appendix.......................................................................................................................................................4

ES as a Search Platform........................................................................................................................................4

Extensibility:......................................................................................................................................................4

Leverages Lucene Features...............................................................................................................................5

HA and Scale-out...............................................................................................................................................5

Aggregation.......................................................................................................................................................5

WHAT ES IS NOT DESIGNED FOR.......................................................................................................................6

Search Platform Vs Reporting Platform Vs Analytics Platform.............................................................................6

ALM Search Platform

1

1. Making the ES deployment available to other VSO Services

A well designed search platform will go a long ways towards supporting the growing needs around data crunching, quick data searching and finally reporting.

Apart from ALM Search, here is what I have come up with so far as candidates that can leverage ES for indexing and analysis.

1) Test Results2) Perhaps query of test data related to certain releases in Release Management3) Internal Telemetry4) Log Analysis [TBD 5) Code Lens 6) Potentially Load Testing

The following spec describes at a high level, the various options that would be plausible for an overall architecture that encompasses both ALM search and Test Result Indexing. It was deemed that a shared Elastic Search Cluster, is a must at a minimum.

https://microsoft.sharepoint.com/teams/DPT/_layouts/15/WopiFrame.aspx?sourcedoc={B0D71992-E9C8-428F-BA72-C3840C45D656}&file=Indexing%20for%20VSO%20Services.pptx&action=default

Here is a link summary of proposal to enable the same.

With multiple indexing pipelines, in different services, feeding into the same ES cluster, issues around throttling might arise. More experiments are required to firm up this proposal.

In context of this analysis, we did work on whether we can separate query and ingestion paths into an ES cluster. We also worked on how we can secure an ES cluster that multiple VSO services can access in a secure manner.

TBD: Attach the mail/notes with Pradeep and Sean.

TBD: Apart from basic VSO artefact search, what are the other needs for Search in DevDiv. A better understanding of those will help us flush the requirements around the Search Platform for Devdiv.

2. Beyond ALM SearchThis is an attempt to better understand the relationships between technologies like ES, which is essentially an inverted index and say, Titan, which is a graph database.

2.1 ScenariosWe will use Semantic Search, CodeLens and Code Connect scenarios to explore, analyze and understand this space better.

2

2.1.1 Test Results Search and Reporting

2.1.2 Code Lens The Code Lens Service crawls for changelists (commits) and workitems. Eventually for each file, it builds a document that provides file level information and function level information. When a file is open in a VS Client or code browser, the client fetches the file level information and decorates the file with the relevant metadata. The indexing technology used by Code Lens is essentially SQL.

Interestingly, the same data could have been stored in an Elastic Search Index or be served by a Graph Database, akin to what the CEG team has proposed.

The key questions are:

1) What would be the benefit of storing Code Lens Data in ES? Format/Schema-less/Support for change.

2) What are the benefits of storing it in a graph database akin to what the CEG team is proposing to do for semantic search?

Hopefully, this attempt will yield a better understanding of graph databases.

2.1.3 Code ConnectAnother use case, which we will explore here is Code Connect. Code Connect is a hackathon project that is a social project.

2.1.4 Semantic SearchCurrently, the Code Entity Graph project is implementing Semantic Search support. Semantic search at a function level answers queries around references to a function (who are the callers of a given function).

The Code Entity Graph Team’s proposal is to use Titan to build relationship graphs between different entities like code (functions), people, work items etc over a period of time.

Titan is a graph database.

2.1.5 Code Map2.1.6 Tech Debt

2.2 Overview of Indexing Technologies for SearchThe following are the axes along which we would like a better understanding of technologies like SQL/Elastic Search and Graph Databases.

1. What it does best2. What it is not designed for3. Scale Characteristics

3

2.2.1 Elastic SearchElastic Search index, on the other hand is an inverted index designed for search. It allows building inverted indices for multiple fields of a document allowing each field to be searchable in an efficient manner. Here is where it differs from SQL, where building multiple indices for each column tends to be fairly expensive. Elastic Search is schema-less.

2.2.2 Graph DatabasesBy definition, a graph database has constant search time for discovering the adjacent set of nodes for a given node in a graph.

http://en.wikipedia.org/wiki/Graph_database has an excellent overview of graph databases, different providers and their comparison.

Titan is essentially a graph database. Titan works with storage technologies like HBASE, Cassandra. Graph Databases usually start with an assumption that the exact kinds of relationships are not fully known upfront.

http://vschart.com/compare/titan-database/vs/elasticsearch

2.2.3 Titan and Elastic SearchTitan recommends to use an external index for numeric range, full-text or geo-spatial indexing. An external index like ES can speed up Order By Queries. For exact match retrieval, the standard index suffices.

https://github.com/thinkaurelius/titan/wiki/Indexing-Backend-Overview https://github.com/thinkaurelius/ titan/wiki/Using-Elastic-Search

http://stackoverflow.com/questions/18191737/how-to-use-elasticsearch-index-in-titan-gremlin-query

3. Appendix ES as a Search PlatformFrom the perspectives of providing the underpinnings of a search platform, the key features of ES that stand out are the following: Extensibility,

Extensibility: These are mechanism it provides to content owners to build the kind of index they want and with knobs to control the features they would want to use.

Some of the extensibilities that we have used are

1) Analyzer Plugin 2) Query Highlighter3) Query Filter

TBD: Add details.

4

Leverages Lucene FeaturesLucene provides a transaction log based indexing mechanism which attempts to reduce loss of data in the face of failures and is built on a segment and thread pool model where an index is actually comprised of multiple smaller index parts so that each may be searched in parallel. It is highly optimized to do I/O in bulk by using buffering techniques. If desired, it also allows two phase commit of index documents. This means that you can commit a change to an index in context of a transaction to a SQL DB, by writing a basic coordinator.

HA and Scale-outES essentially provides a high available index with support for scale out. HA is provided by providing replica shards. Each shard is a Lucene Index. ES monitors for failures and elects new primary when necessary. Replica shards also help with read scale out. Significant amongst the Scale out features is support for bulk indexing, the ability to move a shard to a differently sized node and the ability to use an alias to access multiple indexes. When the tenant data grows, another index may be added.

So, say your search query for an account is against the Index Alias, I.

Currently I is mapped to I1. Once I1 reaches its limit based on the number of shards it was allocated, you can index new repositories into I2. And change the alias Mapping of I to I1, I2.

The ability to separate out ingestion and query nodes and the capability to do load balancing across those also allows a single ES cluster to handle multiple indexes fairly easily.

These are described in detail in the following spec.

https://microsoft.sharepoint.com/teams/DPT/_layouts/15/WopiFrame.aspx?sourcedoc={7F951956-9215-4221-9E08-BCA0B8A7503B}&file=Index%20Provision%20Schemes%20for%20multi%20tenant.docx&action=default

TBD: Above Spec needs cleanup

AggregationIt is equally important to understand what ES is not designed for. This is important to understand how other technologies, say like, COSMOS or HADOOP or GREMLIN or even SQL can play in an environment along with ES to provide solutions for that. [TBD: more thought on this is required]

WHAT ES IS NOT DESIGNED FOR

5

Search Platform Vs Reporting Platform Vs Analytics PlatformA search engine like ES which has facet support needs fairly powerful aggregation capabilities. This makes it a natural candidate for data analytics. A lot of data analytics workloads are concerned with aggregations over a given axis, be it time/location/category. Kibana is a data visualization tools built on top of Elastic Search.

TBD: Evaluation of Kibana over analytics frameworks used in devdiv, if any.

My read is that we are pretty behind in service telemetry evaluation with respect to VSO services. It would be interesting to compare the solutions in place today to an ES based solution based on Logstash and Kibana.

6