26
0 CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA S9359 Real-time connection-based filtering to improve the precision of the search engine in life sciences For: NVIDIA GTC 2019: Deep Learning & AI Conference 20 March 2019 From: Vatsal Agarwal Vice President – Technology & Innovation

S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

0CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

S9359 Real-time connection-based filtering to improve the precision of the search engine in life sciences

For:NVIDIA GTC 2019: Deep Learning & AI Conference20 March 2019

From:Vatsal AgarwalVice President – Technology & Innovation

Page 2: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

1

AGENDA

01

02

03

CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

Problem• Business problem• Why is it important?• Technical Challenge

• Connection filtering• Graph embedding• Performance optimization

Results• Scope of tests• Graph embedding• Real-time search

Approach

Page 3: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

2CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USACONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

What is the problem being discussed? Why is it important?

Problem01

Page 4: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

3CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

World needs machine to enable information discovery rather than keyword search

Keyword search Information discovery

›Finds documents

›Containing query terms

›Known facts

›Google Search

‹Collate information

‹From multiple documents

‹New signals

‹Ontosight search

Page 5: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

4CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

Information discovery becomes more important and more difficult in research fields such as life science

The Importance The Difficulty

▪ Research output is only incrementaldespite exponential money flowing in

▪ Too much information availableyet scattered and not readily consumable

Example:▪ Find important targets for a rare disease

▪ Specialized knowledgeNLP was developed for common language use

▪ Size and ComplexityVocabulary size; context disambiguation

Example:▪ APR can mean multiple things

At stake is the human life

Page 6: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

5CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

We at Innoplexus use Ontology based concept search to enable a more precise annotations in life sciences

Using synonym based concept search

How it improves the search precision

Ontology to be continuous and self-learning

Entity Extraction as feedback

What is an ontology?

A representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities

that substantiate one, many, or all domains

Page 7: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

6CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

But concept search still lacks contextual understanding of concepts. In comes connection filtering.

What is connection filtering?

Extending the Ontology based search

from synonyms to

connected concepts

How to use connection filtering

(a) Embeddings built on Ontologytraining the corpus using ontology

and leveraging connections

(b) Leverage it for information discoveryCalculate the distance of query tokens

against each document

Page 8: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

7CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

NVIDIA GPUs make the real-time connection-based filtering possible for information discovery

Solving for real-time usage was not practical for two reasons:

Incredible size Space - time complexity

▪ >30 millions terms in life science vocabulary

▪ Over a trillion connections among them

▪ More than 100 million document needs to

be indexed at Innoplexus

▪ Index all combinations of search vocabulary

Real-time but extremely large space requirement

▪ Calculate the distance of each term from

each query token for every document

Too many calculations in real-time

Hence, we took an intermediate approach. Treating all distance calculations as few matrix multiplications accelerated by a cluster of 2 NVIDIA V100 GPUs

Page 9: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

8CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USACONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

How can you do connection filtering?How to do it in real time?

Approach02

Page 10: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

9CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

We start by building an Ontology for the domain of Life sciences

Decide on classesE.g. Disease, medicine, gene

1

Each concept forms a nodeE.g. Fever, Combiflam

2

All synonyms and biological informal of each node goes in node properties

E.g. Gene location, medicine dosage

3Biological relations appear as edges

E.g. Combiflam and fever would have an edge with relation “treats”

4

Innoplexus’ ontology has over 15 million nodes and over a trillion connections built, using a combination of 200 biological databases, enriched with self-learning algorithms

Page 11: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

10CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

Prepared Ontology can then be used for calculating distance of each document from the user query

User searches for terms (such as Headache, Fever, Combiflam)

For each document in database,

All concepts are tagged using the ontology & synonyms

Distance is calculated for each query concept to each document concept as

Number of minimum network hops needed to reach from query concept to document concept

All distances are aggregated to calculate net distance of document to query

All distances are sorted to find nearest documents

Page 12: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

11CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

Deploying the Ontology graph in hyperbolic space would make distance estimation many folds faster

Finding the distance between C1 & C2

I. Get coordinates of C1 in N-dimensional hyperbolic space

II. Get coordinates of C2 in the same hyperbolic space

III. Calculate distance between two coordinatesIllustration of Ontology embedded

in 2-dimensional hyperbolic space

Page 13: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

12CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

GPUs further speeds up the processing by converting the problem as matrix multiplications

Linear (or even non-linear) combination of the two matrices, A& B would

give distances of all documents from the query.

All documents can be one dimension of Matrix A

All concepts in each document can be another dimension

Coordinates of H-space embeddings can be third dimension.

All concepts in query can be one dimension of Matrix B

Coordinates of H-space embeddings can be another dimension

Third dimension could be unity to allow matrix processing between A & B

Page 14: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

13CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

Quasi cyclicity is the enemy of precision & speed when leveraging hyperbolic space embedding

Quasi-cyclicity i.e. small cycles within the graph, brings in multiplicative distortion1

▪ Any cyclicity makes it difficult to place a node in hyperbolic space, adding a distortion.

Time complexity also varies exponentially

▪ Each cycle takes extra time to calculate the right coordinate wrt existing nodes and relations

1 Verbeek, Kevin & Suri, Subhash. (2014). Metric Embedding, Hyperbolic Space, and Social Networks . Proceedings of the Annual Symposium on Computational Geometry. 10.1145/2582112.25821392 HyperE: Hyperbolic Embeddings for Entities. Beliz Gunel, Fred Sala, Albert Gu, Christopher Ré

In big graphs, there are a large number of cycles

▪ high distortion in embeddings

▪ impractically long embedding time

▪ Only tree-like graphs are feasible

Page 15: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

14CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

Breast cancer community

Breaking the Ontology graph into tree-like communities minimizes quasi-cyclicity to improve precision & speed

We split the huge Ontology graph into many sub-graphs (called communities)

• Each community has a tree-like structure with minimal quasi-cyclicity, if any.

• Concepts & relationships (nodes & edges) might be redundant among various sub-graphs to ensure full-coverage

One extreme would be to create sub-graphs, one for each node of the main graph.

• Each subgraph would have one node as root • It includes all edges to the nodes below• But skips any edges with nodes above or at same level.• This ensures each nodes has all connected nodes covered in one

of the communities • All communities have zero quasi-cyclicity.• But negative is that there would be too many communities to

be stored & processed in real-time

Page 16: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

15CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

Building communities for only selected broad concept can control the amount of processing needed

A compromise is to make communities for few broad conceptse.g. Breast Cancer

• Positive: Controlled number of communities for real-time

processing

• Negative: It misses distantly connected nodes

o For example, a concept X which was connected to

concept Y through 10 hops, may not be connected in

any community

o Such connections won’t be key in ultimate document

level distance calculations

• Trade-off: Minimal quasi-cyclicity allowed to cover maximum

connections

Relationship between average community size and the number of communities

Page 17: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

16CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USACONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

How did we set it up?What it we achieve?

Results03

Page 18: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

17CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

A small but significant portion of Innoplexus Ontology was used to build the proof-of-concept

To build a proof-of-concept, we sampled a connected sub-graph with..

350K Nodes 5.2M Edges4 ClassesDiseases, Chemical

Molecules, Genes & Others

In this graph, there are roughly a million edges across different entity classes, which could form billions of quasi-cycles.

Page 19: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

18CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

We built about 65 thousand tree-like communities around various disease areas to reduce quasi cyclicity

1. We can find the distance of each concept

from an average of 46k other concepts

(in this setup)

2. 87% of concept pairs are not available in any

community

▪ This is because we made communities for only broad

concepts and skipped the cyclic connection

▪ These are only far-off concepts, at least 3 hops away

Num

ber o

f con

cept

pai

rs

Average community size

538No of communities

65.5k

Page 20: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

19CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

H-embeddings were generated for all communities in 110 hours using cluster of 2 Tesla V100 GPUs

CPU (4-core) takes over 29 hours to embed one community with 1000 epochs (~103.5 seconds per epoch)

compared to

less than 8 hours for Tesla V100 GPU (~28 seconds per epoch)

Embedding the whole graph with 65.5K communities

Cluster of 50 CPUs (4-core)

Over 1 year(Estimated)

Cluster of 2 GPUs (Tesla V100)

4.5 days

Page 21: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

20CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

This trained setup can now enable real-time connection based filtered search in life sciences

Search query of 3 concepts generates ~1 million concept pairs (3 x 350k)

• To detect relevant community (per million pairs) - 6.12 seconds - It was parallelized over 48 cores to complete in ~0.14 seconds

• Distance calculation per million pairs on - CPU (4 core) - 1.03 seconds - GPU (Tesla V100) - 0.17 seconds

Note: We are using PCIe link which is slow in data loading. Much faster speeds are expected with NV link setup.

4x

Scaling on our real Ontology, search each concept is expected to take 4.6s on GPU vs 17.6s on CPU

Page 22: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

21CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USACONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

Questions?04

Page 23: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

22CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

Come to the talk by Innoplexus CTO, Gaurav Tripathi to know how else GPU can accelerate drug development

How To Use GPUs For Faster, Better and Cheaper Drug Development

Tomorrow 10:00 AM - 10:50 AM (Thursday, Mar 21)

SJCC Room 220B (Concourse Level)

By Gaurav Tripathi, CTO, Innoplexus AGIndustry Segments: Healthcare & Life Sciences

Technical Level: Business/Executive level

Page 24: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

23CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

Participate in Innoplexus Online Hiring Hackathon: Saving lives with AI

We invite you to participate in the Innoplexus Online Hackathon to understand and solve a real-world problem from industry. Imagine some of the world’s largest companies facing this problem, throwing a lot of resources at it and you will realize the scale of the same.

Prizes:• Nvidia Shield pro• Echo Plus (2nd gen) with a built-in smart home hub• Echo - Smart speaker with Alexa

Date: 22-24 March 2019https://datahack.analyticsvidhya.com/contest/innoplexus-online-hiring-hackathon-saving-lives-wi/

Page 25: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

24CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

Innoplexus AG: Aspiring European AI Champion

EXPERTISE

300+ highly skilled employees in multidisciplinary teams

GLOBAL

Offices in Germany, India, USA

CLIENTS

20 global clients including many Big Pharma

SECURITY

Global Standards in Data Protection

FOUNDATION

Innoplexus AG was founded end of 2015

DATA VOLUME

100 s of TBs of business and scientific data

PATENTS

80 Patent applications, another 23 in pipeline

TECHNOLOGY

State of the art technology to build reliable, secure and compliant applications

Page 26: S9359 Real-time connection-based filtering to improve the ......E.g. Fever, Combiflam 2 All synonyms and biological informal of each node goes in node properties E.g. Gene location,

25CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA

www.innoplexus.com

Frankfurt (Germany):

Innoplexus AGFrankfurter Strasse 63,65760 Eschborn

Pune (India):

Innoplexus Consulting Services Pvt. Ltd.7th Floor, Midas TowerHinjewadi Phase 1, Pune 57

New Jersey (USA):

Innoplexus Holdings, Inc.258 Newark Street, Suite 200, Hoboken, NJ 07030

25

Innoplexus AG offers Data as a Service and Continuous Analytics as a Service products and solutions helping organisations move towards continuous decision-making by generating insights from structured and unstructured private and public data leveraging cutting edge, proprietary Artificial Intelligence, Machine Learning and Blockchain technologies. More than 80 patent applications make Innoplexus to a leading European AI champion. Founded in 2015, INNOPLEXUS AG is headquartered in Eschborn, Germany with offices in Pune, India, and Hoboken, USA.

Contact