Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
0CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
S9359 Real-time connection-based filtering to improve the precision of the search engine in life sciences
For:NVIDIA GTC 2019: Deep Learning & AI Conference20 March 2019
From:Vatsal AgarwalVice President – Technology & Innovation
1
AGENDA
01
02
03
CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
Problem• Business problem• Why is it important?• Technical Challenge
• Connection filtering• Graph embedding• Performance optimization
Results• Scope of tests• Graph embedding• Real-time search
Approach
2CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USACONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
What is the problem being discussed? Why is it important?
Problem01
3CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
World needs machine to enable information discovery rather than keyword search
Keyword search Information discovery
›Finds documents
›Containing query terms
›Known facts
›Google Search
‹Collate information
‹From multiple documents
‹New signals
‹Ontosight search
4CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
Information discovery becomes more important and more difficult in research fields such as life science
The Importance The Difficulty
▪ Research output is only incrementaldespite exponential money flowing in
▪ Too much information availableyet scattered and not readily consumable
Example:▪ Find important targets for a rare disease
▪ Specialized knowledgeNLP was developed for common language use
▪ Size and ComplexityVocabulary size; context disambiguation
Example:▪ APR can mean multiple things
At stake is the human life
5CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
We at Innoplexus use Ontology based concept search to enable a more precise annotations in life sciences
Using synonym based concept search
How it improves the search precision
Ontology to be continuous and self-learning
Entity Extraction as feedback
What is an ontology?
A representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities
that substantiate one, many, or all domains
6CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
But concept search still lacks contextual understanding of concepts. In comes connection filtering.
What is connection filtering?
Extending the Ontology based search
from synonyms to
connected concepts
How to use connection filtering
(a) Embeddings built on Ontologytraining the corpus using ontology
and leveraging connections
(b) Leverage it for information discoveryCalculate the distance of query tokens
against each document
7CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
NVIDIA GPUs make the real-time connection-based filtering possible for information discovery
Solving for real-time usage was not practical for two reasons:
Incredible size Space - time complexity
▪ >30 millions terms in life science vocabulary
▪ Over a trillion connections among them
▪ More than 100 million document needs to
be indexed at Innoplexus
▪ Index all combinations of search vocabulary
Real-time but extremely large space requirement
▪ Calculate the distance of each term from
each query token for every document
Too many calculations in real-time
Hence, we took an intermediate approach. Treating all distance calculations as few matrix multiplications accelerated by a cluster of 2 NVIDIA V100 GPUs
8CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USACONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
How can you do connection filtering?How to do it in real time?
Approach02
9CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
We start by building an Ontology for the domain of Life sciences
Decide on classesE.g. Disease, medicine, gene
1
Each concept forms a nodeE.g. Fever, Combiflam
2
All synonyms and biological informal of each node goes in node properties
E.g. Gene location, medicine dosage
3Biological relations appear as edges
E.g. Combiflam and fever would have an edge with relation “treats”
4
Innoplexus’ ontology has over 15 million nodes and over a trillion connections built, using a combination of 200 biological databases, enriched with self-learning algorithms
10CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
Prepared Ontology can then be used for calculating distance of each document from the user query
User searches for terms (such as Headache, Fever, Combiflam)
For each document in database,
All concepts are tagged using the ontology & synonyms
Distance is calculated for each query concept to each document concept as
Number of minimum network hops needed to reach from query concept to document concept
All distances are aggregated to calculate net distance of document to query
All distances are sorted to find nearest documents
11CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
Deploying the Ontology graph in hyperbolic space would make distance estimation many folds faster
Finding the distance between C1 & C2
I. Get coordinates of C1 in N-dimensional hyperbolic space
II. Get coordinates of C2 in the same hyperbolic space
III. Calculate distance between two coordinatesIllustration of Ontology embedded
in 2-dimensional hyperbolic space
12CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
GPUs further speeds up the processing by converting the problem as matrix multiplications
Linear (or even non-linear) combination of the two matrices, A& B would
give distances of all documents from the query.
All documents can be one dimension of Matrix A
All concepts in each document can be another dimension
Coordinates of H-space embeddings can be third dimension.
All concepts in query can be one dimension of Matrix B
Coordinates of H-space embeddings can be another dimension
Third dimension could be unity to allow matrix processing between A & B
13CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
Quasi cyclicity is the enemy of precision & speed when leveraging hyperbolic space embedding
Quasi-cyclicity i.e. small cycles within the graph, brings in multiplicative distortion1
▪ Any cyclicity makes it difficult to place a node in hyperbolic space, adding a distortion.
Time complexity also varies exponentially
▪ Each cycle takes extra time to calculate the right coordinate wrt existing nodes and relations
1 Verbeek, Kevin & Suri, Subhash. (2014). Metric Embedding, Hyperbolic Space, and Social Networks . Proceedings of the Annual Symposium on Computational Geometry. 10.1145/2582112.25821392 HyperE: Hyperbolic Embeddings for Entities. Beliz Gunel, Fred Sala, Albert Gu, Christopher Ré
In big graphs, there are a large number of cycles
▪ high distortion in embeddings
▪ impractically long embedding time
▪ Only tree-like graphs are feasible
14CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
Breast cancer community
Breaking the Ontology graph into tree-like communities minimizes quasi-cyclicity to improve precision & speed
We split the huge Ontology graph into many sub-graphs (called communities)
• Each community has a tree-like structure with minimal quasi-cyclicity, if any.
• Concepts & relationships (nodes & edges) might be redundant among various sub-graphs to ensure full-coverage
One extreme would be to create sub-graphs, one for each node of the main graph.
• Each subgraph would have one node as root • It includes all edges to the nodes below• But skips any edges with nodes above or at same level.• This ensures each nodes has all connected nodes covered in one
of the communities • All communities have zero quasi-cyclicity.• But negative is that there would be too many communities to
be stored & processed in real-time
15CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
Building communities for only selected broad concept can control the amount of processing needed
A compromise is to make communities for few broad conceptse.g. Breast Cancer
• Positive: Controlled number of communities for real-time
processing
• Negative: It misses distantly connected nodes
o For example, a concept X which was connected to
concept Y through 10 hops, may not be connected in
any community
o Such connections won’t be key in ultimate document
level distance calculations
• Trade-off: Minimal quasi-cyclicity allowed to cover maximum
connections
Relationship between average community size and the number of communities
16CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USACONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
How did we set it up?What it we achieve?
Results03
17CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
A small but significant portion of Innoplexus Ontology was used to build the proof-of-concept
To build a proof-of-concept, we sampled a connected sub-graph with..
350K Nodes 5.2M Edges4 ClassesDiseases, Chemical
Molecules, Genes & Others
In this graph, there are roughly a million edges across different entity classes, which could form billions of quasi-cycles.
18CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
We built about 65 thousand tree-like communities around various disease areas to reduce quasi cyclicity
1. We can find the distance of each concept
from an average of 46k other concepts
(in this setup)
2. 87% of concept pairs are not available in any
community
▪ This is because we made communities for only broad
concepts and skipped the cyclic connection
▪ These are only far-off concepts, at least 3 hops away
Num
ber o
f con
cept
pai
rs
Average community size
538No of communities
65.5k
19CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
H-embeddings were generated for all communities in 110 hours using cluster of 2 Tesla V100 GPUs
CPU (4-core) takes over 29 hours to embed one community with 1000 epochs (~103.5 seconds per epoch)
compared to
less than 8 hours for Tesla V100 GPU (~28 seconds per epoch)
Embedding the whole graph with 65.5K communities
Cluster of 50 CPUs (4-core)
Over 1 year(Estimated)
Cluster of 2 GPUs (Tesla V100)
4.5 days
20CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
This trained setup can now enable real-time connection based filtered search in life sciences
Search query of 3 concepts generates ~1 million concept pairs (3 x 350k)
• To detect relevant community (per million pairs) - 6.12 seconds - It was parallelized over 48 cores to complete in ~0.14 seconds
• Distance calculation per million pairs on - CPU (4 core) - 1.03 seconds - GPU (Tesla V100) - 0.17 seconds
Note: We are using PCIe link which is slow in data loading. Much faster speeds are expected with NV link setup.
4x
Scaling on our real Ontology, search each concept is expected to take 4.6s on GPU vs 17.6s on CPU
21CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USACONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
Questions?04
22CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
Come to the talk by Innoplexus CTO, Gaurav Tripathi to know how else GPU can accelerate drug development
How To Use GPUs For Faster, Better and Cheaper Drug Development
Tomorrow 10:00 AM - 10:50 AM (Thursday, Mar 21)
SJCC Room 220B (Concourse Level)
By Gaurav Tripathi, CTO, Innoplexus AGIndustry Segments: Healthcare & Life Sciences
Technical Level: Business/Executive level
23CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
Participate in Innoplexus Online Hiring Hackathon: Saving lives with AI
We invite you to participate in the Innoplexus Online Hackathon to understand and solve a real-world problem from industry. Imagine some of the world’s largest companies facing this problem, throwing a lot of resources at it and you will realize the scale of the same.
Prizes:• Nvidia Shield pro• Echo Plus (2nd gen) with a built-in smart home hub• Echo - Smart speaker with Alexa
Date: 22-24 March 2019https://datahack.analyticsvidhya.com/contest/innoplexus-online-hiring-hackathon-saving-lives-wi/
24CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
Innoplexus AG: Aspiring European AI Champion
EXPERTISE
300+ highly skilled employees in multidisciplinary teams
GLOBAL
Offices in Germany, India, USA
CLIENTS
20 global clients including many Big Pharma
SECURITY
Global Standards in Data Protection
FOUNDATION
Innoplexus AG was founded end of 2015
DATA VOLUME
100 s of TBs of business and scientific data
PATENTS
80 Patent applications, another 23 in pipeline
TECHNOLOGY
State of the art technology to build reliable, secure and compliant applications
25CONFIDENTIAL & COPYRIGHT © 2011-19 INNOPLEXUS AG: GERMANY, INDIA, USA
www.innoplexus.com
Frankfurt (Germany):
Innoplexus AGFrankfurter Strasse 63,65760 Eschborn
Pune (India):
Innoplexus Consulting Services Pvt. Ltd.7th Floor, Midas TowerHinjewadi Phase 1, Pune 57
New Jersey (USA):
Innoplexus Holdings, Inc.258 Newark Street, Suite 200, Hoboken, NJ 07030
25
Innoplexus AG offers Data as a Service and Continuous Analytics as a Service products and solutions helping organisations move towards continuous decision-making by generating insights from structured and unstructured private and public data leveraging cutting edge, proprietary Artificial Intelligence, Machine Learning and Blockchain technologies. More than 80 patent applications make Innoplexus to a leading European AI champion. Founded in 2015, INNOPLEXUS AG is headquartered in Eschborn, Germany with offices in Pune, India, and Hoboken, USA.
Contact