Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Preview:

Citation preview

Today:

Marc C. Hadfield, FounderVital AIhttp://vital.ai marc@vital.ai 917.463.4776

MetaQL:Queries Across NoSQL, SQL, Sparql, and Spark

<intro>

Marc C. Hadfield, Founder Vital AIhttp://vital.aimarc@vital.ai

MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Quick Overview

agenda

MetaQL Intro

Motivation

Domain Models (Schema)

MetaQL DSL

MetaQL Implementations

Examples

MetaQL

Leverage Domain Model (Schema)

Compose Queries in Code: Typed

Execute Queries on Databases, Interchangeably

Minimize TCO: Separation of Concerns

Developer Efficiency

Query Framework

Executable JVM Code! (Groovy Closure)

MetaQL Origin

Across many data-driven application implementations, a desire for:

Reusable Processes, Tools: Stop re-inventing the wheel.

Tools to manage “schema” across an application & organization.

Tools to combine Semantic Web, NOSQL, and Hadoop/Spark.

Team Collaboration: Human Labor is usually limiting factor.

sample

Recipient

Sender EMail

hasRecipient

hasSender

sample

Recipient

Sender EMail

hasRecipient

hasSender

ARC

ARC

sample

Recipient

Sender EMail

hasRecipient

hasSender

notEqual

type:PersonAddress:john@example.org

type:Person

type:hasSender

type:hasRecipient

type:Email

sample MetaQL graph query

GRAPH { value segments: ["mydata"] ARC { node_constraint { Email.class } constraint { "?person1 != ?person2" } ARC_AND { ARC { edge_constraint { Edge_hasSender.class } node_constraint {

Person.props().emailAddress.equalTo(“john@example.org") }

node_constraint { Person.class } node_provides { "person1 = URI" } } ARC { edge_constraint { Edge_hasRecipient.class } node_constraint { Person.class } node_provides { "person2 = URI" } } } } }

Internet of Things

Amazon Echo

Internet of Things

Coffee

Internet of Things:

Batch and Stream

Processing

Amazon Echo

Amazon Echo Service

haley-app webserviceVert.X

Vital Prime

Database

DataScript

Hadoop - HDFS

Apache SparkStreaming, MLLIB, NLP, GraphX

Aspen Datawarehouse

Analytics Layer

Serving Layer

Haley DeviceRaspberry Pi

Voice to Text API

Cognitive Application

NLP and Inference to process User request.

Query Knowledge in DB

Streaming Prediction Models:

“Should I really have more Coffee?”

External APIs…

Demo Examples

Vital Prime

Database

Vert.XVital-Vertx

JavaScript WebAppVitalService-JS

PredictionModels

DataScript

https://github.com/vital-ai/vital-examples

Demo Example

https://demos.vital.ai/enron-js-app/index.htmlhttps://github.com/vital-ai/vital-examples/tree/master/enron-js-app

Demo Example

Demo Example

Demo Example

Recipient EMailhasRecipient

Cytoscape Plugin

https://github.com/vital-ai/vital-cytoscapehttp://cytoscape.org/

Cytoscape Plugin

Cytoscape Plugin

Cytoscape Plugin

Cytoscape Plugin

Cytoscape Plugin: Wordnet Data, “wine, vino”

where are we using MetaQL?

Financial Services Healthcare

Internet-of-Things Start-Ups, Recommendation Apps

motivation for MetaQL

application architecture

Batch and Stream

Processing

Web / Mobile Application

Application Server

TransactionalDatabase

Hadoop - HDFS

Apache SparkStreaming, MLLIB, GraphX

Analytics Layer

Serving Layer

Key/ValueCache

External APIs Exrernal API Services

Multiple Databases + Analytics +

External APIs

enterprise application architecture

Dashboard

Application Server

Enterprise Datawarehouse

Data Silo Data Silo Data Silo Data Silo Data Silo ∞Many Many Many Data Models…

volume, velocity, variety

polyglot persistance = multiple database technologies

…but we also have very many data models.

many databases, many data models, changing rapidly.

too many moving parts for a developer to reasonably manage! need fewer APIs to learn!

what happens when changes occur?

Task

Infrastructure DevOps

Data Scientists

Business +Domain Experts

Developers

Roles

what changes?

Data Model Changes New Data Sources

Infrastructure Change Switch Databases

New Prediction Models / Features New Service APIs…

Many Interdependencies…

Example: Change in the taxonomy of a categorization service breaks all the logic tied to the old categories.

total cost of ownership

How much code changes when we modify our data model to include new sources?

How to minimize by decoupling dependencies?

When we switch database technologies?

Domain Model as “Contract”

Infrastructure DevOps

Data Scientists

Business +Domain Experts

Developers DomainModel

Everyone to agree (or at least be aware) of the definition of Domain Concepts.

Ue semantics to map “views”.

MetaQL Abstraction

Infrastructure DevOps

Data Scientists

Business +Domain Experts

Developers DomainModel

MetaQL

Abstraction to give breathing room to Infrastructure.

Infrastructure / DevOps

Database Types: • Key/Value • Document • RDF Graph • NOSQL • Relational • Timeseries

ACID vs. BASE

Optimizing Query Generation

Tuning Secondary Indices

Update MetaQL DSL for new DB features

CAP Theorem

Domain Model (Schema)

Domain Model Implementation

Combine: SQL-style Schema with Hadoop Data Serialization Schema (Avro, Thrift, Protocol Buffers, Kyro, Parquet) add Semantics: the “Meaning” of objects

Not a table “person”, but define the concept of Person to be used throughout an application. The implementation decides how to store “Person” data in it’s database.

Domain Model Implementation

Domain Model definition resolves: RDF vs Property Graph model Object Relational Impedance Mismatch

Use OWL to capture Domain Model: SubClasses SubProperties

Multiple Inheritance

Marginal technology performance gains are hugely outweighed by Human productively gains, and wider choice of tools.

Compromise across modeling paradigms .

Domain Model Implementation

Example: Healthcare Application: URI<Person123> IS_A: • Patient • BillableAccount • InsuredEntity Same URI across three domain concepts: Diagnostics Records, Billing System, Insurance System.

Implementation Note: We generate code for the JVM using “traits” as a way to implement multiple inheritance (Groovy, Scala, Java8). The trait is used as a semantic marker to link to the Domain Model.

Domain Model - Core Classes

Node NodeEdge

HyperNode

HyperEdge

Properties: • URI • Primary Type • Types

Edges/HyperEdges: • Source URI • Destination URI

Edges: • Peer • Taxonomy

Class Instances contain Properties.

Protege OWL Editor

VitalSigns: Domain Model Dev Kit

$ vitalsigns generate -o ./domain-ontology/enron-dataset-1.0.0.owl

$ ls domain-groovy-jarenron-dataset-groovy-1.0.0.jar

$ ls domain-json-schemaenron-dataset-1.0.0.js

OWL can be compiled into JVM code statically (create an artifact for maven), or done dynamically at runtime.

Development with the Domain Model

Code Completion from Domain Model

Development with the Domain ModelVitalSigns vs = VitalSigns.get()

Musician john = new Musician().generateURI(“john")

john.name = "John Lennon"

john.birthday = "October 9, 1940"^xsd.xdatetime("MMMM d, yyyy”)

MusicGroup thebeatles = new MusicGroup().generateURI("thebeatles")

thebeatles.name = "The Beatles"

// try to assign the wrong property, throws an exception

try { thebeatles.birthday = "January 1, 1970"^xsd.xdatetime("MMMM d, yyyy”)

} catch(Exception ex) { println ex } // no such property exception

vs.addToCache( thebeatles.addEdge_hasMember(john) )

// use cache to resolve queriesthebeatles.getMembers().each{ println it.name }

// use database to resolve queriesthebeatles.getMembers(ServiceWide).each{ println it.name }

Implicit MetaQL Queries

VitalService API

• Open/Close Endpoint • Create/Remove Segment • Create/Read/Update/Delete Object • Queries (MetaQL as input closure) • Service Operations (MetaQL as input closure) • callFunction (DataScript) • init Transaction/Commit/Rollback

A “Segment” is a Database (container of objects)

MetaQL

VitalSigns: Domain Model Manager • MetaQL DSL • Prediction Model DSL • Pipeline Transformation DSL (ETL)

(in development)

A tricky bit is find the best way to express the DSL within the allowed grammar of the host language (Groovy). It’s an ongoing effort.

Query Types

AGGREGATION

PATH

GRAPH

SELECT

Query Elements

• constraints: node_constraint, edge_constraint, … • comparators (equalTo, greaterThan, …) • provides, ?reference • AND, OR • OPTIONAL • Sort Criteria

SELECT query

SELECT {

value limit: 100value offset: 0value segments: ["mydata"]

constraint { Person.class }

constraint { Person.props().name.equalTo("John" ) }

}

GRAPH query

GRAPH { value segments: ["mydata"] ARC { node_constraint { Email.class } constraint { "?person1 != ?person2" } ARC_AND { ARC { edge_constraint { Edge_hasSender.class } node_constraint {

Person.props().emailAddress.equalTo(“john@example.org") } node_constraint { Person.class } node_provides { "person1 = URI" } } ARC { edge_constraint { Edge_hasRecipient.class } node_constraint { Person.class } node_provides { "person2 = URI" } } } } }

GRAPH query (2)

GRAPH {

value segments: [VitalSegment.withId('wordnet')]value inlineObjects: true

ARC {node_bind { "node1" }node_constraint { SynsetNode.expandSubclasses(true) }node_constraint { SynsetNode.props().name.contains_i("happy") }

ARC { edge_bind { "edge" } node_bind { "node2" } }

} }

Code iterating over Results can use bind names to reference objects in each solution: node1, edge, node2.

<—- inline objects

PATH query

def forward = truedef reverse = false

PATH {value segments: segmentsvalue maxdepth: 5 value rootURIs: [URIProperty.withString(inputURI)]

if( forward ) {ARC {

value direction: 'forward'// accept any edge: edge_constraint { }// accept any node: node_constraint { }

}}if( reverse ) {

ARC {value direction: 'reverse'// accept any edge: edge_constraint { }// accept any node: node_constraint { }}

}}

AGGREGATION query

SUM Product.props().cost

AVERAGE Person.props().birthday

COUNT_DISTINCT Document.props().active

FIRST { DISTINCT Document.props().title, expandProperty : false, order: Order.ASC }

Part of a SELECT query

Service Operations DSL

Insert

Update

Delete

Service Operations

INSERT {value segment: 'testing'

insert(MusicGroup.class, provides: "thebeatles") { MusicGroup thebeatles ->thebeatles.name = "The Beatles"thebeatles.URI = "thebeatles"

}insert(Musician.class, provides: "john") {

Musician john ->john.name = "John"john.URI = "john"

}insert(Edge_hasMember) { Edge_hasMember member ->

member.sourceURI = ref("thebeatles").toString()member.destinationURI = ref("john").toString()member.URI = "edge1"

}}

<— Using “provides” values

Transactions

def xid = service.startTransaction()

service.save(xid, person123)

service.commitTransaction(xid)

Implemented at the service level:

MetaQL Implementations

MetaQL

ExecutableQuery

Query Generator

Sparql/RDF Implementation

G S P O

Quad Store

Franz Allegrograph

Sparql/RDF Implementation

VitalGraphQuery q = builder.query {GRAPH {

value segments: ["documents"]ARC {

node_constraint { Person.class }node_constraint { Person.props().emailID.equalTo(“k.lay@enron.com" ) }

ARC {node_constraint { EMailMessage.class }edge_constraint { Edge_hasEMailMessage.class }

} } }

}.toQuery()

println "Query: " + q.toSparql()

Sparql/RDF Implementation

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>PREFIX vital-core: <http://vital.ai/ontology/vital-core#>PREFIX p0: <http://vital.ai/ontology/enron-emails#>

SELECT DISTINCT ?s1 ?d2 ?e2FROM <segment:customer__app__documents>WHERE { { ?s1 p0:hasEmailID ?value1 . ?s1 rdf:type ?value2 . FILTER ( ?value2 = p0:Person && ?value1 = “k.lay@enron.com"^^xsd:string ) { ?d2 rdf:type ?value3 . ?e2 rdf:type ?value4 . FILTER ( ?value3 = p0:EMailMessage && ?value4 = p0:Edge_hasEMailMessage ) ?e2 vital-core:hasEdgeSource ?s1 . ?e2 vital-core:hasEdgeDestination ?d2 . } }}

Spark-SQL / Dataframe

URI P V

Segment RDD Property RDD

K V

Experimenting with: new Dataframe Optimizer: Catalyst, new Dataframe DSL for query generation, and using GraphX for isolated Graph Query cases

Generate “Bad” queries, with optimizer fixing them and Spark partitioning RDDs, as long as Spark is aware of Schema.

Key/Value Implementation

K V

URI —> Serialized Object

Lucene/SOLR Implementation

DocID

1

2

3

P1

V1

V1

P2

V2

V2

P3

V3

V3

P4

V4

V4

Inverted Index of Property Values…

NoSQL BigTable Implementation

DynamoDB (HBase, Cassandra, Accumulo, …)

ROWID

1

2

3

C1

K1=V1

K1=V1

K1=V1

C2

K1=V1

K1=V1

K1=V1

C3

K1=V1

K1=V1

K1=V1

C4

K1=V1, K1=V1

K1=V1, K1=V1

K1=V1, K1=V1

URI P V

Per Segment object table

Per Segment property table

+ Secondary Indices

+ Secondary Indices

SQL Implementation

SQL, Hive-SQL, Redshift, …

G S P O

Per Segment Table

with Partitioning (Hive)

implementation

DSL Documentation to be posted: http://www.metaql.org/

VitalSigns, VitalService, MetaQL https://dashboard.vital.ai/

Vital AI github: https://github.com/vital-ai/ Sample Code

Spark Code: Aspen, Aspen-Datawarehouse

Documentation Coming!

closing thoughts

Separation of Concerns yields the Agility needed to keep up with rapidly evolving Data.

“Domain Model as Contract” provides a framework for consistent interpretation of Data across an application.

MetaQL provides a framework for the consistent access and query of Data across an application.

Context: Data-Driven Application / Cognitive Applications:

Thank You!

Marc C. Hadfield, FounderVital AIhttp://vital.ai marc@vital.ai 917.463.4776

Pipeline DSL (ETL)

PIPELINE { // WorkflowPIPE { // a Workflow Component with dependencies

TRANSFORM { // Joins across Datasets IF (RULE { } ) // Boolean, Query, Construct, … THEN { RULE { } } ELSE { RULE { } } }PIPE { … } // dependent PIPE} // Output Dataset

PIPE { … }

}

Influenced by Spark Pipeline and Google Dataflow Pipeline

Schema Upgrade/Downgrade

UPGRADE {

upgrade(oldClass: OLD_Person.class,newClass: NEW_Person.class ) {

person_old, person_new -> person_new.newName = person_old.oldName }}

DOWNGRADE {

downgrade(newClass: NEW_Person.class,oldClass: OLD_Person.class ) {

person_new, person_old -> person_old.oldName = person_new.newName }}

Multiple Endpoints

def service1 = VitalService.getService(profile:”kv-users”)def service2 = VitalService.getService(profile:”posts-db”)def service3 = VitalService.getService(profile:”friendgraph-db”)

// given user URI:user123@email.org

// get user object from service1

// find friends of user in friendgraph via service3

// find posts of friends in posts-db

// update service1 with cache of user-to-friends-postings

// send postings of friends to user in UI

Recommended