Democratizing Data Science in the Cloud

Preview:

Citation preview

Democratizing Data Science

in the Cloud

Bill Howe, Ph.D.Associate Director and Senior Data Science Fellow, eScience Institute

Affiliate Associate Professor, Computer Science & Engineering

11/1/2016 Bill Howe, UW 1

11/1/2016 Bill Howe, UW 2

Cloud Data Management is about

sharing resources between tenants

We’re interested in new services powered by sharing

more than infrastructure – schema, data, queries

Why? Example: JBOT* Open Data systems

Google Fusion Tables

3

Entrepreneurship

1) “Data once guarded for assumed but untested

reasons is now open, and we're seeing benefits.”

-- Nigel Shadbolt, Open Data Institute

2) Need to help “non-specialists within an

organization use data that had been the

realm of programmers and DB admins”

-- Benjamin Romano, Xconomy

“Businesses are now using data the way

scientists always have”

-- Jeff Hammerbacher

Mt. Sinai, formerly Cloudera

*Just a Bunch of Tables

Data, data, data

4

Kevin Merrit

CEO

Socrata

Deep Dhillon

CTO

Socrata

Q Q Q

….

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

Q Q Q

….

Benefits: Significantly reduced management overhead

Challenges: security, scheduling, SLAs, isolation

Virtualization

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

Q Q Q

….

DB-as-a-Service

Benefits: Significantly reduced management overhead

Challenges: security, scheduling, SLAs, isolation

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

Q Q Q

….

JBOT* Query-as-a-Service Systems

Goal:

smart cross-tenant services,

trained on everyone’s data

• Metadata inference and data curation

• Query recommendation via common idioms

• Data discovery – e.g., “find me things to join with”

• Visualization recommendation

• Semi-automatic integration services

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

*Just a Bunch of Tables

Example Service: Automated Data Curation

11/1/2016 Bill Howe, UW 9

Microarray samples submitted to the Gene Expression Omnibus

Curation is fast becoming the

bottleneck to data sharing

Maxim

Gretchkin

Hoifung

Poon

Example Service: Automated Data CurationMaxim

Gretchkin

Hoifung

Poon

Goal: Repair metadata for genetic

datasets using the content of the data, the

structure of an associated ontology, the

abstract of the paper, and everything else.

Deep Neural Network

Tissue Type Labels

Innovations in transfer learning,

poor training data, etc.

Paper

Abstract

Example Service: Automated Data CurationMaxim

Gretchkin

Hoifung

PoonIterative co-learning between text-based classified and

expression-based classifier: Both models improve by

training on each others’ results

• SQLShare: Query-as-a-Service

• VizDeck: Visualization recommendation

• Myria: Big Data Ecosystems

VizDeck

Some Cloud Data Systems

1) Upload data “as is”

Cloud-hosted, secure; no

need to install or design a

database; no pre-defined

schema; schema inference;

some itegration

2) Write Queries

Right in your browser,

writing views on top of

views on top of views ...

SELECT hit, COUNT(*)

FROM tigrfam_surface

GROUP BY hit

ORDER BY cnt DESC

3) Share the results

Make them public, tag them,

share with specific colleagues –

anyone with access can query

http://sqlshare.escience.washington.edu

11/1/2016 Bill Howe, UW 15

http://sqlshare.escience.washington.edu

SIGMOD 2016

SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp

, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp

, w.category as nc_category

, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)

THEN x.end_bp - x.start_bp + 1

WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)

THEN x.end_bp - w.start_bp + 1

WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)

THEN w.end_bp - x.start_bp + 1

END AS len_overlap

FROM [koesterj@washington.edu].[hotspots_deserts.tab] x

INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w

ON x.chr = w.chr

WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)

OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)

OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)

ORDER BY x.strain, x.chr ASC, x.start_bp ASC

Non-programmers can write very complex queries

(rather than relying on staff programmers)

Example: Computing the overlaps of two sets of blast results

We see thousands of

queries written by

non-programmers

The SQLShare Corpus:

A multi-year log of hand-written analytics queries

Queries 24275

Views 4535

Tables 3891

Users 591

SIGMOD 2016

Shrainik Jain

https://uwescience.github.io/sqlshare

19/57

A SQL “learner”

http://uwescience.github.io/sqlshare/

Latent Idioms for Schema-Independent Query Recommendation

Background on

Word2Vec, GloVE:

Map each term in a

corpus to a vector in

a high-dimensional

space based on its

co-occurrences.

Linear relationships

between these

vectors appear to

capture remarkable

semantic properties

:

SELECT COUNT(*) FROM [candrzejowiec@yahoo.com].[table_Firearms.txt]

SELECT COUNT (HiLo) FROM [roula.cardaras@gmail.com].[table_MUK.csv]

SELECT count(*) FROM [leslie@westerncatholic.org].[Depth_combined]

select count(Wave_Height) from [christa.kohnert@gmail.com].[Join]

SELECT count(*) FROM [wenjunh@washington.edu].[ecoli_nogaps_1.csv]

SELECT Count(*) FROM [latcron@gmail.com].[TargetTrackFeatures.csv]

SELECT count(*) FROM [billhowe].[sunrise sunset times 2009 - 2011]

SELECT Count(*) FROM [bifxcore@gmail.com].[table_ec_pdb_genus.csv]

SELECT count(*) FROM [whitead@washington.edu].[ecoli_nogaps_1.csv]

SELECT COUNT(*) FROM [ribalet@washington.edu].[Tokyo_0_merged.csv]

SELECT COUNT(*) FROM [dhalperi@washington.edu].[SPID_GOnumber.txt]

SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Orthosia]

SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Leucania]

:

Apply the same trick to the SQLShare corpus, cluster the results

A not-very-interesting cluster:

Latent SQL Idioms

:

SELECT COUNT(*) FROM [ajw123@washington.edu].[table_proteins.csv] WHERE species LIKE 'Homo sapiens%'

SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'

SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'

SELECT Count (*) FROM [kzoehayes@gmail.com].[Dated_Join] WHERE Category = 'Warm'

SELECT COUNT (*) FROM [ethanknight08@gmail.com].[table_PopulationV2.txt] WHERE Column1='Country'

SELECT COUNT(*) FROM [missmelupton@gmail.com].[table_pHWaterTemp] WHERE TempCategory='normal'

SELECT COUNT(*) FROM [1004387@apps.nsd.org].[no retweete] WHERE hashtags_in_text LIKE '%#odisha%’

:

Another not-very-interesting cluster:

We see other clusters that seem to capture more basics: “union,”

“group by with one grouping column,” “left outer join,” “string

manipulation,” etc.

Latent SQL Idioms

Latent SQL Idioms

More interesting examples:

select floor(latitude/0.7)*0.7 as latbin

, floor(longitude/0.7)*0.7 as lonbin

, species

FROM [koenigk92@gmail.com].[All3col]

select distinct case when patindex('%[0-9]%', [protein]) = 1 -- first char is number

and charindex(',', [protein]) = 0 -- and no comma present

then [protein]

else substring([protein], patindex('%[0-9]%', [protein]),

charindex(',', [protein])-patindex('%[0-9]%', [protein]))

end as [protein d1124],

[tot indep spectra] as [tot spectra d1124]

from [emmats@washington.edu].[d1_file124.txt]

Parsing a common

bioinformatics file format

Expressions for binning

space and time columns

MYRIA: BIG DATA POLYSTORES

11/1/2016 Bill Howe, UW 24

Q Q Q

….

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

Q Q Q

….

Polystore Ecosystems: “Software Defined Databases”

Data Plane /

Database sys.

Application /

schema, data,

query logs

RDBMS HPC / Linear Algebra Graphs

Polystore

Execution

Plan

move

data

execute

query

Polystore

Execution

Plan

Tables KeyVal Arrays Graphs

Myria Algebra

Tables KeyVal Arrays Graphs

Spark Accumulo CombBLAS GraphX

Parallel Algebra

Logical Algebra

RACORelational Algebra COmpiler

CombBLAS API

Spark API

Accumulo Graph API

rewrite

rulesArray

Algebra

MyriaL

Services: visualization, logging, discovery, history, browsing

Orchestration

https://github.com/uwescience/raco

https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf

11/1/2016 Bill Howe, UW 33

Ollie Lo, Los Alamos National Lab

34

CurGood = SCAN(public:adhoc:sc_points);

DO

mean = [FROM CurGood EMIT val=AVG(v)];

std = [FROM CurGood EMIT val=STDEV(v)];

NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *];

CurGood = CurGood - NewBad;

continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];

WHILE continue;

DUMP(CurGood);

Sigma-clipping, V0

35

CurGood = P

sum = [FROM CurGood EMIT SUM(val)];

sumsq = [FROM CurGood EMIT SUM(val*val)]

cnt = [FROM CurGood EMIT CNT(*)];

NewBad = []

DO

sum = sum – [FROM NewBad EMIT SUM(val)];

sumsq = sum – [FROM NewBad EMIT SUM(val*val)];

cnt = sum - [FROM NewBad EMIT CNT(*)];

mean = sum / cnt

std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum))

NewBad = FILTER([ABS(val-mean)>std], CurGood)

CurGood = CurGood - NewBad

WHILE NewBad != {}

Sigma-clipping, V1: Incremental

36

Points = SCAN(public:adhoc:sc_points);

aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];

newBad = []

bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];

DO

new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];

aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum,

sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];

stats = [FROM aggs EMIT mean=_sum/cnt,

std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];

newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];

tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v

AND v >= bounds.lower EMIT v=Points.v];

tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v

AND v <= bounds.upper EMIT v=Points.v];

newBad = UNIONALL(tooLow, tooHigh);

bounds = newBounds;

continue = [FROM newBad EMIT COUNT(v) > 0];

WHILE continue;

output = [FROM Points, bounds WHERE Points.v > bounds.lower AND

Points.v < bounds.upper EMIT v=Points.v];

DUMP(output);

Sigma-clipping, V2

Dominik Moritz

EuroVis 15

Empower the end user to do

performance profiling, debugging, etc.

Diagnosing problems

Sou

rce n

ode

Destination node

Dominik Moritz

EuroVis 15

Some ongoing work

• “from scratch” polystore optimizer

– Columbia-style, with some ideas from PL community

• Anecdotal Optimization

– Infer optimization decisions based on coarse-grained experimental

results from unreliable sources (blogs, literature)

– “System X is 2X faster than System Y on PageRank”

• Benchmarking Linear Algebra Systems vs. Databases

– HPC community thinks they are 1000X faster; they aren’t

– DB community thinks they are competitive; they aren’t

• Query compilation

– Bridge the gap between MPI and DB

• New query language Kamooks blending arrays and relations

11/1/2016 Bill Howe, UW 39

Some ongoing work

• “from scratch” polystore optimizer

– Columbia-style, with some ideas from PL community

• Anecdotal Optimization

– Infer optimization decisions based on coarse-grained experimental

results from unreliable sources (blogs, literature)

– “System X is 2X faster than System Y on PageRank”

• Benchmarking Linear Algebra Systems vs. Databases

– HPC community thinks they are 1000X faster; they aren’t

– DB community thinks they are competitive; they aren’t

• Query compilation

– Bridge the gap between MPI and DB

• New query language Kamooks blending arrays and relations

11/1/2016 Bill Howe, UW 40

Some ongoing work

• “from scratch” polystore optimizer

– Columbia-style, with some ideas from PL community

• Anecdotal Optimization

– Infer optimization decisions based on coarse-grained experimental

results from unreliable sources (blogs, literature)

– “System X is 2X faster than System Y on PageRank”

• Benchmarking Linear Algebra Systems vs. Databases

– HPC community thinks they are 1000X faster; they aren’t

– DB community thinks they are competitive; they aren’t

• Query compilation

– Bridge the gap between MPI and DB

• New query language Kamooks blending arrays and relations

11/1/2016 Bill Howe, UW 42

Query compilation for distributed processing

pipeline

as

parallel

code

parallel compiler

machine

code

[Myers ’14]

pipeline

fragment

code

pipeline

fragment

code

sequential

compiler

machine

code

[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]

sequential

compiler

RADISH

ICS 16

Brandon

Myers

11/1/2016 Bill Howe, UW 45/57

1% selection microbenchmark, 20GB

Avoid long code paths

ICS 16

Brandon

Myers

11/1/2016 Bill Howe, UW 46/57

Q2 SP2Bench, 100M triples, multiple self-joins

Communication optimization

ICS 16

Brandon

Myers

Graph Patterns

47

• SP2Bench, 100 million triples

• Queries compiled to a PGAS C++ language layer, then

compiled again by a low-level PGAS compiler

• One of Myria’s supported back ends

• Comparison with Shark/Spark, which itself has been shown to

be 100X faster than Hadoop-based systems

• …plus PageRank, Naïve Bayes, and more

RADISH

ICS 16

Brandon

Myers

11/1/2016 Bill Howe, UW 48

ICS 15

RADISH

ICS 16

Brandon

Myers

TPC-H

Some ongoing work

• “from scratch” polystore optimizer

– Columbia-style, with some ideas from PL community

• Anecdotal Optimization

– Infer optimization decisions based on coarse-grained experimental

results from unreliable sources (blogs, literature)

– “System X is 2X faster than System Y on PageRank”

• Benchmarking Linear Algebra Systems vs. Databases

– HPC community thinks they are 1000X faster; they aren’t

– DB community thinks they are competitive; they aren’t

• Query compilation

– “Software-defined Databases”

– Bridge the gap between MPI and DB

• New query language Kamooks blending arrays and relations

11/1/2016 Bill Howe, UW 49

select A.i, B.k, sum(A.val*B.val)

from A, B

where A.j = B.j

group by A.i, B.k

Matrix multiply in RA

Matrix multiply

sparsity exponent (r s.t. m=nr)

Complexity

exponent

n2.38

mn

m0.7n1.2+n2

slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication

n = number of rows

m = number of non-zerosComplexity of matrix multiply

naïve sparse

algorithm

best known

sparse

algorithm

best known

dense

algorithm

lots of room

here

BLAS vs. SpBLAS vs. SQL (10k)off the shelf

database

15X

11/1/2016 Bill Howe, UW 54

20k X 20k matrix multiply by sparsity

CombBLAS, MyriaX, Radish

11/1/2016 Bill Howe, UW 55

50k X 50k matrix multiply by sparsity

CombBLAS, MyriaX, Radish

Filter to upper left corner of result matrix

select AB.i, C.m, sum(AB.val*C.val)

from

(select A.i, B.k, sum(A.val*B.val)

from A, B

where A.j = B.j

group by A.i, B.k

) AB,

C

where AB.k = C.k

group by AB.i, C.m

A x B x C

select A.i, C.m, sum(A.val*B.val*C.val)

from A, B, C

where A.j = B.j

and B.k = C.k

group by A.i, C.m

A(i, j, val)

B(j, k, val)

C(k, m, val)

take three sparse

matrices

Now compute

multiway hypercube join:

O (|A|/p + |B|/p^2 + |C|/p)

Group by:

~O (N)

But wait, there’s more…..

2 seconds,

balancedHypercube

shuffle

Partitioned

hash join43 seconds,

tons of skew

Task: self-multiply with 1M non-zeros

Seung-Hee

BaeScalable Graph Clustering

Version 1

Parallelize Best-known

Serial Algorithm

ICDM 2013

Version 2

Free 30% improvement

for any algorithm

TKDD 2014 SC 2015

Version 3

Distributed approx.

algorithm, 1.5B edges

http://escience.washington.edu

http://myria.cs.washington.edu

http://uwescience.github.io/sqlshare/

VIZDECK: VISUALIZATION

RECOMMENDATION

11/1/2016 Bill Howe, UW 60

“Data Triage” Pipeline

61

SAS

Excel

XML

CSV

SQL Azure

Files Tables Views

parse /

extract

“relational

analysis”

visual

analysis

Visualizations

SIGMOD 11SSDBM 13SIGMOD 16

sqlshare.escience.washington.edu

CHI 12SIGMOD 12 iConference 13

SSDBM 11CiSE 13 SSDBM 15

62

63

video

11/1/2016 Bill Howe, UW 65

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Fusion VizDeck ManyEyes Tableau

Task Completion Rate / Time - All QuestionsCHI 13

Visualization Recommendation

• Model each “vizlet” as a triple

(x_column, y_column, vizlet_type)

• Extract features from each column

(f1x, f2x,…, fNx, f1y, f2y, …, fNy, vizlet_type)

• Interpret each “promotion” as a yes vote and each “discard” as a

no vote

• Train a (simple) model to predict vizlet type from features

• Recommend highest-scoring vizlets

• Add a diversity term to prevent a bunch of similar plots

• Incorporate score modifiers defined by the vizlet designer

– “My bar chart looks best when there are about 5 bars.”

– “My timeseries plot ignores null values”

11/1/2016 Bill Howe, UW 66

Example of a Learned Rule (1)

low x-entropy => bad scatter plot

11/1/2016 Bill Howe, UW 67

bad scatter plotgood scatter plot

Example of a Learned Rule (2)

low x-entropy => histogram

11/1/2016 Bill Howe, UW 68

bad scatter plot good histogram

Example of a Learned Rule (3)

69

high x-periodicity => timeseries plot

(periodicity = 1 / variance in gap length between successive values)

Voyager

11/1/2016 Bill Howe, UW 70

Kanit “Ham” Wongsuphasawat Dominik Moritz

InfoVis 15

Within the first few queries, you’ve

touched all the tables.

SIGMOD 2016

Shrainik Jain

http://uwescience.github.io/sqlshare/

Recommended