Democratizing Data Science in the Cloud

Democratizing Data Science

in the Cloud

Bill Howe, Ph.D.Associate Director and Senior Data Science Fellow, eScience Institute

Affiliate Associate Professor, Computer Science & Engineering

11/1/2016 Bill Howe, UW 1

Cloud Data Management is about

sharing resources between tenants

We’re interested in new services powered by sharing

more than infrastructure – schema, data, queries

Why? Example: JBOT* Open Data systems

Google Fusion Tables

Entrepreneurship

1) “Data once guarded for assumed but untested

reasons is now open, and we're seeing benefits.”

-- Nigel Shadbolt, Open Data Institute

2) Need to help “non-specialists within an

organization use data that had been the

realm of programmers and DB admins”

-- Benjamin Romano, Xconomy

“Businesses are now using data the way

scientists always have”

-- Jeff Hammerbacher

Mt. Sinai, formerly Cloudera

*Just a Bunch of Tables

Data, data, data

Kevin Merrit

Socrata

Deep Dhillon

Socrata

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

Benefits: Significantly reduced management overhead

Challenges: security, scheduling, SLAs, isolation

Virtualization

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

DB-as-a-Service

Benefits: Significantly reduced management overhead

Challenges: security, scheduling, SLAs, isolation

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

JBOT* Query-as-a-Service Systems

smart cross-tenant services,

trained on everyone’s data

• Metadata inference and data curation

• Query recommendation via common idioms

• Data discovery – e.g., “find me things to join with”

• Visualization recommendation

• Semi-automatic integration services

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

*Just a Bunch of Tables

Example Service: Automated Data Curation

Microarray samples submitted to the Gene Expression Omnibus

Curation is fast becoming the

bottleneck to data sharing

Gretchkin

Hoifung

Example Service: Automated Data CurationMaxim

Gretchkin

Hoifung

Goal: Repair metadata for genetic

datasets using the content of the data, the

structure of an associated ontology, the

abstract of the paper, and everything else.

Deep Neural Network

Tissue Type Labels

Innovations in transfer learning,

poor training data, etc.

Abstract

Example Service: Automated Data CurationMaxim

Gretchkin

Hoifung

PoonIterative co-learning between text-based classified and

expression-based classifier: Both models improve by

training on each others’ results

• SQLShare: Query-as-a-Service

• VizDeck: Visualization recommendation

• Myria: Big Data Ecosystems

VizDeck

Some Cloud Data Systems

1) Upload data “as is”

Cloud-hosted, secure; no

need to install or design a

database; no pre-defined

schema; schema inference;

some itegration

2) Write Queries

Right in your browser,

writing views on top of

views on top of views ...

SELECT hit, COUNT(*)

FROM tigrfam_surface

GROUP BY hit

ORDER BY cnt DESC

3) Share the results

Make them public, tag them,

share with specific colleagues –

anyone with access can query

http://sqlshare.escience.washington.edu

11/1/2016 Bill Howe, UW 15

http://sqlshare.escience.washington.edu

SIGMOD 2016

SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp

, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp

, w.category as nc_category

, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)

THEN x.end_bp - x.start_bp + 1

WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)

THEN x.end_bp - w.start_bp + 1

WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)

THEN w.end_bp - x.start_bp + 1

END AS len_overlap

FROM [koesterj@washington.edu].[hotspots_deserts.tab] x

INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w

ON x.chr = w.chr

WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)

OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)

OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)

ORDER BY x.strain, x.chr ASC, x.start_bp ASC

Non-programmers can write very complex queries

(rather than relying on staff programmers)

Example: Computing the overlaps of two sets of blast results

We see thousands of

queries written by

non-programmers

The SQLShare Corpus:

A multi-year log of hand-written analytics queries

Queries 24275

Tables 3891

Users 591

SIGMOD 2016

Shrainik Jain

https://uwescience.github.io/sqlshare

A SQL “learner”

http://uwescience.github.io/sqlshare/

Latent Idioms for Schema-Independent Query Recommendation

Background on

Word2Vec, GloVE:

Map each term in a

corpus to a vector in

a high-dimensional

space based on its

co-occurrences.

Linear relationships

between these

vectors appear to

capture remarkable

semantic properties

SELECT COUNT(*) FROM [candrzejowiec@yahoo.com].[table_Firearms.txt]

SELECT COUNT (HiLo) FROM [roula.cardaras@gmail.com].[table_MUK.csv]

SELECT count(*) FROM [leslie@westerncatholic.org].[Depth_combined]

select count(Wave_Height) from [christa.kohnert@gmail.com].[Join]

SELECT count(*) FROM [wenjunh@washington.edu].[ecoli_nogaps_1.csv]

SELECT Count(*) FROM [latcron@gmail.com].[TargetTrackFeatures.csv]

SELECT count(*) FROM [billhowe].[sunrise sunset times 2009 - 2011]

SELECT Count(*) FROM [bifxcore@gmail.com].[table_ec_pdb_genus.csv]

SELECT count(*) FROM [whitead@washington.edu].[ecoli_nogaps_1.csv]

SELECT COUNT(*) FROM [ribalet@washington.edu].[Tokyo_0_merged.csv]

SELECT COUNT(*) FROM [dhalperi@washington.edu].[SPID_GOnumber.txt]

SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Orthosia]

SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Leucania]

Apply the same trick to the SQLShare corpus, cluster the results

A not-very-interesting cluster:

Latent SQL Idioms

SELECT COUNT(*) FROM [ajw123@washington.edu].[table_proteins.csv] WHERE species LIKE 'Homo sapiens%'

SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'

SELECT Count (*) FROM [kzoehayes@gmail.com].[Dated_Join] WHERE Category = 'Warm'

SELECT COUNT (*) FROM [ethanknight08@gmail.com].[table_PopulationV2.txt] WHERE Column1='Country'

SELECT COUNT(*) FROM [missmelupton@gmail.com].[table_pHWaterTemp] WHERE TempCategory='normal'

SELECT COUNT(*) FROM [1004387@apps.nsd.org].[no retweete] WHERE hashtags_in_text LIKE '%#odisha%’

Another not-very-interesting cluster:

We see other clusters that seem to capture more basics: “union,”

“group by with one grouping column,” “left outer join,” “string

manipulation,” etc.

Latent SQL Idioms

More interesting examples:

select floor(latitude/0.7)*0.7 as latbin

, floor(longitude/0.7)*0.7 as lonbin

, species

FROM [koenigk92@gmail.com].[All3col]

select distinct case when patindex('%[0-9]%', [protein]) = 1 -- first char is number

and charindex(',', [protein]) = 0 -- and no comma present

then [protein]

else substring([protein], patindex('%[0-9]%', [protein]),

charindex(',', [protein])-patindex('%[0-9]%', [protein]))

end as [protein d1124],

[tot indep spectra] as [tot spectra d1124]

from [emmats@washington.edu].[d1_file124.txt]

Parsing a common

bioinformatics file format

Expressions for binning

space and time columns

MYRIA: BIG DATA POLYSTORES

11/1/2016 Bill Howe, UW 24

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

Polystore Ecosystems: “Software Defined Databases”

Data Plane /

Database sys.

Application /

schema, data,

query logs

RDBMS HPC / Linear Algebra Graphs

Polystore

Execution

execute

Polystore

Execution

Tables KeyVal Arrays Graphs

Myria Algebra

Tables KeyVal Arrays Graphs

Spark Accumulo CombBLAS GraphX

Parallel Algebra

Logical Algebra

RACORelational Algebra COmpiler

CombBLAS API

Spark API

Accumulo Graph API

rewrite

rulesArray

Algebra

MyriaL

Services: visualization, logging, discovery, history, browsing

Orchestration

https://github.com/uwescience/raco

https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf

11/1/2016 Bill Howe, UW 33

Ollie Lo, Los Alamos National Lab

CurGood = SCAN(public:adhoc:sc_points);

mean = [FROM CurGood EMIT val=AVG(v)];

std = [FROM CurGood EMIT val=STDEV(v)];

NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *];

CurGood = CurGood - NewBad;

continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];

WHILE continue;

DUMP(CurGood);

Sigma-clipping, V0

CurGood = P

sum = [FROM CurGood EMIT SUM(val)];

sumsq = [FROM CurGood EMIT SUM(val*val)]

cnt = [FROM CurGood EMIT CNT(*)];

NewBad = []

sum = sum – [FROM NewBad EMIT SUM(val)];

sumsq = sum – [FROM NewBad EMIT SUM(val*val)];

cnt = sum - [FROM NewBad EMIT CNT(*)];

mean = sum / cnt

std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum))

NewBad = FILTER([ABS(val-mean)>std], CurGood)

CurGood = CurGood - NewBad

WHILE NewBad != {}

Sigma-clipping, V1: Incremental

Points = SCAN(public:adhoc:sc_points);

aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];

newBad = []

bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];

new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];

aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum,

sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];

stats = [FROM aggs EMIT mean=_sum/cnt,

std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];

newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];

tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v

AND v >= bounds.lower EMIT v=Points.v];

tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v

AND v <= bounds.upper EMIT v=Points.v];

newBad = UNIONALL(tooLow, tooHigh);

bounds = newBounds;

continue = [FROM newBad EMIT COUNT(v) > 0];

WHILE continue;

output = [FROM Points, bounds WHERE Points.v > bounds.lower AND

Points.v < bounds.upper EMIT v=Points.v];

DUMP(output);

Sigma-clipping, V2

Dominik Moritz

EuroVis 15

Empower the end user to do

performance profiling, debugging, etc.

Diagnosing problems

Destination node

Dominik Moritz

EuroVis 15

Some ongoing work

• “from scratch” polystore optimizer

– Columbia-style, with some ideas from PL community

• Anecdotal Optimization

– Infer optimization decisions based on coarse-grained experimental

results from unreliable sources (blogs, literature)

– “System X is 2X faster than System Y on PageRank”

• Benchmarking Linear Algebra Systems vs. Databases

– HPC community thinks they are 1000X faster; they aren’t

– DB community thinks they are competitive; they aren’t

• Query compilation

– Bridge the gap between MPI and DB

• New query language Kamooks blending arrays and relations

11/1/2016 Bill Howe, UW 39

Some ongoing work

11/1/2016 Bill Howe, UW 40

Some ongoing work

11/1/2016 Bill Howe, UW 42

Query compilation for distributed processing

pipeline

parallel

parallel compiler

machine

[Myers ’14]

pipeline

fragment

pipeline

fragment

sequential

compiler

machine

[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]

sequential

compiler

RADISH

ICS 16

Brandon

11/1/2016 Bill Howe, UW 45/57

1% selection microbenchmark, 20GB

Avoid long code paths

ICS 16

Brandon

11/1/2016 Bill Howe, UW 46/57

Q2 SP2Bench, 100M triples, multiple self-joins

Communication optimization

ICS 16

Brandon

Graph Patterns

• SP2Bench, 100 million triples

• Queries compiled to a PGAS C++ language layer, then

compiled again by a low-level PGAS compiler

• One of Myria’s supported back ends

• Comparison with Shark/Spark, which itself has been shown to

be 100X faster than Hadoop-based systems

• …plus PageRank, Naïve Bayes, and more

RADISH

ICS 16

Brandon

11/1/2016 Bill Howe, UW 48

ICS 15

RADISH

ICS 16

Brandon

Some ongoing work

– “Software-defined Databases”

11/1/2016 Bill Howe, UW 49

select A.i, B.k, sum(A.val*B.val)

from A, B

where A.j = B.j

group by A.i, B.k

Matrix multiply in RA

Matrix multiply

sparsity exponent (r s.t. m=nr)

Complexity

exponent

m0.7n1.2+n2

slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication

n = number of rows

m = number of non-zerosComplexity of matrix multiply

naïve sparse

algorithm

best known

sparse

algorithm

best known

algorithm

lots of room

BLAS vs. SpBLAS vs. SQL (10k)off the shelf

database

11/1/2016 Bill Howe, UW 54

20k X 20k matrix multiply by sparsity

CombBLAS, MyriaX, Radish

11/1/2016 Bill Howe, UW 55

50k X 50k matrix multiply by sparsity

CombBLAS, MyriaX, Radish

Filter to upper left corner of result matrix

select AB.i, C.m, sum(AB.val*C.val)

(select A.i, B.k, sum(A.val*B.val)

from A, B

where A.j = B.j

group by A.i, B.k

where AB.k = C.k

group by AB.i, C.m

A x B x C

select A.i, C.m, sum(A.val*B.val*C.val)

from A, B, C

where A.j = B.j

and B.k = C.k

group by A.i, C.m

A(i, j, val)

B(j, k, val)

C(k, m, val)

take three sparse

matrices

Now compute

multiway hypercube join:

O (|A|/p + |B|/p^2 + |C|/p)

Group by:

~O (N)

But wait, there’s more…..

2 seconds,

balancedHypercube

shuffle

Partitioned

hash join43 seconds,

tons of skew

Task: self-multiply with 1M non-zeros

Seung-Hee

BaeScalable Graph Clustering

Version 1

Parallelize Best-known

Serial Algorithm

ICDM 2013

Version 2

Free 30% improvement

for any algorithm

TKDD 2014 SC 2015

Version 3

Distributed approx.

algorithm, 1.5B edges

http://escience.washington.edu

http://myria.cs.washington.edu

VIZDECK: VISUALIZATION

RECOMMENDATION

11/1/2016 Bill Howe, UW 60

“Data Triage” Pipeline

SQL Azure

Files Tables Views

parse /

extract

“relational

analysis”

visual

analysis

Visualizations

SIGMOD 11SSDBM 13SIGMOD 16

sqlshare.escience.washington.edu

CHI 12SIGMOD 12 iConference 13

SSDBM 11CiSE 13 SSDBM 15

11/1/2016 Bill Howe, UW 65

Fusion VizDeck ManyEyes Tableau

Task Completion Rate / Time - All QuestionsCHI 13

Visualization Recommendation

• Model each “vizlet” as a triple

(x_column, y_column, vizlet_type)

• Extract features from each column

(f1x, f2x,…, fNx, f1y, f2y, …, fNy, vizlet_type)

• Interpret each “promotion” as a yes vote and each “discard” as a

no vote

• Train a (simple) model to predict vizlet type from features

• Recommend highest-scoring vizlets

• Add a diversity term to prevent a bunch of similar plots

• Incorporate score modifiers defined by the vizlet designer

– “My bar chart looks best when there are about 5 bars.”

– “My timeseries plot ignores null values”

11/1/2016 Bill Howe, UW 66

Example of a Learned Rule (1)

low x-entropy => bad scatter plot

11/1/2016 Bill Howe, UW 67

bad scatter plotgood scatter plot

low x-entropy => histogram

11/1/2016 Bill Howe, UW 68

bad scatter plot good histogram

high x-periodicity => timeseries plot

(periodicity = 1 / variance in gap length between successive values)

Voyager

11/1/2016 Bill Howe, UW 70

Kanit “Ham” Wongsuphasawat Dominik Moritz

InfoVis 15

Within the first few queries, you’ve

touched all the tables.

SIGMOD 2016

Shrainik Jain

Democratizing Data Science in the Cloud

Data & Analytics

Democratizing technology. democratizing access to space

Democratizing Social Innovation

Democratizing Artificial Intelligence

Stanford Data Science Initiative...the Microsoft Azure cloud Democratizing machine learning in the Stanford DAWN project After DAWN: The potential of usable machine learning today

Democratizing the genome

Democratizing Knowledge Through Open Science #pdf2016

Democratizing Data Science: The Community Data … · Democratizing Data Science: ... In a follow-up survey of non-programmers in a large multinational ... through participatory and

ArdCoin - the Alpha tokenDemocratizing money, Democratizing banking, Democratizing privatizations - Democratizing wealth ArdCoin - the Alpha token 2019 оны 1 сард Ард Бит

Tomas Petricek - The Gamma: Democratizing data science - Codemotion Milan 2017

Democratizing data science through data science trainingpsb.stanford.edu/psb-online/proceedings/psb18/vanhorn.pdf · Democratizing data science through data science training . John

Democratizing Data

DEMOCRATIZING U.S.TRADE POLICY

Democratizing Sequencing with Ion S5TM Sequencers Powered ... · The world leader in serving science *Mohit Gupta Democratizing Sequencing with Ion S5TM Sequencers Powered by GPUs

Democratizing Children’s Computation: Learning ... · Democratizing Children’s Computation: Learning Computational Science as Aesthetic Experience Introduction Over the past several

Democratizing Latin America

Automating and Democratizing Cutting Edge Analytics · Automating and Democratizing Cutting Edge ... the creation and operationalization of advanced, ... AND DEMOCRATIZING CUTTING

Democratizing Decentralization

Democratizing Data Access - Blockchain and Healthcare ...€¦ · HPC , ML , IMC, Cloud , IoT , CyberSecurity • Healthcare , FiS , Logistics , Telecom , eCom , Gaming , Homeland

Democratizing the Network Edge - OpenAirInterface · Democratizing the Network Edge Focus on the Access-Edge: Where the Cloud and Access Collide •Requires a qualitatively new architecture

Democratizing SMART MANUFACTURING - EIT Digital€¦ · Democratizing S MART M ANUFACTURING Democratizing SM Knowledge Workforce Development Business Practices Demystifying Smart