Upload
university-of-washington
View
297
Download
0
Embed Size (px)
Citation preview
Democratizing Data Science
in the Cloud
Bill Howe, Ph.D.Associate Director and Senior Data Science Fellow, eScience Institute
Affiliate Associate Professor, Computer Science & Engineering
11/1/2016 Bill Howe, UW 1
11/1/2016 Bill Howe, UW 2
Cloud Data Management is about
sharing resources between tenants
We’re interested in new services powered by sharing
more than infrastructure – schema, data, queries
Why? Example: JBOT* Open Data systems
Google Fusion Tables
3
Entrepreneurship
1) “Data once guarded for assumed but untested
reasons is now open, and we're seeing benefits.”
-- Nigel Shadbolt, Open Data Institute
2) Need to help “non-specialists within an
organization use data that had been the
realm of programmers and DB admins”
-- Benjamin Romano, Xconomy
“Businesses are now using data the way
scientists always have”
-- Jeff Hammerbacher
Mt. Sinai, formerly Cloudera
*Just a Bunch of Tables
Data, data, data
4
Kevin Merrit
CEO
Socrata
Deep Dhillon
CTO
Socrata
Q Q Q
….
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
Q Q Q
….
Benefits: Significantly reduced management overhead
Challenges: security, scheduling, SLAs, isolation
Virtualization
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
Q Q Q
….
DB-as-a-Service
Benefits: Significantly reduced management overhead
Challenges: security, scheduling, SLAs, isolation
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
Q Q Q
….
JBOT* Query-as-a-Service Systems
Goal:
smart cross-tenant services,
trained on everyone’s data
• Metadata inference and data curation
• Query recommendation via common idioms
• Data discovery – e.g., “find me things to join with”
• Visualization recommendation
• Semi-automatic integration services
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
*Just a Bunch of Tables
Example Service: Automated Data Curation
11/1/2016 Bill Howe, UW 9
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon
Example Service: Automated Data CurationMaxim
Gretchkin
Hoifung
Poon
Goal: Repair metadata for genetic
datasets using the content of the data, the
structure of an associated ontology, the
abstract of the paper, and everything else.
Deep Neural Network
Tissue Type Labels
Innovations in transfer learning,
poor training data, etc.
Paper
Abstract
Example Service: Automated Data CurationMaxim
Gretchkin
Hoifung
PoonIterative co-learning between text-based classified and
expression-based classifier: Both models improve by
training on each others’ results
• SQLShare: Query-as-a-Service
• VizDeck: Visualization recommendation
• Myria: Big Data Ecosystems
VizDeck
Some Cloud Data Systems
1) Upload data “as is”
Cloud-hosted, secure; no
need to install or design a
database; no pre-defined
schema; schema inference;
some itegration
2) Write Queries
Right in your browser,
writing views on top of
views on top of views ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results
Make them public, tag them,
share with specific colleagues –
anyone with access can query
http://sqlshare.escience.washington.edu
11/1/2016 Bill Howe, UW 15
http://sqlshare.escience.washington.edu
SIGMOD 2016
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
THEN w.end_bp - x.start_bp + 1
END AS len_overlap
FROM [[email protected]].[hotspots_deserts.tab] x
INNER JOIN [[email protected]].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of
queries written by
non-programmers
The SQLShare Corpus:
A multi-year log of hand-written analytics queries
Queries 24275
Views 4535
Tables 3891
Users 591
SIGMOD 2016
Shrainik Jain
https://uwescience.github.io/sqlshare
19/57
A SQL “learner”
http://uwescience.github.io/sqlshare/
Latent Idioms for Schema-Independent Query Recommendation
Background on
Word2Vec, GloVE:
Map each term in a
corpus to a vector in
a high-dimensional
space based on its
co-occurrences.
Linear relationships
between these
vectors appear to
capture remarkable
semantic properties
:
SELECT COUNT(*) FROM [[email protected]].[table_Firearms.txt]
SELECT COUNT (HiLo) FROM [[email protected]].[table_MUK.csv]
SELECT count(*) FROM [[email protected]].[Depth_combined]
select count(Wave_Height) from [[email protected]].[Join]
SELECT count(*) FROM [[email protected]].[ecoli_nogaps_1.csv]
SELECT Count(*) FROM [[email protected]].[TargetTrackFeatures.csv]
SELECT count(*) FROM [billhowe].[sunrise sunset times 2009 - 2011]
SELECT Count(*) FROM [[email protected]].[table_ec_pdb_genus.csv]
SELECT count(*) FROM [[email protected]].[ecoli_nogaps_1.csv]
SELECT COUNT(*) FROM [[email protected]].[Tokyo_0_merged.csv]
SELECT COUNT(*) FROM [[email protected]].[SPID_GOnumber.txt]
SELECT COUNT (species) FROM [[email protected]].[Orthosia]
SELECT COUNT (species) FROM [[email protected]].[Leucania]
:
Apply the same trick to the SQLShare corpus, cluster the results
A not-very-interesting cluster:
Latent SQL Idioms
:
SELECT COUNT(*) FROM [[email protected]].[table_proteins.csv] WHERE species LIKE 'Homo sapiens%'
SELECT count (*) FROM [[email protected]].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'
SELECT count (*) FROM [[email protected]].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'
SELECT Count (*) FROM [[email protected]].[Dated_Join] WHERE Category = 'Warm'
SELECT COUNT (*) FROM [[email protected]].[table_PopulationV2.txt] WHERE Column1='Country'
SELECT COUNT(*) FROM [[email protected]].[table_pHWaterTemp] WHERE TempCategory='normal'
SELECT COUNT(*) FROM [[email protected]].[no retweete] WHERE hashtags_in_text LIKE '%#odisha%’
:
Another not-very-interesting cluster:
We see other clusters that seem to capture more basics: “union,”
“group by with one grouping column,” “left outer join,” “string
manipulation,” etc.
Latent SQL Idioms
Latent SQL Idioms
More interesting examples:
select floor(latitude/0.7)*0.7 as latbin
, floor(longitude/0.7)*0.7 as lonbin
, species
FROM [[email protected]].[All3col]
select distinct case when patindex('%[0-9]%', [protein]) = 1 -- first char is number
and charindex(',', [protein]) = 0 -- and no comma present
then [protein]
else substring([protein], patindex('%[0-9]%', [protein]),
charindex(',', [protein])-patindex('%[0-9]%', [protein]))
end as [protein d1124],
[tot indep spectra] as [tot spectra d1124]
from [[email protected]].[d1_file124.txt]
Parsing a common
bioinformatics file format
Expressions for binning
space and time columns
MYRIA: BIG DATA POLYSTORES
11/1/2016 Bill Howe, UW 24
Q Q Q
….
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
Q Q Q
….
Polystore Ecosystems: “Software Defined Databases”
Data Plane /
Database sys.
Application /
schema, data,
query logs
RDBMS HPC / Linear Algebra Graphs
Polystore
Execution
Plan
move
data
execute
query
Polystore
Execution
Plan
Tables KeyVal Arrays Graphs
Myria Algebra
Tables KeyVal Arrays Graphs
Spark Accumulo CombBLAS GraphX
Parallel Algebra
Logical Algebra
RACORelational Algebra COmpiler
CombBLAS API
Spark API
Accumulo Graph API
rewrite
rulesArray
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration
https://github.com/uwescience/raco
https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf
11/1/2016 Bill Howe, UW 33
Ollie Lo, Los Alamos National Lab
34
CurGood = SCAN(public:adhoc:sc_points);
DO
mean = [FROM CurGood EMIT val=AVG(v)];
std = [FROM CurGood EMIT val=STDEV(v)];
NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *];
CurGood = CurGood - NewBad;
continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];
WHILE continue;
DUMP(CurGood);
Sigma-clipping, V0
35
CurGood = P
sum = [FROM CurGood EMIT SUM(val)];
sumsq = [FROM CurGood EMIT SUM(val*val)]
cnt = [FROM CurGood EMIT CNT(*)];
NewBad = []
DO
sum = sum – [FROM NewBad EMIT SUM(val)];
sumsq = sum – [FROM NewBad EMIT SUM(val*val)];
cnt = sum - [FROM NewBad EMIT CNT(*)];
mean = sum / cnt
std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum))
NewBad = FILTER([ABS(val-mean)>std], CurGood)
CurGood = CurGood - NewBad
WHILE NewBad != {}
Sigma-clipping, V1: Incremental
36
Points = SCAN(public:adhoc:sc_points);
aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
newBad = []
bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];
DO
new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum,
sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];
stats = [FROM aggs EMIT mean=_sum/cnt,
std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];
newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];
tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v
AND v >= bounds.lower EMIT v=Points.v];
tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v
AND v <= bounds.upper EMIT v=Points.v];
newBad = UNIONALL(tooLow, tooHigh);
bounds = newBounds;
continue = [FROM newBad EMIT COUNT(v) > 0];
WHILE continue;
output = [FROM Points, bounds WHERE Points.v > bounds.lower AND
Points.v < bounds.upper EMIT v=Points.v];
DUMP(output);
Sigma-clipping, V2
Dominik Moritz
EuroVis 15
Empower the end user to do
performance profiling, debugging, etc.
Diagnosing problems
Sou
rce n
ode
Destination node
Dominik Moritz
EuroVis 15
Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 39
Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 40
Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 42
Query compilation for distributed processing
pipeline
as
parallel
code
parallel compiler
machine
code
[Myers ’14]
pipeline
fragment
code
pipeline
fragment
code
sequential
compiler
machine
code
[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]
sequential
compiler
RADISH
ICS 16
Brandon
Myers
11/1/2016 Bill Howe, UW 45/57
1% selection microbenchmark, 20GB
Avoid long code paths
ICS 16
Brandon
Myers
11/1/2016 Bill Howe, UW 46/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
ICS 16
Brandon
Myers
Graph Patterns
47
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICS 16
Brandon
Myers
11/1/2016 Bill Howe, UW 48
ICS 15
RADISH
ICS 16
Brandon
Myers
TPC-H
Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– “Software-defined Databases”
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 49
select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
Matrix multiply in RA
Matrix multiply
sparsity exponent (r s.t. m=nr)
Complexity
exponent
n2.38
mn
m0.7n1.2+n2
slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication
n = number of rows
m = number of non-zerosComplexity of matrix multiply
naïve sparse
algorithm
best known
sparse
algorithm
best known
dense
algorithm
lots of room
here
BLAS vs. SpBLAS vs. SQL (10k)off the shelf
database
15X
11/1/2016 Bill Howe, UW 54
20k X 20k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
11/1/2016 Bill Howe, UW 55
50k X 50k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
Filter to upper left corner of result matrix
select AB.i, C.m, sum(AB.val*C.val)
from
(select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
) AB,
C
where AB.k = C.k
group by AB.i, C.m
A x B x C
select A.i, C.m, sum(A.val*B.val*C.val)
from A, B, C
where A.j = B.j
and B.k = C.k
group by A.i, C.m
A(i, j, val)
B(j, k, val)
C(k, m, val)
take three sparse
matrices
Now compute
multiway hypercube join:
O (|A|/p + |B|/p^2 + |C|/p)
Group by:
~O (N)
But wait, there’s more…..
2 seconds,
balancedHypercube
shuffle
Partitioned
hash join43 seconds,
tons of skew
Task: self-multiply with 1M non-zeros
Seung-Hee
BaeScalable Graph Clustering
Version 1
Parallelize Best-known
Serial Algorithm
ICDM 2013
Version 2
Free 30% improvement
for any algorithm
TKDD 2014 SC 2015
Version 3
Distributed approx.
algorithm, 1.5B edges
http://escience.washington.edu
http://myria.cs.washington.edu
http://uwescience.github.io/sqlshare/
VIZDECK: VISUALIZATION
RECOMMENDATION
11/1/2016 Bill Howe, UW 60
“Data Triage” Pipeline
61
SAS
Excel
XML
CSV
SQL Azure
Files Tables Views
parse /
extract
“relational
analysis”
visual
analysis
Visualizations
SIGMOD 11SSDBM 13SIGMOD 16
sqlshare.escience.washington.edu
CHI 12SIGMOD 12 iConference 13
SSDBM 11CiSE 13 SSDBM 15
62
63
video
11/1/2016 Bill Howe, UW 65
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Fusion VizDeck ManyEyes Tableau
Task Completion Rate / Time - All QuestionsCHI 13
Visualization Recommendation
• Model each “vizlet” as a triple
(x_column, y_column, vizlet_type)
• Extract features from each column
(f1x, f2x,…, fNx, f1y, f2y, …, fNy, vizlet_type)
• Interpret each “promotion” as a yes vote and each “discard” as a
no vote
• Train a (simple) model to predict vizlet type from features
• Recommend highest-scoring vizlets
• Add a diversity term to prevent a bunch of similar plots
• Incorporate score modifiers defined by the vizlet designer
– “My bar chart looks best when there are about 5 bars.”
– “My timeseries plot ignores null values”
11/1/2016 Bill Howe, UW 66
Example of a Learned Rule (1)
low x-entropy => bad scatter plot
11/1/2016 Bill Howe, UW 67
bad scatter plotgood scatter plot
Example of a Learned Rule (2)
low x-entropy => histogram
11/1/2016 Bill Howe, UW 68
bad scatter plot good histogram
Example of a Learned Rule (3)
69
high x-periodicity => timeseries plot
(periodicity = 1 / variance in gap length between successive values)
Voyager
11/1/2016 Bill Howe, UW 70
Kanit “Ham” Wongsuphasawat Dominik Moritz
InfoVis 15
Within the first few queries, you’ve
touched all the tables.
SIGMOD 2016
Shrainik Jain
http://uwescience.github.io/sqlshare/