Enabling Collaborative Research Data Management with SQLShare

04/11/2023 1

Bill Howe, PhDDirector of Research,

Scalable Data AnalyticsUniversity of Washington

eScience Institute

Enabling Collaborative Research Data Management

with SQLShare

Bill Howe, UW

Tom LewisDirector, Academic &

Collaborative ApplicationsUniversity of WashingtonInformation Technology

Assessing UW Researchers’ Data Management Needs

1. Conversations with Research Leaders (2008)• First large-scale assessment of

researchers’ needs• 124 Interviews with top researchers

2. Faculty Technology Survey (2011)• Use of teaching and research technologies• Paired with student and TA surveys• Reached all disciplines, levels of research• 689 instructors responded

Data Management Needs• Expertise—designing new systems, enhancing

current practices, analysis

• Storage—Large amounts of data for current projects and data archives

• Backup—inconsistent systems among researchers, some unreliable practices

• Security—secure access needed for inter-institutional partners

Types of Data Stored

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

5.20%

11.70%

15.20%

20.50%

22.80%

48.00%

61.20%

71.60%Quantities/statistics

Text (literature, transcriptions, field notes)

Images (photographs, maps)

Video recordings

Audio recordings

Multimedia digital objects

Geo-tagged objects/ spatial data

Other

Data Storage Location

5%

5%

6%

6%

12%

27%

41%

66%

87%My computer

External device (hard drive, thumb drive)

Department-managed server

Server managed by research team

External (Non-UW) data center

Department-managed data center

Other

Data managed by research team

Data center managed by UW-IT

http://escience.washington.edu

04/11/2023 Bill Howe, eScience Institute 8

The University of Washington eScience Institute

• Rationale– The exponential increase in sensors is transitioning all fields of

science and engineering from data-poor to data-rich– Techniques and technologies include

• Sensors and sensor networks, databases, data mining, machine learning, visualization, cluster/cloud computing

– If these techniques and technologies are not widely available and widely practiced, UW will cease to be competitive

• Mission– Help position the University of Washington at the forefront of

research both in modern eScience techniques and technologies, and in the fields that depend upon them

• Strategy– Bootstrap a cadre of Research Scientists– Add faculty in key fields– Build out a “consultancy” of students and non-research staff

Bill Howe, UW 9

eScience Big Data GroupBill Howe, Phd (databases, cloud, data-intensive scalable computing, visualization)

Staff– Seung-Hee Bae, Phd (postdoc, scalable machine learning algorithms)– Dan Halperin, Phd (postdoc; scalable systems)– Sagar Chitnis, Research Engineer (Azure, databases, web services)– (alumna) Marianne Shaw, Phd (hadoop, semantic graph databases)– (alumna) Alicia Key, Research Engineer (visualization, web applications)

Students– Scott Moe (2nd yr Phd, Applied Math)– Daniel Perry (2nd yr Phd, HCDE)

Partners– CSE DB Faculty: Magda Balazinska, Dan Suciu– CSE students: Paris Koutris, Prasang Upadhyaya, – UW-IT (web applications, QA/support) – Cecilia Aragon, Phd, Associate Professor, HCDE (visualization, scientific

applications)

04/11/2023

Pop

ular

ity /

S

ales

Products / Results

Head

Tail

• Power distribution• 80:20 rule

First published May 2007, Wired Magazine article 2004

[src: Carol Goble]

PDB

GenBank

UniProt

Pfam

Spreadsheets, NotebooksLocal, Lost

High throughput experimental methodsIndustrial scaleCommons based productionPublicly data setsCherry picked resultsPreserved

CATH, SCOP(Protein Structure Classification)

ChemSpider

Long Tail of Research Data[src: Carol Goble]

04/11/2023 12

Problem

How much time do you spend “handling data” as opposed to “doing science”?

Mode answer: “90%”

Bill Howe, UW

1304/11/2023 Bill Howe, UW

Simple Example

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome

COGAnnotation_coastal_sample.txt

SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit

04/11/2023 14

Data management becoming the bottleneck

• Data is captured and manipulated in spreadsheets

• This approach made sense five years ago; the data volumes were manageable

• But now: – 50k rows in each of 50-100 spreadsheets, per

scientist– Significantly more sharing required: “mega-

collabs”– Continuously producing new results– The management overhead is beginning to

dominate– Difficult to share data publicly (emailing files is a

mess)– Lots of work to get an answers to any new question

04/11/2023 15

Why not build a database?

• Assumes an pre-defined schema – more on this later

• Requires specialized technical skills and a huge amount of up-front effort

• Not a perfect fit – it’s hard to design a “permanent” database for a fast-moving research target

• Researchers have little interest in operating and maintaining a data system – they just want to organize, manipulate, and share data

04/11/2023 16

Database-as-a-Service for the 99%

Approach: Strip down databases to bare essentials– Upload -> Query -> Share– Try to eliminate installation, configuration,

schema design, data loading, tuning, app-building

Bill Howe, UW

hardereasy, thanks to the cloud

Query

04/11/2023 17

Getting Data In

• Upload data through a browser to a cloud database (so no need to manage a local system)

• No need to “design a database” before you use your data – just upload and get to work

• Write SQL queries, with some automated help and some guided help from experts

• Build on your own results– Output of one query can be the input to another– Encourages sharing and reuse– Avoids everyone on the team running the same

data processing steps over and over– Provides provenance – “how did you get this

result?” Bill Howe, UW

04/11/2023 18

Select from a list of English descriptions

04/11/2023 19

Save the results, share them with others

Edit a Query

04/11/2023 20Bill Howe, UW

http://sqlshare.escience.washington.edu

04/11/2023 21Bill Howe, UW


04/11/2023 22Bill Howe, UW


04/11/2023 23Bill Howe, UW


04/11/2023 24Bill Howe, UWhttp://sqlshare.escience.washington.edu

VizDeck

Python Client

“Flagship” SQLShare App (Python) on EC2

SQLShare REST API

Excel Addin

R Client

Spreadsheet Crawler

ASP.NET

OAuth2

WCF

04/11/2023 26Bill Howe, UW

Deeply nested hierarchies of views

ProvenanceControlled Sharing

04/11/2023 27

4 5

67

8

9

10

11

12

13

14

1516

2010 Pilot- Outreach and Education-based sampling: Schooner Adventuress

Robin Kodner

Bill Howe, UW

04/11/2023 28

NatureMapping Program Wildlife Observations (1902- )

Water Quality Monitoring Sites (2003 - )

Data collection and submission options:

1. Download/upload spreadsheet

2. Online data entry

3. NatureTracker on handheld/GPS

4. Android ODK (Open Data Kit)

Karen Dvornich

Bill Howe, UW

04/11/2023 29Bill Howe, UW

“An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces.

Previously, we were using huge directory trees and plain text files.

Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”

-- Andrew D White

Andrew White, UW Chemistry

04/11/2023 30

WHY SQL?

Bill Howe, UW

04/11/2023 31

A “Needs Hierarchy” of Science Data Management

Bill Howe, UW

storage

sharing

full semantic integration

query

analytics

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

04/11/2023 32

NoSchema (not NoSQL)• A schema* is a shared consensus about

some universe of discourse• At the frontier of research, this shared

consensus does not exist, by definition • Any schema that does emerge will change

frequently, by definition• Data found “in the wild” will typically not conform

to any schema, by definition• But this doesn’t mean we have to punt on

databases and go back to ad hoc scripts and files

* ontology/metadata standard/controlled vocabulary/etc.

Bill Howe, UW

04/11/2023 33

Silos

Cloudstorage

sharing

query

full semantic integration

analytics


Bill Howe, UW

5/18/10 Garret Cole, eScience Institute

What’s the point?• Conventional wisdom says “Science data isn’t relational”

– This is utter nonsense

• Conventional wisdom says “Scientists won’t write SQL”– This is utter nonsense

• So why aren’t databases being used more often?– They’re a PITA

• We implicate difficulty in– installation, configuration– schema design, data loading– performance tuning– app-building (NoGUI?)

We ask instead, “What kind of platform can support ad hoc scientific Q&A with SQL?”

SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap

FROM [[email protected]].[hotspots_deserts.tab] x INNER JOIN [[email protected]].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC

Biologists are beginning to write very complex queries (rather than relying on staff programmers)

Example: Computing the overlaps of two sets of blast results

We see thousands of queries written by non-programmers

04/11/2023 36

So why SQL?• Covers 80% of what we need

– Ex: Sloan Digital Sky Survey– Ex: Hybrid Hash Join algorithm published in BMC

bioinformatics

• Empower a new class of data-savvy scientist who isn’t forced to be trained as an IT professional

• Algebraic optimization– Ask me about this if you’re interested

• Views: Logical and physical data Independence– Reason about the problem independently of the data

representation– No re-execution of “workflows”– No file format incompatibilities– No version mismatches– Data and code tightly coupled and (logically) centralizeBill Howe, UW

04/11/2023 37

SQLShare as a CS Research Platform• Automatic “Starter” Queries

– (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle)

• VizDeck: Automatic Mashups and Visualization – (Bill Howe, Alicia Key, Daniel Perry, Cecilia Aragon)

• Info Extraction from Spreadsheets– (Mike Cafarella, Dave Maier, Bill Howe, Sagar Chitnis, Abdu Alwani)

• Scalable Analytics-as-a-Service– (Dan Suciu, Magda Balazinska, Bill Howe)

• Optimizing Iterative Queries for Machine Learning– (Dan Suciu, Magda Balazinska, Bill Howe)

SSDBM 2011SIGMOD 2011 (demo)

SSDBM 2011CHI 2012SIGMOD 2012 (demo)

Bill Howe, UW

VLDB 2010Datalog2.0 2012CIDR 2013

04/11/2023 38

Where we’re headed:

Bill Howe, UW

• Local or cloud-hosted deployments• Multi-institution sharing• Global users and permissions• Distributed data and distributed query

done!

We are looking for partners!

04/11/2023 39Bill Howe, UW


[email protected]

http://escience.washington.edu

04/11/2023 40Bill Howe, UW

04/11/2023 41

Scientific data management reduces to sharing views• Integrate data from multiple sources?

– joins and unions with views

• Standardize on units, apply naming conventions?– rename columns, apply functions with views

• Attach metadata?– add new tables with descriptive names, add new columns

with views

• Data cleaning, quality control?– hide bad values with views

• Maintain provenance?– inspect view dependencies

• Propagate updates? – view maintenance

• Protect sensitive data?– expose subsets with views (assuming views carry

permissions) Bill Howe, UW

An observation about “handling data”

• How many plasmids were bombarded in July and have a rescue and expression?


SELECT count(*)

FROM [bombardment_log]

WHERE bomb_date BETWEEN ’7/1/2010' AND ’7/31/2010'

AND rescue clone IS NOT NULL

AND [expression?] = 'yes'


• Which samples have not been cloned?


SELECT * FROM plasmiddb

WHERE NOT (ISDATE(cloned) OR cloned = ‘yes’)


• How often does each RNA hit appear inside the annotated surface group?


SELECT hit, COUNT(*) as cnt

FROM tigrfamannotation_surface

GROUP BY hit

ORDER BY cnt DESC


For a given promoter (or protein fusion), how many expressing line have been generated (they would all have different strain designations)


SELECT strain, count(distinct line) FROM glycerol_stocks

GROUP BY strain

04/11/2023 46

Find all TIGRFam ids (proteins) that are missing from at least one of three samples (relations)

SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]

UNIONSELECT col0 FROM [est_hma_fasta_TGIRfam_refs]

UNIONSELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]

EXCEPT

SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]

INTERSECTSELECT col0 FROM [est_hma_fasta_TGIRfam_refs]

INTERSECTSELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]

Bill Howe, UW

04/11/2023 47Bill Howe, UW

On NoSQL

04/11/2023 48

An Observation on NoSQL

• 2004 Dean et al. MapReduce• 2008 Hadoop 0.17 release• 2008 Olston et al. Pig: Relational Algebra on Hadoop• 2008 DryadLINQ: Relational Algebra in a Hadoop-like system• 2009 Thusoo et al. HIVE: SQL on Hadoop

NoSQL is a misnomer– NoMySQL?– NoSchema?– NoLoading? – NoLicenseFees!

Bill Howe, UW

Pre-Relational: if your data changed, your application broke.

Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”

Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation

Digression: Relational Database History

-- Codd 1979

04/11/2023 Bill Howe, UW 50

Key Idea: An Algebra of Tables

select

project

join join

Other operators: aggregate, union, difference, cross product

51

Algebraic OptimizationN = ((4*2)+((4*3)+0))/1

Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x

Apply rules 1, 3, 4, 2: N = (2+3)*4

two operations instead of five, no division operator

Same idea works with very large tables / graphs, but the payoff is much higher

Windows Azure

Architecture

VizDeck

Python Client

“Flagship” SQLShare App (Python) on EC2

SQLShare REST API

SQL Azure

Excel Addin

R Client

SQL Server

Spreadsheet Ingest

ASP.NET

SQLShare REST API

IIS


Science is becoming a database query problem Old model: “Query the world” (Data acquisition coupled to a specific

hypothesis) New model: “Download the world” (Data acquisition supports many

hypotheses)– Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST,

PanSTARRS)– Biology: lab automation, high-throughput sequencing, – Oceanography: high-resolution models, cheap sensors, satellites

40TB / 2 nights

1 device

~1TB / day100s of devices

04/11/2023 54

Problem– Research data is captured and manipulated in

spreadsheets– This perhaps made sense five years ago; the data

volumes were manageable– But now: 50k rows, 100s of files, “mega-collabs”

Why not put everything into a database?– A huge amount of up-front effort– Hard to design for a moving target– Running a database system is huge drain

Approach: SQLShare– Upload data through your browser: no setup, no

install– Login to browse science questions in English– Click a question, see the SQL to answer it question– Edit the SQL to answer an "adjacent" question,

even if you wouldn’t know how to write it from scratchhttps://sqlshare.escience.washington.edu/

04/11/2023 55

# of

byt

es

Biology

Galaxy

BioMart

GEO

LANL HIV Pathway

Commons

Astronomy

LSST

SDSS

PanSTARRS

# of sources

SQLShare

Bill Howe, UW

Oceanography

IOOS

OOI

NDBC

PDB GenBank

UniProt Pfam

SRA MG-RAST

Sca

le,

trai

ning

life sciences earth sciences commercialastrophysics

citizen science; public outreach

“Long tail”

Large scale

SQLShare: Query-as-a-Service

GridFields

VizDeck

Parallel Iterative Analytics

Sca

le,

trai

ning

life sciences earth sciences commercialastrophysics

citizen science; public outreach

“Long tail”

Large scale GridFields

VizDeck

Parallel Iterative Analytics

SQLShare: Query-as-a-Service

04/11/2023 58


storage

sharing

Bill Howe, UW

query

full integration

analytics

04/11/2023 59


storage

sharing

Bill Howe, UW

full integration

query

analytics

04/11/2023 60

Four Conjectures about Declarative Query for Science• Most science data manipulation tasks can

be expressed in relational algebra• Most science analytics task can be

expressed in relational algebra + recursionHellerstein 09, Re 12

• These expressions can be efficiently and scalably executed in the cloud

• Researchers are willing and able to program using relational algebra languagesc.f. SDSS

Bill Howe, UW




metadata

search results

sequence data


SQL


data

vol

ume

rank

CERN (~15PB/year)

LSST (~100PB)

PanSTARRS (~40PB)

Ocean Modelers <Spreadsheet

users>

SDSS (~100TB)

Seis-mologists

MicrobiologistsCARMEN (~50TB)

“The future is already here; it’s just not very evenly distributed.”

-- William Gibson

The Long Tail

A stripped-down version of Jim Gray’s “20 questions” methodology

Experimental Engagement Algorithm for the Long Tail

1. Get the data2. Load the data “as is” – no schema design3. Get ~20 questions (in English)4. Translate the questions into SQL (when possible)5. Provide these “starter queries” to the researchers

Q: Can researchers questions be expressed in SQL?Q: Are a few examples sufficient for novices to self-train with SQL?Q: Can we scale this process up?Q: If so, will the use of SQL reduce their data handling overhead?

04/11/2023 68

• Which samples have not been cloned?

SELECT * FROM plasmiddb

WHERE NOT (ISDATE(cloned) OR cloned = ‘yes’)

Bill Howe, UW

04/11/2023 69

Details

• Backend is Microsoft’s SQL Azure• About a year old, essentially zero advertising• 50+ active users (UW and external)• 2500+ tables/views (~20% are public• largest table: 1.1M rows, smallest table: 1 row• transitioning to support by UW-IT

Bill Howe, UW

04/11/2023 70

View Semantics and Features• “Saved query” = View with attached metadata• Unify views and tables as “datasets”

– table = “select * from [raw_table]”• Replacement semantics for name conflicts

– old versions materialized and archived• Materialize downstream views

– when dependencies deleted– when dependencies become incompatible

• Permissions– public, private, ACLs (may work on groups in the future)

• Sharing, social querying, CQMS*– search, recent queries, friends’ queries, favorites, ratings– facilitate sharing and recommendations of not just whole queries, but

common predicates, join patterns, etc.

* [Khoussainova, CIDR 2009]Bill Howe, UW

04/11/2023 71Bill Howe, UW

Karen Dvornich

04/11/2023 72Bill Howe, UW