72
Bill Howe, PhD Director of Research, Scalable Data Analytics University of Washington eScience Institute Enabling Collaborative Research Data Management with SQLShare 06/26/2022 Bill Howe, UW 1 Tom Lewis Director, Academic & Collaborative Applications University of Washington Information Technology

Enabling Collaborative Research Data Management with SQLShare

Embed Size (px)

DESCRIPTION

Relational databases remain underused in the long tail of science, despite a number of significant success stories and a natural correspondence between scientific inquiry and ad hoc database query. Barriers to adoption have been articulated in the past, but spreadsheets and other file-oriented ap- proaches still dominate. At the University of Washington eScience Institute, we are exploring a new “delivery vector” for selected database features targeting researchers in the long tail: a web-based query-as-a-service system called SQLShare that eschews conventional database design, instead empha- sizing a simple Upload-Query-Share workflow and exposing a direct, full-SQL query interface over “raw” tabular data. We augment the basic query interface with services for cleaning and integrating data, recommending and authoring queries, and automatically generating visualizations. We find that even non-programmers are able to create and share SQL views for a variety of tasks, including quality control, integration, basic analysis, and access control. Researchers in oceanography, molecular biol- ogy, and ecology report migrating data to our system from spreadsheets, from conventional databases, and from ASCII files. In this paper, we will provide some examples of how the platform has enabled sci- ence in other domains, describe our SQLShare system, and propose some emerging research directions in this space for the database community.

Citation preview

Page 1: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 1

Bill Howe, PhDDirector of Research,

Scalable Data AnalyticsUniversity of Washington

eScience Institute

Enabling Collaborative Research Data Management

with SQLShare

Bill Howe, UW

Tom LewisDirector, Academic &

Collaborative ApplicationsUniversity of WashingtonInformation Technology

Page 2: Enabling Collaborative Research Data Management with SQLShare

Assessing UW Researchers’ Data Management Needs

1. Conversations with Research Leaders (2008)• First large-scale assessment of

researchers’ needs• 124 Interviews with top researchers

2. Faculty Technology Survey (2011)• Use of teaching and research technologies• Paired with student and TA surveys• Reached all disciplines, levels of research• 689 instructors responded

Page 3: Enabling Collaborative Research Data Management with SQLShare

Data Management Needs• Expertise—designing new systems, enhancing

current practices, analysis

• Storage—Large amounts of data for current projects and data archives

• Backup—inconsistent systems among researchers, some unreliable practices

• Security—secure access needed for inter-institutional partners

Page 4: Enabling Collaborative Research Data Management with SQLShare

Types of Data Stored

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

5.20%

11.70%

15.20%

20.50%

22.80%

48.00%

61.20%

71.60%Quantities/statistics

Text (literature, transcriptions, field notes)

Images (photographs, maps)

Video recordings

Audio recordings

Multimedia digital objects

Geo-tagged objects/ spatial data

Other

Page 5: Enabling Collaborative Research Data Management with SQLShare

Data Storage Location

5%

5%

6%

6%

12%

27%

41%

66%

87%My computer

External device (hard drive, thumb drive)

Department-managed server

Server managed by research team

External (Non-UW) data center

Department-managed data center

Other

Data managed by research team

Data center managed by UW-IT

Page 6: Enabling Collaborative Research Data Management with SQLShare
Page 7: Enabling Collaborative Research Data Management with SQLShare

http://escience.washington.edu

Page 8: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 Bill Howe, eScience Institute 8

The University of Washington eScience Institute

• Rationale– The exponential increase in sensors is transitioning all fields of

science and engineering from data-poor to data-rich– Techniques and technologies include

• Sensors and sensor networks, databases, data mining, machine learning, visualization, cluster/cloud computing

– If these techniques and technologies are not widely available and widely practiced, UW will cease to be competitive

• Mission– Help position the University of Washington at the forefront of

research both in modern eScience techniques and technologies, and in the fields that depend upon them

• Strategy– Bootstrap a cadre of Research Scientists– Add faculty in key fields– Build out a “consultancy” of students and non-research staff

Page 9: Enabling Collaborative Research Data Management with SQLShare

Bill Howe, UW 9

eScience Big Data GroupBill Howe, Phd (databases, cloud, data-intensive scalable computing, visualization)

Staff– Seung-Hee Bae, Phd (postdoc, scalable machine learning algorithms)– Dan Halperin, Phd (postdoc; scalable systems)– Sagar Chitnis, Research Engineer (Azure, databases, web services)– (alumna) Marianne Shaw, Phd (hadoop, semantic graph databases)– (alumna) Alicia Key, Research Engineer (visualization, web applications)

Students– Scott Moe (2nd yr Phd, Applied Math)– Daniel Perry (2nd yr Phd, HCDE)

Partners– CSE DB Faculty: Magda Balazinska, Dan Suciu– CSE students: Paris Koutris, Prasang Upadhyaya, – UW-IT (web applications, QA/support) – Cecilia Aragon, Phd, Associate Professor, HCDE (visualization, scientific

applications)

04/11/2023

Page 10: Enabling Collaborative Research Data Management with SQLShare

Pop

ular

ity /

S

ales

Products / Results

Head

Tail

• Power distribution• 80:20 rule

First published May 2007, Wired Magazine article 2004

[src: Carol Goble]

Page 11: Enabling Collaborative Research Data Management with SQLShare

PDB

GenBank

UniProt

Pfam

Spreadsheets, NotebooksLocal, Lost

High throughput experimental methodsIndustrial scaleCommons based productionPublicly data setsCherry picked resultsPreserved

CATH, SCOP(Protein Structure Classification)

ChemSpider

Long Tail of Research Data[src: Carol Goble]

Page 12: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 12

Problem

How much time do you spend “handling data” as opposed to “doing science”?

Mode answer: “90%”

Bill Howe, UW

Page 13: Enabling Collaborative Research Data Management with SQLShare

1304/11/2023 Bill Howe, UW

Simple Example

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome

COGAnnotation_coastal_sample.txt

SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit

Page 14: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 14

Data management becoming the bottleneck

• Data is captured and manipulated in spreadsheets

• This approach made sense five years ago; the data volumes were manageable

• But now: – 50k rows in each of 50-100 spreadsheets, per

scientist– Significantly more sharing required: “mega-

collabs”– Continuously producing new results– The management overhead is beginning to

dominate– Difficult to share data publicly (emailing files is a

mess)– Lots of work to get an answers to any new question

Page 15: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 15

Why not build a database?

• Assumes an pre-defined schema – more on this later

• Requires specialized technical skills and a huge amount of up-front effort

• Not a perfect fit – it’s hard to design a “permanent” database for a fast-moving research target

• Researchers have little interest in operating and maintaining a data system – they just want to organize, manipulate, and share data

Page 16: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 16

Database-as-a-Service for the 99%

Approach: Strip down databases to bare essentials– Upload -> Query -> Share– Try to eliminate installation, configuration,

schema design, data loading, tuning, app-building

Bill Howe, UW

hardereasy, thanks to the cloud

Query

Page 17: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 17

Getting Data In

• Upload data through a browser to a cloud database (so no need to manage a local system)

• No need to “design a database” before you use your data – just upload and get to work

• Write SQL queries, with some automated help and some guided help from experts

• Build on your own results– Output of one query can be the input to another– Encourages sharing and reuse– Avoids everyone on the team running the same

data processing steps over and over– Provides provenance – “how did you get this

result?” Bill Howe, UW

Page 18: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 18

Select from a list of English descriptions

Page 19: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 19

Save the results, share them with others

Edit a Query

Page 20: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 20Bill Howe, UW

http://sqlshare.escience.washington.edu

Page 21: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 21Bill Howe, UW

http://sqlshare.escience.washington.edu

Page 22: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 22Bill Howe, UW

http://sqlshare.escience.washington.edu

Page 23: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 23Bill Howe, UW

http://sqlshare.escience.washington.edu

Page 24: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 24Bill Howe, UWhttp://sqlshare.escience.washington.edu

Page 25: Enabling Collaborative Research Data Management with SQLShare

VizDeck

Python Client

“Flagship” SQLShare App (Python) on EC2

SQLShare REST API

Excel Addin

R Client

Spreadsheet Crawler

ASP.NET

OAuth2

WCF

Page 26: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 26Bill Howe, UW

Deeply nested hierarchies of views

ProvenanceControlled Sharing

Page 27: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 27

4 5

67

8

9

10

11

12

13

14

1516

2010 Pilot- Outreach and Education-based sampling: Schooner Adventuress

Robin Kodner

Bill Howe, UW

Page 28: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 28

NatureMapping Program Wildlife Observations (1902- )

Water Quality Monitoring Sites (2003 - )

Data collection and submission options:

1. Download/upload spreadsheet

2. Online data entry

3. NatureTracker on handheld/GPS

4. Android ODK (Open Data Kit)

Karen Dvornich

Bill Howe, UW

Page 29: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 29Bill Howe, UW

“An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces.

Previously, we were using huge directory trees and plain text files.

Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”

-- Andrew D White

Andrew White, UW Chemistry

Page 30: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 30

WHY SQL?

Bill Howe, UW

Page 31: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 31

A “Needs Hierarchy” of Science Data Management

Bill Howe, UW

storage

sharing

full semantic integration

query

analytics

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

Page 32: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 32

NoSchema (not NoSQL)• A schema* is a shared consensus about

some universe of discourse• At the frontier of research, this shared

consensus does not exist, by definition • Any schema that does emerge will change

frequently, by definition• Data found “in the wild” will typically not conform

to any schema, by definition• But this doesn’t mean we have to punt on

databases and go back to ad hoc scripts and files

* ontology/metadata standard/controlled vocabulary/etc.

Bill Howe, UW

Page 33: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 33

Silos

Cloudstorage

sharing

query

full semantic integration

analytics

A “Needs Hierarchy” of Science Data Management

Bill Howe, UW

Page 34: Enabling Collaborative Research Data Management with SQLShare

5/18/10 Garret Cole, eScience Institute

What’s the point?• Conventional wisdom says “Science data isn’t relational”

– This is utter nonsense

• Conventional wisdom says “Scientists won’t write SQL”– This is utter nonsense

• So why aren’t databases being used more often?– They’re a PITA

• We implicate difficulty in– installation, configuration– schema design, data loading– performance tuning– app-building (NoGUI?)

We ask instead, “What kind of platform can support ad hoc scientific Q&A with SQL?”

Page 35: Enabling Collaborative Research Data Management with SQLShare

SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap

FROM [[email protected]].[hotspots_deserts.tab] x INNER JOIN [[email protected]].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC

Biologists are beginning to write very complex queries (rather than relying on staff programmers)

Example: Computing the overlaps of two sets of blast results

We see thousands of queries written by non-programmers

Page 36: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 36

So why SQL?• Covers 80% of what we need

– Ex: Sloan Digital Sky Survey– Ex: Hybrid Hash Join algorithm published in BMC

bioinformatics

• Empower a new class of data-savvy scientist who isn’t forced to be trained as an IT professional

• Algebraic optimization– Ask me about this if you’re interested

• Views: Logical and physical data Independence– Reason about the problem independently of the data

representation– No re-execution of “workflows”– No file format incompatibilities– No version mismatches– Data and code tightly coupled and (logically) centralizeBill Howe, UW

Page 37: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 37

SQLShare as a CS Research Platform• Automatic “Starter” Queries

– (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle)

• VizDeck: Automatic Mashups and Visualization – (Bill Howe, Alicia Key, Daniel Perry, Cecilia Aragon)

• Info Extraction from Spreadsheets– (Mike Cafarella, Dave Maier, Bill Howe, Sagar Chitnis, Abdu Alwani)

• Scalable Analytics-as-a-Service– (Dan Suciu, Magda Balazinska, Bill Howe)

• Optimizing Iterative Queries for Machine Learning– (Dan Suciu, Magda Balazinska, Bill Howe)

SSDBM 2011SIGMOD 2011 (demo)

SSDBM 2011CHI 2012SIGMOD 2012 (demo)

Bill Howe, UW

VLDB 2010Datalog2.0 2012CIDR 2013

Page 38: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 38

Where we’re headed:

Bill Howe, UW

• Local or cloud-hosted deployments• Multi-institution sharing• Global users and permissions• Distributed data and distributed query

done!

We are looking for partners!

Page 39: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 39Bill Howe, UW

http://sqlshare.escience.washington.edu

[email protected]

http://escience.washington.edu

Page 40: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 40Bill Howe, UW

Page 41: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 41

Scientific data management reduces to sharing views• Integrate data from multiple sources?

– joins and unions with views

• Standardize on units, apply naming conventions?– rename columns, apply functions with views

• Attach metadata?– add new tables with descriptive names, add new columns

with views

• Data cleaning, quality control?– hide bad values with views

• Maintain provenance?– inspect view dependencies

• Propagate updates? – view maintenance

• Protect sensitive data?– expose subsets with views (assuming views carry

permissions) Bill Howe, UW

Page 42: Enabling Collaborative Research Data Management with SQLShare

An observation about “handling data”

• How many plasmids were bombarded in July and have a rescue and expression?

5/18/10 Garret Cole, eScience Institute

SELECT count(*)

FROM [bombardment_log]

WHERE bomb_date BETWEEN ’7/1/2010' AND ’7/31/2010'

AND rescue clone IS NOT NULL

AND [expression?] = 'yes'

Page 43: Enabling Collaborative Research Data Management with SQLShare

An observation about “handling data”

• Which samples have not been cloned?

5/18/10 Garret Cole, eScience Institute

SELECT * FROM plasmiddb

WHERE NOT (ISDATE(cloned) OR cloned = ‘yes’)

Page 44: Enabling Collaborative Research Data Management with SQLShare

An observation about “handling data”

• How often does each RNA hit appear inside the annotated surface group?

5/18/10 Garret Cole, eScience Institute

SELECT hit, COUNT(*) as cnt

FROM tigrfamannotation_surface

GROUP BY hit

ORDER BY cnt DESC

Page 45: Enabling Collaborative Research Data Management with SQLShare

An observation about “handling data”

For a given promoter (or protein fusion), how many expressing line have been generated (they would all have different strain designations)

5/18/10 Garret Cole, eScience Institute

SELECT strain, count(distinct line) FROM glycerol_stocks

GROUP BY strain

Page 46: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 46

Find all TIGRFam ids (proteins) that are missing from at least one of three samples (relations)

SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]

UNIONSELECT col0 FROM [est_hma_fasta_TGIRfam_refs]

UNIONSELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]

EXCEPT

SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]

INTERSECTSELECT col0 FROM [est_hma_fasta_TGIRfam_refs]

INTERSECTSELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]

Bill Howe, UW

Page 47: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 47Bill Howe, UW

On NoSQL

Page 48: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 48

An Observation on NoSQL

• 2004 Dean et al. MapReduce• 2008 Hadoop 0.17 release• 2008 Olston et al. Pig: Relational Algebra on Hadoop• 2008 DryadLINQ: Relational Algebra in a Hadoop-like system• 2009 Thusoo et al. HIVE: SQL on Hadoop

NoSQL is a misnomer– NoMySQL?– NoSchema?– NoLoading? – NoLicenseFees!

Bill Howe, UW

Page 49: Enabling Collaborative Research Data Management with SQLShare

Pre-Relational: if your data changed, your application broke.

Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”

Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation

Digression: Relational Database History

-- Codd 1979

Page 50: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 Bill Howe, UW 50

Key Idea: An Algebra of Tables

select

project

join join

Other operators: aggregate, union, difference, cross product

Page 51: Enabling Collaborative Research Data Management with SQLShare

51

Algebraic OptimizationN = ((4*2)+((4*3)+0))/1

Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x

Apply rules 1, 3, 4, 2: N = (2+3)*4

two operations instead of five, no division operator

Same idea works with very large tables / graphs, but the payoff is much higher

Page 52: Enabling Collaborative Research Data Management with SQLShare

Windows Azure

Architecture

VizDeck

Python Client

“Flagship” SQLShare App (Python) on EC2

SQLShare REST API

SQL Azure

Excel Addin

R Client

SQL Server

Spreadsheet Ingest

ASP.NET

SQLShare REST API

IIS

Page 53: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 Bill Howe, eScience Institute 53

Science is becoming a database query problem Old model: “Query the world” (Data acquisition coupled to a specific

hypothesis) New model: “Download the world” (Data acquisition supports many

hypotheses)– Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST,

PanSTARRS)– Biology: lab automation, high-throughput sequencing, – Oceanography: high-resolution models, cheap sensors, satellites

40TB / 2 nights

1 device

~1TB / day100s of devices

Page 54: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 54

Problem– Research data is captured and manipulated in

spreadsheets– This perhaps made sense five years ago; the data

volumes were manageable– But now: 50k rows, 100s of files, “mega-collabs”

Why not put everything into a database?– A huge amount of up-front effort– Hard to design for a moving target– Running a database system is huge drain

Approach: SQLShare– Upload data through your browser: no setup, no

install– Login to browse science questions in English– Click a question, see the SQL to answer it question– Edit the SQL to answer an "adjacent" question,

even if you wouldn’t know how to write it from scratchhttps://sqlshare.escience.washington.edu/

Page 55: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 55

# of

byt

es

Biology

Galaxy

BioMart

GEO

LANL HIV Pathway

Commons

Astronomy

LSST

SDSS

PanSTARRS

# of sources

SQLShare

Bill Howe, UW

Oceanography

IOOS

OOI

NDBC

PDB GenBank

UniProt Pfam

SRA MG-RAST

Page 56: Enabling Collaborative Research Data Management with SQLShare

Sca

le,

trai

ning

life sciences earth sciences commercialastrophysics

citizen science; public outreach

“Long tail”

Large scale

SQLShare: Query-as-a-Service

GridFields

VizDeck

Parallel Iterative Analytics

Page 57: Enabling Collaborative Research Data Management with SQLShare

Sca

le,

trai

ning

life sciences earth sciences commercialastrophysics

citizen science; public outreach

“Long tail”

Large scale GridFields

VizDeck

Parallel Iterative Analytics

SQLShare: Query-as-a-Service

Page 58: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 58

A “Needs Hierarchy” of Science Data Management

storage

sharing

Bill Howe, UW

query

full integration

analytics

Page 59: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 59

A “Needs Hierarchy” of Science Data Management

storage

sharing

Bill Howe, UW

full integration

query

analytics

Page 60: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 60

Four Conjectures about Declarative Query for Science• Most science data manipulation tasks can

be expressed in relational algebra• Most science analytics task can be

expressed in relational algebra + recursionHellerstein 09, Re 12

• These expressions can be efficiently and scalably executed in the cloud

• Researchers are willing and able to program using relational algebra languagesc.f. SDSS

Bill Howe, UW

Page 61: Enabling Collaborative Research Data Management with SQLShare

5/18/10 Garret Cole, eScience Institute

Page 62: Enabling Collaborative Research Data Management with SQLShare

5/18/10 Garret Cole, eScience Institute

Page 63: Enabling Collaborative Research Data Management with SQLShare

5/18/10 Garret Cole, eScience Institute

Page 64: Enabling Collaborative Research Data Management with SQLShare

metadata

search results

sequence data

Page 65: Enabling Collaborative Research Data Management with SQLShare

5/18/10 Garret Cole, eScience Institute

SQL

Page 66: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 Bill Howe, eScience Institute 66

data

vol

ume

rank

CERN (~15PB/year)

LSST (~100PB)

PanSTARRS (~40PB)

Ocean Modelers <Spreadsheet

users>

SDSS (~100TB)

Seis-mologists

MicrobiologistsCARMEN (~50TB)

“The future is already here; it’s just not very evenly distributed.”

-- William Gibson

The Long Tail

Page 67: Enabling Collaborative Research Data Management with SQLShare

A stripped-down version of Jim Gray’s “20 questions” methodology

Experimental Engagement Algorithm for the Long Tail

1. Get the data2. Load the data “as is” – no schema design3. Get ~20 questions (in English)4. Translate the questions into SQL (when possible)5. Provide these “starter queries” to the researchers

Q: Can researchers questions be expressed in SQL?Q: Are a few examples sufficient for novices to self-train with SQL?Q: Can we scale this process up?Q: If so, will the use of SQL reduce their data handling overhead?

Page 68: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 68

• Which samples have not been cloned?

SELECT * FROM plasmiddb

WHERE NOT (ISDATE(cloned) OR cloned = ‘yes’)

Bill Howe, UW

Page 69: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 69

Details

• Backend is Microsoft’s SQL Azure• About a year old, essentially zero advertising• 50+ active users (UW and external)• 2500+ tables/views (~20% are public• largest table: 1.1M rows, smallest table: 1 row• transitioning to support by UW-IT

Bill Howe, UW

Page 70: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 70

View Semantics and Features• “Saved query” = View with attached metadata• Unify views and tables as “datasets”

– table = “select * from [raw_table]”• Replacement semantics for name conflicts

– old versions materialized and archived• Materialize downstream views

– when dependencies deleted– when dependencies become incompatible

• Permissions– public, private, ACLs (may work on groups in the future)

• Sharing, social querying, CQMS*– search, recent queries, friends’ queries, favorites, ratings– facilitate sharing and recommendations of not just whole queries, but

common predicates, join patterns, etc.

* [Khoussainova, CIDR 2009]Bill Howe, UW

Page 71: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 71Bill Howe, UW

Karen Dvornich

Page 72: Enabling Collaborative Research Data Management with SQLShare

04/11/2023 72Bill Howe, UW