Upload
bill-howe
View
1.044
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Relational databases remain underused in the long tail of science, despite a number of significant success stories and a natural correspondence between scientific inquiry and ad hoc database query. Barriers to adoption have been articulated in the past, but spreadsheets and other file-oriented ap- proaches still dominate. At the University of Washington eScience Institute, we are exploring a new “delivery vector” for selected database features targeting researchers in the long tail: a web-based query-as-a-service system called SQLShare that eschews conventional database design, instead empha- sizing a simple Upload-Query-Share workflow and exposing a direct, full-SQL query interface over “raw” tabular data. We augment the basic query interface with services for cleaning and integrating data, recommending and authoring queries, and automatically generating visualizations. We find that even non-programmers are able to create and share SQL views for a variety of tasks, including quality control, integration, basic analysis, and access control. Researchers in oceanography, molecular biol- ogy, and ecology report migrating data to our system from spreadsheets, from conventional databases, and from ASCII files. In this paper, we will provide some examples of how the platform has enabled sci- ence in other domains, describe our SQLShare system, and propose some emerging research directions in this space for the database community.
Citation preview
04/11/2023 1
Bill Howe, PhDDirector of Research,
Scalable Data AnalyticsUniversity of Washington
eScience Institute
Enabling Collaborative Research Data Management
with SQLShare
Bill Howe, UW
Tom LewisDirector, Academic &
Collaborative ApplicationsUniversity of WashingtonInformation Technology
Assessing UW Researchers’ Data Management Needs
1. Conversations with Research Leaders (2008)• First large-scale assessment of
researchers’ needs• 124 Interviews with top researchers
2. Faculty Technology Survey (2011)• Use of teaching and research technologies• Paired with student and TA surveys• Reached all disciplines, levels of research• 689 instructors responded
Data Management Needs• Expertise—designing new systems, enhancing
current practices, analysis
• Storage—Large amounts of data for current projects and data archives
• Backup—inconsistent systems among researchers, some unreliable practices
• Security—secure access needed for inter-institutional partners
Types of Data Stored
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
5.20%
11.70%
15.20%
20.50%
22.80%
48.00%
61.20%
71.60%Quantities/statistics
Text (literature, transcriptions, field notes)
Images (photographs, maps)
Video recordings
Audio recordings
Multimedia digital objects
Geo-tagged objects/ spatial data
Other
Data Storage Location
5%
5%
6%
6%
12%
27%
41%
66%
87%My computer
External device (hard drive, thumb drive)
Department-managed server
Server managed by research team
External (Non-UW) data center
Department-managed data center
Other
Data managed by research team
Data center managed by UW-IT
http://escience.washington.edu
04/11/2023 Bill Howe, eScience Institute 8
The University of Washington eScience Institute
• Rationale– The exponential increase in sensors is transitioning all fields of
science and engineering from data-poor to data-rich– Techniques and technologies include
• Sensors and sensor networks, databases, data mining, machine learning, visualization, cluster/cloud computing
– If these techniques and technologies are not widely available and widely practiced, UW will cease to be competitive
• Mission– Help position the University of Washington at the forefront of
research both in modern eScience techniques and technologies, and in the fields that depend upon them
• Strategy– Bootstrap a cadre of Research Scientists– Add faculty in key fields– Build out a “consultancy” of students and non-research staff
Bill Howe, UW 9
eScience Big Data GroupBill Howe, Phd (databases, cloud, data-intensive scalable computing, visualization)
Staff– Seung-Hee Bae, Phd (postdoc, scalable machine learning algorithms)– Dan Halperin, Phd (postdoc; scalable systems)– Sagar Chitnis, Research Engineer (Azure, databases, web services)– (alumna) Marianne Shaw, Phd (hadoop, semantic graph databases)– (alumna) Alicia Key, Research Engineer (visualization, web applications)
Students– Scott Moe (2nd yr Phd, Applied Math)– Daniel Perry (2nd yr Phd, HCDE)
Partners– CSE DB Faculty: Magda Balazinska, Dan Suciu– CSE students: Paris Koutris, Prasang Upadhyaya, – UW-IT (web applications, QA/support) – Cecilia Aragon, Phd, Associate Professor, HCDE (visualization, scientific
applications)
04/11/2023
Pop
ular
ity /
S
ales
Products / Results
Head
Tail
• Power distribution• 80:20 rule
First published May 2007, Wired Magazine article 2004
[src: Carol Goble]
PDB
GenBank
UniProt
Pfam
Spreadsheets, NotebooksLocal, Lost
High throughput experimental methodsIndustrial scaleCommons based productionPublicly data setsCherry picked resultsPreserved
CATH, SCOP(Protein Structure Classification)
ChemSpider
Long Tail of Research Data[src: Carol Goble]
04/11/2023 12
Problem
How much time do you spend “handling data” as opposed to “doing science”?
Mode answer: “90%”
Bill Howe, UW
1304/11/2023 Bill Howe, UW
Simple Example
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
COGAnnotation_coastal_sample.txt
SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit
04/11/2023 14
Data management becoming the bottleneck
• Data is captured and manipulated in spreadsheets
• This approach made sense five years ago; the data volumes were manageable
• But now: – 50k rows in each of 50-100 spreadsheets, per
scientist– Significantly more sharing required: “mega-
collabs”– Continuously producing new results– The management overhead is beginning to
dominate– Difficult to share data publicly (emailing files is a
mess)– Lots of work to get an answers to any new question
04/11/2023 15
Why not build a database?
• Assumes an pre-defined schema – more on this later
• Requires specialized technical skills and a huge amount of up-front effort
• Not a perfect fit – it’s hard to design a “permanent” database for a fast-moving research target
• Researchers have little interest in operating and maintaining a data system – they just want to organize, manipulate, and share data
04/11/2023 16
Database-as-a-Service for the 99%
Approach: Strip down databases to bare essentials– Upload -> Query -> Share– Try to eliminate installation, configuration,
schema design, data loading, tuning, app-building
Bill Howe, UW
hardereasy, thanks to the cloud
Query
04/11/2023 17
Getting Data In
• Upload data through a browser to a cloud database (so no need to manage a local system)
• No need to “design a database” before you use your data – just upload and get to work
• Write SQL queries, with some automated help and some guided help from experts
• Build on your own results– Output of one query can be the input to another– Encourages sharing and reuse– Avoids everyone on the team running the same
data processing steps over and over– Provides provenance – “how did you get this
result?” Bill Howe, UW
04/11/2023 18
Select from a list of English descriptions
04/11/2023 19
Save the results, share them with others
Edit a Query
04/11/2023 20Bill Howe, UW
http://sqlshare.escience.washington.edu
04/11/2023 21Bill Howe, UW
http://sqlshare.escience.washington.edu
04/11/2023 22Bill Howe, UW
http://sqlshare.escience.washington.edu
04/11/2023 23Bill Howe, UW
http://sqlshare.escience.washington.edu
04/11/2023 24Bill Howe, UWhttp://sqlshare.escience.washington.edu
VizDeck
Python Client
“Flagship” SQLShare App (Python) on EC2
SQLShare REST API
Excel Addin
R Client
Spreadsheet Crawler
ASP.NET
OAuth2
WCF
04/11/2023 26Bill Howe, UW
Deeply nested hierarchies of views
ProvenanceControlled Sharing
04/11/2023 27
4 5
67
8
9
10
11
12
13
14
1516
2010 Pilot- Outreach and Education-based sampling: Schooner Adventuress
Robin Kodner
Bill Howe, UW
04/11/2023 28
NatureMapping Program Wildlife Observations (1902- )
Water Quality Monitoring Sites (2003 - )
Data collection and submission options:
1. Download/upload spreadsheet
2. Online data entry
3. NatureTracker on handheld/GPS
4. Android ODK (Open Data Kit)
Karen Dvornich
Bill Howe, UW
04/11/2023 29Bill Howe, UW
“An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces.
Previously, we were using huge directory trees and plain text files.
Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”
-- Andrew D White
Andrew White, UW Chemistry
04/11/2023 30
WHY SQL?
Bill Howe, UW
04/11/2023 31
A “Needs Hierarchy” of Science Data Management
Bill Howe, UW
storage
sharing
full semantic integration
query
analytics
“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”
-- Maslow 43
04/11/2023 32
NoSchema (not NoSQL)• A schema* is a shared consensus about
some universe of discourse• At the frontier of research, this shared
consensus does not exist, by definition • Any schema that does emerge will change
frequently, by definition• Data found “in the wild” will typically not conform
to any schema, by definition• But this doesn’t mean we have to punt on
databases and go back to ad hoc scripts and files
* ontology/metadata standard/controlled vocabulary/etc.
Bill Howe, UW
04/11/2023 33
Silos
Cloudstorage
sharing
query
full semantic integration
analytics
A “Needs Hierarchy” of Science Data Management
Bill Howe, UW
5/18/10 Garret Cole, eScience Institute
What’s the point?• Conventional wisdom says “Science data isn’t relational”
– This is utter nonsense
• Conventional wisdom says “Scientists won’t write SQL”– This is utter nonsense
• So why aren’t databases being used more often?– They’re a PITA
• We implicate difficulty in– installation, configuration– schema design, data loading– performance tuning– app-building (NoGUI?)
We ask instead, “What kind of platform can support ad hoc scientific Q&A with SQL?”
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap
FROM [[email protected]].[hotspots_deserts.tab] x INNER JOIN [[email protected]].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Biologists are beginning to write very complex queries (rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of queries written by non-programmers
04/11/2023 36
So why SQL?• Covers 80% of what we need
– Ex: Sloan Digital Sky Survey– Ex: Hybrid Hash Join algorithm published in BMC
bioinformatics
• Empower a new class of data-savvy scientist who isn’t forced to be trained as an IT professional
• Algebraic optimization– Ask me about this if you’re interested
• Views: Logical and physical data Independence– Reason about the problem independently of the data
representation– No re-execution of “workflows”– No file format incompatibilities– No version mismatches– Data and code tightly coupled and (logically) centralizeBill Howe, UW
04/11/2023 37
SQLShare as a CS Research Platform• Automatic “Starter” Queries
– (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle)
• VizDeck: Automatic Mashups and Visualization – (Bill Howe, Alicia Key, Daniel Perry, Cecilia Aragon)
• Info Extraction from Spreadsheets– (Mike Cafarella, Dave Maier, Bill Howe, Sagar Chitnis, Abdu Alwani)
• Scalable Analytics-as-a-Service– (Dan Suciu, Magda Balazinska, Bill Howe)
• Optimizing Iterative Queries for Machine Learning– (Dan Suciu, Magda Balazinska, Bill Howe)
SSDBM 2011SIGMOD 2011 (demo)
SSDBM 2011CHI 2012SIGMOD 2012 (demo)
Bill Howe, UW
VLDB 2010Datalog2.0 2012CIDR 2013
04/11/2023 38
Where we’re headed:
Bill Howe, UW
• Local or cloud-hosted deployments• Multi-institution sharing• Global users and permissions• Distributed data and distributed query
done!
We are looking for partners!
04/11/2023 39Bill Howe, UW
http://sqlshare.escience.washington.edu
http://escience.washington.edu
04/11/2023 40Bill Howe, UW
04/11/2023 41
Scientific data management reduces to sharing views• Integrate data from multiple sources?
– joins and unions with views
• Standardize on units, apply naming conventions?– rename columns, apply functions with views
• Attach metadata?– add new tables with descriptive names, add new columns
with views
• Data cleaning, quality control?– hide bad values with views
• Maintain provenance?– inspect view dependencies
• Propagate updates? – view maintenance
• Protect sensitive data?– expose subsets with views (assuming views carry
permissions) Bill Howe, UW
An observation about “handling data”
• How many plasmids were bombarded in July and have a rescue and expression?
5/18/10 Garret Cole, eScience Institute
SELECT count(*)
FROM [bombardment_log]
WHERE bomb_date BETWEEN ’7/1/2010' AND ’7/31/2010'
AND rescue clone IS NOT NULL
AND [expression?] = 'yes'
An observation about “handling data”
• Which samples have not been cloned?
5/18/10 Garret Cole, eScience Institute
SELECT * FROM plasmiddb
WHERE NOT (ISDATE(cloned) OR cloned = ‘yes’)
An observation about “handling data”
• How often does each RNA hit appear inside the annotated surface group?
5/18/10 Garret Cole, eScience Institute
SELECT hit, COUNT(*) as cnt
FROM tigrfamannotation_surface
GROUP BY hit
ORDER BY cnt DESC
An observation about “handling data”
For a given promoter (or protein fusion), how many expressing line have been generated (they would all have different strain designations)
5/18/10 Garret Cole, eScience Institute
SELECT strain, count(distinct line) FROM glycerol_stocks
GROUP BY strain
04/11/2023 46
Find all TIGRFam ids (proteins) that are missing from at least one of three samples (relations)
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
UNIONSELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
UNIONSELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
EXCEPT
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
INTERSECTSELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
INTERSECTSELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
Bill Howe, UW
04/11/2023 47Bill Howe, UW
On NoSQL
04/11/2023 48
An Observation on NoSQL
• 2004 Dean et al. MapReduce• 2008 Hadoop 0.17 release• 2008 Olston et al. Pig: Relational Algebra on Hadoop• 2008 DryadLINQ: Relational Algebra in a Hadoop-like system• 2009 Thusoo et al. HIVE: SQL on Hadoop
NoSQL is a misnomer– NoMySQL?– NoSchema?– NoLoading? – NoLicenseFees!
Bill Howe, UW
Pre-Relational: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.
“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”
Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation
Digression: Relational Database History
-- Codd 1979
04/11/2023 Bill Howe, UW 50
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
51
Algebraic OptimizationN = ((4*2)+((4*3)+0))/1
Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2: N = (2+3)*4
two operations instead of five, no division operator
Same idea works with very large tables / graphs, but the payoff is much higher
Windows Azure
Architecture
VizDeck
Python Client
“Flagship” SQLShare App (Python) on EC2
SQLShare REST API
SQL Azure
Excel Addin
R Client
SQL Server
Spreadsheet Ingest
ASP.NET
SQLShare REST API
IIS
04/11/2023 Bill Howe, eScience Institute 53
Science is becoming a database query problem Old model: “Query the world” (Data acquisition coupled to a specific
hypothesis) New model: “Download the world” (Data acquisition supports many
hypotheses)– Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST,
PanSTARRS)– Biology: lab automation, high-throughput sequencing, – Oceanography: high-resolution models, cheap sensors, satellites
40TB / 2 nights
1 device
~1TB / day100s of devices
04/11/2023 54
Problem– Research data is captured and manipulated in
spreadsheets– This perhaps made sense five years ago; the data
volumes were manageable– But now: 50k rows, 100s of files, “mega-collabs”
Why not put everything into a database?– A huge amount of up-front effort– Hard to design for a moving target– Running a database system is huge drain
Approach: SQLShare– Upload data through your browser: no setup, no
install– Login to browse science questions in English– Click a question, see the SQL to answer it question– Edit the SQL to answer an "adjacent" question,
even if you wouldn’t know how to write it from scratchhttps://sqlshare.escience.washington.edu/
04/11/2023 55
# of
byt
es
Biology
Galaxy
BioMart
GEO
LANL HIV Pathway
Commons
Astronomy
LSST
SDSS
PanSTARRS
# of sources
SQLShare
Bill Howe, UW
Oceanography
IOOS
OOI
NDBC
PDB GenBank
UniProt Pfam
SRA MG-RAST
Sca
le,
trai
ning
life sciences earth sciences commercialastrophysics
citizen science; public outreach
“Long tail”
Large scale
SQLShare: Query-as-a-Service
GridFields
VizDeck
Parallel Iterative Analytics
Sca
le,
trai
ning
life sciences earth sciences commercialastrophysics
citizen science; public outreach
“Long tail”
Large scale GridFields
VizDeck
Parallel Iterative Analytics
SQLShare: Query-as-a-Service
04/11/2023 58
A “Needs Hierarchy” of Science Data Management
storage
sharing
Bill Howe, UW
query
full integration
analytics
04/11/2023 59
A “Needs Hierarchy” of Science Data Management
storage
sharing
Bill Howe, UW
full integration
query
analytics
04/11/2023 60
Four Conjectures about Declarative Query for Science• Most science data manipulation tasks can
be expressed in relational algebra• Most science analytics task can be
expressed in relational algebra + recursionHellerstein 09, Re 12
• These expressions can be efficiently and scalably executed in the cloud
• Researchers are willing and able to program using relational algebra languagesc.f. SDSS
Bill Howe, UW
5/18/10 Garret Cole, eScience Institute
5/18/10 Garret Cole, eScience Institute
5/18/10 Garret Cole, eScience Institute
metadata
search results
sequence data
5/18/10 Garret Cole, eScience Institute
SQL
04/11/2023 Bill Howe, eScience Institute 66
data
vol
ume
rank
CERN (~15PB/year)
LSST (~100PB)
PanSTARRS (~40PB)
Ocean Modelers <Spreadsheet
users>
SDSS (~100TB)
Seis-mologists
MicrobiologistsCARMEN (~50TB)
“The future is already here; it’s just not very evenly distributed.”
-- William Gibson
The Long Tail
A stripped-down version of Jim Gray’s “20 questions” methodology
Experimental Engagement Algorithm for the Long Tail
1. Get the data2. Load the data “as is” – no schema design3. Get ~20 questions (in English)4. Translate the questions into SQL (when possible)5. Provide these “starter queries” to the researchers
Q: Can researchers questions be expressed in SQL?Q: Are a few examples sufficient for novices to self-train with SQL?Q: Can we scale this process up?Q: If so, will the use of SQL reduce their data handling overhead?
04/11/2023 68
• Which samples have not been cloned?
SELECT * FROM plasmiddb
WHERE NOT (ISDATE(cloned) OR cloned = ‘yes’)
Bill Howe, UW
04/11/2023 69
Details
• Backend is Microsoft’s SQL Azure• About a year old, essentially zero advertising• 50+ active users (UW and external)• 2500+ tables/views (~20% are public• largest table: 1.1M rows, smallest table: 1 row• transitioning to support by UW-IT
Bill Howe, UW
04/11/2023 70
View Semantics and Features• “Saved query” = View with attached metadata• Unify views and tables as “datasets”
– table = “select * from [raw_table]”• Replacement semantics for name conflicts
– old versions materialized and archived• Materialize downstream views
– when dependencies deleted– when dependencies become incompatible
• Permissions– public, private, ACLs (may work on groups in the future)
• Sharing, social querying, CQMS*– search, recent queries, friends’ queries, favorites, ratings– facilitate sharing and recommendations of not just whole queries, but
common predicates, join patterns, etc.
* [Khoussainova, CIDR 2009]Bill Howe, UW
04/11/2023 71Bill Howe, UW
Karen Dvornich
04/11/2023 72Bill Howe, UW