Upload
dhalperi
View
69
Download
0
Tags:
Embed Size (px)
DESCRIPTION
I gave this talk at the UW Systems, Architecture, & Networking (SANE) retreat in May 2014. I argued that as a community, big data system-builders may be great at building fast systems.. but that these systems DO NOT serve the scientists we work with at the UW eScience Institute. I then provide a few ideas going forward for how to build services for scientists that will enable them to do their own work, thus "serving themselves".
Citation preview
Designing for self-serve science
Daniel Halperin
How much time “handling data” vs “doing science”?
How much time “handling data” vs “doing science”?
90%
“I sort both my spreadsheets on Gene ID, then I copy matches into a new one”
We are the problem
0
30
60
90
120
Benchmark 1 Benchmark 2
Old system Your system Our system
0
2500
5000
7500
10000
Benchmark 1 Benchmark 2
Old system Your systemOur system What people use
Perfo
rman
ce
Complexity
Perfo
rman
ce
Complexity
Perfo
rman
ce
Complexity
Perfo
rman
ce
Complexity
Design for here
What we build What they need
Steve Jurvetson https://www.flickr.com/photos/jurvetson/7408464122
sutton-images.com http://biser3a.com/formula-1/f1-airboxes-all-you-need-to-know/
terms: http://sutton-images.com/terms.asp
Lowering barrier to entry
Developing a new language
• SQL: 3 great features for science • THE language of data
management!• We know how to
scale it • Scientists can learn it
• MyriaL is better • Imperative &
declarative:easy to write
• Iteration & recursion!• Lots of practical
extensions
Giving users insight
Diagnosing problems����������������
�� ��������
� � � � � � � � � ��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
���������
������������������������������������������������������������������������������������������������������������������������������
Sour
ce n
ode
Destination node
Automating the ‘CS parts’• Do work on the user’s behalf:
(Ratul Mahajan’s Buffet Principle)
• Infer indexes and constraints!
• Aggressively reuse computation
• Speculatively apply queries to data
• Key enabler: science data is (mostly) read-only
Enable authoring & sharing
• “Autocomplete for science” - predict query snippets as users work. (Nodira Khoussainova)
• Natural language interface: queries → English questions → queries “Compute the fraction of CGs that are methylated in the oyster genome.”
Improve their state of the art
• “You just did in 1 minute what took me a week”
• “Replaced 100 lines of Python with 1 line of SQL”
• “That 5-line MyriaL program was 100x faster than my R cluster, and much simpler”
Trust, but Verify (& Support)
Trust, but Verify (& Support)