View
235
Download
5
Category
Preview:
Citation preview
Data-Intensive Computing Symposium
Data-Intensive ComputingSymposium: Report Out
Phillip B. GibbonsIntel Research Pittsburgh
Phillip B. Gibbons, Data-Intensive Computing Symposium2
Data-Intensive Computing Symposium
Held 3/26/08 @Yahoo! in Sunnyvale, CA
Sponsored by:
– Yahoo! Research
– Computing Community Consortium supports the computing research community in creating compelling research visions and the mechanisms to realize these visions (http://www.cra.org/ccc/)
~100 invited attendees, ~12 invited talks
Slides and video to be posted on CCC web site
Blog: http://dita.ncsa.uiuc.edu/xllora (thanks!)
Phillip B. Gibbons, Data-Intensive Computing Symposium3
Randy Bryant (CMU)Data-Intensive Scalable Computing
Local speaker; I’ll skip in interest of time
DISC has been renamed
Phillip B. Gibbons, Data-Intensive Computing Symposium4
ChengXiang Zhai (UIUC)Text Information Management
Phillip B. Gibbons, Data-Intensive Computing Symposium5
ChengXiang Zhai (UIUC)Proposal 1: Maximum Personalization
Phillip B. Gibbons, Data-Intensive Computing Symposium6
ChengXiang Zhai (UIUC)
Phillip B. Gibbons, Data-Intensive Computing Symposium7
ChengXiang Zhai (UIUC)
Phillip B. Gibbons, Data-Intensive Computing Symposium8
Dan Reed (Microsoft)Clouds and ManyCore: The Revolution
Big Data: Should focus more on the user experience
How to manage resources
Cloud computing can help organically orchestrate resources on demand
Initiative to bring academics, business, and users together under the big data problem (PCAST NITRD review)
Phillip B. Gibbons, Data-Intensive Computing Symposium9
Jill Mesirov (Broad Institute)Comput. Paradigms for Genomic Medicine
Broad has 4.8K processors, 1.4 PBs storage on site
Big Data Problem: Mining genome expression arrays– Row: patients; Column: genes, Value: expression values
– Example: classify leukemias based on expression arrays
– Solved by grad student over the weekend using web sources
Challenge: Computation/Analysis/Provenance infrastructure needed– Developed GenePattern 3.1: Software infrastructure for
interoperable informatics
– Usable by biologists
Phillip B. Gibbons, Data-Intensive Computing Symposium10
Garth Gibson (CMU)Simplicity and Complexity in Data Systems at Scale
Petascale Data Storage Institute Understanding disk failures, cfdr.usenix.org
Another local speaker, so I’ll skip in interest of time
Phillip B. Gibbons, Data-Intensive Computing Symposium11
Jeff Dean (Google)Handling Large Datasets at Google
Phillip B. Gibbons, Data-Intensive Computing Symposium12
Jeff Dean (Google)
Phillip B. Gibbons, Data-Intensive Computing Symposium13
Jeff Dean (Google)
Phillip B. Gibbons, Data-Intensive Computing Symposium14
Jeff Dean (Google)
GFS Usage
Phillip B. Gibbons, Data-Intensive Computing Symposium15
Jeff Dean (Google)
Phillip B. Gibbons, Data-Intensive Computing Symposium16
Jeff Dean (Google)
Phillip B. Gibbons, Data-Intensive Computing Symposium17
Jeff Dean (Google)
Phillip B. Gibbons, Data-Intensive Computing Symposium18
Jeff Dean (Google)
Phillip B. Gibbons, Data-Intensive Computing Symposium19
Jon Kleinberg (Cornell)Large-Scale Social Network Data
Diffusion in Social Networks
Why is chain letter diffusion so deep & narrow?
Iraq war authorization protestchain letter diffusion (18K nodes)
Phillip B. Gibbons, Data-Intensive Computing Symposium20
Jon Kleinberg (Cornell)
Phillip B. Gibbons, Data-Intensive Computing Symposium21
Jon Kleinberg (Cornell)
Phillip B. Gibbons, Data-Intensive Computing Symposium22
Marc Najork (Microsoft Research)Mining the Web Graph
Scalable Hyperlink Store: used internally within MSR, for web graphs
Query-dependent link-based ranking algorithm (HITS, SALSA)
Phillip B. Gibbons, Data-Intensive Computing Symposium23
Joe Hellerstein (UC Berkeley)“What” Goes Around
1. Industrial revolution of data: sensors, logs, cameras
2. Hardware revolution: datacenters/virtualization, many-core
3. Industrial revolution in software? Declarative languages in some domains
Why “What”: – Rapid prototyping
– Pocket-size code bases
– Independent from the runtime
– Ease of analysis and security
– Allow optimization and adaptability
Phillip B. Gibbons, Data-Intensive Computing Symposium24
Joe Hellerstein (UC Berkeley)
Phillip B. Gibbons, Data-Intensive Computing Symposium25
Joe Hellerstein (UC Berkeley)
Sensor Networks, Mobile Networks, Modular Robotics, computer games, program analysis
Distributive inference (junction trees and loopy belief propagation), graphs upon graphs
Evita Raced: Overlog Metacompiler (compiler is written declaratively)
– matches datalog optimizations (dynamic prog.), cycle tests
Datalog with known extensions and tweaks Centrality of Rendezvous & graphs
Challenges: – performance beyond number of messages (e.g., memory
hierarchy), availability, real programs, not Turing complete
Phillip B. Gibbons, Data-Intensive Computing Symposium26
Raghu Ramakrishnan (Yahoo! Res.)Sherpa: Cloud Computing of the Third Kind
Phillip B. Gibbons, Data-Intensive Computing Symposium27
Raghu Ramakrishnan (Yahoo! Res.)
Phillip B. Gibbons, Data-Intensive Computing Symposium28
Raghu Ramakrishnan (Yahoo! Res.)
Phillip B. Gibbons, Data-Intensive Computing Symposium29
Alex Szalay (Johns Hopkins)Scientific Applications of Large Databases
Phillip B. Gibbons, Data-Intensive Computing Symposium30
Alex Szalay (Johns Hopkins)
Phillip B. Gibbons, Data-Intensive Computing Symposium31
Alex Szalay (Johns Hopkins)
Phillip B. Gibbons, Data-Intensive Computing Symposium32
Important, interesting, exciting research area
Cluster approach:computing is co-located where the storage is at
Memory hierarchy issues:where the (intermediate) data are at, over the course of the computation
Pervasive multimedia sensing: processing & querying must be pushed out of the data center to where the sensors are at
I know where it’s at, man!
Focus of this talk:
Phillip Gibbons (Intel Research)Data-Rich Computing: Where It’s At
Phillip B. Gibbons, Data-Intensive Computing Symposium33
Hierarchy-Savvy Parallel Algorithm Design (HI-SPADE) project
Hierarchy-savvy:– Hide what can be hid– Expose what must be exposed
– Sweet-spot between ignorant and fully aware
Support:– Develop the compilers, runtime systems,
architectural features, etc. to realize the model– Important component: fine-grain threading
Goal: Support a hierarchy-savvy model ofcomputation for parallel algorithm design
Phillip B. Gibbons, Data-Intensive Computing Symposium34
IrisNet’s Two-Tier Architecture
User
. . .SA
senseletsenselet
Sensor
SA
senseletsenselet
Sensor Sensor
SA
senseletsenselet
Web Serverfor the url
. . .
Query
OAXML database
. . .OA
XML databaseOA
XML database
Two components:SAs: sensor feed processingOAs: distributed database
Sensornet
Phillip B. Gibbons, Data-Intensive Computing Symposium35
Jeannette Wing (CMU/NSF)NSF Plans for SupportingData-Intensive Computing
Google/IBM Data Center– ~2000 processors, large Hadoop cluster
– Allocate in units of rack weeks
– NSF will review proposals for use: Cluster Exploratory (CluE)
– Running Xen; Won’t open up performance monitoring
– Goal: Show applicable outside of computer science
Academic-Industry-Government partnership
Phillip B. Gibbons, Data-Intensive Computing Symposium36
Randy Bryant (CMU)Big Data Computing Study Group
Collection of ~20 people (looking for volunteers) Goals:
– Fostering educational activities
– Advocacy
– Building community
CCC’s Big Data Computing Study Group seeks to foster collaborations between industry, academia, and the U.S. government to advance the state of art in the development and application of large scale computing systems for making intelligent use of the massive amounts of data being generated in science, commerce, and society
Recommended