The Pan-STARRS Data Challenge
Jim Heasley
Institute for Astronomy
University of Hawaii
IDIES09
What is Pan-STARRS?
• Pan-STARRS - a new telescope facility• 4 smallish (1.8m) telescopes, but with
extremely wide field of view• Can scan the sky rapidly and repeatedly,
and can detect very faint objects– Unique time-resolution capability
• Project led by IfA with help from Air Force, Maui High Performance Computer Center, MIT’s Lincoln Lab.
• The prototype, PS1, will be operated by an international consortium
IDIES09
Pan-STARRS Overview
• Time domain astronomy– Transient objects– Moving objects– Variable objects
• Static sky science– Enabled by stacking repeated scans to form a collection of
ultra-deep static sky images
•Pan-STARRS observatory specifications–Four 1.8m R-C + corrector–7 square degree FOV - 1.4Gpixel cameras–Sited in Hawaii–A = 50 –R ~ 24 in 30 s integration
–-> 7000 square deg/night–All sky + deep field surveys in g,r,i,z,y
IDIES09
Image Processing
Pipeline(IPP)
Moving Object Processing
System(MOPS)
Solar System Data Manager
(SSDM)
Object Data Manager(ODM)
Web-Based Interface
(WBI)
Data Retrieval Layer(DRL)
End Users
Detection Records
Rec
ord
s
Rec
ord
s
Gigapixel Camera
Images
Pho
tons
Telescope
Published ScienceProducts Subsystem
(PSPS)
The Published Science Products Subsystem
IDIES09
IDIES09
IDIES09
Front of the Wave
• Pan-STARRS is only the first of a new generation of astronomical data programs that will generate such large volumes of data:– SkyMapper, southern hemisphere optical– VISTA, southern hemisphere IR survey– LSST, an all sky survey like Pan-STARRS
• Eventually, these data sets will be useful for data mining.
IDIES09
IDIES09
PS1 Data Products
• Detections—measurements obtained directly from processed image frames– Detection catalogs– “Stacks” of the sky images source catalogs– Difference catalogs
• High significance (> 5 transient events)• Low significance (transients between 3 and 5 )
– Other Image Stacks (Medium Deep Survey)
• Objects—aggregates derived from detections
IDIES09
What’s the Challenge?
• At first blush, this looks pretty much like the Sloan Digital Sky Survey…
• BUT– Size – Over its 3 year mission, PS1 will record
over 150 billion detections for approximately 5.5 billion sources
– Dynamic Nature – new data will be always coming into the database system, for things we’ve seen before or new discoveries
IDIES09
How to Approach This Challenge
• There are many possible approaches to deal with this data challenge.
• Shared what?– Memory– Disk– Nothing
• Not all of these approaches are created equal, either in cost and/or performance (DeWitt & Gray, 1992, “Parallel Database Systems: The Future of High Performance Database Processing”).
IDIES09
Conversation with the Pan-STARRS Project Manager
• Jim: Tom, what are we going to do if the solution proposed by TBJD is more than you can afford?
• Tom: Jim, I’m sure you’ll think of something!• Not long after that, TBJD did give us a
hardware/software plan we couldn’t afford. Not long after, Tom resigned from the project to pursue other activities…
• The Pan-STARRS project teamed up with Alex and his database team at JHU
IDIES09
Building upon the SDSS Heritage
• In teaming with the team at JHU we hoped to build upon the experience and software developed for the SDSS.
• A key question was how could we scale the system to deal with the volume of data expected from PS1 (> 10X SDSS in the first year alone).
• The second key question, could the system keep up with the data flow.
• The heritage is more one of philosophy than recycled software, as to deal with the challenges posed by PS1 we’ve had to generate a great deal of new code.
IDIES09
The Object Data Manager
• The Object Data Manager (ODM) was considered to be the “long pole” in the development of the PS1 PSPS.
• Parallel database systems can provide both data redundancy and spreading very large tables that can’t fit on a single machine over multiple storage volumes.
• For PS1 (and beyond) we need both.
IDIES09
Distributed Architecture
• The bigger tables will be spatially partitioned across servers called Slices
• Using slices improves system scalability• Tables are sliced into ranges of ObjectID,
which correspond to broad declination ranges• ObjectID boundaries are selected so that
each slice has a similar number of objects• Distributed Partitioned Views “glue” the data
together
IDIES09
Distributed Architecture
• The bigger tables will be spatially partitioned across servers called Slices
• Using slices improves system scalability• Tables are sliced into ranges of ObjectID,
which correspond to broad declination ranges• ObjectID boundaries are selected so that
each slice has a similar number of objects• Distributed Partitioned Views “glue” the data
together
IDIES09
Design Decisions: ObjID• Objects have their positional information encoded in
their objID– fGetPanObjID (ra, dec, zoneH)– ZoneID is the most significant part of the ID– objID is the Primary Key
• Objects are organized (clustered indexed) so nearby objects in the sky are stored on disk nearby as well
• It gives good search performance, spatial functionality, and scalability
IDIES09
Telescope
CSV FilesCSV FilesImage
Procesing Pipeline (IPP)
Image Procesing
Pipeline (IPP)CSV FilesCSV Files
Load Workflow
Load Workflow
Load Workflow
Load Workflow
Load DB
Load DB
Load DB
Load DB
Cold Slice DB 1
Cold Slice DB 1
Cold Slice DB 2
Cold Slice DB 2
WarmSlice DB 1
WarmSlice DB 1
WarmSlice DB 2
WarmSlice DB 2
Merge Workflow
Merge Workflow
Merge Workflow
Merge Workflow
Hot Slice DB 2
Hot Slice DB 2
Hot Slice DB 1
Hot Slice DB 1
Flip Workflow
Flip Workflow
Flip Workflow
Flip Workflow
MainDB Distribute
d View
MainDB Distribute
d View
MainDBDistribute
d View
MainDBDistribute
d View CASJobsQuery Service
CASJobsQuery Service
MyDBMyDB
MyDBMyDB
The Pan-STARRS Science CloudThe Pan-STARRS Science Cloud
← Behind the Cloud|| User facing services →
Validation Exception
Notification
Data Valet Workflows
Data ConsumerQueries & Workflows
Data flows in one direction→, except for
error recovery
Slice Fault Recover
Workflow
Slice Fault Recover
Workflow
Data CreatorsAstronomers
(Data Consumers)
Admin & Load-Merge Machines
Production Machines
Pan-STARRS Data Flow
IDIES09
Pan-STARRS Data Layout
Slice 1
Slice 1
Slice 2
Slice 2
Slice 3
Slice 3
Slice 4
Slice 4
Slice 5
Slice 5
Slice 6
Slice 6
Slice 7
Slice 7
Slice 8
Slice 8
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
S15
S16
S16
S3
S2
S5
S4
S7
S6
S9
S8
S11
S10
S13
S12
S15
S14
S1
Load Merge 1
Load Merge 1
Load Merge 2
Load Merge 2
Load Merge 3
Load Merge 3
Load Merge 4
Load Merge 4
Load Merge 5
Load Merge 5
Load Merge 6
Load Merge 6
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
S15
S16
csvcsvcsvcsvcsvcsv csvcsvcsvcsvcsvcsv
Image Image PipeliPipeli
nene
HHOOTT
WWAARRMM
Main MainDistributed Distributed ViViewew
Head 2Head 2Head 1Head 1
Slice Slice Nodes Nodes
Load-Merge Load-Merge NodesNodes
CCOOLLDD
L1 L1 DataData
L2 L2 DataData LL
OOAADD
Head Head NodesNodes
IDIES09
The ODM Infrastructure
• Much of our software development has gone into extending the ingest pipeline developed for SDSS.
• Unlike SDSS, we don’t have “campaign” loads but a steady from of data from the telescope through the Image Processing Pipeline to the ODM.
• We have constructed data workflows to deal with both the regular data flow into the ODM as well as anticipated failure modes (lost disk, RAID, and various severer nodes).
IDIES09
Pan-STARRS Object Data Manager Subsystem
Pan-STARRS Cloud Services for Astronomers
System Operation UI System Health Monitor UI
Query Performance UI
System & Administration Workflows
Orchestrates all cluster changes, such as, data loading, or fault tolerance
Configuration, Health & Performance Monitoring
Cluster deployment and operations
Internal Data Flow and State Logging
Tools for supporting workflow authoring and execution
Loaded Astronomy Databases
~70TB Transfer/Week
Deployed Astronomy Databases
~70TB Storage/Year
Query ManagerScience queries and
MyDB for results
Image Processing PipelineExtracts objects like stars and
galaxies from telescope images~1TB Input/Week
Pan-STARRS Telescope
Data Flow
Control Flow
21
IDIES09
What Next?
• Will this approach scale to our needs?– PS1 – yes. But, we already see the need for better
parallel processing query plans.– PS4 – unclear! Even though I’m not from Missouri,
“show me!” One year of PS4 produces > data volume than the entire PS1 3 year mission!
• Cloud computing? – How can we test issues like scalability without actually
building the system?– Does each project really need its own data center?– Having these databases “in the cloud” may greatly
facilitate data sharing/mining.
IDIES09
Finally, Thanks• To Alex for stepping in, hosting the development
system at JHU, and building up his core team to construct the ODM, especially– Maria Nieto-Santisteban – Richard Wilton– Susan Werner
• And at Microsoft to– Michael Thomassy– Yogesh Simmhan– Catharine van Ingen