43
Astronomy’s Big Data Challenges Juan de Dios Santander Vela (IAA-CSIC)

VO Course 10: Big data challenges in astronomy

Embed Size (px)

DESCRIPTION

How future astronomy projects will generate enormous amounts of data, and what does that mean for astronomical data processing. Part of the virtual observatory course by Juan de Dios Santander Vela, as imparted for the MTAF (Métodos y Técnicas Avanzadas en Física, Advanced Methods and Techniques in Physics) Master at the University of Granada (UGR).

Citation preview

Page 1: VO Course 10: Big data challenges in astronomy

Astronomy’s Big Data ChallengesJuan de Dios Santander Vela (IAA-CSIC)

Page 2: VO Course 10: Big data challenges in astronomy

Overview

What is, exactly, big data?

Which are the dimensions of big data?

Which are the big data drivers in astronomy?

How can we deal with big data?

VO tools for dealing with big data

Page 3: VO Course 10: Big data challenges in astronomy

What is exactly Big Data?

Data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

WIKIPEDIA: “BIG DATA”

Page 4: VO Course 10: Big data challenges in astronomy

What is exactly Big Data?Big Data is data with at least one Big dimension

Bandwidth

Number of individual assets

Size of individual assets

Response speed

Page 5: VO Course 10: Big data challenges in astronomy

Big Data

Size

Storage

Access techniques

Processing techniques

Flow

Real time

Event Processi

ng

Offline

Data mining

Processing level

Raw Data

Processed Data Statistics

Schemata

Stuctured

Tagging

Unstructured

Value

Files

Formats Durability

Paralell Access

Capabilities

Information Extracted

Tech Debt

Page 6: VO Course 10: Big data challenges in astronomy

Next big data projects in astronomy

Page 7: VO Course 10: Big data challenges in astronomy

Large Synoptic Survey Telescope

Page 8: VO Course 10: Big data challenges in astronomy

The Large Synoptic Survey Telescope Camera

Steven M. KahnStanford/SLAC

(for the LSST Consortium)

Page 9: VO Course 10: Big data challenges in astronomy

LSST Data Rates

* 2.3 billion pixels read out in less than 2 sec, every 12 sec

* 1 pixel = 2 Bytes (raw)

* Over 3 GBytes/sec peak raw data from camera

* Real-time processing and transient detection: < 10 sec

* Dynamic range: 4 Bytes / pixel

* > 0.6 GB/sec average in pipeline

* 5000 floating point operations per pixel

* 2 TFlop/s average, 9 TFlop/s peak

* ~ 18 Tbytes/night

Page 10: VO Course 10: Big data challenges in astronomy

Relative Survey Power

Page 11: VO Course 10: Big data challenges in astronomy

Square Kilometre Array

Page 12: VO Course 10: Big data challenges in astronomy

Signal Transport & Processing

Page 13: VO Course 10: Big data challenges in astronomy

DESIGNS COUNTS!Signal Transport & Processing

Page 14: VO Course 10: Big data challenges in astronomy

Massive Data Flow, Storage & Processing

18 PB/YEAR

Antenna & Front End Systems

Correlation

Data Product Generation

Long Term Storage

High Availability Storage / DB

On-Demand Processing

STORAGE?CAN’T STORE IT!1 DAY STREAM = 150 DAYSGLOBAL INTERNET TRAFFIC

800 PBTemporaryStorage

Page 15: VO Course 10: Big data challenges in astronomy

Massive Data Flow, Storage & Processing

PROCESSING NEEDS109 TOP RANGE PCS > 1 EXAFLOP/S

30 PETAFLOPS/S

Antenna & Front End Systems

Correlation

Data Product Generation

Long Term Storage

High Availability Storage / DB

On-Demand Processing

TemporaryStorage

Page 16: VO Course 10: Big data challenges in astronomy

Massive Data Flow, Storage & Processing

7 PB/S

> 300 GB/S

BANDWIDTHTYPICAL SURVEY, 5 DAYS READ TIME @ 10GB/SEC

Antenna & Front End Systems

Correlation

Data Product Generation

Long Term Storage

High Availability Storage / DB

On-Demand Processing

TemporaryStorage

Page 17: VO Course 10: Big data challenges in astronomy

0" 5" 10" 15" 20" 25" 30" 35" 40"

ALMA"

LOFAR"

Bandwidth)in)TB/s)

MASSIVE DATA FLOW, STORAGE & PROCESSING

Antenna & Front End Systems

Correlation

Page 18: VO Course 10: Big data challenges in astronomy

0" 5" 10" 15" 20" 25" 30" 35" 40"

ALMA"

LOFAR"

Bandwidth)in)TB/s)

0" 10" 20" 30" 40" 50" 60" 70"

LOFAR"

ASKAP"

Bandwidth)in)TB/s)

MASSIVE DATA FLOW, STORAGE & PROCESSING

Antenna & Front End Systems

Correlation

Page 19: VO Course 10: Big data challenges in astronomy

0" 5" 10" 15" 20" 25" 30" 35" 40"

ALMA"

LOFAR"

Bandwidth)in)TB/s)

0" 10" 20" 30" 40" 50" 60" 70"

LOFAR"

ASKAP"

Bandwidth)in)TB/s)

MASSIVE DATA FLOW, STORAGE & PROCESSING

Antenna & Front End Systems

Correlation

Page 20: VO Course 10: Big data challenges in astronomy

0" 0,0005" 0,001" 0,0015" 0,002"

VLA"

ALMA"

Processing*TFlops/s*

MASSIVE DATA FLOW, STORAGE & PROCESSING

Correlation

Page 21: VO Course 10: Big data challenges in astronomy

0" 0,0005" 0,001" 0,0015" 0,002"

VLA"

ALMA"

Processing*TFlops/s*

0" 20" 40" 60" 80" 100" 120"

ALMA"

LOFAR"

Processing*TFlops/s*

MASSIVE DATA FLOW, STORAGE & PROCESSING

Correlation

Page 22: VO Course 10: Big data challenges in astronomy

0" 0,0005" 0,001" 0,0015" 0,002"

VLA"

ALMA"

Processing*TFlops/s*

0" 20" 40" 60" 80" 100" 120"

ALMA"

LOFAR"

Processing*TFlops/s*

0" 50" 100" 150" 200" 250" 300" 350"

LOFAR"

ASKAP"

Processing*TFlops/s*

MASSIVE DATA FLOW, STORAGE & PROCESSING

Correlation

Page 23: VO Course 10: Big data challenges in astronomy

0" 0,0005" 0,001" 0,0015" 0,002"

VLA"

ALMA"

Processing*TFlops/s*

0" 20" 40" 60" 80" 100" 120"

ALMA"

LOFAR"

Processing*TFlops/s*

0" 50" 100" 150" 200" 250" 300" 350"

LOFAR"

ASKAP"

Processing*TFlops/s*

MASSIVE DATA FLOW, STORAGE & PROCESSING

Correlation

Page 24: VO Course 10: Big data challenges in astronomy

Comparison: LHC

Page 25: VO Course 10: Big data challenges in astronomy

CERN/IT/DB

online systemmulti-level triggerfilter out backgroundreduce data volume from40TB/s to 100MB/s

level 1 - special hardware

40 MHz (40 TB/sec)level 2 - embedded processorslevel 3 - PCs

75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &offline analysis

Page 26: VO Course 10: Big data challenges in astronomy

CERN/IT/DBEvent Filter & Reconstruction

(figures are for one experiment)

switch

data from detector - event builder

high speed network

computer farm

tapeand disk servers

raw datasummary data

input: 5-100 GB/sec

capacity: 50K SI95 (~4K 1999 PCs)

recording rate: 100 MB/sec (Alice – 1 GB/sec)

+ 1-1.25 PetaByte/year+ 1-500 TB/year

20,000 Redwood cartridges every year (+ copy)

Page 27: VO Course 10: Big data challenges in astronomy

Dealing with Big Data

Page 28: VO Course 10: Big data challenges in astronomy

Dealing with Big Data

We cannot allow for arbitrary queries

We can have arbitrary processing instead

We cannot allow full data dumps

We can generate data on the the fly (see above)

Page 29: VO Course 10: Big data challenges in astronomy

Queries as functions

QUERY = FUNCTION { }DATA

QUERIES NEED TO BE PRECOMPUTEDARBITRARY QUERIES ONLY POSSIBLE ON THE PRECOMPUTED, SMALLER DATA SETS

Page 30: VO Course 10: Big data challenges in astronomy

Queries as functions

QUERY = FUNCTION { }DATAALL

QUERIES NEED TO BE PRECOMPUTEDARBITRARY QUERIES ONLY POSSIBLE ON THE PRECOMPUTED, SMALLER DATA SETS

Page 31: VO Course 10: Big data challenges in astronomy

Lambda Architecture

Batch Layer

Serving Layer

Speed Layer

STORE MASTER DATASETCOMPUTE ARBITRARY VIEWS

RANDOM ACCESS TO VIEWSUPDATED BY BATCH LAYER

FAST, INCREMENTAL ALGOS.QUERIES NOT ON BATCH L.COMPENSATES FOR LATENCY

Page 32: VO Course 10: Big data challenges in astronomy

Batch Layer

Stores master copy of the dataset

Precomputes batch views on that master dataset

INMUTABLE, CONSTANTLY

GROWING

INMUTABLE, CONSTANTLY

GROWING

Page 33: VO Course 10: Big data challenges in astronomy

Batch Layer

All Data Batch Layer

View 1

View 2

View n

…NEW DATA

UPDATED VIEWS

TYPICALLY, MAP/REDUCE

Page 34: VO Course 10: Big data challenges in astronomy

Serving Layer

Allows for:

batch writes of view updates

random reads on the views

Does not allow random writes

Page 35: VO Course 10: Big data challenges in astronomy

Speed Layer

Allows for:

incremental writes of view updates

short-term temporal queries on the views

Can be discarded!

Page 36: VO Course 10: Big data challenges in astronomy

Figure 2.1 The master dataset in the Lambda Architecture serves as the source oftruth of your Big Data system. Errors at the serving and speed layers can becorrected, but corruption at the master dataset is irreparable.

The master dataset is the only part of the Lambda Architecture that absolutelymust be safeguarded from corruption. Overloaded machines, failing disks, andpower outages all could cause errors, and human error with dynamic data systemsis an intrinsic risk and inevitable eventuality. You must carefully engineer themaster dataset to prevent corruption in all these cases, as fault tolerance is essentialto the health of a long running data system.

There are two components to the master dataset: the data model to use, and howto physically store it. This chapter is about designing a data model for the masterdataset and the properties such a data model should have. You will learn aboutphysically storing a master dataset in the next chapter.

To provide a roadmap for your undertaking, you will

learn the key properties of datasee how these properties are maintained in the fact-based modelexamine the advantages of the fact-based model for the master dataset

©Manning Publications Co. Please post comments or corrections to the Author Online forum: http://www.manning-sandbox.com/forum.jspa?forumID=787

27

Licensed to Juan de Dios Santander Vela <[email protected]>

Page 37: VO Course 10: Big data challenges in astronomy

Computing over Big DataBatch layer as a computational engine on data

Need to formally specify

Inputs

Processes

OutputsTHAT LOOKS LIKE

A WORKFLOW!

OR SQL

QUERYING

Page 38: VO Course 10: Big data challenges in astronomy

Map/Reduce

Page 39: VO Course 10: Big data challenges in astronomy

Map/Reducefrom%random%import%normalvariatefrom,math,import,sqrt

def,res2(x):,return,pow(mean_v,6,x,,2.)#"Random"vector,"mean"1,"stdev"0.001v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)

res2_v,=,map(res2,,v)

stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))print,(mean_v,,stdev)

PARALELLISABLE!

Page 40: VO Course 10: Big data challenges in astronomy

Map/Reducefrom%random%import%normalvariatefrom,math,import,sqrtfrom,multiprocessing,import,Pooldef,res2(x):,return,pow(mean_v,6,x,,2.)#"Random"vector,"mean"1,"stdev"0.001v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)pool,=,Pool(processes=4)res2_v,=,pool.map(res2,,v)pool.close()stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))print,(mean_v,,stdev)

ONLY FOR MAP, BUT REDUCE ALSO PARALLELISABLE

Page 41: VO Course 10: Big data challenges in astronomy

0,4

0,5

0,6

0,7

0,8

1 2 3 4 5 6 7 8

Dependence of execution time with the number of pool processorsse

cond

s pe

r milli

on e

lem

ents

Number of pool processors

20 millions10 millions5 millions1 million

Page 42: VO Course 10: Big data challenges in astronomy

Conclusions

Big data needs different approaches

Parallelism & data-side processing

Map/Reduce as parallelism engine

Need of ways to formally specify computations