VO Course 10: Big data challenges in astronomy

Astronomy’s Big Data ChallengesJuan de Dios Santander Vela (IAA-CSIC)

Overview

What is, exactly, big data?

Which are the dimensions of big data?

Which are the big data drivers in astronomy?

How can we deal with big data?

VO tools for dealing with big data

What is exactly Big Data?

Data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

WIKIPEDIA: “BIG DATA”

What is exactly Big Data?Big Data is data with at least one Big dimension

Bandwidth

Number of individual assets

Size of individual assets

Response speed

…

Big Data

Size

Storage

Access techniques

Processing techniques

Flow

Real time

Event Processi

ng

Offline

Data mining

Processing level

Raw Data

Processed Data Statistics

Schemata

Stuctured

Tagging

Unstructured

Value

Files

Formats Durability

Paralell Access

Capabilities

Information Extracted

Tech Debt

Next big data projects in astronomy

Large Synoptic Survey Telescope

The Large Synoptic Survey Telescope Camera

Steven M. KahnStanford/SLAC

(for the LSST Consortium)

LSST Data Rates

* 2.3 billion pixels read out in less than 2 sec, every 12 sec

* 1 pixel = 2 Bytes (raw)

* Over 3 GBytes/sec peak raw data from camera

* Real-time processing and transient detection: < 10 sec

* Dynamic range: 4 Bytes / pixel

* > 0.6 GB/sec average in pipeline

* 5000 floating point operations per pixel

* 2 TFlop/s average, 9 TFlop/s peak

* ~ 18 Tbytes/night

Relative Survey Power

Square Kilometre Array

Signal Transport & Processing

DESIGNS COUNTS!Signal Transport & Processing

Massive Data Flow, Storage & Processing

18 PB/YEAR

Antenna & Front End Systems

Correlation

Data Product Generation

Long Term Storage

High Availability Storage / DB

On-Demand Processing

STORAGE?CAN’T STORE IT!1 DAY STREAM = 150 DAYSGLOBAL INTERNET TRAFFIC

800 PBTemporaryStorage


PROCESSING NEEDS109 TOP RANGE PCS > 1 EXAFLOP/S

30 PETAFLOPS/S


Correlation


Long Term Storage



TemporaryStorage


7 PB/S

> 300 GB/S

BANDWIDTHTYPICAL SURVEY, 5 DAYS READ TIME @ 10GB/SEC


Correlation


Long Term Storage



TemporaryStorage

0" 5" 10" 15" 20" 25" 30" 35" 40"

ALMA"

LOFAR"

Bandwidth)in)TB/s)

MASSIVE DATA FLOW, STORAGE & PROCESSING


Correlation

0" 5" 10" 15" 20" 25" 30" 35" 40"

ALMA"

LOFAR"

Bandwidth)in)TB/s)

0" 10" 20" 30" 40" 50" 60" 70"

LOFAR"

ASKAP"

Bandwidth)in)TB/s)



Correlation

0" 5" 10" 15" 20" 25" 30" 35" 40"

ALMA"

LOFAR"

Bandwidth)in)TB/s)

0" 10" 20" 30" 40" 50" 60" 70"

LOFAR"

ASKAP"

Bandwidth)in)TB/s)



Correlation

0" 0,0005" 0,001" 0,0015" 0,002"

VLA"

ALMA"

Processing*TFlops/s*


Correlation

0" 0,0005" 0,001" 0,0015" 0,002"

VLA"

ALMA"


0" 20" 40" 60" 80" 100" 120"

ALMA"

LOFAR"



Correlation

0" 0,0005" 0,001" 0,0015" 0,002"

VLA"

ALMA"


0" 20" 40" 60" 80" 100" 120"

ALMA"

LOFAR"


0" 50" 100" 150" 200" 250" 300" 350"

LOFAR"

ASKAP"



Correlation

0" 0,0005" 0,001" 0,0015" 0,002"

VLA"

ALMA"


0" 20" 40" 60" 80" 100" 120"

ALMA"

LOFAR"


0" 50" 100" 150" 200" 250" 300" 350"

LOFAR"

ASKAP"



Correlation

Comparison: LHC

CERN/IT/DB

online systemmulti-level triggerfilter out backgroundreduce data volume from40TB/s to 100MB/s

level 1 - special hardware

40 MHz (40 TB/sec)level 2 - embedded processorslevel 3 - PCs

75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &offline analysis

CERN/IT/DBEvent Filter & Reconstruction

(figures are for one experiment)

switch

data from detector - event builder

high speed network

computer farm

tapeand disk servers

raw datasummary data

input: 5-100 GB/sec

capacity: 50K SI95 (~4K 1999 PCs)

recording rate: 100 MB/sec (Alice – 1 GB/sec)

+ 1-1.25 PetaByte/year+ 1-500 TB/year

20,000 Redwood cartridges every year (+ copy)

Dealing with Big Data

Dealing with Big Data

We cannot allow for arbitrary queries

We can have arbitrary processing instead

We cannot allow full data dumps

We can generate data on the the fly (see above)

Queries as functions

QUERY = FUNCTION { }DATA

QUERIES NEED TO BE PRECOMPUTEDARBITRARY QUERIES ONLY POSSIBLE ON THE PRECOMPUTED, SMALLER DATA SETS

Queries as functions

QUERY = FUNCTION { }DATAALL

QUERIES NEED TO BE PRECOMPUTEDARBITRARY QUERIES ONLY POSSIBLE ON THE PRECOMPUTED, SMALLER DATA SETS

Lambda Architecture

Batch Layer

Serving Layer

Speed Layer

STORE MASTER DATASETCOMPUTE ARBITRARY VIEWS

RANDOM ACCESS TO VIEWSUPDATED BY BATCH LAYER

FAST, INCREMENTAL ALGOS.QUERIES NOT ON BATCH L.COMPENSATES FOR LATENCY

Batch Layer

Stores master copy of the dataset

Precomputes batch views on that master dataset

INMUTABLE, CONSTANTLY

GROWING

INMUTABLE, CONSTANTLY

GROWING

Batch Layer

All Data Batch Layer

View 1

View 2

View n

…NEW DATA

UPDATED VIEWS

TYPICALLY, MAP/REDUCE

Serving Layer

Allows for:

batch writes of view updates

random reads on the views

Does not allow random writes

Speed Layer

Allows for:

incremental writes of view updates

short-term temporal queries on the views

Can be discarded!

Figure 2.1 The master dataset in the Lambda Architecture serves as the source oftruth of your Big Data system. Errors at the serving and speed layers can becorrected, but corruption at the master dataset is irreparable.

The master dataset is the only part of the Lambda Architecture that absolutelymust be safeguarded from corruption. Overloaded machines, failing disks, andpower outages all could cause errors, and human error with dynamic data systemsis an intrinsic risk and inevitable eventuality. You must carefully engineer themaster dataset to prevent corruption in all these cases, as fault tolerance is essentialto the health of a long running data system.

There are two components to the master dataset: the data model to use, and howto physically store it. This chapter is about designing a data model for the masterdataset and the properties such a data model should have. You will learn aboutphysically storing a master dataset in the next chapter.

To provide a roadmap for your undertaking, you will

learn the key properties of datasee how these properties are maintained in the fact-based modelexamine the advantages of the fact-based model for the master dataset

©Manning Publications Co. Please post comments or corrections to the Author Online forum: http://www.manning-sandbox.com/forum.jspa?forumID=787

27

Licensed to Juan de Dios Santander Vela <[email protected]>

Computing over Big DataBatch layer as a computational engine on data

Need to formally specify

Inputs

Processes

OutputsTHAT LOOKS LIKE

A WORKFLOW!

OR SQL

QUERYING

Map/Reduce

Map/Reducefrom%random%import%normalvariatefrom,math,import,sqrt

def,res2(x):,return,pow(mean_v,6,x,,2.)#"Random"vector,"mean"1,"stdev"0.001v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)

res2_v,=,map(res2,,v)

stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))print,(mean_v,,stdev)

PARALELLISABLE!

Map/Reducefrom%random%import%normalvariatefrom,math,import,sqrtfrom,multiprocessing,import,Pooldef,res2(x):,return,pow(mean_v,6,x,,2.)#"Random"vector,"mean"1,"stdev"0.001v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)pool,=,Pool(processes=4)res2_v,=,pool.map(res2,,v)pool.close()stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))print,(mean_v,,stdev)

ONLY FOR MAP, BUT REDUCE ALSO PARALLELISABLE

0,4

0,5

0,6

0,7

0,8

1 2 3 4 5 6 7 8

Dependence of execution time with the number of pool processorsse

cond

s pe

r milli

on e

lem

ents

Number of pool processors

20 millions10 millions5 millions1 million

Conclusions

Big data needs different approaches

Parallelism & data-side processing

Map/Reduce as parallelism engine

Need of ways to formally specify computations

References & Links

“The Fourth Paradigm: Data-Intensive Scientific Discovery”, Jim Gray, Microsoft Research

“MapReduce: Simplified Data Processing on Large Clusters”, Jeffrey Dean and Sanjay Ghemawat, Google

MyExperiment

http://research.microsoft.com/en-us/collaboration/fourthparadigm/




http://research.google.com/archive/mapreduce.html






http://www.myexperiment.org

http://www.myexperiment.org

Documents

VO Course 10: Big data challenges in astronomy