View
348
Download
0
Tags:
Embed Size (px)
DESCRIPTION
How future astronomy projects will generate enormous amounts of data, and what does that mean for astronomical data processing. Part of the virtual observatory course by Juan de Dios Santander Vela, as imparted for the MTAF (Métodos y Técnicas Avanzadas en Física, Advanced Methods and Techniques in Physics) Master at the University of Granada (UGR).
Citation preview
Astronomy’s Big Data ChallengesJuan de Dios Santander Vela (IAA-CSIC)
Overview
What is, exactly, big data?
Which are the dimensions of big data?
Which are the big data drivers in astronomy?
How can we deal with big data?
VO tools for dealing with big data
What is exactly Big Data?
Data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.
WIKIPEDIA: “BIG DATA”
What is exactly Big Data?Big Data is data with at least one Big dimension
Bandwidth
Number of individual assets
Size of individual assets
Response speed
…
Big Data
Size
Storage
Access techniques
Processing techniques
Flow
Real time
Event Processi
ng
Offline
Data mining
Processing level
Raw Data
Processed Data Statistics
Schemata
Stuctured
Tagging
Unstructured
Value
Files
Formats Durability
Paralell Access
Capabilities
Information Extracted
Tech Debt
Next big data projects in astronomy
Large Synoptic Survey Telescope
The Large Synoptic Survey Telescope Camera
Steven M. KahnStanford/SLAC
(for the LSST Consortium)
LSST Data Rates
* 2.3 billion pixels read out in less than 2 sec, every 12 sec
* 1 pixel = 2 Bytes (raw)
* Over 3 GBytes/sec peak raw data from camera
* Real-time processing and transient detection: < 10 sec
* Dynamic range: 4 Bytes / pixel
* > 0.6 GB/sec average in pipeline
* 5000 floating point operations per pixel
* 2 TFlop/s average, 9 TFlop/s peak
* ~ 18 Tbytes/night
Relative Survey Power
Square Kilometre Array
Signal Transport & Processing
DESIGNS COUNTS!Signal Transport & Processing
Massive Data Flow, Storage & Processing
18 PB/YEAR
Antenna & Front End Systems
Correlation
Data Product Generation
Long Term Storage
High Availability Storage / DB
On-Demand Processing
STORAGE?CAN’T STORE IT!1 DAY STREAM = 150 DAYSGLOBAL INTERNET TRAFFIC
800 PBTemporaryStorage
Massive Data Flow, Storage & Processing
PROCESSING NEEDS109 TOP RANGE PCS > 1 EXAFLOP/S
30 PETAFLOPS/S
Antenna & Front End Systems
Correlation
Data Product Generation
Long Term Storage
High Availability Storage / DB
On-Demand Processing
TemporaryStorage
Massive Data Flow, Storage & Processing
7 PB/S
> 300 GB/S
BANDWIDTHTYPICAL SURVEY, 5 DAYS READ TIME @ 10GB/SEC
Antenna & Front End Systems
Correlation
Data Product Generation
Long Term Storage
High Availability Storage / DB
On-Demand Processing
TemporaryStorage
0" 5" 10" 15" 20" 25" 30" 35" 40"
ALMA"
LOFAR"
Bandwidth)in)TB/s)
MASSIVE DATA FLOW, STORAGE & PROCESSING
Antenna & Front End Systems
Correlation
0" 5" 10" 15" 20" 25" 30" 35" 40"
ALMA"
LOFAR"
Bandwidth)in)TB/s)
0" 10" 20" 30" 40" 50" 60" 70"
LOFAR"
ASKAP"
Bandwidth)in)TB/s)
MASSIVE DATA FLOW, STORAGE & PROCESSING
Antenna & Front End Systems
Correlation
0" 5" 10" 15" 20" 25" 30" 35" 40"
ALMA"
LOFAR"
Bandwidth)in)TB/s)
0" 10" 20" 30" 40" 50" 60" 70"
LOFAR"
ASKAP"
Bandwidth)in)TB/s)
MASSIVE DATA FLOW, STORAGE & PROCESSING
Antenna & Front End Systems
Correlation
0" 0,0005" 0,001" 0,0015" 0,002"
VLA"
ALMA"
Processing*TFlops/s*
MASSIVE DATA FLOW, STORAGE & PROCESSING
Correlation
0" 0,0005" 0,001" 0,0015" 0,002"
VLA"
ALMA"
Processing*TFlops/s*
0" 20" 40" 60" 80" 100" 120"
ALMA"
LOFAR"
Processing*TFlops/s*
MASSIVE DATA FLOW, STORAGE & PROCESSING
Correlation
0" 0,0005" 0,001" 0,0015" 0,002"
VLA"
ALMA"
Processing*TFlops/s*
0" 20" 40" 60" 80" 100" 120"
ALMA"
LOFAR"
Processing*TFlops/s*
0" 50" 100" 150" 200" 250" 300" 350"
LOFAR"
ASKAP"
Processing*TFlops/s*
MASSIVE DATA FLOW, STORAGE & PROCESSING
Correlation
0" 0,0005" 0,001" 0,0015" 0,002"
VLA"
ALMA"
Processing*TFlops/s*
0" 20" 40" 60" 80" 100" 120"
ALMA"
LOFAR"
Processing*TFlops/s*
0" 50" 100" 150" 200" 250" 300" 350"
LOFAR"
ASKAP"
Processing*TFlops/s*
MASSIVE DATA FLOW, STORAGE & PROCESSING
Correlation
Comparison: LHC
CERN/IT/DB
online systemmulti-level triggerfilter out backgroundreduce data volume from40TB/s to 100MB/s
level 1 - special hardware
40 MHz (40 TB/sec)level 2 - embedded processorslevel 3 - PCs
75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &offline analysis
CERN/IT/DBEvent Filter & Reconstruction
(figures are for one experiment)
switch
data from detector - event builder
high speed network
computer farm
tapeand disk servers
raw datasummary data
input: 5-100 GB/sec
capacity: 50K SI95 (~4K 1999 PCs)
recording rate: 100 MB/sec (Alice – 1 GB/sec)
+ 1-1.25 PetaByte/year+ 1-500 TB/year
20,000 Redwood cartridges every year (+ copy)
Dealing with Big Data
Dealing with Big Data
We cannot allow for arbitrary queries
We can have arbitrary processing instead
We cannot allow full data dumps
We can generate data on the the fly (see above)
Queries as functions
QUERY = FUNCTION { }DATA
QUERIES NEED TO BE PRECOMPUTEDARBITRARY QUERIES ONLY POSSIBLE ON THE PRECOMPUTED, SMALLER DATA SETS
Queries as functions
QUERY = FUNCTION { }DATAALL
QUERIES NEED TO BE PRECOMPUTEDARBITRARY QUERIES ONLY POSSIBLE ON THE PRECOMPUTED, SMALLER DATA SETS
Lambda Architecture
Batch Layer
Serving Layer
Speed Layer
STORE MASTER DATASETCOMPUTE ARBITRARY VIEWS
RANDOM ACCESS TO VIEWSUPDATED BY BATCH LAYER
FAST, INCREMENTAL ALGOS.QUERIES NOT ON BATCH L.COMPENSATES FOR LATENCY
Batch Layer
Stores master copy of the dataset
Precomputes batch views on that master dataset
INMUTABLE, CONSTANTLY
GROWING
INMUTABLE, CONSTANTLY
GROWING
Batch Layer
All Data Batch Layer
View 1
View 2
View n
…NEW DATA
UPDATED VIEWS
TYPICALLY, MAP/REDUCE
Serving Layer
Allows for:
batch writes of view updates
random reads on the views
Does not allow random writes
Speed Layer
Allows for:
incremental writes of view updates
short-term temporal queries on the views
Can be discarded!
Figure 2.1 The master dataset in the Lambda Architecture serves as the source oftruth of your Big Data system. Errors at the serving and speed layers can becorrected, but corruption at the master dataset is irreparable.
The master dataset is the only part of the Lambda Architecture that absolutelymust be safeguarded from corruption. Overloaded machines, failing disks, andpower outages all could cause errors, and human error with dynamic data systemsis an intrinsic risk and inevitable eventuality. You must carefully engineer themaster dataset to prevent corruption in all these cases, as fault tolerance is essentialto the health of a long running data system.
There are two components to the master dataset: the data model to use, and howto physically store it. This chapter is about designing a data model for the masterdataset and the properties such a data model should have. You will learn aboutphysically storing a master dataset in the next chapter.
To provide a roadmap for your undertaking, you will
learn the key properties of datasee how these properties are maintained in the fact-based modelexamine the advantages of the fact-based model for the master dataset
©Manning Publications Co. Please post comments or corrections to the Author Online forum: http://www.manning-sandbox.com/forum.jspa?forumID=787
27
Licensed to Juan de Dios Santander Vela <[email protected]>
Computing over Big DataBatch layer as a computational engine on data
Need to formally specify
Inputs
Processes
OutputsTHAT LOOKS LIKE
A WORKFLOW!
OR SQL
QUERYING
Map/Reduce
Map/Reducefrom%random%import%normalvariatefrom,math,import,sqrt
def,res2(x):,return,pow(mean_v,6,x,,2.)#"Random"vector,"mean"1,"stdev"0.001v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)
res2_v,=,map(res2,,v)
stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))print,(mean_v,,stdev)
PARALELLISABLE!
Map/Reducefrom%random%import%normalvariatefrom,math,import,sqrtfrom,multiprocessing,import,Pooldef,res2(x):,return,pow(mean_v,6,x,,2.)#"Random"vector,"mean"1,"stdev"0.001v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)pool,=,Pool(processes=4)res2_v,=,pool.map(res2,,v)pool.close()stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))print,(mean_v,,stdev)
ONLY FOR MAP, BUT REDUCE ALSO PARALLELISABLE
0,4
0,5
0,6
0,7
0,8
1 2 3 4 5 6 7 8
Dependence of execution time with the number of pool processorsse
cond
s pe
r milli
on e
lem
ents
Number of pool processors
20 millions10 millions5 millions1 million
Conclusions
Big data needs different approaches
Parallelism & data-side processing
Map/Reduce as parallelism engine
Need of ways to formally specify computations
References & Links
“The Fourth Paradigm: Data-Intensive Scientific Discovery”, Jim Gray, Microsoft Research
“MapReduce: Simplified Data Processing on Large Clusters”, Jeffrey Dean and Sanjay Ghemawat, Google
MyExperiment