Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
National Aeronautics and Space Administration
© 2017 California Institute of Technology. Government sponsorship acknowledged.Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsements by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.
Big Data Analytics On Cloud Using NEXUSThomas Huang1, Stewart P. Abercrombie1, Carmen Boening1, Kevin Gill1, Frank Greguska1, Joseph Jacob1, Michael Little2, Chris Lynnes2, Nga Quach1, Rob Toaz1, Brian Wilson1
1Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109-8099, U.S.A.2NASA Goddard Space Flight Center, 8800 Greenbelt Rd, Greenbelt, MD 20771, U.S.A.
Almost all of the existing Earth science data analysis solutions are built around large archives of files. When an analysis involves large collection of files, performance suffers due to the large amount of I/O required. Common data access solutions, such as OPeNDAP and THREDDS, provide web service interface to archives of observational data. They also yield poor performance when it comes to large amount of observations, because they are still built around the notion of files. In his famous 2005 paper on Scientific Data Management in the Coming Decade, the late Jim Gray stated, “The scientific file-formats of HDF, NetCDF, and FITS can represent tabular data but they provide minimal tools for searching and analyzing tabular data.” He continued to point out, “Performing this filter-then-analyze, data analysis on large datasets with conventional procedural tools runs slower and slower as data volume increases.”
NEXUS (https://github.com/dataplumber/nexus) is an emerging, open source, data-intensive analysis framework developed with a new approach for handling science data that enables large-scale data analysis. MapReduce is a well-known paradigm for processing large amounts of data in parallel using clustering or Cloud environments. Unfortunately, this paradigm doesn't work well with temporal, geospatial array-based data. One major issue is they are packaged in files in various sizes. The size of each data file can range from tens of megabytes to several gigabytes. Depending on the user input, some analysis operations could involve hundreds to thousands of these files.
NEXUS takes on a different approach in handling file-based observational temporal, geospatial artifacts by fully leveraging the elasticity of Cloud Computing environment. Rather than performing on-the-fly file I/O, NEXUS stores tiled data in Cloud-scaled databases with high-performance spatial lookup service. NEXUS provides the bridge between science data and horizontal-scaling data analysis. This platform simplifies development of big data analysis solutions by bridging the gap between files and MapReduce solutions such as Spark. NEXUS has been integrated into the NASA Sea Level Change Portal (https://sealevel.nasa.gov) as the Big Data analytic backend for its Data Analysis Tool.
National Aeronautics and Space Administration NASA SEA LEVEL CHANGE PORTAL
These activities were carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. © 2016 California Institute of Technology. Government sponsorship acknowledged.
https://sealevel.nasa.gov“One-Stop” source for current sea level change information and data
interactive tools for accessing and viewing regional datavirtual dashboard of sea level indicators
latest news, quarterly reports, and publications ongoing updates
multimediamobile friendly
Carmen Boening, Pat Brennan, Kevin Gill, Thomas Huang, Randal Jackson, Eric Larour, Nga Quach, Holly Shaftel, Laura Tenenbaum, Victor ZlotnickiJet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109-8099, U.S.A.
Andra Boeck, Bergen Moore, Justin MooreMoore Boeck
Relevance Multidiscipline
High Quality Latest
Expert Novice Non-Scientist
Virtual Earth System Laboratory
Columbia Glacier Haig Glacier
Eustatic Sea Level Local Sea Level
Greenland Basal Friction Greenland Steady-State-Friction
Data Analysis ToolParameter Visualization | Polar Projection | Dataset Info | Data Sequencer
Suite of Analytic Functions | Download Plots and Results
FeaturedNews
Indicators
Learn
Search&
Tools
News&
Updates
On-the-fly recreated identification of ”The Blob”• The Blob is the name given to a large mass of relatively warm water in the Pacific ocean off
the coast of North America. It was first detected in late 2013 and continued to spread throughout 2014 and 2015.
• SST anomaly = SST – SST Climatology at each location to compare with standard deviation - Dr. Chelle Gentemann, Senior Scientist at Earth & Space Research
NASA AIST - OceanXtremesOceanographic Data-Intensive Anomaly Detection and Analysis Portal
Apache Spark (In-Memory MapReduce) and Two Scalable DatabasesNEXUS: The Deep Data Platform
Workflow Automation Horizontal-Scale Data Analysis Environment
Deep Data Processors
Index and Data
Catalog
Data Access
ETL System
Private Cloud
Analytic Platform
ETL System Deep Data Processors
Spring XD
Index and Data Catalog Analytic Platform
Manager Manager Inventory Security SigEvent Search
ZooKeeper ZooKeeperZooKeeper
Manager
Handler Handler Handler
IngestPool
IngestPool
WorkerPool
WorkerPool
File & ProductServices
Job Tracking Services
Business Logics
Applications
HORIZON Data Management and Workflow Framework
Ingest Ingest Ingest
ProductSubscriber
ProductSubscriber
Staging
NoS
QL
Index
Alg Alg Alg
Alg
Alg
Alg
Alg
WebPortal
TaskScheduler
Task threads
Block manager
RDD Objects DAG Scheduler Task Scheduler Executor
DAG TaskSet Task
NASA Physical Oceanography Distributed Active Archive Center
Comparison of Giovanni and NEXUS run times to compute 16-year area-averaged daily time series of MODIS-Terra Aerosol Optical Depth (AOD) 550 nm dark target at 1 degree resolution for the indicated spatial bounds (Global, Colorado, or Boulder). NEXUS was run on 6 Amazon Web Services (AWS) Cloud instances of type “i2.4xlarge” with Solr, Cassandra, Spark 2.0, and the Mesos scheduler. NEXUS has superior performance for subsetting operations, as indicated by the ~300x speedup for the Colorado subset.
Comparison of NEXUS run times using the YARN and Mesos schedulers for the MODIS-Terra time series calculation on an 8-node cluster computer at JPL running Solr, Cassandra, and Spark 2.0 with the YARN and Mesos schedulers. In our experiments, using Mesos consistently yields a speedup of 2 to 4 times over YARN.
0
1000
2000
3000
4000
Global Colorado Boulder
1778
3010
1228
35 11 10
RunTime(s)
MODIS-TerraArea-AveragedTimeSeries
Giovanni NEXUS16-way
0
50
100
150
Global Colorado Boulder
124
42 372712 12Ru
nTime(s)
MODIS-TerraArea-AveragedTimeSeries
YARN Mesos
Comparison of Giovanni and NEXUS run times to compute global correlation map of 0.25 degree TRMM daily precipitation rate (TRMM_3B42_daily_precipitation_V7) vs. T R M M r e a l - t i m e d a i l y p r e c i p i t a t i o n (TRMM_3B42RT_daily_precipitation_V7) for 1/1/2001 - 12/31/2014. These runs included 5,113 global data granule pairs (between 50 deg. south and 50 deg. north latitude) totaling 40 GB. NEXUS was run on an 8-node cluster computer at JPL running Solr, Cassandra, and Spark 2.0 with the Mesos scheduler.
0
100
200
300
400
500
600
700
800
Giovanni NEXUS16-way NEXUS64-way
768
60 34
RunTime(s)
TRMM_3B42vsTRMM_3B42RTCorrelationMap(Global)
Comparison of Giovanni and NEXUS run times to compute area-averaged time series over the continental United States (Figure above) and global time averaged map (Figure below) of 0.25 degree TRMM daily precipitation rate (TRMM_3B42_daily_precipitation_V7) for 1/1/1998 - 12/31/2015. These runs included 6,574 global data granules (between 50 deg. south and 50 deg. north latitude) totaling 26 GB. NEXUS was run on an 8-node cluster computer at JPL running Solr, Cassandra, and Spark 2.0 with the Mesos scheduler.
050010001500200025003000350040004500
Giovanni NEXUS16-way NEXUS64-way
4120
94 40
RunTime(s)
TRMM_3B42AreaAveragedTimeSeries(CONUS)
020406080
100120140160180
Giovanni NEXUS16-way NEXUS64-way
169
5441Ru
nTime(s)
TRMM_3B42TimeAveragedMap(Global)
Performance and Scaling
NEXUS is an emerging technology developed at JPLOpen Source: https://github.com/dataplumber/nexusA Cloud-based/Cluster-based data platform that performs scalable handling
of observational parameters analysis designed to scale horizontally byLeveraging high-performance indexed, temporal, and geospatial search
solutionBreaks data products into small chunks and stores them in a Cloud data
storeData Volumes Exploding
SWOT mission is comingFile I/O is slow
Scalable Store & Compute is AvailableNoSQL cluster databasesParallel compute, in-memory map-reduceBring Compute to Highly-Accessible Data (using Hybrid Cloud)
Pre-Chunk and Summarize Key VariablesEasy statistics instantly (milliseconds)Harder statistics on-demand (in seconds)Visualize original data (layers) on a map quickly
Using open source technologies• Apache Cassandra• Apache Kafka• Apache Mesos/YARN• Apache Solr• Apache Spark/PySpark• Apache Zookeeper• EDGE• Jupyter• Spring XD• Tornado