Upload
marshall-jordan
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
PDS Data Movement and Storage Planning(PMWG)
PDS MC F2FUCLA
Dan CrichtonNovember 28-29, 2012
1
Growth of Planetary Data Archived from U.S. Solar System Research
Yes, size matters, but so does complexity… 2
Big Data Challenges
• Storage• Computation• Movement of Data• Heterogeneity• Distribution
…can affect how we generate, manage, and analyze science data.
…commodity computing can help, if architected correctly
Big Data Technologies
5
Architecting PDS Towards a Decoupled Architecture
Data Providers
Data Providers
PDSData
Management
PDSData
ManagementDistributionDistributionTrans
formTransform IngestIngest Trans
formTransform UsersUsers
Preserve and ensure the stability and integrity of PDS data
Core PDS
Improve user support and usability of the data in the archive
Improve efficiency and support to deliver high quality science products to PDS
Data Movement Data MovementComputation
Storage
Heterogeneous Data
Big Data Challenges
• Storage• Computation• Movement of Data• Heterogeneity• Distribution
…can affect how we generate, manage, and analyze science data.
Storage Eye Chart• Direct Attached Storage (DAS)
• DAS based storage (usually disk or tape) is directly attached to internal server (point-to-point).
• Network Attached Storage (NAS)• A NAS unit or “appliance” is a dedicated storage server connected to an Ethernet network
that provides file-based data storage services to other devices on the network. NAS units remove the responsibility of file serving from other servers on the network.
• Storage Area Network (SAN)• SAN is an architecture to connect detached storage devices, such as disk arrays, tape
libraries, and optical jukeboxes, to servers in a way that the devices appear as local resources.
• Redundant Array of Inexpensive Disks (RAID)• The concept of RAID is to combine multiple inexpensive disk drives into an array of disk
drives which perform (usually) better then a single disk drive. The RAID array will appear as a single drive to the connected server. RAID technology is typically employed in a DAS, NAS, or SAN solution.
• Cloud Storage• Cloud Storage involves storage capacity that is accessed through the internet or wide area
network (WAN) , storage is usually purchased on an as-needed basis. Users can expand capacity on the fly. Providers operates a highly scalable storage infrastructure ,often in physically dispersed locations.
• Solid State Drive Storage• Solid State Drive storage technology is evolving to a point where SSDs can, in some cases,
start to supplant traditional storage. SSDs that use DRAM-based technology (volatile memory) cannot survive a power loss but flash-based SSDs (non-volatile), although slower then DRAM-based SSDs, do not require a battery backup and therefore become acceptable in the enterprise. It has recently been announced that 1TB SSDs are available for industrial applications, like military, medical and the like. SSD technology is rapidly evolving and in the near future will be a major contender in the storage arena.
Storage Architectural Concepts
8
Cloud Deployment Models• Public Cloud:
• Cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services (e.g. Amazon, RackSpace, Nirvanix)
• Applications are typically “multi-tenant” and physical infrastructure is shared• Private Cloud:
• Cloud infrastructure is operated solely for an organization. It may be managed by the organization or on their behalf by a third party and may exist on premise or at a provider’s site in a hosting center. Could be using cloud software (e.g., Eucalyptus)
• Hybrid Cloud:• Organization provides and manages some resources in-house and has others
provided externally• Possibility to leverage existing technologies and future technologies with minimal
cost (e.g. backup/archive data managed externally, operational data managed internally)
Photo credit: AcuteSys
Many Benefits of Cloud Computing
10
Broad network access
Measured Service
Resource Pooling
Rapid Elasticity
Accessible from anywhere
Shared pool of configurable computing resources; reliability through replicas, etc Scale when needed with storage and services/cores, etc
Utility Computing, pay by the drink, rapidly provisioned
Challenges of Cloud Storage
• Data Integrity• Ownership (local control, etc)• Security• ITAR• Data movement to/from cloud• Procurement• Cost arrangements
The Planetary Cloud Experiment
• Utility to PDS• How does it fit
PDS4 architecture• APIs• Decoupled
storage and services
• Data movement challenges?
• Cloud Storage Tested as a secondary storage option• iRODS @ SDSC,
Amazon (S3), Nirvanix
12
IEEE Pro, Sept/Oct 2010
Results of Study
NirvanixiRODS @ SDSCAmazon
• Moving massive amounts of data “online” a limiting factor…more to come
• Varying cost scenarios• (target < $500/TB/year)
• Proprietary APIs (but some open source cloud implementations gaining steam)
• But, entirely feasible as a decoupled ”storage service” in PDS4
• Low risk option is to explore as an operational, secondary copy and access point for planetary data
Benchmarking (2009)
MER Planning on the Cloud
* Credit: Khawaja Shams
S3
Archive, Compression, Encryption(in memory)
Parallel Uploads to S3Daily Mars Data
5x
Polyphony Schedules Backups for Each of the Last 5 Days Daily
MER Planning: Backup to the Cloud*
* Credit: Khawaja Shams, George Chang
S3
Polyphony Immediately Schedules Another Backup of Inconsistent Data
If Downloaded Backup Does Not Match Local Data
MER Planning: Data Integrity on the Cloud
Big Data Challenges
• Storage• Computation• Movement of Data• Heterogeneity• Distribution
…can affect how we generate, manage, and analyze science data.
Cloud Computing and Computation
• On-demand computation (scaling to massive number of cores)
• Amazon EC2, one of the most popular
• Commoditizing super-computing
• Again, architecting systems to decouple “processing” and “computation” so it can be executed on the cloud is key… two examples• LMMP example (to come)• Airborne data processing (to come)
• Coupled with computational frameworks (e.g., Apache Hadoop)• Open source implementation of Map-Reduce
Lunar Mapping and Modeling Project:
Big Data Challenges*• The image files LMMP manages range from a few gigabytes to hundreds of gigabytes in size with new data arriving every day
• Lunar surface images are too large to efficiently load and manipulate in memory
• LMMP must make the data readily available in a timely manner for users to view and analyze
• LMMP needs to accommodate large numbers of users with minimal latency
20* Credit: Emily Law, George Chang
Cloud Computing Solutions with Map-Reduce
• Slice a large image into many small images and to merge and resize until the last merge and reduce yields a reasonably sized image that depicts the entire image
• Amazon EC2 for computing; S3 for storage
• Installed Hadoop framework on a number of EC2 instances
• Used distributed approach with Elastic Map-Reduce in Hadoop to tile images
• Developed a hybrid solution (multi-tiered data access approach) to serve images to users by cloud storage
21
LMMP Tiling Test Results(Cloud vs Local)
• Configuration 1• 2x Sun Fire 4170• Gigabit Network
Interconnects• 72 GB RAM• 64 GB SSD Storage• $10K each, plus
administration and infrastructure costs
• Configuration 2– 20 EC2 Large Instances (4
Compute Units ~ 4x1GHz Xeon)
– 7.5 GB RAM– 850 GB Storage– $0.34/instance/hour
• Configuration 3– 4 EC2 CC Instances (33.5
Compute Units)– Gigabit Interconnects– 23 GB RAM– 1.69 TB Storage– $1.60/instance/hour
Cloud Computing: Addressing Challenges
• Cloud has shown very promising results, but there are challenges• Proprietary APIs• Support for ITAR-sensitive data• Data transfer rates to the commercial cloud• Firewall issues• Procurement• Costs for long term storage
• More work ahead• Amazon EC2/S3 reported an “ITAR Region” available• Continued benchmarking and optimization has demonstrated increased
data transfer rates, particularly using Internet2• JPL developing a “Virtual Private Cloud” connection to Amazon, causing
EC2 nodes to appear inside the JPL Firewall• Improved procurement process to allow JPL projects to use AWS
23
Big Data Challenges
• Storage• Computation• Movement of Data• Heterogeneity• Distribution
…can affect how we generate, manage, and analyze science data.
The Planetary Data Movement Experiment
• Online data movement has been a limiting factor for embracing big data technologies
• Conducted in 2006*, 2009 and 2012
• Evaluate trade offs for moving data
• to PDS• between Nodes• to NSSDC/deep archive• to Cloud
25
* C. Mattmann, S. Kelly, D. Crichton, J. S. Hughes, S. Hardman, R. Joyner and P. Ramirez. A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Products. In Proceedings of the NASA/IEEE Conference on Mass Storage Systems and Technologies (MSST2006), pp. 131-135, College Park, Maryland, May 15-18, 2006
Data Xfer Technologies Evaluated
• FTP uses a single connection from transferring files; in general it is ubiquitous and where possible the simplest way for PDS to transfer data electronically
• bbFTP uses multiple threads/connections to improve data transfer. It works well as long as the number of connections are kept to a reasonable limit
• GridFTP uses multiple threads/connections. It is part of the Globus project and is used by the climate research community to move models. In general, tests have shown that it is more difficult to set up due to the security infrastructure, etc
• iRODS uses multiple threads/connections to improve data transfer. It works well as long as the number of connections are kept to a reasonable limit
• FDT uses multiple threads/connections to improve data transfer. It works well as long as the number of connections are kept to a reasonable limit
Some of our Findings• Transfer speed among the nodes differ
greatly, however, the fundamental findings about how to best transfer data for each scenario is consistent
• Parallel transfer mechanisms show improvement over conventional transfer mechanisms (FTP, socket-to-socket) for files larger than ~10MB
• Packaging/bundling small files help to achieve significantly better transfer performance with parallel data transfer
• Reliability has improved over the past five years in many of the products we have tested
• However, UDP approaches have suffered largely due to more aggressive network infrastructure seeing this as distributed denial of service attacks (DDOS)
27
Transfer rate (Y axis) versus file size (X axis)GridFTP: blue, bbFTP: red, FTP: green
Data Movement of WAN using TCP/IP
Data Movement Recommendations (2010)
FTP bbFTP GridFTP Data Brick FDT iRODS
Efficiency High for files < 1 GB HighSlightly lower than bbFTP
Low Very High High
Scalability LinearBased on number of
threads
Based on number of
threads
Based on available storage
sizes Adaptive Adaptive
Reliability
Fault rate dependent on underlying TCP/IP protocol, but 0 faults /
20 hours of testing and 10s of GBs of data
Good (support retransmit,
issue with > 12 threads)
High (support retransmit)
High Poor Excellent
Ease of Use Easy Easy Medium Based on brand Medium Easy
Ease of Deployment
Easy (standard component on
Linux/UNIX/Mac, and some Windows
solutions)
Easy to deploy on Unix based systems with /etc/passwd
security. Can also use Globus
GSI security)
Difficult to deploy; relies
on Grid Security
Infrastructure and certificate management
for hosts, users, services
Based on brand Medium Difficult
Cost (Operate & Implement)
Low LowMedium (hard
to deploy)Based on brand
& volumeLow Low
Pilot with DNs (Big Data)
• iRODS has shown to be the most promising for data transfer
• Setting up an iRODS infrastructure for data movement with 3 zones: GEO, USGS, JPL/IMG as a pilot• Run along side other mechanisms• Expand to other nodes if this proves
successful
BenchmarksJPL to Geo
File SizeTechnology 1 MiB 10 MiB 100 MiB 1 GiB 2 GiBTCP 1 0.55 0.94 0.93 1.33 0.94TCP 2 0.55 1.07 2.58 2.68 2.73TCP 4 0.55 1.19 5.07 5.46 5.45TCP 8 0.56 1.19 8.95 10.6 10.79TCP 16 0.56 1.19 12.02 18.45 20.32
Geo to JPL File Size
Technology 1 MiB 10 MiB 100 MiB 1 GiB 2 GiBTCP 1 0.36 0.61 0.66 0.58 0.68TCP 2 0.36 0.63 1.31 1.36 1.37TCP 4 0.39 0.62 2.26 2.69 2.7TCP 8 0.41 0.62 3.8 5.06 5.2TCP 16 0.41 0.63 5.72 8.06 8.87
Benchmarks (2)USGS to JPL
File SizeTechnology 1 MiB 10 MiB 100 MiB 1 GiB 2 GiBTCP 1 1.29 2.11 2.59 2.61 1.78TCP 2 0.93 2.59 3.6 4.01 2.6TCP 4 0.9 1.87 4.3 4.17 3.22TCP 8 0.89 2.56 3.95 4.28 3.86TCP 16 0.89 2.16 4.16 4.19 3.84
JPL to USGS File Size
Technology 1 MiB 10 MiB 100 MiB 1 GiB 2 GiBTCP 1 0.87 0.89 0.88 0.96 N/ATCP 2 0.83 1.01 1.71 1.81 N/ATCP 4 0.77 0.91 2.45 3.03 3.12TCP 8 0.87 1.02 2.89 3.73 3.76TCP 16 0.81 0.74 3.55 3.79 4.02
Recommendations• Data Movement
• PMWG will update its current data movement recommendations based on these results
• Run current data movement deployment in parallel to FTP and other mechanisms as a pilot
• Consider adding another “zone” at NSSDC for electronic data transfers• Capture updated benchmarks for Flagstaff after the network upgrade• Other DNs worry about this when they hit the larger thresholds
• Data Storage• We have quite a bit of experience now with cloud computing, etc to
comment• Focus on requirements for data storage (e.g., storage service) as other
development activities are under control
• Computation• The new PDS4 architecture allows us to run computationally
intensive services in many different topologies. Explore as needed.