Upload
christopher-hogue
View
2.678
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A walk-through of Joyent's Manta platform on SmartOS that explains how the illumos innovations of zones, zfs and Node.js led to the development of the Manta Object Store. Examples, primary manta commands and simple use-cases are provided to start using Manta to analyze Big Data in with any arbitrary Unix/Posix code without moving the data.
Citation preview
Manta Unleashed
BigDataSg Meetup2 July 2013
Christopher W. V. Hogue [email protected]
The SlideShare Friendly Version
Big Data in 2002 – NBLAST - Computing 361,249,575,000Protein Sequence Alignments & storing significant hits
http://www.biomedcentral.com/content/pdf/1471-2105-3-13.pdf
Big Data in 2003 – Distributed Computing, Tiered Architecture for 10 Billion Protein 3D structure samples
Volunteer Computing
Blueprint Data Center
What is Manta?Manta is a new operating-system level component of the IaaS platform of Joyent released June 26 2013.
http://www.joyent.com
Manta is an object store system for big-data that you can compute on without moving your data.
Manta provides map-reduce capability for executing POSIX standard, arbitrary compute jobs directly on cloud storage servers.
What is Manta?Manta allows map-reduce operations
formed by any standard UNIX command or application
in any run-time language
without moving stored data
without Hadoop or Java code
without loading raw data into a database
What Operating System?
• Manta is built on SmartOS, using the illumos kernel, which is open-source UNIX
• SmartOS is Not GNU/LINUX
• SmartOS is a very lightweight illumos distro for cloud hypervisors with KVM and storage that runs in RAM from PXE/CD/USB boot media
• Derived from Sun Microsystem’s Open Solaris
• Over 10,000 packages supported via pkgsrc system
illumos is the Open Source Unix kernel forked from Solaris
Cloud OS
Server OS
Storage OSKernelDTraceCrossbowZonesZFSSMFMDB CDDL
Oracle closed its Solaris source…
Aug 2010
Database OS
Jan 2010
and more…
Kernel InnovationsBugfixesGCC buildZFS feature flagsZFS background deleteZFS LZ4 compressionKVM Type 1 hypervisor
UNIX System V Release 4
Four years of legal work to open-source Solaris. 2004-2008
1992
Manta – What is SmartOS?
• SmartOS is Joyent’s lightweight illumos kernel based operating system optimized for high-performance cloud computing.
• illumos is an open-source fork of Open Solaris, supported by Joyent, Nexenta, OmniTI, DEY systems, and Delphix and other core committers.
• After Oracle bought Sun Microsystems, many Solaris software engineers, those who built ZFS, Dtrace and other components, left Oracle and joined the illumos effort.
• illumos distros that you can experiment with include SmartOS, OmniOS, OpenIndiana, and NexentaStor.
• Prerequisite for Manta Use: Your code needs to run/tested on (x86) illumos!
Started in 2004
IaaS hosting: – Windows, Linux, FreeBSD
KVM images– LinkedIn , Wanelo, Voxer,
Storify, Geeklist, Tripshare …many others
– Singapore’s Reebonz (reebonz.com.sg)
4 Primary Data Centers ->
http://www.joyent.com/products/compute-service/data-centers
• Class-1 DC Operators• SSAE 16 Certified• Multi-layered Physical Security• Highly-Redundant Power• Early Warning Fire Suppression• All Tier-1 ISP Connectivity• 10gb/40gb Fully-Meshed Network• Full Peering, Fiber Connectivity
May 20 2013 – Dell drops Open Stack Cloud, Partners with Joyent for high-performance, high-availability IaaS service provision.
• 3rd Party Smart Data Center Licensees who run Joyent-Powered Clouds, e.g.:
– Telefonica – Spain– http://cloud.telefonica.com/instantservers/
– MiCloud – Taiwan– http://micloud.tw/ch/
– Libero – Italy– http://cloud.libero.it/it/il_nostro_cloud/profilo/
Joyent as an IaaS provider
• Has full development control of the entire operating system stack
• Is the corporate steward of the Node.js Javascript run-time language
• Community friendly - provides SmartOS image downloads, source for free, and support
• You can deploy a private cloud for free with 3rd party management software “Project Fifo”
SmartOS Storage Implementation
• All SmartOS storage is local, on ZFS
– Integrated disk/volume management– Copy-on-write– Self-healing– Protection against silent data corruption– No hardware RAID dependency– Striping, RAID-Z with no write hole– No fsck resilvering– Built-in filesystem compression options– Compress a subdirectory– Snapshots– zfs send / receive– Integrated SSD IO caching– Add drives with one command, while in production
Manta Storage Implementation
• Manta in the Joyent Datacenter is built on ZFS
– no SAN, no NAS head nodes– no tiered layers – standard commodity Intel servers– 4 U servers with 73 TiB of user data– basic SAS HBA technology
– Every object is stored on 2 ZFS pools by policy default, local to the server on which it is accessed
– Architecture leads me to speculate that Manta stands for
“Manta is Not Tiered Architecture”
Manta Features • A multi-datacenter object store• Fine-grained replication commands• No object size limits• Per-object replication policies• File system-like namespace
including directory queries• Up to 1 million files per directory• Public folders for CDN data delivery• Read-after write consistency
Manta Features • SnapLinks – a file hard-link (ln) and
snapshot mash-up, allowing alternate file naming and versioning in place. Use to mimic data movement.
• REST with JSON API• Interactive shell access through
Node.js driven SDK and commands• Compute in place with map-reduce
processing with arbitrary code and scripts without data movement
• GuardTime keyless data signatures and validation
Manta’s Compute-on-Storage
On AWS S3
• Move the “big data” into – EC2– Hadoop
• Then orchestrate a method to run the query
• Then clean up additional big data instances
On Manta
• grep in place on the storage servers
• Manta hands back your job output in a new folder
For a simple grep style text query in a big-data collection of server logs:
How does Manta work?
• Install Node.js package with mlogin() and local Manta commands
• Local Node.js environment includes Manta interactive shell and fast I/O data and command transfers up to the Manta Data Center .
• Commands transit via REST APIs with JSON encoding. These can be called directly.
End User
How does Manta work?
• Connects to End User• Distributes and commits data uploads according to
replication policy (2 by default)• Fast consistency, data is ready to use without
waiting for synch• Jobs are launched in SmartOS Zone VM images on
the server• The hashed UID of the Zone that is launched
becomes the job number/directory for output data
Data Center
Manta Commands
Installed locally as Joyent Manta Node.js SDK.
mls - Lists directory contentsmput - Uploads data to an objectmget - Downloads an object from the servicemjob - Creates and runs a computational job on the servicemfind - Walks a hierarchy to find names of objects by name, size, or typemlogin - Interactive session clientmln - Makes SnapLinks between objectsmmkdir - Make directoriesmrm - Remove objects or directoriesmrmdir - Remove empty directoriesmsign - Create a signed URL to a object stored in the servicemuntar - Create a directory hierarchy from a tar file
Client-Side UtilitiesControl interactively via shell-like SDK, OR automate with REST + JSON APIs.
Manta Commands
Additional commands are available to your jobs in the data-side compute environment:
maggr - Performs key-wise aggregation on plain text files.
mcat - Emits the named object as an output for the current task.
mpipe - Output pipe for the current task.
msplit - Split the output stream for the current task to many reducers.
mtee - Capture stdin and write to both stdout and a object.
Data-Side Utilities
Manta patterns for job creation
• $ mjob create –m ’command-to-map’ –r ‘command-to-reduce’
• Big Data Map Reduce version of grep:– (GNU grep –H prints name of file matching pattern, so you know what file is matched)
• $ mjob create -m ’grep -H --label=$MANTA_INPUT_OBJECT pattern’ -r cat
http://apidocs.joyent.com/manta/job-patterns.html
Manta Documentation – Total Word Count in text file collection with map-reduce of wc + awk 1-liner
Interactive
REST + JSON API
Manta Documentation – Image conversion with ImageMagik “convert”
What software can I run on Manta?
Thousands of ready to use UNIX packages on the VM image:
PythonPerlRNode.jsJavaImageMagik ffmpegOpenSSLSqliteMySQL clientPostgres client
What software can I run on Manta?
Or run custom software that is not on the VM image:
• These are called Assets
• Can be interpretable code or SmartOS compatible binaries
• Upload a SmartOS compatible package (e.g. tarball as tgz or a script file) on Manta
• Use a job script that unpacks the custom asset inside the Manta VM, and executes it.
• Use standard Unix approaches for error loging, output, pipes and tees.
Use Cases
• Running a checksum over your data to assure its integrity
• Log processing: clickstream analysis, MapReduce on logs
• Text processing including search
• Image processing: converting formats, generating thumbnails, resizing
• Video processing: transcoding, extracting segments, resizing
• Data Analysis, Mining and Graphing with NumPy, SciPy and R
Use Cases• Democratization of BIG DATA
– No longer in the hands of a few
• Mass market self-logging devices– Transportation/Automotive– E-health monitoring systems– Sensor networks
• Scientific paper PDF collections– Federate collections– Allow large scale text mining
• Genomic Sequence Analysis– Store Raw Data– Move compute pipeline to data– Meta-pipelines in parallel for computing over old data with new knowledge
Manta Pricing http://www.joyent.com/products/manta/pricing
Manta compute charges are by the second: $0.00004/GB DRAM * sec
If you run 1000 parallel tasks in 32GB DRAM instances on 1000 objects and they each take a second, then you've used 32000 seconds of time and the cost for this job would be $1.28.
Request Type Price per unit of requests
Delete Free
POST, PUT, LIST (“GET DIR”) $0.005/1000 requests
GET, OPTION, HEAD $0.004/10000 requests
Manta Pricing http://www.joyent.com/products/manta/pricing
Storage charges are slightly less than Amazon S3:
Bandwidth IN is freeBandwith OUT has tiered charges.
Storage Tier Default (2 copies) Price per GB (per individual copy)First 1 TB/mo $0.086 $0.043Next 49 TB/mo $0.072 $0.036Next 450 TB/mo $0.064 $0.032Next 500 TB/mo $0.058 $0.029Next 4000 TB/mo $0.054 $0.027Next 5000 TB/mo $0.050 $0.025Default is 2 copies. When submitting an object to the service, you can specify the number of copies stored, from one (1) to six (6).
Deploy a Fast, Scalable, Free, Open Source Private IaaS Cloud Today.
• SmartOShttp://smartos.org/
• Project FiFOhttp://project-fifo.net
My PXE boot 2-node desktop IaaS Cloud setup
Fifo Web Console managing SmartOSKVM Type 1 (bare metal) Hypervisor