30
Manta Unleashed BigDataSg Meetup 2 July 2013 Christopher W. V. Hogue Ph.D. [email protected] The SlideShare Friendly Version

Manta Unleashed BigDataSG talk 2 July 2013

Embed Size (px)

DESCRIPTION

A walk-through of Joyent's Manta platform on SmartOS that explains how the illumos innovations of zones, zfs and Node.js led to the development of the Manta Object Store. Examples, primary manta commands and simple use-cases are provided to start using Manta to analyze Big Data in with any arbitrary Unix/Posix code without moving the data.

Citation preview

Page 1: Manta Unleashed BigDataSG talk 2 July 2013

Manta Unleashed

BigDataSg Meetup2 July 2013

Christopher W. V. Hogue [email protected]

The SlideShare Friendly Version

Page 2: Manta Unleashed BigDataSG talk 2 July 2013

Big Data in 2002 – NBLAST - Computing 361,249,575,000Protein Sequence Alignments & storing significant hits

http://www.biomedcentral.com/content/pdf/1471-2105-3-13.pdf

Page 3: Manta Unleashed BigDataSG talk 2 July 2013

Big Data in 2003 – Distributed Computing, Tiered Architecture for 10 Billion Protein 3D structure samples

Volunteer Computing

Blueprint Data Center

Page 4: Manta Unleashed BigDataSG talk 2 July 2013

What is Manta?Manta is a new operating-system level component of the IaaS platform of Joyent released June 26 2013.

http://www.joyent.com

Manta is an object store system for big-data that you can compute on without moving your data.

Manta provides map-reduce capability for executing POSIX standard, arbitrary compute jobs directly on cloud storage servers.

Page 5: Manta Unleashed BigDataSG talk 2 July 2013

What is Manta?Manta allows map-reduce operations

formed by any standard UNIX command or application

in any run-time language

without moving stored data

without Hadoop or Java code

without loading raw data into a database

Page 6: Manta Unleashed BigDataSG talk 2 July 2013

What Operating System?

• Manta is built on SmartOS, using the illumos kernel, which is open-source UNIX

• SmartOS is Not GNU/LINUX

• SmartOS is a very lightweight illumos distro for cloud hypervisors with KVM and storage that runs in RAM from PXE/CD/USB boot media

• Derived from Sun Microsystem’s Open Solaris

• Over 10,000 packages supported via pkgsrc system

Page 7: Manta Unleashed BigDataSG talk 2 July 2013

illumos is the Open Source Unix kernel forked from Solaris

Cloud OS

Server OS

Storage OSKernelDTraceCrossbowZonesZFSSMFMDB CDDL

Oracle closed its Solaris source…

Aug 2010

Database OS

Jan 2010

and more…

Kernel InnovationsBugfixesGCC buildZFS feature flagsZFS background deleteZFS LZ4 compressionKVM Type 1 hypervisor

UNIX System V Release 4

Four years of legal work to open-source Solaris. 2004-2008

1992

Page 8: Manta Unleashed BigDataSG talk 2 July 2013

Manta – What is SmartOS?

• SmartOS is Joyent’s lightweight illumos kernel based operating system optimized for high-performance cloud computing.

• illumos is an open-source fork of Open Solaris, supported by Joyent, Nexenta, OmniTI, DEY systems, and Delphix and other core committers.

• After Oracle bought Sun Microsystems, many Solaris software engineers, those who built ZFS, Dtrace and other components, left Oracle and joined the illumos effort.

• illumos distros that you can experiment with include SmartOS, OmniOS, OpenIndiana, and NexentaStor.

• Prerequisite for Manta Use: Your code needs to run/tested on (x86) illumos!

Page 9: Manta Unleashed BigDataSG talk 2 July 2013

Started in 2004

IaaS hosting: – Windows, Linux, FreeBSD

KVM images– LinkedIn , Wanelo, Voxer,

Storify, Geeklist, Tripshare …many others

– Singapore’s Reebonz (reebonz.com.sg)

4 Primary Data Centers ->

http://www.joyent.com/products/compute-service/data-centers

• Class-1 DC Operators• SSAE 16 Certified• Multi-layered Physical Security• Highly-Redundant Power• Early Warning Fire Suppression• All Tier-1 ISP Connectivity• 10gb/40gb Fully-Meshed Network• Full Peering, Fiber Connectivity

May 20 2013 – Dell drops Open Stack Cloud, Partners with Joyent for high-performance, high-availability IaaS service provision.

Page 10: Manta Unleashed BigDataSG talk 2 July 2013

• 3rd Party Smart Data Center Licensees who run Joyent-Powered Clouds, e.g.:

– Telefonica – Spain– http://cloud.telefonica.com/instantservers/

– MiCloud – Taiwan– http://micloud.tw/ch/

– Libero – Italy– http://cloud.libero.it/it/il_nostro_cloud/profilo/

Page 11: Manta Unleashed BigDataSG talk 2 July 2013

Joyent as an IaaS provider

• Has full development control of the entire operating system stack

• Is the corporate steward of the Node.js Javascript run-time language

• Community friendly - provides SmartOS image downloads, source for free, and support

• You can deploy a private cloud for free with 3rd party management software “Project Fifo”

Page 12: Manta Unleashed BigDataSG talk 2 July 2013

SmartOS Storage Implementation

• All SmartOS storage is local, on ZFS

– Integrated disk/volume management– Copy-on-write– Self-healing– Protection against silent data corruption– No hardware RAID dependency– Striping, RAID-Z with no write hole– No fsck resilvering– Built-in filesystem compression options– Compress a subdirectory– Snapshots– zfs send / receive– Integrated SSD IO caching– Add drives with one command, while in production

Page 13: Manta Unleashed BigDataSG talk 2 July 2013

Manta Storage Implementation

• Manta in the Joyent Datacenter is built on ZFS

– no SAN, no NAS head nodes– no tiered layers – standard commodity Intel servers– 4 U servers with 73 TiB of user data– basic SAS HBA technology

– Every object is stored on 2 ZFS pools by policy default, local to the server on which it is accessed

– Architecture leads me to speculate that Manta stands for

“Manta is Not Tiered Architecture”

Page 14: Manta Unleashed BigDataSG talk 2 July 2013

Manta Features • A multi-datacenter object store• Fine-grained replication commands• No object size limits• Per-object replication policies• File system-like namespace

including directory queries• Up to 1 million files per directory• Public folders for CDN data delivery• Read-after write consistency

Page 15: Manta Unleashed BigDataSG talk 2 July 2013

Manta Features • SnapLinks – a file hard-link (ln) and

snapshot mash-up, allowing alternate file naming and versioning in place. Use to mimic data movement.

• REST with JSON API• Interactive shell access through

Node.js driven SDK and commands• Compute in place with map-reduce

processing with arbitrary code and scripts without data movement

• GuardTime keyless data signatures and validation

Page 16: Manta Unleashed BigDataSG talk 2 July 2013

Manta’s Compute-on-Storage

On AWS S3

• Move the “big data” into – EC2– Hadoop

• Then orchestrate a method to run the query

• Then clean up additional big data instances

On Manta

• grep in place on the storage servers

• Manta hands back your job output in a new folder

For a simple grep style text query in a big-data collection of server logs:

Page 17: Manta Unleashed BigDataSG talk 2 July 2013

How does Manta work?

• Install Node.js package with mlogin() and local Manta commands

• Local Node.js environment includes Manta interactive shell and fast I/O data and command transfers up to the Manta Data Center .

• Commands transit via REST APIs with JSON encoding. These can be called directly.

End User

Page 18: Manta Unleashed BigDataSG talk 2 July 2013

How does Manta work?

• Connects to End User• Distributes and commits data uploads according to

replication policy (2 by default)• Fast consistency, data is ready to use without

waiting for synch• Jobs are launched in SmartOS Zone VM images on

the server• The hashed UID of the Zone that is launched

becomes the job number/directory for output data

Data Center

Page 19: Manta Unleashed BigDataSG talk 2 July 2013

Manta Commands

Installed locally as Joyent Manta Node.js SDK.

mls - Lists directory contentsmput - Uploads data to an objectmget - Downloads an object from the servicemjob - Creates and runs a computational job on the servicemfind - Walks a hierarchy to find names of objects by name, size, or typemlogin - Interactive session clientmln - Makes SnapLinks between objectsmmkdir - Make directoriesmrm - Remove objects or directoriesmrmdir - Remove empty directoriesmsign - Create a signed URL to a object stored in the servicemuntar - Create a directory hierarchy from a tar file

Client-Side UtilitiesControl interactively via shell-like SDK, OR automate with REST + JSON APIs.

Page 20: Manta Unleashed BigDataSG talk 2 July 2013

Manta Commands

Additional commands are available to your jobs in the data-side compute environment:

maggr - Performs key-wise aggregation on plain text files.

mcat - Emits the named object as an output for the current task.

mpipe - Output pipe for the current task.

msplit - Split the output stream for the current task to many reducers.

mtee - Capture stdin and write to both stdout and a object.

Data-Side Utilities

Page 21: Manta Unleashed BigDataSG talk 2 July 2013

Manta patterns for job creation

• $ mjob create –m ’command-to-map’ –r ‘command-to-reduce’

• Big Data Map Reduce version of grep:– (GNU grep –H prints name of file matching pattern, so you know what file is matched)

• $ mjob create -m ’grep -H --label=$MANTA_INPUT_OBJECT pattern’ -r cat

http://apidocs.joyent.com/manta/job-patterns.html

Page 22: Manta Unleashed BigDataSG talk 2 July 2013

Manta Documentation – Total Word Count in text file collection with map-reduce of wc + awk 1-liner

Interactive

REST + JSON API

Page 23: Manta Unleashed BigDataSG talk 2 July 2013

Manta Documentation – Image conversion with ImageMagik “convert”

Page 24: Manta Unleashed BigDataSG talk 2 July 2013

What software can I run on Manta?

Thousands of ready to use UNIX packages on the VM image:

PythonPerlRNode.jsJavaImageMagik ffmpegOpenSSLSqliteMySQL clientPostgres client

Page 25: Manta Unleashed BigDataSG talk 2 July 2013

What software can I run on Manta?

Or run custom software that is not on the VM image:

• These are called Assets

• Can be interpretable code or SmartOS compatible binaries

• Upload a SmartOS compatible package (e.g. tarball as tgz or a script file) on Manta

• Use a job script that unpacks the custom asset inside the Manta VM, and executes it.

• Use standard Unix approaches for error loging, output, pipes and tees.

Page 26: Manta Unleashed BigDataSG talk 2 July 2013

Use Cases

• Running a checksum over your data to assure its integrity

• Log processing: clickstream analysis, MapReduce on logs

• Text processing including search

• Image processing: converting formats, generating thumbnails, resizing

• Video processing: transcoding, extracting segments, resizing

• Data Analysis, Mining and Graphing with NumPy, SciPy and R

Page 27: Manta Unleashed BigDataSG talk 2 July 2013

Use Cases• Democratization of BIG DATA

– No longer in the hands of a few

• Mass market self-logging devices– Transportation/Automotive– E-health monitoring systems– Sensor networks

• Scientific paper PDF collections– Federate collections– Allow large scale text mining

• Genomic Sequence Analysis– Store Raw Data– Move compute pipeline to data– Meta-pipelines in parallel for computing over old data with new knowledge

Page 28: Manta Unleashed BigDataSG talk 2 July 2013

Manta Pricing http://www.joyent.com/products/manta/pricing

Manta compute charges are by the second: $0.00004/GB DRAM * sec

If you run 1000 parallel tasks in 32GB DRAM instances on 1000 objects and they each take a second, then you've used 32000 seconds of time and the cost for this job would be $1.28.

Request Type Price per unit of requests

Delete Free

POST, PUT, LIST (“GET DIR”) $0.005/1000 requests

GET, OPTION, HEAD $0.004/10000 requests

Page 29: Manta Unleashed BigDataSG talk 2 July 2013

Manta Pricing http://www.joyent.com/products/manta/pricing

Storage charges are slightly less than Amazon S3:

Bandwidth IN is freeBandwith OUT has tiered charges.

Storage Tier Default (2 copies) Price per GB (per individual copy)First 1 TB/mo $0.086 $0.043Next 49 TB/mo $0.072 $0.036Next 450 TB/mo $0.064 $0.032Next 500 TB/mo $0.058 $0.029Next 4000 TB/mo $0.054 $0.027Next 5000 TB/mo $0.050 $0.025Default is 2 copies. When submitting an object to the service, you can specify the number of copies stored, from one (1) to six (6).

Page 30: Manta Unleashed BigDataSG talk 2 July 2013

Deploy a Fast, Scalable, Free, Open Source Private IaaS Cloud Today.

• SmartOShttp://smartos.org/

• Project FiFOhttp://project-fifo.net

My PXE boot 2-node desktop IaaS Cloud setup

Fifo Web Console managing SmartOSKVM Type 1 (bare metal) Hypervisor