18
S The Hadoop Path A short presentation on where Hadoop is going By Subash DSouza

The Hadoop Path

Embed Size (px)

DESCRIPTION

A short presentation on where Hadoop is going

Citation preview

Page 1: The Hadoop Path

S

The Hadoop Path A short presentation on where Hadoop is going

BySubash DSouza

Page 2: The Hadoop Path

Hadoop and Google

Hadoop came out of seminal papers released by Google in the early 2000’s viz. GFS, MapReduce and Big Table.

To see where Hadoop is moving is to see where Google has gone.

Great keynote talk by M.C. Srivas of MapR next week that addresses this question.

Page 3: The Hadoop Path

Jonathan Hsieh – Keynote talk at Big Data Camp LA 2014

Page 4: The Hadoop Path

Where I think Hadoop is moving?

Security

Real Time Analytics

Page 5: The Hadoop Path

Security

Hadoop vendors have become serious about security in the past year

Hortonworks’s acquisition of XA Secure

Cloudera’s acquisition of Gazzang

Kerberos has been the premise for authentication for quite some time but things like audit control and MDM have been on the horizon.

With these acquisitions, Hadoop vendors have been positioning themselves for a better security play.

Cloudera has Apache Sentry, Hortonworks has Apache Knox.

MapR supports security through authentication and authorization

Page 6: The Hadoop Path

Real Time Analytics

Real Time Streaming Quickly ingest data as it comes in.

Real Time Reporting Quickly process the ingested data.

Page 7: The Hadoop Path

Real Time Streaming

Storm

Spark Streaming

Samza

Page 8: The Hadoop Path

Apache Storm

One of the first streaming tools built.

Very low latency, typically looking at 10-200 ms.

Started by Nathan Marz from Backtype acquired by Twitter.

Strong support from Hortonworks.

Lower level API’s than Spark.

Trident is the micro-batching method that closely resembles Spark.

Page 9: The Hadoop Path

Spark Streaming

Based on the fact that not all data is required instantaneously.

Uses micro batch method.

Latency is approx. 1 sec.

Streaming has single points of failure.

Has scale issues.

Good for machine learning.

Strong support from Databricks, Cloudera, Hortonworks, MapR, Datastax & Pivotal.

Easier to integrate with Spark.

Page 10: The Hadoop Path

Apache Samza (Incubator)

Stream processing API built atop Kafka and Yarn.

Support from Linkedin.

Very similar to Storm.

Currently only one level of guarantee vs. multiple levels of guarantee in Storm.

Page 11: The Hadoop Path

Real Time Reporting ( or near real time)

Hive on Tez (Stinger)

Impala

Drill

Spark

Hawq

Page 12: The Hadoop Path

Apache Hive on Apache Tez

Tez is new application framework built atop YARN.

Workflows complied to DAG’s on Tez.

Optimizes MapReduce jobs up to 5 times faster than Standard MapReduce.

Supports in-memory jobs for small datasets.

Supported by Hortonworks & MapR.

Page 13: The Hadoop Path

Cloudera Impala

Massively parallel processing (MPP) architecture for performance, with Hadoop scalability.

Perform interactive analysis on any data stored in HDFS and Hbase.

Built with native Hadoop security: integrated with Kerberos for authentication and Apache Sentry for fine-grained, role-based authorization.

ANSI-92 SQL support.

Supports common Hadoop file formats: text, SequenceFiles, Avro, RCFile, LZO and Parquet.

Supported by Cloudera & MapR.

Page 14: The Hadoop Path

Apache Drill (Incubator)

Drill is a clustered, powerful MPP (Massively Parallel Processing) query engine for Hadoop that can process petabytes of data, fast.

Useful for short, interactive ad-hoc queries on large-scale data sets.

Capable of querying nested data in formats like JSON and Parquet and performing dynamic schema discovery.

Does not require a centralized metadata repository.

Apache Drill provides direct queries on self-describing and semi-structured data in files (such as JSON, Parquet) and HBase tables.

Supported by MapR.

Page 15: The Hadoop Path

Apache Spark

Consists of multiple projects – Spark Streaming, Spark SQL, MLib and GraphX.

Runs atop YARN, Mesos & EC2.

Uses the concept of RDD’s(Resilient Distributed Datasets) where the data is immutable during transforms.

Enables in-memory processing when needed.

Supported by Databricks, Cloudera, MapR, Hortonworks, Datastax & Pivotal.

Strong support not just from Hadoop community but also from Data Science – Mahout moving to Spark, so is Cloudera Oryx.

Page 16: The Hadoop Path

Pivotal HAWQ

Part of the Pivotal platform.

Full SQL syntax support.

Interoperability with Hive and HBase through the Pivotal Xtension Framework (PXF).

Interoperability with Pivotal’s GemFire XD, their in-memory real-time database backed by HDFS.

Proprietary to the Pivotal platform.

Page 17: The Hadoop Path

What to use where?

Dependent on Use cases.

Use the right tool for the job.

Sometimes several tool for the same job, especially in the Hadoop ecosystem.

Use what is most easiest and scalable to the enterprise in such scenarios.

Page 18: The Hadoop Path

Q&A

@sawjd22

[email protected]