The Evolution and Future of Hadoop Storage （Hadoop Conference Japan 2016キーノート講演資料）

1 © Cloudera, Inc. All rights reserved.

The Evolu:on and Future of Hadoop Storage Todd Lipcon | Engineer at Cloudera TwiCer @tlipcon | [email protected]


Introduc:on (the evolu:on and future of me) Mailing list messages sent by Todd Lipcon

Spoke at HCJ 2011!



-‐ Early user of Hadoop -‐ Joined Cloudera as So4ware Engineer

Spoke at HCJ 2011!



-‐ Early user of Hadoop -‐ Joined Cloudera as So4ware Engineer -‐  Work on HDFS, HBase,

MR (HA, performance, stability, etc)

-‐  Became a commiFer, PMC member, and ASF Member

Spoke at HCJ 2011!




-‐  Founded the Kudu project within Cloudera

-‐  Secretly developing with a small team for 3 years

-‐  Work on HDFS, HBase, MR (HA, performance, stability, etc)


Spoke at HCJ 2011!




-‐  Founded the Kudu project within Cloudera

-‐  Secretly developing with a small team for 3 years

-‐  Kudu announced and contributed to the ASF as Apache Kudu (incubaMng)

-‐  Work on HDFS, HBase, MR (HA, performance, stability, etc)


Spoke at HCJ 2011!


誕生日おめでとうございます。 Hadoop: the last 10 years



Parquet Sentry Spark Tez

Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog

Hue Sqoop Whirr Avro Hive

Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

Evolu:on of the Hadoop Plagorm

2006 2008 2009 2010 2011 2012 2013

Core Hadoop (HDFS,

MapReduce)

HBase ZooKeeper

Solr Pig

Core Hadoop

Hive Mahout HBase

ZooKeeper Solr Pig

Core Hadoop

Sqoop Whirr Avro Hive

Mahout HBase

ZooKeeper Solr Pig

Core Hadoop

Flume Bigtop Oozie MRUnit HCatalog


Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

Spark Tez



Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

The stack is con:nually evolving and growing!

2007

Solr Pig

Core Hadoop

Ibis Flink




Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

2014-‐15





Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

Basics


2006 2008 2009 2010 2011 2012 2013

Core Hadoop (HDFS,

MapReduce)

HBase ZooKeeper

Solr Pig

Core Hadoop

Hive Mahout HBase

ZooKeeper Solr Pig

Core Hadoop


Mahout HBase

ZooKeeper Solr Pig

Core Hadoop



Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

Spark Tez



Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop


2007

Solr Pig

Core Hadoop

Ibis Flink




Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

2014-‐15

-‐ Very basic Hadoop

-‐ Batch processes only

-‐ Not stable, fast, or featureful





Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

Basics


2006 2008 2009 2010 2011 2012 2013

Core Hadoop (HDFS,

MapReduce)

HBase ZooKeeper

Solr Pig

Core Hadoop

Hive Mahout HBase

ZooKeeper Solr Pig

Core Hadoop


Mahout HBase

ZooKeeper Solr Pig

Core Hadoop



Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

Spark Tez



Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop


2007

Solr Pig

Core Hadoop

Ibis Flink




Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

2014-‐15




-‐ Expanding feature set -‐ Basic security, HA, stability

-‐ Commercial distribuMons

Produc:on





Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

Basics


2006 2008 2009 2010 2011 2012 2013

Core Hadoop (HDFS,

MapReduce)

HBase ZooKeeper

Solr Pig

Core Hadoop

Hive Mahout HBase

ZooKeeper Solr Pig

Core Hadoop


Mahout HBase

ZooKeeper Solr Pig

Core Hadoop



Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

Spark Tez



Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop


2007

Solr Pig

Core Hadoop

Ibis Flink




Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

2014-‐15

Enterprise

-‐ Security -‐ Performance -‐ Fast full-‐featured SQL




-‐ Expanding feature set -‐ Basic security, HA, stability

-‐ Commercial distribuMons

Produc:on


Evolu:on of Storage (Basics / 2006-‐2007)

• HDFS only •  Support basic batch workloads. No HA. • Performance not important • MapReduce is too slow, anyway! • Batch only

• Early Adopters (FaceBook, Yahoo, etc)


Evolu:on of Storage (Produc:on / 2008-‐2011)

• HDFS evolves to add high availability and security • Focused on batch workloads •  Inefficient file formats commonly used (text) • Query engines are slow! No need for beCer performance

• Apache HBase becomes an Apache Top-‐Level Project (TLP) •  Introduces fast random access • Early adopters experiment with new use cases • Deployed at Facebook and other large companies


Evolu:on of Storage (Enterprise / 2012-‐2015) • Reliable core brings new users • Enterprise features: access control, disaster recovery, encryp:on

•  Introduc:on of fast query engines • 10-‐100x faster SQL-‐on-‐Hadoop (Impala, Spark, etc.) • Pushes HDFS performance improvements: caching, CPU efficiency, columnar file formats (Apache Parquet, ORCFile)

• HBase evolves to 1.0 •  Improved stability, scalability, security • Good random access -‐ not fast for SQL analy:cs.

•  IniMal support for cloud storage • Rising adop:on of AWS, Azure, Google Compute, etc.


So what’s the next genera:on? 2016 and beyond


2016-‐2020 (Next-‐gen): storage hardware

•  Spinning disk -‐> solid state storage • NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec write throughput, at a price of less than $3/GB and dropping fast • 3D XPoint memory (1000x faster than NAND, cheaper than RAM)

• RAM is cheaper and more abundant: • 64-‐>128-‐>256GB over last few years

• HDFS and HBase were not designed for next-‐genera:on hardware. • Not using full speed of flash or RAM size


2016-‐2020 (Next-‐gen): gaps in capabili:es HDFS good at:

•  Batch ingest only (eg hourly) •  Efficiently scanning large amounts

of data (analy:cs) HBase good at:

•  Efficiently finding and wri:ng individual rows

•  Making data mutable Gaps exist when these proper:es are needed simultaneously


• High throughput for big scans Goal: Within 2x of Parquet

•  Low-‐latency for short accesses Goal: 1ms read/write on SSD

•  RelaMonal data model •  SQL queries are easy •  “NoSQL” style scan/insert/update (Java/C++ client)

•  Expands Hadoop use cases •  Real-‐:me analy:cs and :me series •  Internet-‐of-‐things

2016-‐2020 (Next-‐gen): Apache Kudu (incuba:ng)


Kudu: Open source, scalable and fast tabular storage

•  Scalable • Designed to scale to 1000s of nodes, tens of PBs

•  Fast • Designed for modern hardware • Millions of read/write opera:ons per second across cluster • MulMple GB/second read throughput per node

• Tabular • Store tables like a normal database (support SQL, Spark, etc) • NoSQL-‐style access to 100+ billion row tables (Java/C++/Python APIs)


2016-‐2020 (Next gen): Predic:ons

• Kudu will evolve an enterprise feature set and enable simple high-‐performance real-‐:me architectures •  Increasing ability to migrate tradi:onal applica:ons

• HDFS and HBase will con:nue to innovate and adapt to next genera:on hardware • Steady improvements in performance, efficiency, and scalability (e.g. erasure coding)

• Cloud storage will become increasingly important • Hadoop ecosystem will evolve to coexist


ありがとうございます @tlipcon @ApacheKudu

To learn more about Kudu, please aCend my session at 13:45, Conference Room B (7F)

Technology

The Evolution and Future of Hadoop Storage （Hadoop Conference Japan 2016キーノート講演資料）