Upload
hadoop-conference-japan
View
874
Download
0
Embed Size (px)
Citation preview
1 © Cloudera, Inc. All rights reserved.
The Evolu:on and Future of Hadoop Storage Todd Lipcon | Engineer at Cloudera TwiCer @tlipcon | [email protected]
2 © Cloudera, Inc. All rights reserved.
Introduc:on (the evolu:on and future of me) Mailing list messages sent by Todd Lipcon
Spoke at HCJ 2011!
3 © Cloudera, Inc. All rights reserved.
Introduc:on (the evolu:on and future of me) Mailing list messages sent by Todd Lipcon
-‐ Early user of Hadoop -‐ Joined Cloudera as So4ware Engineer
Spoke at HCJ 2011!
4 © Cloudera, Inc. All rights reserved.
Introduc:on (the evolu:on and future of me) Mailing list messages sent by Todd Lipcon
-‐ Early user of Hadoop -‐ Joined Cloudera as So4ware Engineer -‐ Work on HDFS, HBase,
MR (HA, performance, stability, etc)
-‐ Became a commiFer, PMC member, and ASF Member
Spoke at HCJ 2011!
5 © Cloudera, Inc. All rights reserved.
Introduc:on (the evolu:on and future of me) Mailing list messages sent by Todd Lipcon
-‐ Early user of Hadoop -‐ Joined Cloudera as So4ware Engineer
-‐ Founded the Kudu project within Cloudera
-‐ Secretly developing with a small team for 3 years
-‐ Work on HDFS, HBase, MR (HA, performance, stability, etc)
-‐ Became a commiFer, PMC member, and ASF Member
Spoke at HCJ 2011!
6 © Cloudera, Inc. All rights reserved.
Introduc:on (the evolu:on and future of me) Mailing list messages sent by Todd Lipcon
-‐ Early user of Hadoop -‐ Joined Cloudera as So4ware Engineer
-‐ Founded the Kudu project within Cloudera
-‐ Secretly developing with a small team for 3 years
-‐ Kudu announced and contributed to the ASF as Apache Kudu (incubaMng)
-‐ Work on HDFS, HBase, MR (HA, performance, stability, etc)
-‐ Became a commiFer, PMC member, and ASF Member
Spoke at HCJ 2011!
7 © Cloudera, Inc. All rights reserved.
誕生日おめでとう ございます。 Hadoop: the last 10 years
8 © Cloudera, Inc. All rights reserved.
9 © Cloudera, Inc. All rights reserved.
Parquet Sentry Spark Tez
Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
Evolu:on of the Hadoop Plagorm
2006 2008 2009 2010 2011 2012 2013
Core Hadoop (HDFS,
MapReduce)
HBase ZooKeeper
Solr Pig
Core Hadoop
Hive Mahout HBase
ZooKeeper Solr Pig
Core Hadoop
Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
Core Hadoop
Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
Spark Tez
Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
The stack is con:nually evolving and growing!
2007
Solr Pig
Core Hadoop
Ibis Flink
Parquet Sentry Spark Tez
Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
2014-‐15
10 © Cloudera, Inc. All rights reserved.
Parquet Sentry Spark Tez
Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
Basics
Evolu:on of the Hadoop Plagorm
2006 2008 2009 2010 2011 2012 2013
Core Hadoop (HDFS,
MapReduce)
HBase ZooKeeper
Solr Pig
Core Hadoop
Hive Mahout HBase
ZooKeeper Solr Pig
Core Hadoop
Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
Core Hadoop
Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
Spark Tez
Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
The stack is con:nually evolving and growing!
2007
Solr Pig
Core Hadoop
Ibis Flink
Parquet Sentry Spark Tez
Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
2014-‐15
-‐ Very basic Hadoop
-‐ Batch processes only
-‐ Not stable, fast, or featureful
11 © Cloudera, Inc. All rights reserved.
Parquet Sentry Spark Tez
Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
Basics
Evolu:on of the Hadoop Plagorm
2006 2008 2009 2010 2011 2012 2013
Core Hadoop (HDFS,
MapReduce)
HBase ZooKeeper
Solr Pig
Core Hadoop
Hive Mahout HBase
ZooKeeper Solr Pig
Core Hadoop
Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
Core Hadoop
Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
Spark Tez
Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
The stack is con:nually evolving and growing!
2007
Solr Pig
Core Hadoop
Ibis Flink
Parquet Sentry Spark Tez
Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
2014-‐15
-‐ Very basic Hadoop
-‐ Batch processes only
-‐ Not stable, fast, or featureful
-‐ Expanding feature set -‐ Basic security, HA, stability
-‐ Commercial distribuMons
Produc:on
12 © Cloudera, Inc. All rights reserved.
Parquet Sentry Spark Tez
Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
Basics
Evolu:on of the Hadoop Plagorm
2006 2008 2009 2010 2011 2012 2013
Core Hadoop (HDFS,
MapReduce)
HBase ZooKeeper
Solr Pig
Core Hadoop
Hive Mahout HBase
ZooKeeper Solr Pig
Core Hadoop
Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
Core Hadoop
Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
Spark Tez
Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
The stack is con:nually evolving and growing!
2007
Solr Pig
Core Hadoop
Ibis Flink
Parquet Sentry Spark Tez
Impala Ka]a Drill Flume Bigtop Oozie MRUnit HCatalog
Hue Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
2014-‐15
Enterprise
-‐ Security -‐ Performance -‐ Fast full-‐featured SQL
-‐ Very basic Hadoop
-‐ Batch processes only
-‐ Not stable, fast, or featureful
-‐ Expanding feature set -‐ Basic security, HA, stability
-‐ Commercial distribuMons
Produc:on
13 © Cloudera, Inc. All rights reserved.
Evolu:on of Storage (Basics / 2006-‐2007)
• HDFS only • Support basic batch workloads. No HA. • Performance not important • MapReduce is too slow, anyway! • Batch only
• Early Adopters (FaceBook, Yahoo, etc)
14 © Cloudera, Inc. All rights reserved.
Evolu:on of Storage (Produc:on / 2008-‐2011)
• HDFS evolves to add high availability and security • Focused on batch workloads • Inefficient file formats commonly used (text) • Query engines are slow! No need for beCer performance
• Apache HBase becomes an Apache Top-‐Level Project (TLP) • Introduces fast random access • Early adopters experiment with new use cases • Deployed at Facebook and other large companies
15 © Cloudera, Inc. All rights reserved.
Evolu:on of Storage (Enterprise / 2012-‐2015) • Reliable core brings new users • Enterprise features: access control, disaster recovery, encryp:on
• Introduc:on of fast query engines • 10-‐100x faster SQL-‐on-‐Hadoop (Impala, Spark, etc.) • Pushes HDFS performance improvements: caching, CPU efficiency, columnar file formats (Apache Parquet, ORCFile)
• HBase evolves to 1.0 • Improved stability, scalability, security • Good random access -‐ not fast for SQL analy:cs.
• IniMal support for cloud storage • Rising adop:on of AWS, Azure, Google Compute, etc.
16 © Cloudera, Inc. All rights reserved.
So what’s the next genera:on? 2016 and beyond
17 © Cloudera, Inc. All rights reserved.
2016-‐2020 (Next-‐gen): storage hardware
• Spinning disk -‐> solid state storage • NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec write throughput, at a price of less than $3/GB and dropping fast • 3D XPoint memory (1000x faster than NAND, cheaper than RAM)
• RAM is cheaper and more abundant: • 64-‐>128-‐>256GB over last few years
• HDFS and HBase were not designed for next-‐genera:on hardware. • Not using full speed of flash or RAM size
18 © Cloudera, Inc. All rights reserved.
2016-‐2020 (Next-‐gen): gaps in capabili:es HDFS good at:
• Batch ingest only (eg hourly) • Efficiently scanning large amounts
of data (analy:cs) HBase good at:
• Efficiently finding and wri:ng individual rows
• Making data mutable Gaps exist when these proper:es are needed simultaneously
19 © Cloudera, Inc. All rights reserved.
• High throughput for big scans Goal: Within 2x of Parquet
• Low-‐latency for short accesses Goal: 1ms read/write on SSD
• RelaMonal data model • SQL queries are easy • “NoSQL” style scan/insert/update (Java/C++ client)
• Expands Hadoop use cases • Real-‐:me analy:cs and :me series • Internet-‐of-‐things
2016-‐2020 (Next-‐gen): Apache Kudu (incuba:ng)
20 © Cloudera, Inc. All rights reserved.
Kudu: Open source, scalable and fast tabular storage
• Scalable • Designed to scale to 1000s of nodes, tens of PBs
• Fast • Designed for modern hardware • Millions of read/write opera:ons per second across cluster • MulMple GB/second read throughput per node
• Tabular • Store tables like a normal database (support SQL, Spark, etc) • NoSQL-‐style access to 100+ billion row tables (Java/C++/Python APIs)
21 © Cloudera, Inc. All rights reserved.
2016-‐2020 (Next gen): Predic:ons
• Kudu will evolve an enterprise feature set and enable simple high-‐performance real-‐:me architectures • Increasing ability to migrate tradi:onal applica:ons
• HDFS and HBase will con:nue to innovate and adapt to next genera:on hardware • Steady improvements in performance, efficiency, and scalability (e.g. erasure coding)
• Cloud storage will become increasingly important • Hadoop ecosystem will evolve to coexist
22 © Cloudera, Inc. All rights reserved.
ありがとうございます @tlipcon @ApacheKudu
To learn more about Kudu, please aCend my session at 13:45, Conference Room B (7F)