Big Data Infrastructures Technologies - boncz/bads/03-The Hadoop Data Infrastructures Technologies Hadoop Streaming ... Big Data Infrastructures Technologies ... –Cascading

  • View
    217

  • Download
    4

Embed Size (px)

Text of Big Data Infrastructures Technologies - boncz/bads/03-The Hadoop Data Infrastructures Technologies...

  • www.cwi.nl/~boncz/bads

    Big Data Infrastructures & Technologies Hadoop Streaming Revisit

  • www.cwi.nl/~boncz/bads

    ENRON Mapper

  • www.cwi.nl/~boncz/bads

    ENRON Mapper Output (Excerpt)

    acomnes@enron.com edward.snowden@cia.gov

    blake.walker@enron.com alex.berenson@nyt.com

  • www.cwi.nl/~boncz/bads

    ENRON Reducer

    Optimization?

  • www.cwi.nl/~boncz/bads

    ENRON Final Results (Excerpt)

    acomnes@enron.com 32

    anne.jolibois@enron.com 18

    billy.lemmons@enron.com 11

    brad.guilmino@enron.com 11

  • www.cwi.nl/~boncz/bads

    Big Data Infrastructures & Technologies Frameworks Beyond MapReduce

  • www.cwi.nl/~boncz/bads

    THE HADOOP ECOSYSTEM

  • www.cwi.nl/~boncz/bads

    YARN: Hadoop version 2.0

  • www.cwi.nl/~boncz/bads

    YARN: Hadoop version 2.0

    Hadoop limitations:

    Can only run MapReduce

    What if we want to run other distributed frameworks?

  • www.cwi.nl/~boncz/bads

    YARN: Hadoop version 2.0

    Hadoop limitations:

    Can only run MapReduce

    What if we want to run other distributed frameworks?

    YARN = Yet-Another-Resource-Negotiator

    Provides API to develop any generic distribution application

    Handles scheduling and resource request

    MapReduce (MR2) is one such application in YARN

  • www.cwi.nl/~boncz/bads

    YARN: architecture

  • www.cwi.nl/~boncz/bads

    The Hadoop Ecosystem

    YARN

  • www.cwi.nl/~boncz/bads

    The Hadoop Ecosystem

    YARN

  • www.cwi.nl/~boncz/bads

    The Hadoop Ecosystem

    YARN

  • www.cwi.nl/~boncz/bads

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

  • www.cwi.nl/~boncz/bads

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

    Impala

  • www.cwi.nl/~boncz/bads

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

    Impala

  • www.cwi.nl/~boncz/bads

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

    Impala

  • www.cwi.nl/~boncz/bads

    data querying

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

    Impala

  • www.cwi.nl/~boncz/bads

    data querying

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

    Impala

  • www.cwi.nl/~boncz/bads

    graph

    analysis data querying

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

    Impala

  • www.cwi.nl/~boncz/bads

    graph

    analysis data querying

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

    MLIB

    Impala

  • www.cwi.nl/~boncz/bads

    fast in-memory

    processing

    graph

    analysis data querying

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

    MLIB

    Impala

  • www.cwi.nl/~boncz/bads

    fast in-memory

    processing

    graph

    analysis

    machine learning

    data querying

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

    MLIB

    Impala

  • www.cwi.nl/~boncz/bads

    fast in-memory

    processing

    graph

    analysis

    machine learning

    data querying

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

    MLIB

    Impala

  • www.cwi.nl/~boncz/bads

    fast in-memory

    processing

    graph

    analysis

    machine learning

    data querying

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

    MLIB

    Impala

  • www.cwi.nl/~boncz/bads

    fast in-memory

    processing

    graph

    analysis

    machine learning

    data querying

    The Hadoop Ecosystem

    YARN

    HC

    AT

    AL

    OG

    MLIB

    Impala

    SparkSQL

    graphX

  • www.cwi.nl/~boncz/bads

    The Hadoop Ecosystem

  • www.cwi.nl/~boncz/bads

    The Hadoop Ecosystem Basic services

    HDFS = Open-source GFS clone originally funded by Yahoo

    MapReduce = Open-source MapReduce implementation (Java,Python)

    YARN = Resource manager to share clusters between MapReduce and other tools

    HCATALOG = Meta-data repository for registering datasets available on HDFS (Hive Catalog)

    Cascading = Dataflow tool for creating multi-MapReduce job dataflows (Driven = GUI for it)

    Spark = new in-memory MapReduce++ based on Scala (avoids HDFS writes)

  • www.cwi.nl/~boncz/bads

    The Hadoop Ecosystem Basic services

    HDFS = Open-source GFS clone originally funded by Yahoo

    MapReduce = Open-source MapReduce implementation (Java,Python)

    YARN = Resource manager to share clusters between MapReduce and other tools

    HCATALOG = Meta-data repository for registering datasets available on HDFS (Hive Catalog)

    Cascading = Dataflow tool for creating multi-MapReduce job dataflows (Driven = GUI for it)

    Spark = new in-memory MapReduce++ based on Scala (avoids HDFS writes)

    Data Querying

    Pig = Relational Algebra system that compiles to MapReduce

    Hive = SQL system that compiles to MapReduce (Hortonworks)

    Impala, or, Drill = efficient SQL systems that do *not* use MapReduce (Cloudera,MapR)

    SparkSQL = SQL system running on top of Spark

  • www.cwi.nl/~boncz/bads

    The Hadoop Ecosystem Basic services

    HDFS = Open-source GFS clone originally funded by Yahoo

    MapReduce = Open-source MapReduce implementation (Java,Python)

    YARN = Resource manager to share clusters between MapReduce and other tools

    HCATALOG = Meta-data repository for registering datasets available on HDFS (Hive Catalog)

    Cascading = Dataflow tool for creating multi-MapReduce job dataflows (Driven = GUI for it)

    Spark = new in-memory MapReduce++ based on Scala (avoids HDFS writes)

    Data Querying

    Pig = Relational Algebra system that compiles to MapReduce

    Hive = SQL system that compiles to MapReduce (Hortonworks)

    Impala, or, Drill = efficient SQL systems that do *not* use MapReduce (Cloudera,MapR)

    SparkSQL = SQL system running on top of Spark

    Graph Processing

    Giraph = Pregel clone on Hadoop (Facebook)

    GraphX = graph analysis library of Spark

  • www.cwi.nl/~boncz/bads

    The Hadoop Ecosystem Basic services

    HDFS = Open-source GFS clone originally funded by Yahoo

    MapReduce = Open-source MapReduce implementation (Java,Python)

    YARN = Resource manager to share clusters between MapReduce and other tools

    HCATALOG = Meta-data repository for registering datasets available on HDFS (Hive Catalog)

    Cascading = Dataflow tool for creating multi-MapReduce job dataflows (Driven = GUI for it)

    Spark = new in-memory MapReduce++ based on Scala (avoids HDFS writes)

    Data Querying

    Pig = Relational Algebra system that compiles to MapReduce

    Hive = SQL system that compiles to MapReduce (Hortonworks)

    Impala, or, Drill = efficient SQL systems that do *not* use MapReduce (Cloudera,MapR)

    SparkSQL = SQL system running on top of Spark

    Graph Processing

    Giraph = Pregel clone on Hadoop (Facebook)

    GraphX = graph analysis library of Spark

    Machine Learning

    Okapi = Giraph based library of machine learning algorithms (graph-oriented)

    Mahout = MapReduce-based library of machine learning algorithms

    MLib = Spark based library of machine learning algorithms

  • www.cwi.nl/~boncz/bads

    HIGH-LEVEL WORKFLOWS HIVE & PIG

  • www.cwi.nl/~boncz/bads

    Need for high-level languages

    Hadoop is great for large-data processing!

    But writing Java/Python/ programs for everything is verbose and slow

    Cumbersome to work with multi-step processes

    Data scientist