SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL

  • View
    216

  • Download
    0

Embed Size (px)

Text of SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data...

  • Slide 1
  • SQL on Hadoop
  • Slide 2
  • Todays agenda Introduction Hive the first SQL approach Data ingestion and data formats Impala MPP SQL
  • Slide 3
  • Why SQL? Data warehousing Structured data organization of the data optimized data access Declarative data processing No need to have developer skills, but Portable universal language We are lazy SQL drivers supported No need of Hadoop client installation Easier integration with the current systems
  • Slide 4
  • Why not SQL It is not RDBMS! big tables joins should by avoided no indexes by default no primary keys and constraints Not suited for OLTP no locks no transactions write once read many Additional data structuring during data shipping (ETL) needed Not all problems can be solved with SQL
  • Slide 5
  • SQL on Hadoop HDFS Hadoop Distributed File System Hbase NoSql columnar store YARN Cluster resource manager MapReduce Hive SQL Pig Scripting Flume Log data collector Sqoop Data exchange with RDBMS Oozie Workflow manager Mahout Machine learning Zookeeper Coordination Impala SQL Spark Large scale data proceesing 5
  • Slide 6
  • SQL on Hadoop Client Metadata SQL master node JDBC/ODBC server SQL engine HDFS Executor Cluster Node HDFS Executor Cluster Node HDFS Executor Cluster Node HDFS Executor SQL Data Tables definition lookup YARN
  • Slide 7
  • There are others exotic animals Purely on Hadoop Stinger.next/Hive on Tez (improved MR executions, ACID, etc) Presto (graph based processing, multiple data sources) SparkSQL (Spark based) See Greg Rhan slides: https://speakerdeck.com/grahn/an-independent- comparison-of-open-source-sql-on-hadoop https://speakerdeck.com/grahn/an-independent- comparison-of-open-source-sql-on-hadoop
  • Slide 8
  • Summary SQL on Hadoop is not for OLTP! but for data warehousing workloads ad-hoc queries Enforces semi-structuring of the data Does not enforce using certain data format on HDFS