Text of SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data...
SQL on Hadoop
Todays agenda Introduction Hive the first SQL approach Data ingestion and data formats Impala MPP SQL
Why SQL? Data warehousing Structured data organization of the data optimized data access Declarative data processing No need to have developer skills, but Portable universal language We are lazy SQL drivers supported No need of Hadoop client installation Easier integration with the current systems
Why not SQL It is not RDBMS! big tables joins should by avoided no indexes by default no primary keys and constraints Not suited for OLTP no locks no transactions write once read many Additional data structuring during data shipping (ETL) needed Not all problems can be solved with SQL
SQL on Hadoop HDFS Hadoop Distributed File System Hbase NoSql columnar store YARN Cluster resource manager MapReduce Hive SQL Pig Scripting Flume Log data collector Sqoop Data exchange with RDBMS Oozie Workflow manager Mahout Machine learning Zookeeper Coordination Impala SQL Spark Large scale data proceesing 5
There are others exotic animals Purely on Hadoop Stinger.next/Hive on Tez (improved MR executions, ACID, etc) Presto (graph based processing, multiple data sources) SparkSQL (Spark based) See Greg Rhan slides: https://speakerdeck.com/grahn/an-independent- comparison-of-open-source-sql-on-hadoop https://speakerdeck.com/grahn/an-independent- comparison-of-open-source-sql-on-hadoop
Summary SQL on Hadoop is not for OLTP! but for data warehousing workloads ad-hoc queries Enforces semi-structuring of the data Does not enforce using certain data format on HDFS