5
2016 Big Data Technologies Hadoop and Analytics Course Guide Big Data Technologies Hadoop and Analytics Venue: Indian Institute of Corporate Affairs (IICA) (Under Ministry of Corporate Affairs) Plot No. 6,7,8 Sector 5 IMT Manesar, Gurgaon Haryana

Big Data Analytics Course Guide TOC

Embed Size (px)

Citation preview

Page 1: Big Data Analytics Course Guide TOC

2016

Big Data Technologies

Hadoop and Analytics

Course Guide

Big Data Technologies Hadoop and Analytics

Venue: Indian Institute of Corporate Affairs (IICA) (Under Ministry of Corporate Affairs) Plot No. 6,7,8 Sector 5 IMT Manesar, Gurgaon Haryana

Page 2: Big Data Analytics Course Guide TOC

Big Data Technologies • HADOOP • Analytics IICA

Centre for e-Governance • Indian Institute of Corporate Affairs

2

Big Data Technologies Hadoop and Analytics Hands on with Big Data Technologies and Analytics Center for e-Governance

Indian Institute of Corporate Affairs

(Under Ministry of Corporate Affairs)

Plot No. 6,7,8 Sector 5

IMT Manesar, Gurgaon

Haryana

Website: http://www.iica.in Updated Dec 2016

Page 3: Big Data Analytics Course Guide TOC

Big Data Technologies • HADOOP • Analytics IICA

Centre for e-Governance • Indian Institute of Corporate Affairs

3

Table of Contents

Module 1 - Introduction to Linux ........................................................................... 7

- Linux as a prerequisite for Big Data and Hadoop - Overview of Linux Operating System - Understanding the Linux command line - Linux Commands and Shell Scripts - Working with Linux GUI - Exercises

Module 2 - Understanding Big Data .................................................................... 22

- Introduction to Big Data Technologies - The 3 Vs of Big Data (Volume, Variety and Velocity) - Structured and Unstructured Data - Centralized vs. Distributed computing - Applications and use cases of Big Data - Opportunities and challenges of Big Data

Module 3 - Getting started with Hadoop ............................................................. 34

- What is Hadoop, and why is it popular - Overview of Apache BigTop and Hadoop installation - Hadoop configuration files - Overview of Hadoop Vendor Distributions - Distributed File Systems (DFS) - Various types of DFS - Getting familiar with Hadoop Virtual Machine Environment - Hadoop Ecosystem Tools and Components - Hadoop Command line (CLI) and Graphical interface (GUI) - Exercises

Module 4 - Understanding the Hadoop Architecture ......................................... 51

- Name Node and Data Nodes - Difference between Hadoop 1.x and 2.x - Hadoop Distributed File System (HDFS) - HDFS Overview and Architecture - HDFS Data Flows (Read and Write) - HDFS Interfaces - Command Line Interface, File System, Administrative and

Web Interface - Copying data into HDFS, and working with data in HDFS - Advanced HDFS features, like Data replication, Rack awareness, Fuse-DFS - Overview of HDFS Federation, High Availability, Distcp and Hadoop Archives - Exercises

Page 4: Big Data Analytics Course Guide TOC

Big Data Technologies • HADOOP • Analytics IICA

Centre for e-Governance • Indian Institute of Corporate Affairs

4

Module 5 - YARN and MapReduce....................................................................... 75

- Functional Programming paradigms - What is MapReduce - Shuffling and Sorting - YARN Resource Manager UI - Standalone, Pseudo distributed, and Fully distributed mode - MapReduce v1 compared to YARN and MapReduce v2 - Examples of MapReduce programs - Exercises

Module 6 - Data Ingestion in HDFS...................................................................... 82

- Importing data to HDFS

- Introduction to SQOOP - SQOOP configuration - Ingesting data in HDFS using SQOOP - Exporting data to RDBMS - Introduction to Flume - Flume configuration - Capturing data in real-time using Flume - Exercises

Module 7 - Working with Hive .............................................................................. 95

- Introduction to Hive and its Architecture - Different Modes of executing Hive queries - HiveQL (DDL & DML Operations) - External vs. Managed Tables - Hive vs. Impala - User-Defined Functions (UDFs) - Exercises

Module 8 - Working with Pig .............................................................................. 107

- Different Modes of executing Pig - Pig Data Types - Pig Latin language Constructs (LOAD, STORE, DUMP, SPLI T etc.) - User-Defined Functions (UDFs) - Developing and deploying Pig programs

- Exercises

Module 9 - Getting familiar with Apache Hadoop Ecosystem Tools .............. 112

- Introduction to Oozie workflows, designs and deployments - Apache Mahout, and Building a Recommender using Mahout - Introduction to Avro, Kafka, Storm, and Zookeeper - Exercises

Page 5: Big Data Analytics Course Guide TOC

Big Data Technologies • HADOOP • Analytics IICA

Centre for e-Governance • Indian Institute of Corporate Affairs

5

Module 10 - Introduction to NoSQL Databases ................................................ 120

- Review of RDBMS - Need for NoSQL - Brewers CAP Theorem - ACID vs. BASE - Schema on Read vs. Schema on Write - Different levels of consistency - Different types of NoSQL databases - Exercises

Module 11 - Working with NoSQL Databases ................................................... 123

- Document stores - CouchBase, MongoDB

- Graph databases - Neo4J - Key-value stores - Riak - Column Family - Cassandra, HBase - Overview of Hybrid NoSQL Databases - Exercises

Module 12 - Working with Apache Spark .......................................................... 130

- Understanding Spark Architecture - Comparing Hadoop and Spark - Introduction to RDD - Spark SQL - Sample programs in Spark - Exercises

Module 13 - Introduction to Data Analytics ...................................................... 138

- Difference between Data Analysis and Analytics - Types of Analytics - Big Data Analytics - Business Analytics - Predictive Analytics - Real-Time Analytics - Web Analytics - Customized Analytics Solutions - Exercises

Module 14 - Big Data Proof of Concepts and Use Cases ................................ 155

- Text Mining - Traditional case of Watson - Sentiment Analysis - Weather Data Analysis - Trending Topics and Conclusion - Exercises