Upload
manish-chopra
View
84
Download
3
Embed Size (px)
Citation preview
2016
Big Data Technologies
Hadoop and Analytics
Course Guide
Big Data Technologies Hadoop and Analytics
Venue: Indian Institute of Corporate Affairs (IICA) (Under Ministry of Corporate Affairs) Plot No. 6,7,8 Sector 5 IMT Manesar, Gurgaon Haryana
Big Data Technologies • HADOOP • Analytics IICA
Centre for e-Governance • Indian Institute of Corporate Affairs
2
Big Data Technologies Hadoop and Analytics Hands on with Big Data Technologies and Analytics Center for e-Governance
Indian Institute of Corporate Affairs
(Under Ministry of Corporate Affairs)
Plot No. 6,7,8 Sector 5
IMT Manesar, Gurgaon
Haryana
Website: http://www.iica.in Updated Dec 2016
Big Data Technologies • HADOOP • Analytics IICA
Centre for e-Governance • Indian Institute of Corporate Affairs
3
Table of Contents
Module 1 - Introduction to Linux ........................................................................... 7
- Linux as a prerequisite for Big Data and Hadoop - Overview of Linux Operating System - Understanding the Linux command line - Linux Commands and Shell Scripts - Working with Linux GUI - Exercises
Module 2 - Understanding Big Data .................................................................... 22
- Introduction to Big Data Technologies - The 3 Vs of Big Data (Volume, Variety and Velocity) - Structured and Unstructured Data - Centralized vs. Distributed computing - Applications and use cases of Big Data - Opportunities and challenges of Big Data
Module 3 - Getting started with Hadoop ............................................................. 34
- What is Hadoop, and why is it popular - Overview of Apache BigTop and Hadoop installation - Hadoop configuration files - Overview of Hadoop Vendor Distributions - Distributed File Systems (DFS) - Various types of DFS - Getting familiar with Hadoop Virtual Machine Environment - Hadoop Ecosystem Tools and Components - Hadoop Command line (CLI) and Graphical interface (GUI) - Exercises
Module 4 - Understanding the Hadoop Architecture ......................................... 51
- Name Node and Data Nodes - Difference between Hadoop 1.x and 2.x - Hadoop Distributed File System (HDFS) - HDFS Overview and Architecture - HDFS Data Flows (Read and Write) - HDFS Interfaces - Command Line Interface, File System, Administrative and
Web Interface - Copying data into HDFS, and working with data in HDFS - Advanced HDFS features, like Data replication, Rack awareness, Fuse-DFS - Overview of HDFS Federation, High Availability, Distcp and Hadoop Archives - Exercises
Big Data Technologies • HADOOP • Analytics IICA
Centre for e-Governance • Indian Institute of Corporate Affairs
4
Module 5 - YARN and MapReduce....................................................................... 75
- Functional Programming paradigms - What is MapReduce - Shuffling and Sorting - YARN Resource Manager UI - Standalone, Pseudo distributed, and Fully distributed mode - MapReduce v1 compared to YARN and MapReduce v2 - Examples of MapReduce programs - Exercises
Module 6 - Data Ingestion in HDFS...................................................................... 82
- Importing data to HDFS
- Introduction to SQOOP - SQOOP configuration - Ingesting data in HDFS using SQOOP - Exporting data to RDBMS - Introduction to Flume - Flume configuration - Capturing data in real-time using Flume - Exercises
Module 7 - Working with Hive .............................................................................. 95
- Introduction to Hive and its Architecture - Different Modes of executing Hive queries - HiveQL (DDL & DML Operations) - External vs. Managed Tables - Hive vs. Impala - User-Defined Functions (UDFs) - Exercises
Module 8 - Working with Pig .............................................................................. 107
- Different Modes of executing Pig - Pig Data Types - Pig Latin language Constructs (LOAD, STORE, DUMP, SPLI T etc.) - User-Defined Functions (UDFs) - Developing and deploying Pig programs
- Exercises
Module 9 - Getting familiar with Apache Hadoop Ecosystem Tools .............. 112
- Introduction to Oozie workflows, designs and deployments - Apache Mahout, and Building a Recommender using Mahout - Introduction to Avro, Kafka, Storm, and Zookeeper - Exercises
Big Data Technologies • HADOOP • Analytics IICA
Centre for e-Governance • Indian Institute of Corporate Affairs
5
Module 10 - Introduction to NoSQL Databases ................................................ 120
- Review of RDBMS - Need for NoSQL - Brewers CAP Theorem - ACID vs. BASE - Schema on Read vs. Schema on Write - Different levels of consistency - Different types of NoSQL databases - Exercises
Module 11 - Working with NoSQL Databases ................................................... 123
- Document stores - CouchBase, MongoDB
- Graph databases - Neo4J - Key-value stores - Riak - Column Family - Cassandra, HBase - Overview of Hybrid NoSQL Databases - Exercises
Module 12 - Working with Apache Spark .......................................................... 130
- Understanding Spark Architecture - Comparing Hadoop and Spark - Introduction to RDD - Spark SQL - Sample programs in Spark - Exercises
Module 13 - Introduction to Data Analytics ...................................................... 138
- Difference between Data Analysis and Analytics - Types of Analytics - Big Data Analytics - Business Analytics - Predictive Analytics - Real-Time Analytics - Web Analytics - Customized Analytics Solutions - Exercises
Module 14 - Big Data Proof of Concepts and Use Cases ................................ 155
- Text Mining - Traditional case of Watson - Sentiment Analysis - Weather Data Analysis - Trending Topics and Conclusion - Exercises