6
COURSE CURRICULUM BIG DATA HADOOP FULL Pre-requisites for the Big Data Hadoop Training Course? There will be no pre-requisites but Knowledge of Java/ Python, SQL, Linux will be beneficial, but not mandatory. Ducat provides a crash course for pre-requisites required to initiate Big Data training. Apache Hadoop on AWS Cloud This module will help you understand how to configure Hadoop Cluster on AWS Cloud: Introduction to Amazon Elastic MapReduce l AWS EMR Cluster l AWS EC2 Instance: Multi Node Cluster Configuration l AWS EMR Architecture l Web Interfaces on Amazon EMR l Amazon S3 l Executing MapReduce Job on EC2 & EMR l Apache Spark on AWS, EC2 & EMR l Submitting Spark Job on AWS l Hive on EMR l Available Storage types: S3, RDS & DynamoDB l Apache Pig on AWS EMR l Processing NY Taxi Data using SPARK on Amazon EMR[Type text] l Learning Big Data and Hadoop This module will help you understand Big Data: Common Hadoop ecosystem components l Hadoop Architecture l HDFS Architecture l Anatomy of File Write and Read l How MapReduce Framework works l Hadoop high level Architecture l MR2 Architecture l Hadoop YARN l Hadoop 2.x core components l Hadoop Distributions l Hadoop Cluster Formation l Hadoop Architecture and HDFS This module will help you to understand Hadoop & HDFS ClusterArchitecture: Configuration files in Hadoop Cluster (FSimage & editlog file) l Setting up of Single & Multi node Hadoop Cluster l HDFS File permissions l HDFS Installation & Shell Commands l Deamons of HDFS l Node Manager l Resource Manager l NameNode l DataNode l

BIG DATA HADOOP FULLlBulk Loading in HBase lCreate, Insert, Read Tables in HBase lHBase Admin APIs l HBase Security lHBase vs Hive lBackup & Restore in HBase lApache HBase External

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BIG DATA HADOOP FULLlBulk Loading in HBase lCreate, Insert, Read Tables in HBase lHBase Admin APIs l HBase Security lHBase vs Hive lBackup & Restore in HBase lApache HBase External

COURSE CURRICULUM

BIG DATAHADOOP

FULL

Pre-requisites for the Big Data Hadoop Training Course?There will be no pre-requisites but Knowledge of Java/ Python, SQL, Linux will be beneficial, but not mandatory. Ducat provides a crash course for pre-requisites required to initiate Big Data training.

Apache Hadoop on AWS CloudThis module will help you understand how to configure Hadoop Cluster on AWS Cloud:

Introduction to Amazon Elastic MapReducel�

AWS EMR Clusterl�

AWS EC2 Instance: Multi Node Cluster Configurationl�

AWS EMR Architecturel�

Web Interfaces on Amazon EMRl�

Amazon S3l�

Executing MapReduce Job on EC2 & EMRl�

Apache Spark on AWS, EC2 & EMRl�

Submitting Spark Job on AWSl�

Hive on EMRl�

Available Storage types: S3, RDS & DynamoDBl�

Apache Pig on AWS EMRl�

Processing NY Taxi Data using SPARK on Amazon EMR[Type text]l�

Learning Big Data and HadoopThis module will help you understand Big Data:

Common Hadoop ecosystem componentsl�

Hadoop Architecturel�

HDFS Architecturel�

Anatomy of File Write and Readl�

How MapReduce Framework worksl�

Hadoop high level Architecturel�

MR2 Architecturel�

Hadoop YARNl�

Hadoop 2.x core componentsl�

Hadoop Distributionsl�

Hadoop Cluster Formationl�

Hadoop Architecture and HDFSThis module will help you to understand Hadoop & HDFS ClusterArchitecture:

Configuration files in Hadoop Cluster (FSimage & editlog file)l�

Setting up of Single & Multi node Hadoop Clusterl�

HDFS File permissionsl�

HDFS Installation & Shell Commandsl�

Deamons of HDFSl�

Node Managerl�

Resource Managerl�

NameNodel�

DataNodel�

Page 2: BIG DATA HADOOP FULLlBulk Loading in HBase lCreate, Insert, Read Tables in HBase lHBase Admin APIs l HBase Security lHBase vs Hive lBackup & Restore in HBase lApache HBase External

� l� Secondary NameNode YARN Deamonsl�

HDFS Read & Write Commandsl�

NameNode & DataNode Architecturel�

HDFS Operationsl�

Hadoop MapReduce Jobl�

Executing MapReduce Jobl�

Hadoop MapReduce FrameworkThis module will help you to understand Hadoop MapReduce framework:

How MapReduce works on HDFS data setsl�

MapReduce Algorithml�

MapReduce Hadoop Implementationl�

Hadoop 2.x MapReduce Architecturel�

MapReduce Componentsl�

YARN Workflowl�

MapReduce Combinersl�

MapReduce Partitionersl�

MapReduce Hadoop Administrationl�

MapReduce APIsl�

Input Split & String Tokenizer in MapReducel�

MapReduce Use Cases on Data setsl�

Advanced MapReduce ConceptsThis module will help you to learn:

Job Submission & Monitoringl�

Countersl�

Distributed Cachel�

Map & Reduce Joinl�

Data Compressorsl�

Job Configurationl�

Record Readerl�

PigThis module will help you to understand Pig Concepts:

Pig Architecturel�

Pig Installationl�

Pig Grunt shelll�

Pig Running Modesl�

Pig Latin Basicsl�

Pig LOAD & STORE Operators[Type text]l�

Diagnostic Operatorsl�

DESCRIBE Operatorl�

EXPLAIN Operatorl�

ILLUSTRATE Operatorl�

DUMP Operatorl�

Grouping & Joiningl�

GROUP Operatorl�

COGROUP Operatorl�

JOIN Operatorl�

CROSS Operatorl�

Combining & Splittingl�

UNION Operatorl�

SPLIT Operatorl�

Filteringl�

FILTER Operatorl�

DISTINCT Operatorl�

FOREACH Operatorl�

Page 3: BIG DATA HADOOP FULLlBulk Loading in HBase lCreate, Insert, Read Tables in HBase lHBase Admin APIs l HBase Security lHBase vs Hive lBackup & Restore in HBase lApache HBase External

l� Sorting ORDERBYFIRSTl�

LIMIT Operatorl�

Built in Fuctionsl�

EVAL Functionsl�

LOAD & STORE Functionsl�

Bag & Tuple Functionsl�

String Functionsl�

Date-Time Functionsl�

MATH Functionsl�

Pig UDFs (User Defined Functions)l�

Pig Scripts in Local Model�

Pig Scripts in MapReduce Model�

Analysing XML Data using Pigl�

Pig Use Cases (Data Analysis on Social Media sites, Banking, Stock Market & Others)l�

Analysing JSON data using Pigl�

Testing Pig Sctiptsl�

HiveThis module will build your concepts in learning:

Hive Installationl�

Hive Data typesl�

Hive Architecture & Componentsl�

Hive Meta Storel�

Hive Tables(Managed Tables and External Tables)l�

Hive Partitioning & Bucketingl�

Hive Joins & Sub Queryl�

Running Hive Scriptsl�

Hive Indexing & Viewl�

Hive Queries (HQL); Order By, Group By, Distribute By, Cluster By, Examplesl�

Hive Functions: Built-in & UDF (User Defined Functions)l�

Hive ETL: Loading JSON, XML, Text Data Examplesl�

Hive Querying Datal�

Hive Tables (Managed & External Tables)l�

Hive Used Casesl�

Hive Optimization Techniquesl�

Partioning(Static & Dynamic Partition) & Bucketingl�

Hive Joins > Map + BucketMap + SMB (SortedBucketMap) + Skewl�

Hive FileFormats ( ORC+SEQUENCE+TEXT+AVRO+PARQUET)l�

CBOl�

Vectorizationl�

Indexing (Compact + BitMap)l�

Integration with TEZ & Sparkl�

Hive SerDer ( Custom + InBuilt)l�

Hive integration NoSQL (HBase + MongoDB + Cassandra)l�

Thrift API (Thrift Server)l�

UDF, UDTF & UDAFl�

Hive Multiple Delimitersl�

XML & JSON Data Loading HIVE.l�

Aggregation & Windowing Functions in Hivel�

Hive Connect with Tableaul�

SqoopSqoop Installationl�

Loading Data form RDBMS using Sqoopl�

Sqoop Import & Import-All-Tablel�

Fundamentals & Architecture of Apache Sqoopl�

Sqoop Jobl�

Sqoop Codegenl�

Sqoop Incremental Import & Incremental Exportl�

Page 4: BIG DATA HADOOP FULLlBulk Loading in HBase lCreate, Insert, Read Tables in HBase lHBase Admin APIs l HBase Security lHBase vs Hive lBackup & Restore in HBase lApache HBase External

l�Sqoop MergeImport Data from MySQL to Hive using Sqoopl�

Sqoop: Hive Importl�

Sqoop Metastorel�

Sqoop Use Casesl�

Sqoop- HCatalog Integrationl�

Sqoop Scriptl�

Sqoop Connectorsl�

FlumeThis module will help you to learn Flume Concepts:

Flume Introductionl�

Flume Architecturel�

Flume Data Flowl�

Flume Configurationl�

Flume Agent Component Typesl�

Flume Setupl�

Flume Interceptorsl�

Multiplexing (Fan-Out), Fan-In-Flowl�

Flume Channel Selectorsl�

Flume Sync Processorsl�

Fetching of Streaming Data using Flume (Social Media Sites: YouTube, LinkedIn, Twitter)l�

Flume + Kafka Integrationl�

Flume Use Casesl�

KAFKAThis module will help you to learn Kafka concepts:

Kafka Fundamentalsl�

Kafka Cluster Architecturel�

Kafka Workflowl�

Kafka Producer, Consumer Architecturel�

Integration with SPARKl�

Kafka Topic Architecturel�

Zookeeper & Kafkal�

Kafka Partitionsl�

Kafka Consumer Groupsl�

KSQL (SQL Engine for Kafka)l�

Kafka Connectorsl�

Kafka REST Proxyl�

Kafka Offsetsl�

OozieThis module will help you to understand Oozie concepts:

Oozie Introductionl�

Oozie Workflow Specificationl�

Oozie Coordinator Functional Specificationl�

Oozie H-catalog Integrationl�

Oozie Bundle Jobsl�

Oozie CLI Extensionsl�

Automate MapReduce, Pig, Hive, Sqoop Jobs using Ooziel�

Packaging & Deploying an Oozie Workflow Applicationl�

HBaseThis module will help you to learn HBase Architecture:

HBase Architecture, Data Flow & Use Casesl�

Apache HBase Configurationl�

HBase Shell & general commandsl�

HBase Schema Designl�

HBase Data Modell�

HBase Region & Master Serverl�

HBase & MapReducel�

Page 5: BIG DATA HADOOP FULLlBulk Loading in HBase lCreate, Insert, Read Tables in HBase lHBase Admin APIs l HBase Security lHBase vs Hive lBackup & Restore in HBase lApache HBase External

l� Bulk Loading in HBase Create, Insert, Read Tables in HBasel�

HBase Admin APIsl�

HBase Securityl�

HBase vs Hivel�

Backup & Restore in HBasel�

Apache HBase External APIs (REST, Thrift, Scala)l�

HBase & SPARKl�

Apache HBase Coprocessorsl�

HBase Case Studiesl�

HBase Trobleshootingl�

Data Processing with Apache SparkSpark executes in-memory data processing & how Spark Job runs faster then Hadoop MapReduce Job. Course will also help you understand the Spark Ecosystem & it related APIs like Spark SQL, Spark Streaming, Spark MLib, Spark GraphX & Spark Core concepts as well.This course will help you to understand Data Analytics & Machine Learning algorithms applying to various datasets to process & to analyze large amount of data.

Spark RDDs.l�

Spark RDDs Actions & Transformations.l�

Spark SQL : Connectivity with various Relational sources & its convert it into Data Frame using Spark SQL.l�

Spark Streamingl�

Understanding role of RDDl�

Spark Core concepts : Creating of RDDs: Parrallel RDDs, MappedRDD, HadoopRDD, JdbcRDD.l�

Spark Architecture & Components.l�

Project #1: Working with MapReduce, Pig, Hive & FlumeProblem Statement : Fetch structured & unstructured data sets from various sources like Social Media Sites, Web Server & structured source like MySQL, Oracle & others and dump it into HDFS and then analyze the same datasets using PIG,HQL queries & MapReduce technologies to gain proficiency in Hadoop related stack & its ecosystem tools.Data Analysis Steps in :

Dump XML & JSON datasets into HDFS.l�

Convert semi-structured data formats(JSON & XML) into structured format using Pig,Hive & MapReduce.l�

Push the data set into PIG & Hive environment for further analysis.l�

Writing Hive queries to push the output into relational database(RDBMS) using Sqoop.l�

Renders the result in Box Plot, Bar Graph & others using R & Python integration with Hadoop.l�

Project #2: Analyze Stock Market DataIndustry: FinanceData : Data set contains stock information such as daily quotes ,Stock highest price, Stock opening price on New York Stock Exchange.Problem Statement: Calculate Co-variance for stock data to solve storage & processing problems related to huge volume of data.

Positive Covariance, If investment instruments or stocks tend to be up or down during the same time l�

periods, they have positive covariance.Negative Co-variance, If return move inversely,If investment tends to be up while other is down, this l�

shows Negative Co-variance.

Project #3: Hive,Pig & MapReduce with New York City Uber TripsProblem Statement: What was the busiest dispatch base by trips for a particular day on entire month?l�

What day had the most active vehicles.l�

What day had the most trips sorted by most to fewest.l�

Dispatching_Base_Number is the NYC taxi & Limousine company code of that base that dispatched the l�

UBER.active_vehicles shows the number of active UBER vehicles for a particular date & company(base). l�

Trips is the number of trips for a particular base & date.

BIG DATA PROJECTS

Page 6: BIG DATA HADOOP FULLlBulk Loading in HBase lCreate, Insert, Read Tables in HBase lHBase Admin APIs l HBase Security lHBase vs Hive lBackup & Restore in HBase lApache HBase External

Partners :

PITAMPURA (DELHI)NOIDAA-43 & A-52, Sector-16,

GHAZIABAD1, Anand Industrial Estate, Near ITS College, Mohan Nagar, Ghaziabad (U.P.)

GURGAON1808/2, 2nd floor old DLF,Near Honda Showroom,Sec.-14, Gurgaon (Haryana)

SOUTH EXTENSION

www.facebook.com/ducateducation

Java

Plot No. 366, 2nd Floor, Kohat Enclave, Pitampura,( Near- Kohat Metro Station)Above Allahabad Bank, New Delhi- 110034.

Noida - 201301, (U.P.) INDIA 70-70-90-50-90 +91 99-9999-3213 70-70-90-50-90 70-70-90-50-90

70-70-90-50-90

70-70-90-50-90

D-27,South Extension-1New Delhi-110049

+91 98-1161-2707

(DELHI)

Project #4: Analyze Tourism DataData: Tourism Data comprises contains : City Pair, seniors travelling,children traveling, adult traveling, car booking price & air booking price.Problem Statement: Analyze Tourism data to find out :

Top 20 destinations tourist frequently travel to: Based on given data we can find the most popular l�

destinations where people travel frequently, based on the specific initial number of trips booked for a particular destination

Top 20 high air-revenue destinations, i.e the 20 cities that generate high airline revenues for travel, so l�

that the discount offers can be given to attract more bookings for these destinations.Top 20 locations from where most of the trips start based on booked trip count.l�

Project #5: Airport Flight Data Analysis : We will analyze Airport Information System data that gives information regarding flight delays,source & destination details diverted routes & others.Industry: AviationProblem Statement: Analyze Flight Data to:

List of Delayed flights.l�

Find flights with zero stop.l�

List of Active Airlines all countries.l�

Source & Destination details of flights.l�

Reason why flight get delayed.l�

Time in different formats.l�

Project #6: Analyze Movie RatingsIndustry: MediaData: Movie data from sites like rotten tomatoes, IMDB, etc. Problem Statement: Analyze the movie ratings by different users to:

Get the user who has rated the most number of moviesl�

Get the user who has rated the least number of moviesl�

Get the count of total number of movies rated by user belonging to a specific occupationl�

Get the number of underage usersl�

Project #7: Analyze Social Media Channels :Facebookl�

Twitterl�

Instagram l�

YouTubel�

Industry: Social Medial�

Data: DataSet Columns : VideoId, Uploader, Internal Day of establishment of You tube & the date of l�

uploading of the video,Category,Length,Rating, Number of comments.Problem Statement: Top 5 categories with maximum number of videos uploaded.l�

Problem Statement: Identify the top 5 categories in which the most number of videos are uploaded, l�

the top 10 rated videos, and the top 10 most viewed videos.Apart from these there are some twenty more use-cases to choose: Twitter Data Analysisl�

Market data Analysisl�