Total Duration: 80 Hours (10 Days)€¦ · Pig, Hive, Impala, HBase, Sqoop, Flume, Oozie, Zookeeper, Spark and Storm. The course will also include Spark, along with hands-on integration

Course Description

Big Data is data that is too large and complex for conventional data tools to capture, store and analyze. When put to good use, Big Data allows analysts to spot trends, extract insights and make predications. This course developed by our industry experts will help you develop core competency expected in a Big Data analyst, skilled at effectively mining, manipulating, and analyzing Big Data, using basic and advanced analytical techniques. A completely industry relevant Big Data Analytics training and a great blend of analytics and technology, making it quite apt for aspirants who want to develop Big Data Analytics skills and head-start in Big Data.

Course Objective

The objective of the course is to understand big data and how to store, manage and process big data using big data technologies like Hadoop&Hadoop Ecosystem. In this Big Data training, attendees will gain practical skill set on Hadoop&Hadoop Ecosystem in detail, like HDFS, MapReduce, Spark (Core, SQL, MLLIB, Graphx), Pig, Hive, Impala, HBase, Sqoop, Flume, Oozie, Zookeeper, Spark and Storm. The course will also include Spark, along with hands-on integration of Hadoop with Spark. An introduction to machine learning will also then be included. At the end of the program candidates are awarded Certified Big Data Analyst on successful completion of projects that are provided as part of the training. Optionally, candidates can also appear for the Cloudera or Hortonworks Big Data Hadoopcertification after this course. This course will encompass all to help you emerge as an Industry ready professional in the field of Big Data Analytics.

Who should do this course?

Candidates from various quantitative backgrounds, like Engineering, Finance, Maths, Statistics, Business Management who want to head start their career in analytics. IT/ ITES, data analytics, Business Intelligence, Database professionals/ computer science (or any other circuit branches) who want to get into a Big Data Analytics/ Developer role.

Prerequisites

Knowledge of excel is mandatory and a quantitative background is preferred. Knowledge of any programming & data analytics exposure would be an advantage.

Who are the trainers?

Our trainers are highly qualified industry experts and certified instructors with more than 10 years of global analytical experience.

Projects – Case Studies

- Data storage using HDFSThis case study aims to give practical experience on Storing & managing different types of data(Structured/Semi/Unstructured) - both compressed andun-compressed.- Processing data using map reduceThis case study aims to give practical experience on understanding & developing Map reduce programs in JAVA & R and running streaming job in terminal &Ecclipse- Data integration using sqoop& flumeThis case study aims to give practical experience on Extracting data from Oracle and load into HDFS and vice versa also Extracting data from twitter and store in HDFS.- Data Analysis using PigThis case study aims to give practical experience on complete data analysis using pig and create and usage of user defined function (UDF)- Data Analysis using HiveThis case study aims to give practical experience on complete data analysis using Hive and create and usage of user defined function (UDF)- Hbase-NoSql data base creationThis case study aims to give practical experience on Data table/cluster creation using Hbase- Final Project : Integration of Hadoop componentsThe final project aims to give practical experience on how different modules(Pig-Hive-Hbase) can be used for solving big data problems

Who provides the certification?

The certification is provided by Databyte AcademyUpon successful completion of the program, students will be conferred with dual certification:- Certificate of Completion - CERTIFIED BIG DATA ANALYTICS**In order to be “Certified” as part of the course, students need to complete the assignments and examination. Once all your assignments are submitted and evaluated, the certificate shall be awarded.

Course Outcome

Ability to understand big data and use Big Data Ecosystem tools store and process the big data. Also get hands on exposure on how to use big data technology to improve performance across functions by storing, managing and processing big data in efficient manner.

CERTIFIED BIG DATA ANALYTICS(Hadoop - Spark)

Total Duration: 80 Hours (10 Days)

Databyte Academy Sdn Bhd (1176678-V)No. 18-4, Jalan 13/48A

Sentul Boulevard Shop OfficeJalan Sentul, 51000 Kuala Lumpur, Malaysia

[email protected]

www.databyte.com.my

+603-4045 5000

+603-4045 6000

Course Content

•Introduction and relevance•Uses of Big Data analytics in various industries like Telecom, E- commerce, Finance and Insurance etc.•Problems with Traditional Large-Scale Systems

Introduction to Big Data

•Integrating Hadoop into an Existing Enterprise•Loading Data from an RDBMS into HDFS by Using Sqoop•Managing Real-Time Data Using Flume•Accessing HDFS from Legacy Systems

Data Integration Using Sqoop & Flume

•Introduction to Apache Spark•Streaming Data Vs. In Memory Data•Map Reduce Vs. Spark•Modes of Spark•Spark Installation Demo•Overview of Spark on a cluster•Spark Standalone Cluster

SPARK Introduction

•Analyze Hive and Spark SQL Architecture•Analyze Spark SQL

•Context in Spark SQL•Implement a sample example for Spark SQL•Integrating Hive and Spark SQL•Support for JSON and Parquet File Formats Implement Data Visualization in Spark•Loading of Data•Hive Queries through Spark•Performance Tuning Tips in Spark•Shared Variables: Broadcast Variables & Accumulators

SPARK meets HIVE

•Overview of GraphX module in spark•Creating graphs with GraphX

SPARK GraphX

•Consolidate all the leanings•Working on Big Data Project by integrating various key components

Final Project

•Extract and analyze the data from twitter using Spark streaming•Comparison of Spark and Storm – Overview

SPARK streaming

•Brief introduction to Machine learning framework•Implement some of the ML algorithms using Spark MLLib (ML is not covered in detail in this course, for Machine Learning concept please refer to Advance Big Data Science course or Machine Learning Specialization course

Implement Machine Learning Using Spark

•Invoking Spark Shell•Creating the Spark Context•Loading a File in Shell•Performing Some Basic Operations on Files in Spark Shell•Caching Overview•Distributed Persistence•Spark Streaming Overview(Example: Streaming Word Count)

SPARK in Practice

•Apache Hive - Hive Vs. PIG - Hive Use Cases•Discuss the Hive data storage principle•Explain the File formats and Records formats supported by the Hive environment•Perform operations with data in Hive•Hive QL: Joining Tables, Dynamic Partitioning, Custom Map/Reduce Scripts•Hive Script, Hive UDF•Hive Persistence formats•Loading data in Hive - Methods•Serialization & Deserialization•Handling Text data using Hive•Integrating external BI tools with Hadoop Hive

Data Analysis Using HIVE•Introduction to Impala & Architecture•How Impala executes Queries and its importance•Hive vs. PIG vs. Impala•Extending Impala with User Defined functions

Data Analysis Using IMPALA

•NoSQL database - Hbase Introduction Oozie

Introduction to Other Ecosystem Tools

•Introduction to Data Analysis Tools•Apache PIG - MapReduceVs Pig, Pig Use Cases•PIG’s Data Model•PIG Streaming•Pig Latin Program & Execution•Pig Latin : Relational Operators, File Loaders, Group Operator, COGROUP Operator, Joins and COGROUP, Union, Diagnostic Operators, Pig UDF•Writing JAVA UDF’s•Embedded PIG in JAVA•PIG Macros•Parameter Substitution•Use Pig to automate the design and implementation of MapReduce applications•Use Pig to apply structure to unstructured Big Data

Data Analysis Using PIG

•Motivation for Hadoop•Different types of projects by Apache•Role of projects in the Hadoop Ecosystem•Key technology foundations required for Big Data•Limitations and Solutions of existing Data Analytics Architecture•Comparison of traditional data management systems with Big Data management systems•Evaluate key framework requirements for Big Data analytics•Hadoop Ecosystem &Hadoop 2.x core components•Explain the relevance of real-time data•Explain how to use big and real-time data as a Business planning tool

Hadoop (Big Data) Ecosystem

•HDFS Overview & Data storage in HDFS•Get the data into Hadoop from local machine(Data Loading Techniques) - vice versa•Map Reduce Overview (Traditional way Vs. MapReduce way) Concept of Mapper & Reducer•Understanding MapReduce program Framework•Develop MapReduce Program using Java (Basic)•Develop MapReduce program with streaming API) (Basic)

Hadoop Core Components-HDFS &Mapreduce (Yarn)•Hadoop Master-Slave Architecture

•The Hadoop Distributed File System - Concept of data storage•Explain different types of cluster setups (Fully distributed/Pseudo etc)•Hadoop cluster set up - Installation•Hadoop 2.x Cluster Architecture•A Typical enterprise cluster – Hadoop Cluster Modes•Understanding cluster management tools like Cloudera manager/Apache Ambari

Hadoop Cluster- Architecture-Configuration File

CERTIFIED BIG DATA ANALYTICS(Hadoop - Spark)

Total Duration: 80 Hours (10 Days)

Documents

Total Duration: 80 Hours (10 Days)€¦ · Pig, Hive, Impala, HBase, Sqoop, Flume, Oozie, Zookeeper, Spark and Storm. The course will also include Spark, along with hands-on integration