1. Apache Hadoop Sheetal Sharma Intern At IBM Innovation
Centre
2. What Is Apache Hadoop? The Apache Hadoop project develops
open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for
the distributed processing of large data sets across clusters of
computers using simple programming models. It is designed to scale
up from single servers to thousands of machines, each offering
local computation and storage. Rather than rely on hardware to
deliver high-availability, the library itself is designed to detect
and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of
which may be prone to failures.
3. The project includes these modules: Hadoop Common: The
common utilities that support the other Hadoop modules. Hadoop
Distributed File System (HDFS): A distributed file system that
provides high-throughput access to application data. Hadoop YARN: A
framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of
large data sets.
4. Other Hadoop-related projects at Apache include: Ambari: A
web-based tool for provisioning, managing, and monitoring Apache
Hadoop clusters which includes support for Hadoop HDFS, Hadoop
MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.
Ambari also provides a dashboard for viewing cluster health such as
heatmaps and ability to view MapReduce, Pig and Hive applications
visually along with features to diagnose their performance
characteristics in a user-friendly manner. Avro: A data
serialization system. Cassandra: A scalable multi-master database
with no single points of failure. Chukwa: A data collection system
for managing large distributed systems. HBase: A scalable,
distributed database that supports structured data storage for
large tables.
5. Other Hadoop-related projects at Apache include: Hive: A
data warehouse infrastructure that provides data summarization and
ad hoc querying. Mahout: A Scalable machine learning and data
mining library. Pig: A high-level data-flow language and execution
framework for parallel computation. Spark: A fast and general
compute engine for Hadoop data. Spark provides a simple and
expressive programming model that supports a wide range of
applications, including ETL, machine learning, stream processing,
and graph computation. Tez: A generalized data-flow programming
framework, built on Hadoop YARN, which provides a powerful and
flexible engine to execute an arbitrary DAG of tasks to process
data for both batch and interactive use-cases. Tez is being adopted
by Hive, Pig and other frameworks in the Hadoop ecosystem, and also
by other commercial software (e.g. ETL tools), to replace Hadoop
MapReduce as the underlying execution engine. ZooKeeper: A
high-performance coordination service for distributed
applications.
6. Introduction The Apache Ambari project is aimed at making
Hadoop management simpler by developing software for provisioning,
managing, and monitoring Apache Hadoop clusters. Ambari provides an
intuitive, easy-to-use Hadoop management web UI backed by its
RESTful APIs. Ambari enables System Administrators to: Provision a
Hadoop Cluster Ambari provides a step-by-step wizard for installing
Hadoop services across any number of hosts. Ambari handles
configuration of Hadoop services for the cluster. Manage a Hadoop
Cluster Ambari provides central management for starting, stopping,
and reconfiguring Hadoop services across the entire cluster.
7. Monitor a Hadoop Cluster Ambari provides a dashboard for
monitoring health and status of the Hadoop cluster. Ambari
leverages Ganglia for metrics collection. Ambari leverages Nagios
for system alerting and will send emails when your attention is
needed (e.g., a node goes down, remaining disk space is low, etc).
Ambari enables Application Developers and System Integrators to:
Easily integrate Hadoop provisioning, management, and monitoring
capabilities to their own applications with the Ambari REST
APIs.
8. Getting Started with Ambari Follow the installation guide
for Ambari 1.7.0. Note: Ambari currently supports the 64-bit
version of the following Operating Systems: RHEL (Redhat Enterprise
Linux) 5 and 6 CentOS 5 and 6 OEL (Oracle Enterprise Linux) 5 and 6
SLES (SuSE Linux Enterprise Server) 11 Ubuntu 12
9. Apache Avro Introduction Apache Avro is a data serialization
system. Avro provides: Rich data structures. A compact, fast,
binary data format. A container file, to store persistent data.
Remote procedure call (RPC). Simple integration with dynamic
languages. Code generation is not required to read or write data
files nor to use or implement RPC protocols. Code generation as an
optional optimization, only worth implementing for statically typed
languages.
10. Apache Avro Schemas Avro relies on schemas. When Avro data
is read, the schema used when writing it is always present. This
permits each datum to be written with no per-value overheads,
making serialization both fast and small. This also facilitates use
with dynamic, scripting languages, since data, together with its
schema, is fully self- describing. When Avro data is stored in a
file, its schema is stored with it, so that files may be processed
later by any program. If the program reading the data expects a
different schema this can be easily resolved, since both schemas
are present. When Avro is used in RPC, the client and server
exchange schemas in the connection handshake. (This can be
optimized so that, for most calls, no schemas are actually
transmitted.) Since both client and server both have the other's
full schema, correspondence between same named fields, missing
fields, extra fields, etc. can all be easily resolved. Avro schemas
are defined with JSON . This facilitates implementation in
languages that already have JSON libraries.
11. Apache Avro Comparison with other systems Avro provides
functionality similar to systems such as Thrift, Protocol Buffers,
etc. Avro differs from these systems in the following fundamental
aspects. Dynamic typing: Avro does not require that code be
generated. Data is always accompanied by a schema that permits full
processing of that data without code generation, static datatypes,
etc. This facilitates construction of generic data-processing
systems and languages. Untagged data: Since the schema is present
when data is read, considerably less type information need be
encoded with data, resulting in smaller serialization size. No
manually-assigned field IDs: When a schema changes, both the old
and new schema are always present when processing data, so
differences may be resolved symbolically, using field names. Apache
Avro, Avro, Apache, and the Avro and Apache logos are trademarks of
The Apache Software Foundation.
12. Apache Cassandra The Apache Cassandra database is the right
choice when you need scalability and high availability without
compromising performance. Linear scalability and proven
fault-tolerance on commodity hardware or cloud infrastructure make
it the perfect platform for mission-critical data. Cassandra's
support for replicating across multiple data centers is best-in-
class, providing lower latency for your users and the peace of mind
of knowing that you can survive regional outages. Cassandra's data
model offers the convenience of column indexes with the performance
of log-structured updates, strong support for denormalization and
materialized views, and powerful built-in caching.
13. Apache Cassandra Overview Proven Cassandra is in use at
Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu,
Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over
1500 more companies that have large, active data sets. One of the
largest production deployments is Apple's, with over 75,000 nodes
storing over 10 PB of data. Other large Cassandra installations
include Netflix (2,500 nodes, 420 TB, over 1 trillion requests per
day), Chinese search engine Easou (270 nodes, 300 TB, over 800
million reqests per day), and eBay (over 100 nodes, 250 TB) Fault
Tolerant Data is automatically replicated to multiple nodes for
fault-tolerance. Replication across multiple data centers is
supported. Failed nodes can be replaced with no downtime.
14. Apache Cassandra Overview Performance Cassandra
consistently outperforms popular NoSQL alternatives in benchmarks
and real applications, primarily because of fundamental
architectural choices. Decentralized There are no single points of
failure. There are no network bottlenecks. Every node in the
cluster is identical. Durable Cassandra is suitable for
applications that can't afford to lose data, even when an entire
data center goes down.
15. Apache Cassandra Overview You're in Control Choose between
synchronous or asynchronous replication for each update. Highly
available asynchronous operations are optimized with features like
Hinted Hand off and Read Repair. Elastic Read and write throughput
both increase linearly as new machines are added, with no downtime
or interruption to applications. Professionally Supported Cassandra
support contracts and services are available from third
parties.
16. Chukwa Chukwa is an open source data collection system for
monitoring large distributed systems. Chukwa is built on top of the
Hadoop Distributed File System (HDFS) and Map/Reduce framework and
inherits Hadoops scalability and robustness. Chukwa also includes a
exible and powerful toolkit for displaying, monitoring and
analyzing results to make the best use of the collected data.
17. Apache HBase is the Hadoop database, a distributed,
scalable, big data store. Use Apache HBase when you need random,
real time read/write access to your Big Data. This project's goal
is the hosting of very large tables -- billions of rows X millions
of columns -- atop clusters of commodity hardware. Apache HBase is
an open-source, distributed, versioned, non- relational database
modeled after Google's Bigtable: A Distributed Storage System for
Structured Data by Chang et al. Just as Bigtable leverages the
distributed data storage provided by the Google File System, Apache
HBase provides Bigtable-like capabilities on top of Hadoop and
HDFS.
18. Features of Apache HBase Linear and modular scalability.
Strictly consistent reads and writes. Automatic and configurable
sharding of tables Automatic failover support between Region
Servers. Convenient base classes for backing Hadoop MapReduce jobs
with Apache HBase tables. Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries. Query
predicate push down via server side Filters Thrift gateway and a
REST-ful Web service that supports XML, Protobuf, and binary data
encoding options Extensible jruby-based (JIRB) shell Support for
exporting metrics via the Hadoop metrics subsystem to files or
Ganglia; or via JMX
19. Apache Hive The Apache Hive data warehouse software
facilitates querying and managing large data sets residing in
distributed storage. Hive provides a mechanism to project structure
onto this data and query the data using a SQL-like language called
HiveQL. At the same time this language also allows traditional
map/reduce programmers to plug in their custom mappers and reducers
when it is inconvenient or inefficient to express this logic in
HiveQL. Hive is an open source volunteer project under the Apache
Software Foundation. Previously it was a subproject of Apache
Hadoop, but has now graduated to become a top-level project of its
own.
20. Apache Mahout The Apache Mahout project's goal is to build
a scalable machine learning library. With scalable we mean:
Scalable to large data sets. Our core algorithms for clustering,
classification and collaborative filtering are implemented on top
of scalable, distributed systems. However, contributions that run
on a single machine are welcome as well. Scalable to support your
business case. Mahout is distributed under a commercially friendly
Apache Software license.
21. Apache Mahout Scalable community. The goal of Mahout is to
build a vibrant, responsive, diverse community to facilitate
discussions not only on the project itself but also on potential
use cases. Come to the mailing lists to find out more. Currently
Mahout supports mainly three use cases: Recommendation mining takes
users' behavior and from that tries to find items users might like.
Clustering takes e.g. text documents and groups them into groups of
topically related documents. Classification learns from existing
categorized documents what documents of a specific category look
like and is able to assign unlabeled documents to the (hopefully)
correct category.
22. Apache Pig Apache Pig is a platform for analyzing large
data sets that consists of a high-level language for expressing
data analysis programs, coupled with infrastructure for evaluating
these programs. The salient property of Pig programs is that their
structure is amenable to substantial parallelization, which in
turns enables them to handle very large data sets. At the present
time, Pig's infrastructure layer consists of a compiler that
produces sequences of Map-Reduce programs, for which large-scale
parallel implementations already exist (e.g., the Hadoop
subproject).
23. Apache Pig Pig's language layer currently consists of a
textual language called Pig Latin, which has the following key
properties: Ease of programming. It is trivial to achieve parallel
execution of simple, "embarrassingly parallel" data analysis tasks.
Complex tasks comprised of multiple interrelated data
transformations are explicitly encoded as data flow sequences,
making them easy to write, understand, and maintain. Optimization
opportunities. The way in which tasks are encoded permits the
system to optimize their execution automatically, allowing the user
to focus on semantics rather than efficiency. Extensibility. Users
can create their own functions to do special- purpose
processing.
24. Apache Spark Apache Spark is a fast and general engine for
large-scale data processing. Ease of Use Write applications quickly
in Java, Scala or Python. Spark offers over 80 high-level operators
that make it easy to build parallel apps. And you can use it
interactively from the Scala and Python shells. Generality Combine
SQL, streaming, and complex analytics. Spark powers a stack of
high-level tools including Spark SQL, MLlib for machine learning,
GraphX, and Spark Streaming. You can combine these libraries
seamlessly in the same application.
25. Apache Spark Runs Everywhere Spark runs on Hadoop, Mesos,
standalone, or in the cloud. It can access diverse data sources
including HDFS, Cassandra, HBase, S3. You can run Spark readily
using its standalone cluster mode, on EC2, or run it on Hadoop YARN
or Apache Mesos. It can read from HDFS, HBase, Cassandra, and any
Hadoop data source.
26. Apache Tez Introduction The Apache Tez project is aimed at
building an application framework which allows for a complex
directed-acyclic-graph of tasks for processing data. It is
currently built atop Apache Hadoop YARN The 2 main design themes
for Tez are: Empowering end users by: Expressive data flow
definition APIs Flexible Input-Processor-Output run time model Data
type agnostic Simplifying deployment
27. Apache Tez Execution Performance Performance gains over Map
Reduce Optimal resource management Plan reconfiguration at run time
Dynamic physical data flow decisions
28. By allowing projects like Apache Hive and Apache Pig to run
a complex DAG of tasks, Tez can be used to process data, that
earlier took multiple MR jobs, now in a single Tez job as shown
below.
29. Apache ZooKeeper Apache ZooKeeper is an effort to develop
and maintain an open-source server which enables highly reliable
distributed coordination. ZooKeeper is a centralized service for
maintaining configuration information, naming, providing
distributed synchronization, and providing group services. All of
these kinds of services are used in some form or another by
distributed applications. Each time they are implemented there is a
lot of work that goes into fixing the bugs and race conditions that
are inevitable. Because of the difficulty of implementing these
kinds of services, applications initially usually skimp on them
,which make them brittle in the presence of change and difficult to
manage. Even when done correctly, different implementations of
these services lead to management complexity when the applications
are deployed.