27
Big Data – Hadoop Ecosystem Big Data – Hadoop Ecosystem 31 October 2022 Nuria de las Heras

Big Data - Hadoop Ecosystem

Embed Size (px)

Citation preview

Page 1: Big Data -  Hadoop Ecosystem

Big Data – Hadoop Ecosystem

Big Data – Hadoop Ecosystem

15 April 2023

Nuria de las Heras

Page 2: Big Data -  Hadoop Ecosystem

Table of Content

A. Framework Ecosystem – Hadoop Ecosystem....................................61.1. Tools for working with Hadoop.........................................................................................7

1.1.1. No SQL Databases....................................................................................................71.1.1.1. MongoDB..............................................................................................................8

1.1.1.2. Cassandra.............................................................................................................9

1.1.1.3. HBase..................................................................................................................10

1.1.1.4. Zookeeper...........................................................................................................11

1.1.2. Map - Reduce..........................................................................................................131.1.2.1. Hive.....................................................................................................................15

1.1.2.2. Impala.................................................................................................................16

1.1.2.3. Pig.......................................................................................................................17

1.1.2.4. Cascading...........................................................................................................18

1.1.2.5. Flume..................................................................................................................18

1.1.2.6. Chukwa...............................................................................................................20

1.1.2.7. Sqoop..................................................................................................................20

1.1.2.8. Oozie...................................................................................................................21

1.1.2.9. HCatalog.............................................................................................................22

1.1.3. Machine learning....................................................................................................241.1.3.1. WEKA..................................................................................................................25

1.1.3.2. Mahout................................................................................................................25

1.1.4. Visualization...........................................................................................................261.1.4.1. Fusion Tables......................................................................................................27

1.1.4.2. Tableau...............................................................................................................27

1.1.5. Search.....................................................................................................................281.1.5.1. Lucene................................................................................................................28

1.1.5.2. Solr.....................................................................................................................29

Page 3: Big Data -  Hadoop Ecosystem

List of TablesTable 1: No SQL Databases...........................................................................................13Table 2: Map – Reduce..................................................................................................24Table 3: Machine learning.............................................................................................26Table 4: Visualization....................................................................................................28Table 5: Search.............................................................................................................30Table 6: Table 1............................................................................................................31Table 7: Table 2............................................................................................................31Table 8: Risks and Mitigation Graph..............................................................................31

List of FiguresFigure 1: Hadoop Ecosystem............................................................................................7Figure 2: No SQL Databases Ecosystem...........................................................................8Figure 3: Map – Reduce Ecosystem................................................................................15Figure 4: Figure 1...........................................................................................................31Figure 5: Graph 1...........................................................................................................31Figure 6: Graph 2...........................................................................................................31

Page 4: Big Data -  Hadoop Ecosystem

Revision HistoryDate Version Description Author

0.0 Nuria de las Heras

Page 5: Big Data -  Hadoop Ecosystem

A. Framework Ecosystem – Hadoop EcosystemThe Hadoop platform consists of two key services: a reliable, distributed file system called Hadoop Distributed File System (HDFS) and the high-performance parallel data processing engine called Hadoop MapReduce.The combination of HDFS and MapReduce provides a software framework for processing vast amounts of data in parallel on large clusters of commodity hardware (potentially scaling to thousands of nodes) in a reliable, fault-tolerant manner. Hadoop is a generic processing framework designed to execute queries and other batch read operations against massive datasets that can scale from tens of terabytes to petabytes in size.When Hadoop 1.0.0 was released by Apache in 2011, comprising mainly HDFS and MapReduce, it soon became clear that Hadoop was not simply another application or service, but a platform around which an entire ecosystem of capabilities could be built. Since then, dozens of self-standing software projects have sprung into being around Hadoop, each addressing a variety of problem spaces and meeting different needs.The so-called "Hadoop ecosystem" is, as befits an ecosystem, complex, evolving, and not easily parceled into neat categories. Simply keeping track of all the project names may seem like a task of its own, but this pales in comparison to the task of tracking the functional and architectural differences between projects. These projects are not meant to all be used together, as parts of a single organism; some may even be seeking to solve the same problem in different ways. What unites them is that they each seek to tap into the scalability and power of Hadoop, particularly the HDFS component of Hadoop.

Figure 1: Hadoop Ecosystem

1.1. Tools for working with Hadoop

Page 6: Big Data -  Hadoop Ecosystem

1.1.1. No SQL Databases

Next generation databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication support, simple API, eventually consistent / BASE (Basic Availability, Soft-state, and Eventual consistency - not ACID), a huge amount of data and more. So the misleading term "nosql" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above.

Figure 2: No SQL Databases Ecosystem

1.1.1.1. MongoDB

It’s a document-oriented system, with records that look similar to JSON objects with the ability to store and query on nested attributes.More features:

. MongoDB is written in C++.

. It is document-oriented storage. It is assumed that documents encapsulate and encode data in some standard formats or encodings. Encoding in use includes XML, YAML and JSON (JavaScript Object Notation) as well as binary forms like BSON, PDF and MS Office documents.

. Documents use BSON syntax. Data is stored and queried in BSON, think binary-serialized JSON-like data.

Page 7: Big Data -  Hadoop Ecosystem

. MongoDB uses collections for storing groups of data. Documents exist inside a collection.

. Documents are schema-less. Data in MongoDB have flexible schema. Collections do not enforce document structure.

. MongoDB supports index on any attribute, which provides high performance read operations for frequently used queries.

. It supports replication and high availability, which means mirror across LANs and WANs. Replica sets provide redundancy and high availability.

. Auto-sharding. Sharding (the process of storing data records across multiple machines) solves the problem with horizontal scaling. You add more machines to support data growth and the demand of read and write operations.

. Querying supports rich, document-based queries.

. It provides methods to perform update operations.

. Flexible aggregation and data processing. Map-reduce operations can handle complex aggregation tasks.

. It stores files of any size. GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16 MB.

1.1.1.2. Cassandra

Cassandra is an open source distributed database management system designed to handle large amounts of data across many servers, providing high availability with no single point of failure. It offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.More features:

. Cassandra is written in Java.

. Decentralized. Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request.

. Scalability. Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.

. Fault-tolerant. Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.

. Tunable consistency. Cassandra's data model is a partitioned row store with tunable consistency. For any given read or write operation, the client application decides how consistent the requested data should be.

. MapReduce support. Cassandra has Hadoop integration, with MapReduce support. There is support also for Apache Pig and Apache Hive.

. Query language. CQL (Cassandra Query Language) was introduced, a SQL-like alternative to the traditional RPC interface. Language drivers are available for Java (JDBC), Python (DBAPI2) and Node.JS (Helenus).

. Rows are organized into tables; the first component of a table's primary key is the partition key; within a partition, rows are clustered by the

Page 8: Big Data -  Hadoop Ecosystem

remaining columns of the key. Other columns may be indexed separately from the primary key.

. Cassandra is frequently referred to as a “column-oriented” database. Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.

. It does not support joins or subqueries, except for batch analysis via Hadoop.

. It’s not relational, and it does represent its data structures in sparse multidimensional hash tables.

1.1.1.3. HBase

It is a distributed column-oriented database built on top of HDFS, providing Big Table-like capabilities for Hadoop. It has been designed from the ground up with a focus on scale in every direction: tall in numbers of rows (billions), wide in numbers of columns (millions). HBase is at its best when it’s accessed in a distributed fashion by many clients.It is recommended using HBase when you need random, real-time read/write access to Big Data.More features:

. Written in Java.

. Strongly consistent reads/writes. This makes it very suitable for tasks such as high-speed counter aggregation.

. Automatic sharding. HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.

. Automatic Region Server failover.

. In the parlance of CAP theorem, HBase is a CP (consistency and partition tolerance) type system.

. HBase is not relational and does not support SQL.

. It depends on ZooKeeper and by default it manages a ZooKeeper instance as the authority on cluster state.

. MapReduce. HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.

. Java Client API. HBase supports an easy to use Java API for programmatic access. Tables can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway API’s.

. Operational Management. HBase provides build-in web-pages for operational insight as well as JMX metrics.

. Block Cache (an LRU cache that contains three levels of block priority) and Bloom Filters (a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set). HBase supports a Block Cache and Bloom Filters for high volume query optimization.

Page 9: Big Data -  Hadoop Ecosystem

1.1.1.4. Zookeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.More features:

. It allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system. The name space consists of data registers - called znodes, in ZooKeeper parlance - and these are similar to files and directories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can achieve high throughput and low latency numbers.

. The performance aspects of ZooKeeper mean it can be used in large, distributed systems.

. The reliability aspects keep it from being a single point of failure.

. The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available

. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions.

. It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.

. It provides sequential consistency. Updates from a client will be applied in the order that they went sent.

. Atomicity. Updates either succeed or fail. No partial results.

. Single system image. A client will see the same view of the service regardless of the server that is connected to.

. Reliability. Once an update has been applied, it will persist from that time forward until a client overwrites the update.

. Timelines. The client’s view of the system is guaranteed to be up-to-date within a certain time bound.

. It provides a very simple programming interface.

Advantages DisadvantagesMongoDB . Open source

. Easy to “install”

. Scalable

. High performance

. Schema free

. Dynamic queries supported

. Higher chance of losing data when adapting content and hard to retrieve it

. Tops out performance-wise at relatively small data volumes

Cassandra . Open source. Scalable. High-level redundancy, failover and backup-

. Complex administering and managing

. Despite it supports indexes, it is possible to have them out-of-sync with the data because of lack of

Page 10: Big Data -  Hadoop Ecosystem

restore capabilities. It has no single point of failure. Ability to open and deliver data in near real-time. Supports interactive web-based applications

transactions. I has no joins. It is not suitable for large blobs

HBase . Open source. Scalable. Good solution for large scale data processing

and analysis. Strong consistent reads and writes. High write performance. Automatic failover support between Region

Servers

. Management complexity

. Needs Zookeeper

. The HDFS Name Node and HBase Master are SPOF (Single Point of Failure)

Zookeeper . Open source. High performance. Good process synchronization in the cluster. Consistency of the configuration in the cluster. Reliable messaging in the cluster

. Clients need to keep sending heartbeat messages in the absence of activity

. ZooKeeper can’t make partial failures go away, since they are intrinsic to distributed systems

Table 1: No SQL Databases

1.1.2. Map - Reduce

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.Every job in MapReduce consists of three main phases: map, shuffle, and reduce.In the map phase, the application has the opportunity to operate on each record in the input separately. Many maps are started at once so that while the input may be gigabytes or terabytes in size, given enough machines, the map phase can usually be completed in less than one minute.For example, if you were processing web server logs for a website that required users to log in, you might choose the user ID to be your key so that you could see everything done by each user on your website. In the shuffle phase, which happens after the map phase, data is collected together by the key the user has chosen and distributed to different machines for the reduce phase. Every record for a given key will go to the same reducer.In the reduce phase, the application presents each key, together with all of the records containing that key. Again this is done in parallel on many machines. After processing each group, the reducer can write its output.More features:

. Scale-out Architecture. Adds servers to increase processing power.

. Security & Authentication. Works with HDFS and HBase security to make sure that only approved users can operate against the data in the system.

. Resource Manager. Employs data locality and server resources to determine optimal computing operations.

. Optimized Scheduling. Completes jobs according to prioritization.

. Flexibility. Procedures can be written in virtually any programming language.

. Resiliency & High Availability. Multiple job and task trackers ensure that jobs fail independently and restart automatically.

Page 11: Big Data -  Hadoop Ecosystem

Figure 3: Map – Reduce Ecosystem

1.1.2.1. Hive

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Because of Hadoop’s focus on large scale processing, the latency may mean that even simple jobs take minutes to complete, so it’s not a substitute for a real-time transactional database.More features:

. Scalability. Scale out with more machines added dynamically to the Hadoop cluster.

. It provides tools to enable easy data ETL.

. Indexing to provide acceleration, index type including compaction and bitmap index.

. Different storage types such as plain text, RC File, HBase, ORC, and others.

. Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.

. Operating on compressed data stored into Hadoop ecosystem, algorithm including gzip, bzip2, snappy, and others.

. SQL-like queries (Hive QL), which are implicitly converted into map-reduce jobs.

Page 12: Big Data -  Hadoop Ecosystem

. Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.

. Hive also provides query execution via MapReduce. It allows map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

. Hive is not designed for OLTP workloads.

. It does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs).

1.1.2.2. Impala

Impala is an open-source system which is an interactive/real-time SQL query system that runs on HDFS.As Impala supports SQL and provides real-time big data processing functionality, it has the potential to be utilized as a business intelligence (BI) system.Impala has been technically inspired by Google's Dremel paper. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data.The difference between Impala and Hive is whether it is real-time or not. While Hive uses MapReduce for data access, Impala uses its distributed query engine to minimize response time. This distributed query engine is installed on all data nodes in the cluster.More features:

. Nearly all of Hive’s SQL, including insert, join and subqueries.

. Query results faster than Hive.

. Easy to create and change schemas.

. Tables created with Hive can be queried with Impala.

. Support for a variety of data formats: Hadoop native (Apache Avro, SequenceFile, RCFile with Snappy, GZIP, BZIP, or uncompressed); text (uncompressed or LZO-compressed); and Parquet (Snappy or uncompressed), the new state-of-the-art columnar storage format.

. Connectivity via JDBC, ODBC, Hue GUI, or command-line shell.

1.1.2.3. Pig

It is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.The Apache Pig project is a procedural data processing language designed for Hadoop. It provides an engine for executing data flows in parallel on Hadoop. More features:

Page 13: Big Data -  Hadoop Ecosystem

. Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc.

. Intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on Hadoop, but it is not intended to be only on Hadoop.It can also read input from and write output to sources other than HDFS.

. Designed to be easily controlled and modified by its users.It allows integration of user code where ever possible, so it supports user defined field transformation functions, user defined aggregates, and user defined conditionals.

. Pig processes data quickly.

. It includes a language, Pig Latin, for expressing data flows. Pig Latin use cases tend to fall into three separate categories: traditional extract transform load (ETL) data pipelines, research on raw data, and iterative processing.Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data.

1.1.2.4. Cascading

Most real-world Hadoop applications are built of a series of processing steps, and Cascading lets you define that sort of complex workflow as a program. You lay out the logical flow of the data pipeline you need, rather than building it explicitly out of Map-Reduce steps feeding into one another. To use it, you call a Java API, connecting objects that represent the operations you want to perform into a graph. The system takes that definition, does some checking and planning, and executes it on Hadoop cluster. Developers use Cascading to create a .jar file that describes the required processes.There are a lot of built-in objects for common operations like sorting, grouping, and joining, and you can write your own objects to run custom processing code.More features:

. It is simple to build, easy to test, robust in production

. It supports optimized joins.

. Parallel running jobs.

. Creating checkpoints.

. Developers can work on different languages (java, ruby, scala, clojure).

. Support for tsv, csv, and custom delimited text files.

1.1.2.5. Flume

Flume is a distributed system for collecting log data from many sources, aggregating it, and writing it to HDFS. It is designed to be reliable and highly available, while providing a simple, flexible, and intuitive programming model based on streaming data flows.Flume maintains a central list of ongoing data flows, stored redundantly in Zookeeper.

Page 14: Big Data -  Hadoop Ecosystem

One very common use of Hadoop is taking web server or other logs from a large number of machines, and periodically processing them to pull out analytics information. The Flume project is designed to make the data gathering process easy and scalable, by running agents on the source machines that pass the data updates to collectors, which then aggregate them into large chunks that can be efficiently written as HDFS files. It’s usually set up using a command-line tool that supports common operations, like tailing a file or listening on a network socket, and has tunable reliability guarantees that let you trade off performance and the potential for data loss.More features:

. Reliability (the ability to continue delivering events in the face of failures without losing data). Flume can guarantee that all data received by an agent node will eventually make it to the collector at the end of its flow as long as the agent node keeps running. That is, data can be reliably delivered to its eventual destination. Flume allows the user to specify, on a per-flow basis, the level of reliability required. There are three supported reliability levels: end-to-end, store on failure, best effort.

. Scalability (the ability to increase system performance linearly by adding more resources to the system). A key performance measure in Flume is the number or size of events entering the system and being delivered. When load increases, it is simple to add more resources to the system in the form of more machines to handle the increased load.

. Manageability (the ability to control data flows, monitor nodes, modify settings, and control outputs of a large system). The Flume Master is the point where global state such as the data flows can be managed. Via the Flume Master, users can monitor flows and reconfigure them on the fly.

. Extensibility (the ability to add new functionality to a system). For example, you can extend Flume by adding connectors to existing storage layers or data platforms. This is made possible by simple interfaces, separation of functional concerns into simple pieces, a flow specification language, and a simple but flexible data model. Flume provides many common input and output connectors.

1.1.2.6. Chukwa

Log processing was one of the original purposes of MapReduce. Unfortunately, Hadoop is hard to use for this purpose. Writing MapReduce jobs to process logs is somewhat tedious and the batch nature of MapReduce makes it difficult to use with logs that are generated incrementally across many machines. Furthermore, HDFS still does not support appending to existing files. Chukwa is a Hadoop subproject that bridges that gap between log handling and MapReduce. It provides a scalable distributed system for monitoring and analysis of log-based data. Some of the durability features include agent-side replying of data to recover from errors.

. Collection components of Chukwa: adaptors, agents (that run on each machine and emit data), and collectors (that receive data from the agent and write to a stable storage).

. Chukwa includes Hadoop Infrastructure Care Center (HICC), which is a web interface for visualizing system performance.

Page 15: Big Data -  Hadoop Ecosystem

. Flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

. Chukwa’s reliability model supports two levels: end-to-end reliability, and fast-path delivery, which minimizes latencies. After writing data into HDFS Chukwa runs a MapReduce job to demultiplex the data into separate streams.

1.1.2.7. Sqoop

It is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing. This processing can be done with MapReduce programs or other higher-level tools such as Hive. (It’s even possible to use Sqoop to move data from a relational database into HBase.) When the final results of an analytic pipeline are available, Sqoop can export these results back to the database for consumption by other clients.More features:

. Bulk import. Sqoop can import individual tables or entire databases into HDFS. The data is stored in the native directories and files in the HDFS file system.

. Direct input. Sqoop can import and map SQL (relational) databases directly into Hive and HBase.

. Data interaction. Sqoop can generate Java classes so that you can interact with the data programmatically.

. Data export. Sqoop can export data directly from HDFS into a relational database using a target table definition based on the specifics of the target database.

. It integrates with Oozie.

. It is a command line interpreter.

. It comes complete with connectors to MySQL, PostgreSQL, Oracle, SQL Server and DB2.

1.1.2.8. Oozie

Oozie is a workflow scheduler system to manage Hadoop jobs.Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified in hPDL (a XML Process Definition Language).More features:

. Oozie is a scalable, reliable and extensible system.

. Oozie can detect completion of computation/processing tasks by two different means, callbacks and polling.

. Some of the workflows are invoked on demand, but the majority of times it is necessary to run them based on regular time intervals and/or data availability and/or external events. The Oozie Coordinator system allows

Page 16: Big Data -  Hadoop Ecosystem

the user to define workflow execution schedules based on these parameters.

. It can run jobs sequentially (one after the other) and in parallel (multiple at time).

. Oozie can also run plain java classes, Pig workflows, and interact with the HDFS.

. Oozie provides major flexibility (start, stop, suspend and re-run jobs).It allows you to restart from a failure (you can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes).

. Java Client API / Command Line Interface (launch, control and monitor jobs from your Java Apps).

. Web Service API (you can control jobs from anywhere).

. Receive an email when a job is complete.

1.1.2.9. HCatalog

HCatalog is an abstraction for data storage and a metadata service.It provides a set of interfaces that open up access to Hive's metastore for tools inside and outside of the Hadoop grid.More features:

. It presents users with a table abstraction. This frees them from knowing where or how their data are stored.

. It allows data producers to change how they write data while still supporting existing data in the old format so that data consumers do not have to change their processes.

. It provides a shared schema and data model for Pig, Hive, and MapReduce.

. It provides interoperability across data processing tools such as Pig, Map Reduce, and Hive.

. A REST interface to allow language independent access to Hive's metadata.

. HCatalog includes Hive's command line interface so that administrator can create and drop tables, specify table parameters, etc.

. It also provides an API for storage format developers to tell HCatalog how to read and write data stored in different formats.

. It supports RCFile (Record Columnar File), CSV (Comma Separated Values), JSON (JavaScript Object Notation), and SequenceFile formats.

. The data model of HCatalog is similar to HBase’s data model.

Advantages DisadvantagesHive . Open source

. Easy data summarization

. Ad-hoc queries

. Provides Hadoop Query Language, similar to SQL

. Metadata store, which makes the lookup easy

. It is not for OLAP processing

. Data is required to be loaded from a file

Page 17: Big Data -  Hadoop Ecosystem

Impala . Open source. SQL operation on top of Hadoop. Useful with HBase, Hive, Pig. Query results faster than Hive

. Not all Hive-SQLs are supported

. You cannot create or modify a table

Pig . Open source. Very quick for processing large stable datasets

such as meteorological trends or web-server logs

. It’s perfect for data processing that involves a number of steps (a pipeline of processing)

. Ideal for solving problems that can be carved up, analyzed in pieces in parallel and then put back together (text mining, sentiment trends, recommendation, pattern recognition)

. Pig makes it simple to build scripts to analyze data, experimenting with approaches to identify the best approach

. It resides on user machine, it is not necessary to install anything in the Hadoop cluster

. It is not ideal for real-time or near real-time processing

Cascading . Open source. There are a lot of pre-built components that can

be composed together. Very custom operations can be written as

straight java function. It allows to write analytics jobs quickly and

easily in a familiar language

. It is not the best fit for some fine-grained, performance-critical problems

Flume . Open source. Scalable. Solution for data collection of all forms. Possible sources for Flume include Avro files and

system logs. It has a query processing engine. It allows streaming data to be managed and

captured into Hadoop

. It does not do real-time analytics

Chukwa . Open source. Scalable. Comprehensive toolset for log analysis. It has a rich metadata model. It can collect a variety of system metrics and

can receive data via a variety of network protocols, including syslog

. It provides a framework for processing the collected data

. Chukwa works with an Agent-Collector set up that works predominantly with a single collector until specified for multi-collector set up

. It does not have any support for gzip feature to zip the data files before or after storing data in the HDFS

Sqoop . Open source. It is extensible. There are a number of third-

party companies shipping database-specific connectors

. Connector register metadata (Sqoop 2)

. Admins set policy for connection use (Sqoop 2)

. It is compatible with almost any JDBC enabled database

. Integration with Hive and HBase

. Although Sqoop supports importing to a Hive table/partition, it does not allow exporting from a table or partition

Oozie . It supports: mapreduce (java, streaming, pipes), . All the job management happens on the command line and the default UI is

Page 18: Big Data -  Hadoop Ecosystem

pig, java, filesystem, ssh, sub-workflow. It supports variables and functions. Interval job scheduling is time & input-data-

dependent based

read only and requires a non-Apache licensed java script library that makes it more difficult to use

HCatalog . It provides a shared schema and data model for Pig, Hive, and MapReduce

. None found

Table 2: Map – Reduce

1.1.3. Machine learning

Machine learning is a branch of artificial intelligence that concerns the construction and study of systems that can learn from data.For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders.The core of machine learning deals with representation and generalization.Generalization is the ability of a learning machine to perform accurately on new, unseen examples/tasks after having experienced a learning data set. The training examples come from some generally unknown probability distribution (considered representative of the space of occurrences) and the learner has to build a general model about this space that enables it to produce sufficiently-accurate predictions in previously-unseen cases.Machine learning focuses on prediction, based on known properties learned from the training data.

1.1.3.1. WEKA

WEKA is a Java-based framework and GUI for machine learning algorithms. It provides a plug-in architecture for researchers to add their own techniques, with a command-line and window interface that makes it easy to apply them to your own data. You can use it to do everything from basic clustering to advanced classification, together with a lot of tools for visualizing your results.It is heavily used as a teaching tool, but it also comes in extremely handy for prototyping and experimenting outside of the classroom.It has a strong set of preprocessing tools that make it easy to load your data in, and then you have a large library of algorithms at your fingertips, so you can quickly try out ideas until you find an approach that works for your problem.The command-line interface allows you to apply exactly the same code in an automated way for production.More features:

. WEKA includes data preprocessing tools.

. Classification/regression algorithms.

. Clustering algorithms.

. Attribute/subset evaluators and search algorithms for feature selection.

. Algorithms for finding association rules.

. Graphical user interfaces: The Explorer (exploratory data analysis), The Experimenter (experimental environment), and The Knowledge Flow (new process model inspired interface).

. WEKA is platform-independent.

Page 19: Big Data -  Hadoop Ecosystem

. It is easily useable by people who are not data mining specialists.

. Provides flexible facilities for scripting experiments.

1.1.3.2. Mahout

It is an open source machine learning library from Apache. It means primarily recommender engines (collaborative filtering), clustering, and classification. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine.It’s a framework of tools intended to be used and adapted by developers. In practical terms, the framework makes it easy to use analysis techniques to implement features such as Amazon’s “People who bought this also bought” recommendation engine on your own site.More features:

. Mahout is scalable.

. It supports algorithms for recommendation. For example, it takes user’s behavior and from that tries to find items users might like.

. Algorithms for clustering. It takes e.g. text documents and groups them into groups of topically related documents.

. Algorithms for classification. It learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the correct category.

. Algorithms for frequent itemset mining. It takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

Advantages DisadvantagesWEKA . Free availability under the GNU General Public

License. Portability, since it is fully implemented in Java. Ease of use due to its graphical user interfaces. It provides access to SQL databases using Java

Database Connectivity and can process the result returned by a database query

. It is not capable of multi-relational data mining

. Sequence modeling is not covered

. In experiments which involve a very big data quantity (millions of instances), it can spend many time in the processes

Mahout . Open source. Scalable. It can process very large data quantities. It has functionality for many of today’s common

machine learning tasks

. Mahout is merely a library of algorithms, it is not a product

Table 3: Machine learning

1.1.4. Visualization

Visualization tools provide you to gain deeper insights from data stored in Hadoop.Including those tools in analysis reveals patterns and associations that otherwise are missed.

Page 20: Big Data -  Hadoop Ecosystem

1.1.4.1. Fusion Tables

Google has created an integrated online system that lets you store large amounts of data in spreadsheet-like tables and gives you tools to process and visualize the information. It’s particularly good at turning geographic data into compelling maps, with the ability to upload your own custom KML (XML notation for expressing geographic annotation and visualization within Internet-based, two-dimensional maps and three-dimensional Earth browsers) outlines for areas like political constituencies. There is also a full set of traditional graphing tools, as well as a wide variety of options to perform calculations on your data.Fusion Tables is a powerful system, but it’s definitely aimed at fairly technical users; the sheer variety of controls can be intimidating at first. If you’re looking for a flexible tool to make sense of large amounts of data, it’s worth making the effort.More features:

. Fusion Tables is an experimental data visualization web application to gather, visualize, and share larger data tables.

. Fusion Tables permit visualize bigger table data online. Filter and summarize across hundreds of thousands of rows. Then try a chart, map, network graph, or custom layout and embed or share it. Merge two or three tables to generate a single visualization

. Combine with other data in the web.

. Makes a map in minutes.

. Host data online.

1.1.4.2. Tableau

Originally a traditional desktop application for drawing graphs and visualizations, Tableau has been adding a lot of support for online publishing and content creation. Its embedded graphs have become very popular with news organizations on the Web, illustrating a lot of stories.The support for geographic data isn’t as extensive as Fusion’s, but Tableau is capable of creating some map styles that Google’s product can’t produce.More features:

. With Tableau Public interactive visuals can be created and publish them without the help of programmers.

. It offers hundreds of visualization types, such as maps, bar and line charts, lists, and heat maps.

. Tableau Public is automatically touch-optimized for Android and iPad tablets. It supports all browsers without plug-ins.

Advantages DisadvantagesFusion Tables

. Good at turning geographic data into compelling maps, with the ability to upload your own custom KML

. The offer spatial query processing and very thorough Google Maps integration.

. Access must be authenticated

. There is no organization to datasets

Tableau . It is fast bringing data due to its in-memory analytical engine

. It has native connectors to Cloudera Impala and Cloudera Hadoop, DataStax Enterprise,

. It is not open source

Page 21: Big Data -  Hadoop Ecosystem

Hortonworks and MapR Hadoop Distribution for Hadoop reporting and analysis

. It has powerful visualization capabilities that let you create maps, charts and dashboards easily

Table 4: Visualization

1.1.5. Search

Search is well suited to leverage a lot of different types of information, especially unstructured information.One of the first things any organization is going to want to do once it accumulates a mass of Big Data is search it.

1.1.5.1. Lucene

Lucene is a Java-based search library. It has an architecture that employs best practice relevancy ranking and querying, as well as state of the art text compression and a partitioned index strategy to optimize both query performance and indexing flexibility.More features:

. Speed — sub-second query performance for most queries.

. Complete query capabilities: keyword, Boolean and +/- queries, proximity operators, wildcards, fielded searching, term/field/document weights, find-similar, spell-checking, multi-lingual search and others.

. Full results processing, including sorting by relevancy, date or any field, dynamic summaries and hit highlighting.

. Portability: runs on any platform supporting Java, and indexes are portable across platforms – you can build an index on Linux and copy it to a Microsoft Windows machine and search it there.

. Scalability — there are production applications in the hundreds of millions and billions of documents/records.

. Low overhead indexes and rapid incremental indexing.

1.1.5.2. Solr

Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and receive XML, JSON, CSV or binary results.Solr is highly scalable, providing distributed search and index replication.More features:

. Advanced full-text search capabilities.

. Optimized for high volume web traffic.

. Standards based open interfaces - XML, JSON and HTTP.

. Comprehensive HTML administration interfaces.

. Server statistics exposed over JMX for monitoring.

. Linearly scalable, auto index replication, auto failover and recovery.

. Near real-time indexing.

. Flexible and adaptable with XML configuration.

Page 22: Big Data -  Hadoop Ecosystem

. Extensible plugin architecture.

Advantages DisadvantagesLucene . It is the core search library (it's a library for

indexing and searching text). ACID (or near ACID) is not guaranteed.

A crash while writing to Lucene index might render it useless

Solr . It is the logical starting point for developers building search applications

. It is good at reads

. Documents update instead of fields (so when you have a million documents that say "German" and should say "French", you have to reindex a million documents)

. It takes too long to update and commit

Table 5: Search