BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial

BigData NoSQL HadoopPart I:What? How? What for?

Kacper SzkudlarekOpenlab fellowCERN - European Organisation for Nuclear ResearchEN-ICE-SCD Industrial Controls & Engineering, SCADA Systems Email: [email protected] by: Piotr Golonka, Manuel Gonzalez Berges

mailto:[email protected]

What we are going to talk about:

• Today:– BigData– NoSQL – Not Only SQL– Hadoop - what it is all about?– HDFS/MAPR – Distributed File System, base of everything

• Next ICETea:– MapReduce – as a new paradigm for data processing– Hadoop ecosystem tools– Other NoSQL systems

BigData

• Combination of old and new technologies giving availability to:– Manage huge volume of data– Gain the right speed of processing– Within the right time frame to allow real-time

analysis and reactions• Designed for all types of data:

BigData

Structured Pre-defined schema Example: relational database

Semi-structuredInconsistent

structure, cannot be stored in tows and

tables

Example: logs, tweets, sensor feeds

Unstructured Full or partial lack of structure

Example: free-form text, report

The BigData characteristic

• So called the 3 “V”:–Volumes• petabytes and exabytes of data (limited number of

files)

–Variety• any imaginable type of data

–Velocity• speed at which data is collected

NoSQL

Not only SQL

What is NoSQL

• Next Generation Databases addressing new features:– Non-relational– Distributed– Open-source– Horizontally scalable

• Systems providing mechanisms for Big Data processing• New approach for storage of huge amount of data

– Not necessary structured data– Kept in many formats (e.g. key-value pairs, objects, tree …)

• Fast processing focused on data analytics

NoSQL examples

Divided by Data Model

1: Key-value

• Hash map like data alignment, persistent to distributed file system.

• Example: Project Voldemort, riak

12345

ABCD

2014.06.19

Some data

Other data

Yet another data

2: Document

• Database as storage of mass of documents.• Each documents is different data structure– No set schema

• Example: mongoDB, CouchDB.{ _id: 101, type: "fruit", item: "jkl", qty: 10, price: 4.25, memos: [

{ memo: "on time", by: "payment" }, { memo: "delayed", by: "shipping" }

]}

3: Column-family

• Stores multiple aggregates– Identified by row id and column family name,– More complex data model,– Gain on data retrieval.

• Example: Apache Hbase, Cassandra.

12345

Colu

mn

fam

ilyCo

lum

n fa

mily

Name: Kacper

Surname: Szkudlarek

City: Saint-Genis-PuoillyStreet: Rue du BordeauPostal code: 01630

4: Graph

• Modeling of relations between data– Data decompositions.

• Example: Neo4j

Relaxed data consistency

• No ACID (atomicity, consistency, isolation, durability) in meaning as in relation databases– Exception graph DB due to data decompositions

• No really need for transactions– Data is kept aggregated,– Aggregate update is atomic.

Want more information?

• https://www.youtube.com/watch?v=qI_g07C_Q5I

= Distributed FS clustering

job scheduler MapReduce

What is hadoop?

• Apache licensed software• Batch processing system for a cluster of nodes• Underpinnings of Big Data processing systems– Storing huge amount of data– Fast local processing split into chunks

• Can work on any modern desktop PC as a node– Decent, automatic scalability

• Core and main API written in Java (unfortunately)

Who uses Hadoop? (in one or the another form)

A new Hadoop Paradigms

• Process data locally• Reduce dependence on bandwidth• Expect/accept failure– Handle failover elegantly

• Duplicate finite blocks of data to small groups of nodes (rather than entire database)

• Reduce elapsed seek time• Data processing cost reduction

Source: http://bitquill.net/blog/?tag=hadoop

The Hadoop Approach

• Distribute large amounts of data across thousands of commodity hardware nodes– Process data in parallel– Replicate data across cluster for reliability

• Analysis moved to data– Avoid data copy

• Scanning of data– Avoids random seeks– Easiest way to process

The Ecosystem of Projects associated with Hadoop

Data Management

Data Access

HDFS(Hadoop Distributed File System)

Batch

MapReduce

Script

Pig

SQL

Hive

NoSQL

HBase

Stream

Storm

Others

YARN(NextGen MapReduce)

Integration Operations

SqoopFlume

NFSWebHDFS

Monitor

Zookeeper

Scheduling

Oozie

Hadoop and Java

• Core of the Hadoop and base projects developed using Java

• All API’s for Mapper, Reducer, HDFS and so on based on Java interfaces

• Possible usage of other languages for defining certain jobs or part of jobs

and other distributed file systems

What is HDFS?

• Standard Hadoop Distributed File System• Logical file system• Primary storage system for Hadoop• Specialized for read access• Can handle enormous files (> 100 TB)• Deployed currently only on Linux

HDFS Charactersistics• Persistent• Replicated• Linear scalable• Applications sequentially stream reads

– Often from very large files• Optimized for read performance

– Avoids random disk seeks• Write once and read many times• Files append only• Data stored in blocks

– Distributed over many nodes– Block size often range from 128M to 1G

HDFS Architecture

Secondary NameNode

NameNode

NameSpace

NameSpaceBlock Map

DataNode

BL1 BL7

BL8 BL11

DataNode

BL1 BL6

BL2 BL7

DataNodeDataNode DataNode DataNode

Checkpoint Image andEdit Journal Log (backup)

Namespace MetaDataImage (Checkpoint)And Edit Journal Log

Logical File System

• File’s disk blocks are not physically contiguous– Distributed around many DataNode

• Data only logically contiguous• Read/write mechanism transparent to the

user

Data Organization• Metadata– Organized into files and directories– Linux-like permissions prevent accidental deletions

• Files– Divided into uniform sized blocks– Default 64M– Distributed across clusters

• Rack-aware (HA, minimization of out of rack data transfers)

• Checksuming– Corruption detection

HDFS Cluster (I)

• HDFS runs in Hadoop distributed mode• 3 main components:– Name node (eventually secondary NameNode):• Manages DataNodes• Keep Metadata for all nodes & blocks• NOT auto failover (with secondary NameNode)• Backups of logs

HDFS Cluster (II)

• DataNodes– Hold data blocks– Slave in hierarchy– Manages blocks for HDFS– If heartbeat fails:• Removed from cluster• Replicated blocks take over

• Client– Talks directly to NameNodes then DataNodes

NameNode

HDFS

DataNode daemon

DataNode daemon

DataNode daemon

heartbeats

fsimage

editlog

File Access – RPCNameNode

NameSpace

NameSpaceBlock Map

JVM

Distributed File System

FSData Output Stream

Client Code

PIG

Hive

HBase

fsshell

DataNode

1

2

3

4

5 61. Request (create/open/delete)

• Provide name of file or directory2. Approval3. Request for block4. Block ID and list of DataNodes5. Operation on DataNode

• Read• Write• Delete

6. Return

Note:• NameNode is not

in the data path• NameNode only

stores metadata

• Alternative to HDFS• Build for business-critical production applications.

– Commercial product– Free to use version available

• New container architecture different from HDFS• Implements normal files, visible in operating system as

soon as it is written, access via NFS• Solve synchronization problem with commodity

hardware• Reliable

Container architecture

• Chops data of each node into 1000s pieces• Replicate containers across the cluster• If node dies, other replicates missing data with

higher speed

HDFS vs MapR

Disclaimer:Source: http://www.mapr.com/why-hadoop/why-mapr/architecture-matters

MapR advantages

• High Availability Cluster• Better performance than HDFS

– Data from HDFS NameNode moved into the cluster– No file count limitation– Lower costs, less hardware in the cluster

• NFS interface for clusters access, behaves like a giant NFS server with full HA

• Replicated, ultra-reliable solution available in M7 option• Holder of the TeraSort world record (speed of writing of

1TB file) -> 55 seconds (youtube link…)

Other distributed file systems

• GFS – Google File System, proprietary file system developed for own use.

• GridFS – distributed file system used by MongoDB

es-hadoop

• Hadoop extension to work with Elasticsearch data.• Near real-time responses (think milliseconds).• Dedicated Input/Output classes to read data to

Hadoop MapReduce.• Usage of Hadoop paradigm of local data

processing:– Each node works on shards stored on it.

• Integration with Hadoop tools (Pig, Hive, etc.).• Horizontal scaling of cluster

Distributions of Hadoop

• Available many different distributions– Cloudera (under testing @CERN)

• Free VM images/Online Live Service

– Hortonworks • Free VM images

– MapR(image)• Many free and paid VM machines

– Spring for Apache Hadoop• Where to read about?– Online training by Hortonworks and Cloudera

To be continued…

• MapReduce – as a new paradigm for data processing• Hive – SQL like interface data access tool• Pig - high-level scripting tool for data processing• HBase – NoSQL system, the new way of thinking about

databases

Documents

BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial