Upload
marvin-richardson
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
BigData NoSQL HadoopPart I:What? How? What for?
Kacper SzkudlarekOpenlab fellowCERN - European Organisation for Nuclear ResearchEN-ICE-SCD Industrial Controls & Engineering, SCADA Systems Email: [email protected] by: Piotr Golonka, Manuel Gonzalez Berges
What we are going to talk about:
• Today:– BigData– NoSQL – Not Only SQL– Hadoop - what it is all about?– HDFS/MAPR – Distributed File System, base of everything
• Next ICETea:– MapReduce – as a new paradigm for data processing– Hadoop ecosystem tools– Other NoSQL systems
BigData
• Combination of old and new technologies giving availability to:– Manage huge volume of data– Gain the right speed of processing– Within the right time frame to allow real-time
analysis and reactions• Designed for all types of data:
BigData
Structured Pre-defined schema Example: relational database
Semi-structuredInconsistent
structure, cannot be stored in tows and
tables
Example: logs, tweets, sensor feeds
Unstructured Full or partial lack of structure
Example: free-form text, report
The BigData characteristic
• So called the 3 “V”:–Volumes• petabytes and exabytes of data (limited number of
files)
–Variety• any imaginable type of data
–Velocity• speed at which data is collected
NoSQL
Not only SQL
What is NoSQL
• Next Generation Databases addressing new features:– Non-relational– Distributed– Open-source– Horizontally scalable
• Systems providing mechanisms for Big Data processing• New approach for storage of huge amount of data
– Not necessary structured data– Kept in many formats (e.g. key-value pairs, objects, tree …)
• Fast processing focused on data analytics
NoSQL examples
Divided by Data Model
1: Key-value
• Hash map like data alignment, persistent to distributed file system.
• Example: Project Voldemort, riak
12345
ABCD
2014.06.19
Some data
Other data
Yet another data
2: Document
• Database as storage of mass of documents.• Each documents is different data structure– No set schema
• Example: mongoDB, CouchDB.{ _id: 101, type: "fruit", item: "jkl", qty: 10, price: 4.25, memos: [
{ memo: "on time", by: "payment" }, { memo: "delayed", by: "shipping" }
]}
3: Column-family
• Stores multiple aggregates– Identified by row id and column family name,– More complex data model,– Gain on data retrieval.
• Example: Apache Hbase, Cassandra.
12345
Colu
mn
fam
ilyCo
lum
n fa
mily
Name: Kacper
Surname: Szkudlarek
City: Saint-Genis-PuoillyStreet: Rue du BordeauPostal code: 01630
4: Graph
• Modeling of relations between data– Data decompositions.
• Example: Neo4j
Relaxed data consistency
• No ACID (atomicity, consistency, isolation, durability) in meaning as in relation databases– Exception graph DB due to data decompositions
• No really need for transactions– Data is kept aggregated,– Aggregate update is atomic.
Want more information?
• https://www.youtube.com/watch?v=qI_g07C_Q5I
= Distributed FS clustering
job scheduler MapReduce
What is hadoop?
• Apache licensed software• Batch processing system for a cluster of nodes• Underpinnings of Big Data processing systems– Storing huge amount of data– Fast local processing split into chunks
• Can work on any modern desktop PC as a node– Decent, automatic scalability
• Core and main API written in Java (unfortunately)
Who uses Hadoop? (in one or the another form)
A new Hadoop Paradigms
• Process data locally• Reduce dependence on bandwidth• Expect/accept failure– Handle failover elegantly
• Duplicate finite blocks of data to small groups of nodes (rather than entire database)
• Reduce elapsed seek time• Data processing cost reduction
Source: http://bitquill.net/blog/?tag=hadoop
The Hadoop Approach
• Distribute large amounts of data across thousands of commodity hardware nodes– Process data in parallel– Replicate data across cluster for reliability
• Analysis moved to data– Avoid data copy
• Scanning of data– Avoids random seeks– Easiest way to process
The Ecosystem of Projects associated with Hadoop
Data Management
Data Access
HDFS(Hadoop Distributed File System)
Batch
MapReduce
Script
Pig
SQL
Hive
NoSQL
HBase
Stream
Storm
Others
YARN(NextGen MapReduce)
Integration Operations
SqoopFlume
NFSWebHDFS
Monitor
Zookeeper
Scheduling
Oozie
Hadoop and Java
• Core of the Hadoop and base projects developed using Java
• All API’s for Mapper, Reducer, HDFS and so on based on Java interfaces
• Possible usage of other languages for defining certain jobs or part of jobs
and other distributed file systems
What is HDFS?
• Standard Hadoop Distributed File System• Logical file system• Primary storage system for Hadoop• Specialized for read access• Can handle enormous files (> 100 TB)• Deployed currently only on Linux
HDFS Charactersistics• Persistent• Replicated• Linear scalable• Applications sequentially stream reads
– Often from very large files• Optimized for read performance
– Avoids random disk seeks• Write once and read many times• Files append only• Data stored in blocks
– Distributed over many nodes– Block size often range from 128M to 1G
HDFS Architecture
Secondary NameNode
NameNode
NameSpace
NameSpaceBlock Map
DataNode
BL1 BL7
BL8 BL11
DataNode
BL1 BL6
BL2 BL7
DataNodeDataNode DataNode DataNode
Checkpoint Image andEdit Journal Log (backup)
Namespace MetaDataImage (Checkpoint)And Edit Journal Log
Logical File System
• File’s disk blocks are not physically contiguous– Distributed around many DataNode
• Data only logically contiguous• Read/write mechanism transparent to the
user
Data Organization• Metadata– Organized into files and directories– Linux-like permissions prevent accidental deletions
• Files– Divided into uniform sized blocks– Default 64M– Distributed across clusters
• Rack-aware (HA, minimization of out of rack data transfers)
• Checksuming– Corruption detection
HDFS Cluster (I)
• HDFS runs in Hadoop distributed mode• 3 main components:– Name node (eventually secondary NameNode):• Manages DataNodes• Keep Metadata for all nodes & blocks• NOT auto failover (with secondary NameNode)• Backups of logs
HDFS Cluster (II)
• DataNodes– Hold data blocks– Slave in hierarchy– Manages blocks for HDFS– If heartbeat fails:• Removed from cluster• Replicated blocks take over
• Client– Talks directly to NameNodes then DataNodes
NameNode
HDFS
DataNode daemon
DataNode daemon
DataNode daemon
heartbeats
fsimage
editlog
File Access – RPCNameNode
NameSpace
NameSpaceBlock Map
JVM
Distributed File System
FSData Output Stream
Client Code
PIG
Hive
HBase
fsshell
DataNode
1
2
3
4
5 61. Request (create/open/delete)
• Provide name of file or directory2. Approval3. Request for block4. Block ID and list of DataNodes5. Operation on DataNode
• Read• Write• Delete
6. Return
Note:• NameNode is not
in the data path• NameNode only
stores metadata
• Alternative to HDFS• Build for business-critical production applications.
– Commercial product– Free to use version available
• New container architecture different from HDFS• Implements normal files, visible in operating system as
soon as it is written, access via NFS• Solve synchronization problem with commodity
hardware• Reliable
Container architecture
• Chops data of each node into 1000s pieces• Replicate containers across the cluster• If node dies, other replicates missing data with
higher speed
HDFS vs MapR
Disclaimer:Source: http://www.mapr.com/why-hadoop/why-mapr/architecture-matters
MapR advantages
• High Availability Cluster• Better performance than HDFS
– Data from HDFS NameNode moved into the cluster– No file count limitation– Lower costs, less hardware in the cluster
• NFS interface for clusters access, behaves like a giant NFS server with full HA
• Replicated, ultra-reliable solution available in M7 option• Holder of the TeraSort world record (speed of writing of
1TB file) -> 55 seconds (youtube link…)
Other distributed file systems
• GFS – Google File System, proprietary file system developed for own use.
• GridFS – distributed file system used by MongoDB
es-hadoop
• Hadoop extension to work with Elasticsearch data.• Near real-time responses (think milliseconds).• Dedicated Input/Output classes to read data to
Hadoop MapReduce.• Usage of Hadoop paradigm of local data
processing:– Each node works on shards stored on it.
• Integration with Hadoop tools (Pig, Hive, etc.).• Horizontal scaling of cluster
Distributions of Hadoop
• Available many different distributions– Cloudera (under testing @CERN)
• Free VM images/Online Live Service
– Hortonworks • Free VM images
– MapR(image)• Many free and paid VM machines
– Spring for Apache Hadoop• Where to read about?– Online training by Hortonworks and Cloudera
To be continued…
• MapReduce – as a new paradigm for data processing• Hive – SQL like interface data access tool• Pig - high-level scripting tool for data processing• HBase – NoSQL system, the new way of thinking about
databases