40
Roman Nikitchenko, 09.10.2014 BIG.DATA Technology scope

Big data: current technology scope

Embed Size (px)

Citation preview

Page 1: Big data: current technology scope

Roman Nikitchenko, 09.10.2014

BIG.DATATechnology scope

Page 2: Big data: current technology scope

2www.vitech.com.ua

Any real big data is just about DIGITAL LIFE FOOTPRINT

Page 3: Big data: current technology scope

3www.vitech.com.ua

BIG DATA is not about the

data. It is about OUR ABILITY TO HANDLE THEM.

Page 4: Big data: current technology scope

4www.vitech.com.ua

Arguments for meetings with management ;-)

But we are always special, don't you?

What is our stack of big data technologies?

Our stack

Some of our specifics

Couple of buzz words

Page 5: Big data: current technology scope

5www.vitech.com.ua

YARN

Linear scalability: 2 times more power costs

2 times more money

No natural keys so load balancing is perfect

No 'special' hardware so staging is closer to

production.

Page 6: Big data: current technology scope

6www.vitech.com.ua

HADOOP magic is here!

Page 7: Big data: current technology scope

7www.vitech.com.uaWhat is

it?

What is HADOOP?

● Hadoop is open source framework for big data. Both distributed storage and processing.

● Hadoop is reliable and fault tolerant with no rely on hardware for these properties.

● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.

Page 8: Big data: current technology scope

8www.vitech.com.ua

Why hadoop

?

x MAX+

=

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

What is HADOOP INDEED?

Page 9: Big data: current technology scope

9www.vitech.com.ua

SIMPLE BUT RELIABLE

● Really big amount of data stored in reliable manner.

● Storage is simple, recoverable and cheap (relatively).

● The same is about processing power.

Page 10: Big data: current technology scope

10www.vitech.com.ua

COMPLEX INSIDE, SIMPLE OUTSIDE● Complexity is burried

inside. Most of really complex operations are taken by engine.

● Interface is remote, compatible between versions so clients are relatively safe against implementation changes.

Page 11: Big data: current technology scope

11www.vitech.com.ua

DECENTRALIZED● No single point of failure

(almost).

● Scalable as close to linear as possible.

● No manual actions to recover in case of failures

Page 12: Big data: current technology scope

12www.vitech.com.ua

Hadoop historical top view

● HDFS serves as file system layer

● MapReduce originally served as distributed processing framework.

● Native client API is Java but there are lot of alternatives.

● This is only initial architecture and it is now more complex.

Page 13: Big data: current technology scope

13www.vitech.com.ua

HDFS top

view

● Namenode is 'management' component. Keeps 'directory' of what file blocks are stored where.

● Actual work is performed by data nodes.

HDFS is... scalable

Page 14: Big data: current technology scope

14www.vitech.com.ua

● Files are stored in large enough blocks. Every block is replicated to several data nodes.

● Replication is tracked by namenode. Clients only locate blocks using namenode and actual load is taken by datanode.

● Datanode failure leads to replication recovery. Namenode could be backed by standby scheme.

HDFS is... reliable

Page 15: Big data: current technology scope

15www.vitech.com.ua

NO BACKUPS

Page 16: Big data: current technology scope

16www.vitech.com.ua

● 2 steps data processing model: transform and then reduce. Really nice to do things in distributed manner.

● Large class of jobs can be adopted but not all of them.

MapReduce is...

Page 17: Big data: current technology scope

17www.vitech.com.ua

BIG DATA

processing:

requirements

● Work is to be balanced.

● Work can be shared in accordance to data placement.

● Work is to be balanced to reflect resource balance.

DISTRIBUTION LOAD HAS TO BE SHARED

Page 18: Big data: current technology scope

18www.vitech.com.ua

DATA LOCALITY TOOLS ARE TO BE CLOSE TO WORK PLACE● Process data on the

same nodes as it is stored on with MapReduce.

● Distributed storage — distributed processing.

Page 19: Big data: current technology scope

19www.vitech.com.ua

DISTRIBUTION + LOCALITY TOGETHER THEY GO!YOUR DATA

BIG DATA

BIG DATA

Partition

Partition

WORK TO DO

Do it locally

BIG DATA

Share it

JOINED RESULT

Partition

Data partitioning drives work sharing. Good partitioning — good scalability.

Page 20: Big data: current technology scope

20www.vitech.com.ua

● New component (YARN) forms resource management layer and completes real distributed data OS.

● MapReduce is from now only one among other YARN appliactions.

Now with resource management

Page 21: Big data: current technology scope

21www.vitech.com.ua

● Better resource balance for heterogeneous clusterss and multple applications.

● Dynamic applications over static services.

● Much wider applications model over simple MapReduce. Things like Spark ot Tez.

Why YARN is SO important?

Page 22: Big data: current technology scope

22www.vitech.com.ua

First ever worldDATA OS

10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance.

Page 23: Big data: current technology scope

23www.vitech.com.ua

Hadoop: don't do it

yourself

Page 24: Big data: current technology scope

24www.vitech.com.ua

● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet.

Cloudera is stable enough but not stale. Hadoop 2.3 with YARN, HBase 0.98.x. Balance. Spark 1.x is bold move!

● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority.

Choose your destiny! We did.

Page 25: Big data: current technology scope

25www.vitech.com.ua

HBase motivat

ion

● Designed for throughput, not for latency.

● HDFS blocks are expected to be large. There is issue with lot of small files.

● Write once, read many times ideology.

● MapReduce is not so flexible so any database built on top of it.

● How about realtime?

But Hadoop is...

Page 26: Big data: current technology scope

26www.vitech.com.ua

HBase motivat

ion

BUT WE OFTEN NEED...

LATENCY, SPEED and all Hadoop properties.

Page 27: Big data: current technology scope

27www.vitech.com.ua

High layer applications

Resource management

Distributed file system

YARN

Page 28: Big data: current technology scope

28www.vitech.com.ua

Table

Region

Region

Row

Key Family #1 Family #2 ...Column Column ... ...

...

...

...

Data is placed in tables.

Tables are split into regions based on row key ranges.

Columns are grouped into families.Every table row

is identified by unique row key.

Every row consists of columns.

Logical data model

Page 29: Big data: current technology scope

29www.vitech.com.ua

Table

Region

RegionRow

Key Family #1 Family #2 ...Column Column ... ...

...

● Data is stored in HFile.● Families are stored on

disk in separate files.● Row keys are

indexed in memory.● Column includes key,

qualifier, value and timestamp.● No column limit.● Storage is block based.

HFile: family #1

Row key Column Value TS

... ... ... ...

... ... ... ...

HFile: family #2

Row key Column Value TS

... ... ... ...

... ... ... ...

● Delete is just another marker record.

● Periodic compaction is required.

Real data model

Page 30: Big data: current technology scope

30www.vitech.com.ua

DATA

META

RS RS RS RS

ClientMasterZookeeper

Zookeeper coordinates distributed elements and is primary contact point for client.

Master server keeps metadata and manages data distribution over Region servers.

Region servers manage data table regions.

Clients directly communicate with region server for data.

Clients locate master through ZooKeeper then needed regions through master.

Hbase: infrastructure view

Page 31: Big data: current technology scope

31www.vitech.com.ua

DATA

META

Rack

DN DN

RS RS

Rack

DN DN

RS RS

Rack

DN DN

RS RSNameNode

Client

MasterZookeeper

Zookeeper coordinates distributed elements and is primary contact point for client.

Master server keeps metadata and manages data distribution over Region servers.

Region servers manage data table regions.

Actual data storage service including replication is on HDFS data nodes.

Clients directly communicate with region server for data.

Clients locate master through ZooKeeper then needed regions through master.

Together with HDFS

Page 32: Big data: current technology scope

32www.vitech.com.ua

DATA LAKETake as much data about your business processes as you can take. The more data you have the more value you could get from it.

Page 33: Big data: current technology scope

33www.vitech.com.uaZookee

per

… because coordinating distributed systems is a Zoo

Apache ZooKeeper

Page 34: Big data: current technology scope

34www.vitech.com.ua

Apache ZooKeeper

We use this guy:● As a part of Hadoop /

HBase infrastructure● To coordinate MapReduce

job tasks

Page 35: Big data: current technology scope

35www.vitech.com.ua

Apache Spark

● Better MapReduce with at least some MapReduce elements able to be reused.

● Dynamic, faster to startup and does not need anything from cluster.

● New job models. Not only Map and Reduce.

● Results can be passed through memory including final one.

Page 36: Big data: current technology scope

36www.vitech.com.ua

● SOLR indexes documents. What is stored into SOLR index is not what you index. SOLR is NOT A STORAGE, ONLY INDEX

● But it can index ANYTHING. Search result is document ID

INDEX UPDATE

Search responses

INDEX QUERY

Index update request is analyzed, tokenized,

transformed... and the same is for queries.

SOLR is just about search

Page 37: Big data: current technology scope

37www.vitech.com.ua

● HBase handles user data change online requests.

● NGData Lily indexer handles stream of changes and transforms them into SOLR index change requests.

● Indexes are built on SOLR so HBase data are searchable.

Page 38: Big data: current technology scope

38www.vitech.com.ua

ENTERPRISE DATA HUBDon't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution.

Page 39: Big data: current technology scope

39www.vitech.com.ua

HDFS

HBase: Data and search integration

HBase regions

Data update

Client

User just puts (or deletes) data.

Search responses

Lily HBase NRT indexer

Replication can be set up to column

family level.

REPLICATIONHBasecluster

Translates data changes into SOLR

index updates.

SOLR cloudSearch requests (HTTP)

Apache Zookeeper does all coordination

Finally provides search

Serves low level file system.

Page 40: Big data: current technology scope

40www.vitech.com.ua

Questions and discussion