Making Big Data, small

MAKING BIG DATA, SMALL

Using distributed systems for processing, analysing and managing

large huge data sets

Marcin Jedyk

Software Professional’s Network, Cheshire Datasystems Ltd

WARM-UP QUESTIONS

How many of you heard about Big Data before?

How many about NoSQL?

Hadoop?

AGENDA.

Intro – motivation, goal and ‘not about…’

What is Big Data?

NoSQL and systems classification

Hadoop & HDFS

MapReduce & live demo

AGENDA

Building Hadoop cluster

Conclusions

MOTIVATION

Data is everywhere – why not to analyse it?

With Hadoop and NoSQL systems, building

distributed systems is easier than before

Relying on software & cheap hardware rather

than expensive hardware works better!

MOTIVATION

To explain basic ideas behind Big Data

To present different approaches towards BD

To show that Big Data systems are easy to build

To show you where to start with such systems

WHAT IT IS NOT ABOUT?

Not a detailed lecture on a single system

Not about advanced techniques in Big Data

Not only about technology – but also about its

application

WHAT IS BIG DATA?

Data characterised by 3 Vs:

Volume

Variety

Velocity

The interesting ones: variety & velocity

WHAT IS BIG DATA

Data of high velocity: cannot store? Process on

the fly!

Data of high variety: doesn’t fit into relational

schema? Don’t use schema, use NoSQL!

Data which is impractical to process on a single

server

NO-SQL

Hand in and with Big Data

NoSQL – an umbrella term for non-relational

data bases or data storages

It’s not always possible to replace RDBMS with

NoSQL! (opposite is also true)

NO-SQL

NoSQL DBs are built around different principles

Key-value stores: Redis, Riak

Document stores: i.e. MongoDB – record as a document; each entry has its own meta-data (JSON like, BSON)

Table stores: i.e. Hbase – data persisted in multiple columns (even millions), billions of rows and multiple versions of records

HADOOP

Existed before ‘Big Data’ buzzword emerged

A simple idea – MapReduce

A primary purpose – to crunch tera- and

petabytes of data

HDFS as underlying distributed file system

HADOOP – ARCHITECTURE BY EXAMPLE

Image you need to process 1TB of logs

What would you need?

A server!

But 1TB is quite a lot of data… we want it

quicker!

Ok, what about distributed environment?

So what about that Hadoop stuff?

Each node can: store data & process it (DataNode

& TaskTracker)

How about allocating jobs to slaves? We need a

JobTracker!

How about HDFS, how data blocks are

assembled into files?

NameNode does it.

NameNode – manages HDFS metadata, doesn’t deal with files directly

JobTracker – schedules, allocates and monitors job execution on slaves – TaskTrackers

TaskTracker – runs MapReduce operations

DataNode – stores blocks of HDFS – default replication level for each block: 3

HADOOP - LIMITATIONS

DataNodes & TaskTrackers are fault tollerant

NameNode & JobTracker are NOT! (existing

workaround for this problem)

HDFS deals nicely with large files, doesn’t do

well with billions of small files

MAP_REDUCE

MapReduce – parallelisation approach

Two main stages:

Map – do an actual bit of work, i.e.: extract info

Reduce – summarise, aggregate or filter outputs from Map operation

For each job, multiple Map and Reduce operations – each may run on different node = parallelism

MAP_REDUCE – AN EXAMPLE

Let’s process 1TB of raw logs and extract traffic by host.

After submitting a job, JobTracker allocates tasks to slaves – possibly divided into 64MB packs = 16384 Map operations!

Map - analyse logs and return them as set of <key,value>

Reduce -> merge output of Map operations

Take a look at mocked log extract:

[IP – bandwidth]

10.0.0.1 – 1234

10.0.0.1 – 900

10.0.0.2 – 1230

10.0.0.3 – 999

It’s important to define key, in this case IP

<10.0.0.1;2134>

<10.0.0.2;1230>

<10.0.0.3;999>

Now, assume another Map operation returned:

<10.0.0.1;1500>

<10.0.0.3;1000>

<10.0.0.4;500>

Now, Reduce will merge those results:

<10.0.0.1;3624>

<10.0.0.2;2230>

<10.0.0.3;1499>

<10.0.0.4;500>

MAP_REDUCE

Selecting a key is important

It’s possible to define composite key, i.e.

IP+date

For more complex tasks, it’s possible to chain

MapReduce jobs

Another layer on top of Hadoop/HDFS

A distributed data storage

Not a replacement for RDBMS!

Can be used with MapReduce

Good for unstructured data – no need to worry

about exact schema in advance

PIG – HBASE ENHANCEMENT

HBase - missing proper query language

Pig – makes life easier for HBase users

Translates queries into MapReduce jobs

When working with Pig or HBase, forget what

you know about SQL – it makes your life easier

BUILDING HADOOP CLUSTER

Post production servers are ok

Don’t take ‘cheap hardware’ too literally

Good connection between nodes is a must!

>=1Gbps between nodes

>=10Gbps between racks

1 disk per CPU core

More RAM, more caching!

FINAL CONCLUSIONS

Hadoop and NoSQL-like DB/DS scale very well

Hadoop ideal for crunching huge data sets

Does very well in production environment

Cluster of slaves is fault tolerant, NameNode

and JobTracker are not!

EXTERNAL RESOURCES

Trending Topic – build on Wikipedia access logs: http://goo.gl/BWWO1

Building web crawler with Hadoop: http://goo.gl/xPTlJ

Analysing adverse drug events: http://goo.gl/HFXAx

Moving average for large data sets: http://goo.gl/O4oml

EXTERNAL RESOURCES – USEFUL LINKS

http://www.slideshare.net/fullscreen/jpatanooga/la-hug-dec-2011-

recommendation-talk/1

https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide

http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

http://hstack.org/hbase-performance-testing/

http://www.theregister.co.uk/2012/06/12/hortonworks_data_platform_one/

http://wiki.apache.org/hadoop/MachineScaling

http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-

ladis2009.pdf

http://www.cloudera.com/resource-types/video/

http://hstack.org/why-were-using-hbase-part-2/

QUESTIONS?

Making Big Data, small

Education

Big and small: Making comparisons in English

Small companies making a big impact

Brian Wampler is famous for taking big sounds and making them … · 2019. 11. 7. · Brian Wampler is famous for taking big sounds and making them fit into small boxes, capable of

Morning with MongoDB Paris 2012 - Making Big Data Small

Making big data small

All that you need to know about making your small business ...memberfiles.freewebs.com/46/41/82884146/documents... · about making your small business BIG for FREE ... a website and

Small Devices, Big Data and Decision Making

Small Step, big effect: Sustainable trendsetting projects€¦ · trendsetting projects Darmstadt, 23 February, 2017 *** "Volunteering is about making a difference no matter how small"

Unite and Free your Data Making Big Data Big …files.meetup.com/14077672/WiDB - Making Big Data Big...Unite and Free your Data Making Big Data Big Business East Coast Chapter Launch

The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Victor Hernandez: Small screen, big picture: Making sense of rapidly evolving mobile trends

Making a Small Change to Fix a Big Problems3.amazonaws.com/rdcms-aami/files/production/...BrigHt iDeas Making a Small Change to Fix a Big Problem Gavin Stern At a Glance SUBJECT lexington

1 QSX: Querying Social Graphs Querying Big Graphs Parallel scalability Making big graphs small –Bounded evaluability –Query-preserving graph compression

Big. Small. Open. Making open innovation between big and small companies work even better

Big data - small firms - big problems

Making small data big : What can Scratchpads do for you?scratchpads.eu/sites/scratchpads.eu/files/resources... · Making small data big : What can Scratchpads do for you? Eight case

“Making big problems small and small problems disappear”: …devpolicy.org/2017-Australasian-Aid-Conference/... · 2019. 9. 22. · “Making big problems small and small problems

MAKING IT BiG - The FSI · MAKING IT BIG? Most social innovations start small and stay small. Between them, small interventions can create huge social impact, and we’ve written

Making it big with something small - Philip Handschin - Codemotion Roma 2015

149144 - FFE Colloquium Big Steps and Small Steps: Making Progress on Employment Issues