25
Building the Infrastructure for Big Data @ The Fifth Elephant July 27 th , 2012 -Prashant Kumar, Founder- PromptCloud 1 © PromptCloud 2012, All rights reserved

Building infrastructure for Big Data

Embed Size (px)

DESCRIPTION

This deck gives a sample overview on different pain points while building the infrastructure for big data and solutions to the same.

Citation preview

Building the Infrastructure

for Big Data

@ The Fifth Elephant

July 27th

, 2012

-Prashant Kumar, Founder- PromptCloud

1 © PromptCloud 2012, All rights reserved

Agenda

2

About

Context

Machines, Installation & Cloud Automation

Building blocks of a system

Sample application sketch

Lack of time components

© PromptCloud 2012, All rights reserved

About

3

Section 0

© PromptCloud 2012, All rights reserved

About PromptCloud

How??

• Large-scale data crawl and extraction

• Hosted indexing

• Custom data analytics

• Working round the clock

4

We provide data feeds and feed ourselves on data- since 2009

About Me • PromptCloud’s Founder

• Yahoo! - 2007-2008

• IIT-Kanpur CS- 2007

© PromptCloud 2012, All rights reserved

Deliverable

5 © PromptCloud 2012, All rights reserved

Context

6

Section 0.1

© PromptCloud 2012, All rights reserved

Generic Big Data Systems

• Multiple nodes (incoherent set of coherent ones)

• Compute layer- Interdependent processes

• Data storage layer & multiple middleware

• Tools for installation, monitoring & scheduling

*Meta- source control, code reviews, continuous integration

7 © PromptCloud 2012, All rights reserved

Machines, Installation & Cloud Automation

8

Section 1

© PromptCloud 2012, All rights reserved

Installation

Create an image and install

9

•Easy to install •No maintenance cost •1 image for 1 purpose

•Modifications? Difficult to save it back •Apt, yum, etc-keeper like systems but difficult to scale

Solutions??

© PromptCloud 2012, All rights reserved

Enter the Magic!

10

Not a panacea; analgesic though

© PromptCloud 2012, All rights reserved

Virtual Machines

11

Virtual Machines

AWS, Xen, KVM,…

Virtual Box Installation

Vagrant

Init

Shared directory

Port Forwarding

Up ssh

© PromptCloud 2012, All rights reserved

Code the Installation using Chef

12

Give the recipe- code what’s to be done

I’m Solo

Roles, Recipes

Templates, Run List

Knife

Chef Server

Data Files

© PromptCloud 2012, All rights reserved

Building blocks

13

Section 2

© PromptCloud 2012, All rights reserved

To keep processes running,

14

Option 1- Install GOD to monitor processes and to keep them in place

Courtesy- BIT Mesra

Option 2 (for atheists)- Install MONIT

© PromptCloud 2012, All rights reserved

God’s Snippet

God.watch do |w|

w.name = watcher_name

w.start = start_command

#w.restart = restart_command

w.stop = stop_command

w.behavior(:clean_pid_file)

#w.group = "some group"

w.log = "/tmp/god_monitoring_#{watcher_name}.log"

w.keepalive

w.stop_timeout = 10.seconds

end

15

© PromptCloud 2012, All rights reserved

Job Scheduling

16

Resque, Beanstalk, Gearman, Celery, + cron and queues

Things to remember while making choices- • Persistence • Priorities • Tags • Option for retry • Ability to inspect the queue

© PromptCloud 2012, All rights reserved

Data Storage Layer

• For large systems, maintenance cost is a primary overhead

• Replication & Availability

• Consistency guarantees

• Full-text search

17

SQL/NoSQL, key/value, document-based, graph databases

© PromptCloud 2012, All rights reserved

Voldemort

• Distributed key/value store

• Great performance

• Easy to add/remove nodes

• Alternatives- Mongo, Riak, Hbase, Cassandra

18

Courtesy- harrypotter.wikia.com

Not me!!!!!!!!

© PromptCloud 2012, All rights reserved

Messaging Layer-

• RabbitMQ- most commonly used in high-load production systems

• Implements AMQP

• Robust exchange server

• Multiple kinds of exchanges- direct, topic, fanout

• Options for HA with Pacemaker/DRBD

19 © PromptCloud 2012, All rights reserved

Demo

20

Section 3

© PromptCloud 2012, All rights reserved

21

1. We’ll generate random sentences based on Markov chain

2. Store these in Voldemort 3. Enqueue corresponding jobs in RabbitMQ 4. Another set of workers will process these

sentences

Demo Sketch

© PromptCloud 2012, All rights reserved

For the lack of time..

22

Section 4

© PromptCloud 2012, All rights reserved

Sensu &Graphite

• Monitoring router

• "check scripts” on nodes

• “handler scripts” on servers

• Output can be sent to pagerduty, graphite, twitter or IRC

23 © PromptCloud 2012, All rights reserved

Distributed Log Collection

Flume

• Allows multiple topologies

• Agent

• Collector

• Sink

24

Scribe, Flume, Splunk

© PromptCloud 2012, All rights reserved

Feel free to reach out

25

Big Data made Small

[email protected]

© PromptCloud 2012, All rights reserved

Appreciate your time

Thanks to Arpan Jha for her help with the slides