35
Hadoop 101 for bioinformaticians Attila Csordas Anybody used Hadoop before?

Hadoop 101 for bioinformaticians

Embed Size (px)

DESCRIPTION

Hadoop 101 for bioinformaticians: 1 hour informal crash course, theoretical introduction and practical session. See GitHub repo: https://github.com/attilacsordas/hadoop_for_bioinformatics

Citation preview

Page 1: Hadoop 101 for bioinformaticians

Hadoop 101 for bioinformaticians

Attila Csordas

Anybody used Hadoop before?

Page 2: Hadoop 101 for bioinformaticians

1 hour:!theoretical introduction -> be able to assess whether your

problem fits MR/Hadoop!!

practical session -> be able to start developing code!!

Aim

Page 3: Hadoop 101 for bioinformaticians

Hadoop & memostly non-coding bioinformatician at PRIDE!Cloudera Certified Hadoop Developer!~1300 hadoop jobs ~ 4000 MR jobs!3+ yrs of operator experience!

Page 4: Hadoop 101 for bioinformaticians

Linear scalability

6+ days on a 4 core local machine!

Db build time = f(# of peptides)

# of proteins: 4 mill (quarter nr) 8 mill (half nr) 16 mill (nr)

5000 spectra

Search time = f(job complexity)

100,000 spectra

200,000 spectra

Page 5: Hadoop 101 for bioinformaticians

Theoretical introduction• Big Data!

• Data Operating System!

• Hadoop 1.0!

• MapReduce!

• Hadoop 2.0

Page 6: Hadoop 101 for bioinformaticians

Big Datadifficult to process on a single machine!Volume: low TB - low PB!Velocity: generation rate!Variety: tab separated, machine data, documents!biological repositories fit the bell: e.g.. PRIDE

Page 7: Hadoop 101 for bioinformaticians

Data Operating Systemfeatures! components!

distributed storage

scalable resource management

fault-tolerant processing engine 1

redundant processing engine 2

Page 8: Hadoop 101 for bioinformaticians

Hadoop 1.0

storageHadoop Distributed

File System!aka HDFS!

redundant, reliable, distributed

processing engine/cluster resource

managementMapReduce!

distributed, fault-tolerant resource

management, scheduling &

processing engine

single use data platform

Page 9: Hadoop 101 for bioinformaticians

MapReduceprogramming model for large scale, distributed, fault tolerant, batch data processing

execution framework, the “runtime”

software implementation

Page 10: Hadoop 101 for bioinformaticians

Stateless algorithms

dependent: Fibonacci series: F(k+2)=F(k+1)+F(k)

output only depends on the current input but not on previous inputs

Page 11: Hadoop 101 for bioinformaticians

Parallelizable problems

might need a to maintain a global, shared state

easy to separate to parallel tasks

Page 12: Hadoop 101 for bioinformaticians

Embarassingly/pleasingly parallel problems

no dependency or communication between those tasks

easy to separate to parallel tasks

Page 13: Hadoop 101 for bioinformaticians

Functional programming roots

higher-order functions that accept other functions as arguments !map: applies its argument f() to all elements of a list !fold: takes g() + initial value -> final value !sum of squares: map(x2), fold(+) + 0 as initial value

Page 14: Hadoop 101 for bioinformaticians

MapReduce

map(key, value) : [(key, value)] !shuffle&sort: values grouped by key !reduce(key, iterator<value>) : [(key, value)]

Page 15: Hadoop 101 for bioinformaticians

Word CountMap(String docid, String text):! for each word w in text:! Emit(w, 1);!!Reduce(String term, Iterator<Int> values):! int sum = 0;! for each v in values:! sum += v;! Emit(term, value);!!

source: Jimmy Lin and Chris Dryer: Data-Intensive Text Processing with MapReduce

Morgan & Claypool Publishers, 2010.

Page 16: Hadoop 101 for bioinformaticians

Count amino acids in peptides

Map(byte offset of the line, peptide sequence): for each amino acid in peptide:

Emit(amino acid, 1); !

Reduce(amino acid, Iterator<Int> values): int sum = 0;

for each v in values: sum += v;

Emit(amino acid, value);

Page 17: Hadoop 101 for bioinformaticians

Source: redrawn from a slide by Cloduera, cc-licensed

Mapper Mapper Mapper Mapper Mapper

Partitioner Partitioner Partitioner Partitioner Partitioner

Intermediates Intermediates Intermediates Intermediates Intermediates

Reducer Reducer Reduce

Intermediates Intermediates Intermediates

(combiners omitted here)

source: Jimmy Lin and Chris Dryer: Data-Intensive Text Processing with MapReduce

Morgan & Claypool Publishers, 2010.

Page 18: Hadoop 101 for bioinformaticians

combine combine combine combine

b a 1 2 c 9 a c 5 2 b c 7 8

partition partition partition partition

map map map map

k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6

b a 1 2 c c 3 6 a c 5 2 b c 7 8

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3

c 2 3 6 8

source: Jimmy Lin and Chris Dryer: Data-Intensive Text Processing with MapReduce

Morgan & Claypool Publishers, 2010.

Page 19: Hadoop 101 for bioinformaticians

12 DTNGSQFFITTVK

mapmapmap

0 AGELTEDEVER 26 IEVEKPFAIAKE

A 1 G 1 E 1 I 1 E 1 G 1D 1 T 1 N 1

Shuffle & sort: aggregate values by keys

reduce reduce reduce

E 1 1 G 1 1A 1 D 1 I 1 N 1 T 1

E 2 G 2 N 1

Page 20: Hadoop 101 for bioinformaticians

Hadoop 1.0: single usePig: scripting 4 Hadoop, Yahoo!, 2006

hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

Hive: SQL queries 4 Hadoop Facebook

Page 21: Hadoop 101 for bioinformaticians

Hadoop 2.0

processing engines:MapReduce, Tez,

HBase, Storm, Giraph, Spark…

batch, interactive, online, streaming,

graph …

cluster resource management YARN!

distributed, fault-tolerant resource

management, scheduling &

processing engine

storage HDFS2! redundant, reliable, distributed

multi use data platform

Page 22: Hadoop 101 for bioinformaticians

Practical session Objective: up & running w/ Hadoop in 30 mins

Page 23: Hadoop 101 for bioinformaticians

requirements

• java

• intellij idea

• maven

Page 24: Hadoop 101 for bioinformaticians

Outline• check out https://github.com/attilacsordas/

hadoop_introduction

• elements of a MapReduce app, basic Hadoop API & data types

• bioinformatics toy example: counting amino acids in sequences

• executing a hadoop job locally

• 2 more examples building on top of AminoAcidCounter

Page 25: Hadoop 101 for bioinformaticians

https://github.com/attilacsordas/hadoop_for_bioinformatics

Page 26: Hadoop 101 for bioinformaticians

MapReduce app components

• driver

• mapper

• reducer

repo: bioinformatics.hadoop.AminoAcidCounter.java

Page 27: Hadoop 101 for bioinformaticians

Inputformats specifies data passed to Mappers

!specified in driver

!determines input splits and RecordReaders extracting key-

value pairs !

default formats: TextInputFormat, KeyValueTextInputFormat, SequenceFileInputFormat …

!, LongWritable …

Page 28: Hadoop 101 for bioinformaticians

Data types

everything implements Writable, de/serialization !

all keys are WritableComparable: sorting keys !

box classes for primitive data types: Text, IntWritable, LongWritable …

Page 29: Hadoop 101 for bioinformaticians

Our task is to prepare data to form a statistical model on what physico-chemico properties of the constituent amino acid

residues are affecting the visibility of the high-flyer peptides for a mass spectrometer.

!AGELTEDEVER

QSVHLVENEIQASIDQIFSHLER DTNGSQFFITTVK

IEVEKPFAIAKE

bioinformatics toy example

repo: headpeptides.txt in input folder

Page 30: Hadoop 101 for bioinformaticians

run the code!local

pseudodistributed cluster

Page 31: Hadoop 101 for bioinformaticians

Debugging do it locally if possible

print statements !

write to map output !

mapred.map.child.log.level=DEBUG mapred.reduce.child.log.level=DEBUG

-> syslog task logfile !

running debuggers remotely is hard !

JVM debugging options !

task profilers

Page 32: Hadoop 101 for bioinformaticians

Counters

track records for statistics, malformed records

AminoAcidCounterwithMalformedCounter

Page 33: Hadoop 101 for bioinformaticians

• Task 1: modify mapper to count positions too: R_2 -> AminoAcidPositionCounter

!

• Task 2: count positions & normalise to peptide length -> AminoAcidPositionCounterNormalizedToPeptideLength

Page 34: Hadoop 101 for bioinformaticians

Run jar on the clusterset up an account on a hadoop cluster

!move input data into HDFS

!set mainClass in pom.xml

!mvn clean install

!cp jar from target to servers

!ssh into hadoop

!run hadoop command line

Page 35: Hadoop 101 for bioinformaticians

homework, next stepscount all the 3 outputs for 1 input k,v pair

!count all amino acid pairs

!run FastaAminoAcidCounter & check FastaInputFormat

!run the jar on a cluster

!https://github.com/lintool/MapReduce-course-2013s/

tree/master/slides !

n * (trial, error) -> success