If you can't read please download the document
Upload
evert-lammerts
View
969
Download
1
Embed Size (px)
Citation preview
Large-scale data processing
at SARA and BiG Grid
with Apache Hadoop
Evert LammertsApril 10, 2012, SZTAKI
First off...
About me
Consultant for SARA's eScience & Cloud ServicesTechnical lead for LifeWatch NetherlandsLead Hadoop infrastructure
About you
Who uses large-scale computing as a supporting tool?For who is large-scale computing core-business?
In this talk
Large-scale data processing?
Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
Hadoop @ SARA & BiG Grid
Large-scale data processing?
Large-scale @ SARA & BiG GridAn introduction to Hadoop &
MapReduceHadoop @ SARA & BiG Grid
Three observations
I: Data is easier to collect
(Jimmy Lin, University of Maryland / Twitter, 2011)
More business is done on-lineMobile devices are more sophisticatedGovernments collect more dataSensing devices are becoming a commodityTechnology advanced: DNA sequencers!Enormous funding for research infrastructuresAnd so on...
Lesson: everybody collects data
Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 20112016
Three observations
II: Data is easier to store
Storage price decreases
http://www.mkomo.com/cost-per-gigabyte
Storage capacity increases
http://en.wikipedia.org/wiki/File:Hard_drive_capacity_over_time.svg
Three observations
III: Quantity beats quality
(IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
s/knowledge/data/g
Jimmy Lin, University of Maryland / Twitter, 2011
How are these observations addressed?
We collect data, we store data, we have the knowledge to interpret data. What tools do we have that bring these together?
Pioneers: HPC centers, universities, and in recent years, Internet companies. (Lots of knowledge exchange, by the way.)
Some background (bear with me...) 1/3
Amdahl's Law
Some background (bear with me...) 2/3
(The Datacenter as a Computer, 2009, Luiz Andr Barroso and Urs Hlzle)
Some background (bear with me...) 3/3
(The Datacenter as a Computer, 2009, Luiz Andr Barroso and Urs Hlzle)
Nodes (x2000): 8GB DRAM
4 x 1TB disks
Rack: 40 nodes
1Gbps switch
Datacenter: 8Gbps rack-to-cluster
switch connection
(NYT, 14/06/2006)
Large-scale data processing?
Large-scale @ SARA & BiG GridAn introduction to Hadoop &
MapReduceHadoop @ SARA & BiG Grid
SARA
the national center for scientific computing
Facilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing, Large-Scale Data Storage, High-Performance Networking, eScience, and Visualization
Large-scale data != new
Compute @ SARA
How do categories in WikiPedia
evolve over time? (And how do
they relate to internal links?)
2.7 TB raw text, single file
Java application, searches for
categories in Wiki markup,
like [[Category:NAME]]
Executed on the Grid
http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment
Case Study: Virtual Knowledge Studio
Case Study: Virtual Knowledge Studio
MethodTake an article, including history, as input
Extract categories and links for each revision
Output all links for each category, per revision
Aggregate all links for each category, per revision
Generate graph linking all categories on links, per revision
1.1) Copy file from local Machine to Grid storage
2.1) Stream file from Grid Storage to single machine2.2) Cut into pieces of 10 GB2.3) Stream back to Grid Storage
3.1) Process all files in parallel: N machines
run the Java application,
fetch a 10GB file as
input, processing it, and putting the result back
Case Study: Virtual Knowledge Studio
Large-scale data processing?
Large-scale @ SARA & BiG GridAn introduction to Hadoop &
MapReduceHadoop @ SARA & BiG Grid
A bit of history
Nutch*
2002
2004
MR/GFS**
2006
2004
Hadoop
* http://nutch.apache.org/** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
http://wiki.apache.org/hadoop/PoweredBy
2010 - 2012: A Hype in Production
What's different about Hadoop?
No more do-it-yourself parallelism it's hard!But rather linearly scalable data parallelism
Separating the what from the how
2009, Luiz Andr Barroso and Urs Hlzle)
Core principals
Scale out, not up
Move processing to the data
Process data sequentially, avoid random reads
Seamless scalability
(Jimmy Lin, University of Maryland / Twitter, 2011)
A typical data-parallel problem in abstraction
Iterate over a large number of records
Extract something of interest
Create an ordering in intermediate results
Aggregate intermediate results
Generate output
MapReduce: functional abstraction of step 2 & step 4
(Jimmy Lin, University of Maryland / Twitter, 2011)
MapReduce
Programmer specifies two functionsmap(k, v) *
reduce(k', v') *
All values associated with a single key are sent to the same reducer
The framework handles the rest
The rest?
Scheduling, data distribution, ordering, synchronization, error handling...
1) Load file into
HDFS
2) Submit code to MR
Case Study: Virtual Knowledge Studio
Automatic distribution of data,
Parallelism based on data,
Automatic ordering of intermediate results
This is how it would be done with Hadoop
The ecosystem
The Forrester WaveTM: Enterprise Hadoop Solutions, Q1 2012
Large-scale data processing?
Large-scale @ SARA & BiG GridAn introduction to Hadoop &
MapReduceHadoop @ SARA & BiG Grid
Timeline
2009:Piloting Hadoop on Cloud2010:Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me!2011:Funding granted for production service2012:Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants
Architecture
Components
Hadoop, Hive, Pig, Hbase, HCatalog - others?
What are scientists doing?
Information Retrieval
Natural Language Processing
Machine Learning
Econometry
Bioinformatics
Computational Ecology / Ecoinformatics
Machine learning: Infrawatch, Hollandse Brug
Structural health monitoring
145sensors
100Hz
60seconds
60minutes
24hours
365days
x
x
x
x
x
= large data
(Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
And others: NLP & IR
e.g. ClueWeb: a ~13.4 TB webcrawl
e.g. Twitter gardenhose data
e.g. Wikipedia dumps
e.g. del.ico.us & flickr tags
Finding named entities: [person company place] names
Creating inverted indexes
Piloting real-time search
Personalization
Semantic web
Interest from industry
We're opening up shop.
Experiences: Data Science
DevOps
Programming algorithms
Domain knowledge
Experience: How we embrace Hadoop
Parallelism has never been easy so we teach!December 2010: hackathon (~50 participants - full)
April 2011: Workshop for Bioinformaticians
November 2011: 2 day PhD course (~60 participants full)
June 2012: 1 day PhD course
The datascientist is still in school... so we fill the gap!Devops maintain the system, fix bugs, develop new functionality
Technical consultants learn how to efficiently implement algorithms
Users bring domain knowledge
Methods are developing fast... so we build the community!Netherlands Hadoop User Group
http://www.nlhug.org/
Final thoughts
Hadoop is the first to provide commodity computingHadoop is not the only
Hadoop is probably not the best
Hadoop has momentum
What degree of diversification of infrastructure should we embrace?MapReduce fits surprisingly well as a programming model for data-parallelism
Where is the data scientist?Teach. A lot. And work together.
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level