the IDEAL Project Role of Hadoop in Hadoop Team · Schema Design - Avro One schema for each team No...

Hadoop Team:Role of Hadoop in the IDEAL Project

● Jose Cadena

● Chengyuan Wen● Mengsu Chen

CS5604 Spring 2015Instructor: Dr. Edward Fox

Big data and Hadoop

Data sets are so large or complex that traditional data processing tools are inadequate

Challenges include:

● analysis● search

● storage● transfer

Big data and Hadoop

Hadoop solution (inspired by Google)

● distributed storage: HDFS○ a distributed, scalable, and portable file-system○ high capacity at very low cost

● distributed processing: MapReduce○ a programming model for processing large data sets

with a parallel, distributed algorithm on a cluster○ is composed of and procedures

Hadoop Cluster for this Class

● Nodes○ 19 Hadoop nodes○ 1 Manager node○ 2 Tweet DB nodes○ 1 HDFS Backup node

● CPU: Intel i5 Haswell Quad core 3.3Ghz, Xeon● RAM: 660 GB

○ 32GB * 19 (Hadoop nodes) + 4GB * 1 (manager node)

○ 16GB * 1 (HDFS backup) + 16GB * 2 (tweet DB nodes)

● HDD: 60 TB + 11.3TB (backup) + 1.256TB SSD● Hadoop distribution: CDH 5.3.1

Data sets of this class

5.3 GB

3.0 GB

9.9 GB

8.7 GB

2.2 GB

9.6 GB

0.5 GB

~87 million of tweets in total

Mapreduce

● Originally developed for rewriting the indexing system for the Google web search product

● Simplifying the large-scale computations

● MapReduce programs are automatically parallelized and executed on a large-scale cluster

● Programmers without any experience with parallel and distributed systems can easily use large distributed resources

Typical problem solved by MapReduce

● Read data as input● Map: extract something you care about from

each record● Shuffle and Sort● Reduce: aggregate, summarize, filter, or

transform● Write the results

MapReduce Process

Requirements

● Design a workflow for the IDEAL project using appropriate Hadoop tools

● Coordinate data transfer between the different teams

● Help other teams to use the cluster effectively

HADOOP

Noise Reduction

Original tweets

Original web pages (HTML)

Webpage-text

seedURLs.txt Nutch

Noise-reduced web pages

Analyzed data

tweets webpages

Lily indexer

SOLRClu

MapReduce

TweetsWebpages

Noise-reduced tweets

Avro Files

Schema Design - HBase

● Separate tables for tweets and web pages● Both tables have two column families

○ original■ tweet / web page content and metadata

○ analysis■ results of the analysis of each team

● Row ID of a document○ [collection_name]--[UID]○ allows fast retrieval of the documents of a specific

collection

Schema Design - HBase

● Why HBase?○ Our datasets are sparse○ Real-time random I/O access to data○ Lily Indexer allows real-time indexing of data into

Schema Design - Avro

● One schema for each team○ No risk for teams overwriting each other’s data○ Changes in schema for one team do not affect

others● Each schema contains the fields to be

indexed into Solr

Schema Design - Avro

● Why Avro?○ Supports versioning and a schema can be split in

smaller schemas■ We take advantage of these properties for the

data upload○ Schemas can be used to generate a Java API○ MapReduce support and libraries for different

programming languages used in this course○ Supports compression formats used in MapReduce

Loading Data Into HBase

● Sequential Java Program○ Good solution for the small collections○ Does not scale for the big collections

■ Out-of-memory errors on the master node

● MapReduce Program○ Map-only job○ Each map task writes one document to HBase

● Bulk-loading○ Use MapReduce job to generate HFiles○ Write HFiles directly, bypassing the normal HBase write path○ Much faster than our Map-only job, but requires pre-configuration of

the HBase table

http://www.toadworld.com/platforms/nosql/w/wiki/357.hbase-write-ahead-log.aspx

Collaboration with other teams

● Helped other teams to interact with Avro files and output data○ Multiple rounds and revisions were needed○ Thank you, everyone!

● Helped with MapReduce programming○ Classification team had to adapt a third-party tool for

their task

Acknowledgements

● Dr. Fox● Mr. Sunshin Lee● Solr and Noise Reduction teams● National Science Foundation

○ NSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)

Thank you

the IDEAL Project Role of Hadoop in Hadoop Team · Schema Design - Avro One schema for each team No...

Documents

Hadoop Training #4: Programming with Hadoop

Avro Tutorial - Records with Schema for Kafka and Hadoop

Die Microsoft BI Plattform in der Clouddigiblog.s3-eu-central-1.amazonaws.com/app/... · Hadoop vs. SQL Relational Database SCALE (storage & processing) Hadoop Platform schema speed

Integrating Kerberos into Apache Hadoop Kerberos into Apache Hadoop Kerberos Conference 2010 Owen O’Malley owen@yahoo-inc.com Yahoo’s Hadoop Team

PROFESSIONAL HADOOP® SOLUTIONS - Startseite€¦ · The Hadoop Ecosystem 7 Hadoop Core Components 7 Hadoop Distributions 10 Developing Enterprise Applications with Hadoop 12 Summary

Test effort: A pre-conceptual-schema-based representation · man-per-time unit. size, testing team experience , software complexity, etc. 2.3. Pre-conceptual schema representation

2. Hadoop - lsd.ls.fi.upm.eslsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/Hadoop_… · Hadoop Hadoop Software Ecosystem Hadoop MapReduce Hadoop Distributed File System

S ::= Formazione Formazione ::= NomeSquadra Team NomeSquadra ::= Team ::= Schema Tabellino | Tabellino Schema ::= Difesa Tabellino ::= ElencoTitolari

HIVE Data Warehousing Analytics on Hadoop Facebook Data Team

Avaliação do Star Schema Benchmark aplicado a bancos de ...banco de dados NoSQL. 3. HBase. 4. Hadoop MapReduce. 5. Star Schema Benchmark. I. Ciferri, Cristina Dutra de Aguiar, orient

Parallel Data Mining - RITark/fall2014/654/team/2/...• Hadoop works off of disk while SPARK works off memory using RDDs • Hadoop has non-negligible overhead for setting up tasks

Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University

RealWorldBigDataArchitecture@* Splunk, Hadoop,RDBMS · RDBMS" Oracle,"MySQL,"IBM DB2,Teradata" Hadoop SemiStructured MapReduce" Schema"at"Read" HDFS"Storage" Distributed"File"

Hadoop , Hadoop , Hadoop !!!

The Evolving Apache Hadoop Ecosystem – What it means for ... · – The team that originally created Apache Hadoop at Yahoo – The team that is driving key developments in Apache

Analyzing Hadoop with Hadoop

Hadoop Trends & Hadoop on EC2

ACOMPARISON BETWEEN THE HADOOP AND SPARK …€¦ · The solution was built over Apache Hadoop (Apache Hadoop Development Team, 2019), ... in alternative distributed frameworks such

Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team owen@yahoo-inc.com

Making Apache Hadoop Secure Devaraj Das ddas@apache.org Yahoo’s Hadoop Team