75
A company of Daimler AG LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP ANDREAS BUCKENHOFER, DAIMLER TSS

LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

A company of Daimler AG

LECTURE @DHBW: DATA WAREHOUSE

PART VII: HADOOPANDREAS BUCKENHOFER, DAIMLER TSS

Page 2: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

ABOUT ME

https://de.linkedin.com/in/buckenhofer

https://twitter.com/ABuckenhofer

https://www.doag.org/de/themen/datenbank/in-memory/

http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/

https://www.xing.com/profile/Andreas_Buckenhofer2

Andreas Buckenhofer

Senior DB Professional

[email protected]

Since 2009 at Daimler TSS

Department: Big Data

Business Unit: Analytics

Page 3: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

ANDREAS BUCKENHOFER, DAIMLER TSS GMBH

Data Warehouse / DHBWDaimler TSS 3

“Forming good abstractions and avoiding complexity

is an essential part of a successful data architecture”

Data has always been my main focus during my long-time occupation in the area of

data integration. I work for Daimler TSS as Database Professional and Data Architect

with over 20 years of experience in Data Warehouse projects. I am working with

Hadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn new

things, experiment, and program every day.

I share my knowledge in internal presentations or as a speaker at international

conferences. I'm regularly giving a full lecture on Data Warehousing and a seminar on

modern data architectures at Baden-Wuerttemberg Cooperative State University

DHBW. I also gained international experience through a two-year project in Greater

London and several business trips to Asia.

I’m responsible for In-Memory DB Computing at the independent German Oracle User

Group (DOAG) and was honored by Oracle as ACE Associate. I hold current

certifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big Data

Architect“, „Oracle Database 12c Administrator Certified Professional“, “IBM

InfoSphere Change Data Capture Technical Professional”, etc.

Contact/Connect

Page 4: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

As a 100% Daimler subsidiary, we give

100 percent, always and never less.

We love IT and pull out all the stops to

aid Daimler's development with our

expertise on its journey into the future.

Our objective: We make Daimler the

most innovative and digital mobility

company.

NOT JUST AVERAGE: OUTSTANDING.

Daimler TSS

Page 5: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

INTERNAL IT PARTNER FOR DAIMLER

+ Holistic solutions according to the Daimler guidelines

+ IT strategy

+ Security

+ Architecture

+ Developing and securing know-how

+ TSS is a partner who can be trusted with sensitive data

As subsidiary: maximum added value for Daimler

+ Market closeness

+ Independence

+ Flexibility (short decision making process,

ability to react quickly)

Daimler TSS 5

Page 6: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

Daimler TSS

LOCATIONS

Data Warehouse / DHBW

Daimler TSS China

Hub Beijing

10 employees

Daimler TSS Malaysia

Hub Kuala Lumpur

42 employeesDaimler TSS IndiaHub Bangalore22 employees

Daimler TSS Germany

7 locations

1000 employees*

Ulm (Headquarters)

Stuttgart

Berlin

Karlsruhe

* as of August 2017

6

Page 7: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

• After the end of this lecture you will be able to

• Explain Hadoop and its ecosystem

WHAT YOU WILL LEARN TODAY

Data Warehouse / DHBWDaimler TSS 7

Page 8: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

Pre-Google search engines (Google was founded in 1996):

• Existing search engines simply indexed on keywords within webpages

• Inadequate, given the sheer number of possible matches for any search term

• The results were primarily weighted by the number of occurrences of the search term within a

page, with no account for usefulness or popularity

PageRank

• Relevance of a page to be weighted based on the number of links to that page

• Provide a better search outcome than its competitors

PageRank is a great example of a data-driven algorithm that

• leverages the “wisdom of the crowd” (collective intelligence)

• can adapt intelligently as more data is available (machine learning)

ORIGIN OF HADOOP

Data Warehouse / DHBWDaimler TSS 8

Page 9: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

• Google File System (GFS): a distributed cluster file system that allows all

of the disks within the Google data center to be accessed as one massive,

distributed, redundant file system.

http://research.google.com/archive/gfs.html

• MapReduce: a distributed processing framework for parallelizing

algorithms across large numbers of potentially unreliable servers and

being capable of dealing with massive datasets.

http://research.google.com/archive/mapreduce.html

• BigTable: a nonrelational database system that uses the GFS for storage.

http://research.google.com/archive/bigtable.html

WHICH MAIN COMPONENTS ARE PART OF THE ORIGINAL GOOGLE SW STACK? EXPLAIN THE COMPONENTS

Data Warehouse / DHBWDaimler TSS 9

Page 10: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

Hadoop = Open source framework for distributed computations

• Mainly written in Java

• Apache Top-Level project

Components:

• HDFS (Google: GFS) clustered filesystem (Hadoop distributed file system)

• MapReduce parallel processing framework

• HBase (Google: BigTable) wide-columnar NoSQL database

HDFS and MapReduce are considered as Core Hadoop though the original

Google SW stack also contained HBase for fast reads

WHAT ARE THE MAIN COMPONENTS IN HADOOP?

Data Warehouse / DHBWDaimler TSS 10

Page 11: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

• GFS / HDFS

• store webpages

• MapReduce

• process webpages to identify and weigh incoming links

• BigTable /HBase

• store results (e.g. from MapReduce) for fast access

HADOOP PAGERANKHOW DID GOOGLE USE THE COMPONENTS?

Data Warehouse / DHBWDaimler TSS 11

Page 12: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

2003: Paper „Google‘s File System“ http://research.google.com/archive/gfs.html

2004: Paper „Google‘s MapReduce“ http://research.google.com/archive/mapreduce.html

2006: Paper „Google‘s BigTable“ http://research.google.com/archive/bigtable.html

2006: Doug Cutting implements Hadoop 0.1. after reading above papers

2008: Yahoo! Uses Hadoop as it solves their search engine scalability issues

2010: Facebook, LinkedIn, eBay use Hadoop

2012: Hadoop 1.0 released

2013: Hadoop 2.2 („aka Hadoop 2.0“) released

2017: Hadoop 3.0 released

HADOOP TIMELINE

Data Warehouse / DHBWDaimler TSS 12

Page 13: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

WHO HAS THE LARGEST CLUSTER?

Data Warehouse / DHBWDaimler TSS 13

300PB (1100 Nodes)

42000 Nodes

5,3PB (532 Nodes)

1 PB/s (short-time)

Page 14: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

GOOGLE MODULAR DATA CENTER

Data Warehouse / DHBWDaimler TSS 14

Source: https://patents.google.com/patent/US20100251629

Increase data center

capacity by adding 1000

new servers modules at

once

Data center:

https://www.youtube.c

om/watch?v=zRwPSFpL

X8I

Page 15: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

GOOGLE SOFTWARE ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 15

Source: Harrison: Next Generation Databases, Apress 2016

SAN / NAS was

rising in the

2000ies but

Goggle chose

local, directly

attached disks

Page 16: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HADOOP V1

Data Warehouse / DHBWDaimler TSS 16

Source: https://de.hortonworks.com/apache/tez/

Page 17: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HDFS ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 17

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 18: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

WHAT IS HADOOP ALL ABOUT?

Data Warehouse / DHBWDaimler TSS 18

Source: Jason Nolander, Tom Coffing: Tera-Tom Genius Series - Hadoop Architecture and SQL, Coffing Publishing 2016

Page 19: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

DATA LAYOUT

Data Warehouse / DHBWDaimler TSS 19

Source: Jason Nolander, Tom Coffing: Tera-Tom Genius Series - Hadoop Architecture and SQL, Coffing Publishing 2016

Algorithms come

to the data and

not vice versa

Page 20: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

DATA LAYOUT AND PROTECTION

Data Warehouse / DHBWDaimler TSS 20

Source: Jason Nolander, Tom Coffing: Tera-Tom Genius Series - Hadoop Architecture and SQL, Coffing Publishing 2016

Page 21: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

• Input file is split into blocks (> 64MB)

• HDFS is suitable for large files only

• Splittable compression preferable: LZO, bzip2, gzip, snappy

• Each block is stored on 3 different disks (default) for fault-tolerance

• Many servers with local disks instead of SAN

HOW HDFS WORKS

Data Warehouse / DHBWDaimler TSS 21

HDFS

Name

node

ingestion

Page 22: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

TRANSFERING DATA INTO HDFS AND BACK

Data Warehouse / DHBWDaimler TSS 22

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 23: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

SOME MORE HDFS COMMANDS

Data Warehouse / DHBWDaimler TSS 23

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 24: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

Command line

Java API

Web Interface

HDFS INTERFACES

Data Warehouse / DHBWDaimler TSS 24

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 25: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

• Optimal for handling millions of large files, rather than billions of small

files, because:

• In pursuit of responsiveness, the NameNode stores all of its file/block information

• Too many files will cause the NameNode to run out of storage space

• Too many blocks (if the blocks are small) will also cause the NameNode to run out

of space

• Processing each block requires its own Java Virtual Machine (JVM) and (if you have

too many blocks) you begin to see the limits of HDFS scalability

HDFS CHALLENGES

Data Warehouse / DHBWDaimler TSS 25

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 26: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

MAP REDUCE PARALLEL PROCESSING FRAMEWORK

Data Warehouse / DHBWDaimler TSS 26

Source: Harrison: Next Generation Databases, Apress 2016

Page 27: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

MAP REDUCE SAMPLE CODE

Data Warehouse / DHBWDaimler TSS 27

Page 28: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HBASE – WIDE COLUMNAR NOSQL DATABASERELATIONAL DATA MODEL VS WIDE COLUMNAR MODEL

Data Warehouse / DHBWDaimler TSS 28

Source: Guy Harrison: Next generation databases, Apress 2015, p.33

Page 29: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

• Create data models for time series data and for a bill of materials

EXERCISE: HBASE DATA MODEL

Data Warehouse / DHBWDaimler TSS 29

Car1

Engine1

A B

ATM1

C

Car2

Engine2

A X

ATM1

C

Sensor1, 17.01.2012 18:00:00, temperature: 15.1°, speed: 3.1km/h

Sensor1, 17.01.2012 18:00:01, temperature: 15.1°

Sensor2, 17.01.2012 18:00:01, temperature: 85.1F, speed: 10.5km/h

Page 30: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HBASE – TIME SERIES DATA, E.G. SENSOR DATA

Data Warehouse / DHBWDaimler TSS 30

Rowkey Timestamp Temperature Speed

Sensor117.01.2012

18:00:0015.1° 3.1km/h

Sensor117.01.2012

18:00:0115.1°

Sensor217.01.2012

18:00:0185.1F 10.5km/h

Or better: split measurement and unit into

separate fields

Can become slow: Should

be searchable

Page 31: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HBASE – TIME SERIES DATA, PERFORMANCE OPTIMIZED

Data Warehouse / DHBWDaimler TSS 31

Time-Offset Value t2

Time-Offset Value t1

… … …

MetricKey +

Basetimestamp

+01 15.1° t2

+0 15.1° t1

+01 15.1F t3

Sensor1 || 1326823200

01 = coded metric like

Sensor-ID, CPU, usw.

+

Hourly Timestamp

17.01.2012 18:00:00

Sensor1 || 1326823200

Sensor2 || 1326823200

Page 32: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HBASE – BILL OF MATERIALS

Data Warehouse / DHBWDaimler TSS 32

Rowkey

Car1

Car2

Engine1

Car1

Engine2

Car2

ATM1

Car1

Car2

A

Engine1

Engine2

B

Engine1

C

ATM1

ATM1

X

Engine2

Page 33: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HBASE VS HDFS

Data Warehouse / DHBWDaimler TSS 33

HDFS / MapReduce (Hadoop) HBase based on HDFS

Batch Interactive (ms)

Sequential reads and writes Random reads and writes

Optimized for full scans Optimized for selective queries or short

scans

append-only Insert, updates and deletes

How can all these features

be possible on HDFS???

Page 34: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HBASE ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 34

Page 35: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HBASE ARCHITECTURE – DATA DISTRIBUTION

Data Warehouse / DHBWDaimler TSS 35

Page 36: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HBASE ARCHITECTURE – REGIONSERVER WRITES

Data Warehouse / DHBWDaimler TSS 36

Page 37: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HBASE ARCHITECTURE – REGIONSERVER READS

Data Warehouse / DHBWDaimler TSS 37

Page 38: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

Merge data files and sort row keys (server stays online)

• Minor

• Merge HFiles (>= 2) into a new HFile

• Major

• additionally: Delete data from delete-operations

• additionally: Delete expired cells

HBASE COMPACTIONS

Data Warehouse / DHBWDaimler TSS 38

Page 39: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

ENVIRONMENTS FOR DATA ENGINEERING WITHSEPARATE PRODUCTION CLUSTERS

Data Warehouse / DHBWDaimler TSS 39

Lars George, Paul Wilkinson, Ian Buss, Jan Kunigk: Architecting Modern Data Platforms, O'Reilly 2018

Page 40: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HADOOP V1 VS HADOOP V2

Data Warehouse / DHBWDaimler TSS 40

Source: https://de.hortonworks.com/apache/tez/

Page 41: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

• Name Node is not single point of failure anymore

• Manual switch-over

• YARN (Yet Another Resource Negotiator) improves scalability and

flexibility by splitting the roles of the Task Tracker into two processes:

• Resource Manager controls access to the clusters resources (memory, CPU, etc.)

• Application Manager (one per job) controls task execution within containers

• YARN allows to use other engines, not just MapReduce

HADOOP 1 VS HADOOP2

Data Warehouse / DHBWDaimler TSS 41

Page 42: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

YARN replaces Map Reduce and introduces a layer to serve different engines

YARN (YET ANOTHER RESOURCE NEGOTIATOR)

Data Warehouse / DHBWDaimler TSS 42

Source: https://de.hortonworks.com/apache/yarn/

Page 43: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

YARN ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 43

• Resource Manager: accepts job submissions, allocates resources

• Node Manager: is a monitoring and reporting agent of the Resource Manager

• Application Master: created for each application to negotiate for resources and work with the NodeManager to execute and monitor tasks

• Container: controlled by NodeManagers and assigned the system resourcesSource: https://searchdatamanagement.techtarget.com/definition/Apache-Hadoop-YARN-Yet-Another-Resource-Negotiator

Page 44: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

RUNNING AN APPLICATION ON YARN (1)

Data Warehouse / DHBWDaimler TSS 44

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 45: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

RUNNING AN APPLICATION ON YARN (2)

Data Warehouse / DHBWDaimler TSS 45

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 46: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

RUNNING AN APPLICATION ON YARN (3)

Data Warehouse / DHBWDaimler TSS 46

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 47: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

RUNNING AN APPLICATION ON YARN (4)

Data Warehouse / DHBWDaimler TSS 47

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 48: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

RUNNING AN APPLICATION ON YARN (5)

Data Warehouse / DHBWDaimler TSS 48

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 49: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

RUNNING AN APPLICATION ON YARN (6)

Data Warehouse / DHBWDaimler TSS 49

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 50: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

RUNNING AN APPLICATION ON YARN (7)

Data Warehouse / DHBWDaimler TSS 50

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 51: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

RUNNING AN APPLICATION ON YARN (8)

Data Warehouse / DHBWDaimler TSS 51

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 52: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HADOOP ECOSYSTEM

Data Warehouse / DHBWDaimler TSS 52

Source: https://pdfs.semanticscholar.org/presentation/e67d/6df768eb171e1750b8a613884b193bf486e2.pdf

Page 53: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

WHICH TOOLS EXIST IN THE HADOOP ECOSYSTEM AND WHAT ARE THEIR FUNCTION?

Data Warehouse / DHBWDaimler TSS 53

Monitoring

Database management systems

Streaming

Machine Learning

Security

Workflow Scheduler

Data Ingestion

Page 54: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

WHICH COMMERCIAL DISTRIBUTIONS EXIST?

Data Warehouse / DHBWDaimler TSS 54

Source: https://blogs.gartner.com/merv-adrian/2017/12/29/december-2017-tracker-wheres-hadoop/

Page 55: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HIVE: SQL-LIKE ACCESS ON FILES STORED ON HDFSINITIALLY DEVELOPED BY FACEBOOK (2007/2008)

Data Warehouse / DHBWDaimler TSS 55

HDFS

SQL > SELECT sum( income ) from calculation group by location

Ressource

manager

Page 56: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HIVE ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 56

Source: https://www.researchgate.net/figure/Apache-hive-architecture-9_fig1_319193375

Page 57: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

[root@sandbox ~]# cat Sample-Json-simple.json

{"username":"abc","tweet":"Sun shine is bright.","timestamp": 1366150681 }

{"username":"xyz","tweet":"Moon light is mild .","timestamp": 1366154481 }

[root@sandbox ~]#

HIVE SAMPLE WITH JSON DATAVIEW FILE

Data Warehouse / DHBWDaimler TSS 57

Page 58: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

[root@sandbox ~]# hadoop fs -mkdir /user/hive-simple-data/

[root@sandbox ~]# hadoop fs -put Sample-Json-simple.json /user/hive-

simple-data/

HIVE SAMPLE WITH JSON DATALOAD FILE INTO HDFS

Data Warehouse / DHBWDaimler TSS 58

Page 59: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

hive> CREATE EXTERNAL TABLE simple_json_table (

username string,

tweet string,

time1 string)

ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

LOCATION '/user/hive-simple-data/';

OK

Time taken: 0.433 seconds

HIVE SAMPLE WITH JSON DATACREATE HIVE TABLE

Data Warehouse / DHBWDaimler TSS 59

Page 60: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

hive> select * from simple_json_table ;

OK

abc Sun shine is bright. 1366150681

xyz Moon light is mild . 1366154481

Time taken: 0.146 seconds, Fetched: 2 row(s)

hive>

HIVE SAMPLE WITH JSON DATASELECT DATA FROM HIVE TABLE

Data Warehouse / DHBWDaimler TSS 60

Page 61: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

HIVE – CREATE TABLE EXAMPLESCSV, JSON, AVRO, PARQUET, ORC, ETC.

Data Warehouse / DHBWDaimler TSS 61

CREATE EXTERNAL TABLE my_table STORED AS AVRO LOCATION

'/user/…/my_table_avro/'

TBLPROPERTIES ('avro.schema.url'='hdfs:///user/…/my_table.avsc');

CREATE EXTERNAL TABLE external_parquet

(c1 INT, c2 STRING, c3 TIMESTAMP)

STORED AS PARQUET LOCATION '/user/myDirectory';

CREATE EXTERNAL TABLE IF NOT EXISTS Cars (

Name STRING,

Origin CHAR(1))

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE

location '/user/myDirectory';

Page 62: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

• Higher level query language

• SQL is widely known

• Simplifies working with data

• Better learning curve compared to Map Reduce or other tools like Pig

• High latency / no real time capability

• use Hbase instead, but Hbase is only for very selective queries

• Updates and deletes are slow (but available since latest releases)

ADVANTAGES AND DISADVANTAGES OF HIVE

Data Warehouse / DHBWDaimler TSS 62

Page 63: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

• SQOOP, a utility for exchanging data with relational databases, either by

importing relational tables into HDFS files or by exporting HDFS files to

relational databases.

• Oozie, a workflow scheduler that allows complex workflows to be

constructed from lower level jobs (for instance, running a Sqoop job prior

to a MapReduce application).

• Hue / Ambari, graphical user interfaces that simplifies Hadoop

administrative and development tasks.

• Knox / Ranger / Sentry, tools for secure data access, identity control,

security monitoring, etc.

OTHER TOOLS

Data Warehouse / DHBWDaimler TSS 63

Page 64: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

STORAGE OPTIMIZATION - COMPRESSION

Data Warehouse / DHBWDaimler TSS 64

Source: White, Tom - Hadoop The Definitive Guide 3rd Edition - OReilly 2012

Page 65: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

• Different storage formats „schemas“

• Schema-on-read: JSON, CSV, HTML, …

• Schema-on-write: AVRO, PARQUET, ORC, THRIFT, PROTOCOL BUFFER, …

• + structural integrity

• + guarantees on what can and can‘t be stored

• + prevent corruption

SERDE – SERIALIZATION AND DESERIALIZATION

Data Warehouse / DHBWDaimler TSS 65

Page 66: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

SERDE – SERIALIZATION AND DESERIALIZATION

Data Warehouse / DHBWDaimler TSS 66

File format Description Code

generation

Schema

evolution

Splittable

Compression

Apache Hive

support

AVRO row storage format optional Yes Yes Yes

PARQUET columnar storage format No Yes Yes Yes

ORCFILE columnar storage format No Yes Yes Yes

PROTOCOL

BUFFER

originally designed by Google with

interface description language to

generate code

Optional Yes No No

THRIFT data serialization format designed

at Facebook similar to PROTOCOL

BUFFER

mandatory Yes No No

Page 67: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

• CSV / JSON / XML

• Use text-based formats

• Avro

• lightweight and fast data serialisation and deserialization

• Widely used

• Parquet

• column oriented data serialization standard for efficient data analytics

• ORCFile, Protocol Buffers (invented by Google), Sequence Files, etc

STORAGE OPTIMIZATION – SERIALIZATION AND DESERIALIZATION FORMATS

Data Warehouse / DHBWDaimler TSS 67

Page 68: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

SERDE – COMPARISON FILE SIZE

Data Warehouse / DHBWDaimler TSS 68

Owen O'Malley: File format benchmark: Avro, JSON, ORC, and Parquet https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/51952

Page 69: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

SERDE – COMPARISON READ PERFORMANCE

Data Warehouse / DHBWDaimler TSS 69

Owen O'Malley: File format benchmark: Avro, JSON, ORC, and Parquet https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/51952

Page 70: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

STORAGE OPTIMIZATION – PERFORMANCE TESTS BY CERN

Data Warehouse / DHBWDaimler TSS 70

Source: https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines

Page 71: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

Flexibility

•For whom? Writing the data vs reading the data

Simplicity

•For whom? Writing the data vs reading the data

•Human mistakes while trying to reading the data

Agility / Model as you go

•Just copy files into the directory

SCHEMA-ON-READ

Data Warehouse and Big Data / DHBWDaimler TSS 71

Page 72: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

• GDPR – General Data Protection Regulation (Datenschutz-

Grundverordnung)

• Right to be forgotten

• Data protection by design and by default

• Data portability

• Severe penalties of up to 4% of worldwide turnover

• How to achieve these requirements with schema-on-read???

SCHEMA-ON-READ - WHAT ABOUT SECURITY?

Data Warehouse and Big Data / DHBWDaimler TSS 72

Page 73: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

Data Warehouse / DHBWDaimler TSS 73

Hadoop is

•A distributed file storage

•A mainly batch-oriented processing framework for parallelization

•Flexible and scalable

•Suitable for highly diverse data with low information density

•Fault tolerant and robust

•A long-term storage

Hadoop is not

•A relational database

•A self-service BI tool

•Suitable for transactional data

•Suitable for small data (files)

•Easy for development and operations

•Yet mature

Page 74: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle

Data Warehouse / DHBWDaimler TSS 74

THANK YOU

Page 75: LECTURE @DHBW: DATA WAREHOUSE PART VII: HADOOP · Pre-Google search engines (Google was founded in 1996): • Existing search engines simply indexed on keywords within webpages •

DATA ENGINEERING / DATA PIPELINE / ETL / ELT

Data Warehouse / DHBWDaimler TSS 75

Lars George, Paul Wilkinson, Ian Buss, Jan Kunigk: Architecting Modern Data Platforms, O'Reilly 2018