20
ISIT312 Big Data Management Cluster Computing Dr Guoxin Su and Dr Janusz R. Getta School of Computing and Information Technology - University of Wollongong Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1 1 of 20 24/9/21, 7:59 pm

Cluster Computing - University of Wollongong

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cluster Computing - University of Wollongong

  ISIT312 Big Data Management

Cluster ComputingDr Guoxin Su and Dr Janusz R. Getta

School of Computing and Information Technology -University of Wollongong

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

1 of 20 24/9/21, 7:59 pm

Page 2: Cluster Computing - University of Wollongong

Cluster ComputingOutline

Big Data

Traditional Data Architectures

Meet Hadoop !

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 2/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

2 of 20 24/9/21, 7:59 pm

Page 3: Cluster Computing - University of Wollongong

Big Data

What does Big Data mean and how big is Big Data ?

Big Data is so big that it cannot be stored on the persistent storagedevices attached to a single computer system

Big Data may also mean an inTnite amount of dataTOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 3/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

3 of 20 24/9/21, 7:59 pm

Page 4: Cluster Computing - University of Wollongong

Big Data

What are the source of Big Data ?

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 4/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

4 of 20 24/9/21, 7:59 pm

Page 5: Cluster Computing - University of Wollongong

Big Data

Big Data is characterized by so called 3V features:

Additional Vs:

There are many, many other Vs, the largest number of Vs I found onWeb was 42 !

Volume: e.g., billions of rows ? millions of columns

Variety: Complexity of data types and structures

Velocity: Speed of new data creation and growth

-

-

-

Veracity: Ability to represent and process uncertain and imprecise data

Value: Data is the driving force of the next-generate business

Viability: BeneTts we can potentially have from data analysis

-

-

-

Vagueness: The meaning of found data is often very unclear, regardless of howmuch data is available

Validity: Rigor in analysis is essential for valid predictions where data is thedriving force of the next-generate business

Vane: Data science can aid decision making by pointing in the correct direction

... and many, many others ... :)

-

-

-

-TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 5/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

5 of 20 24/9/21, 7:59 pm

Page 6: Cluster Computing - University of Wollongong

Big Data

Examples of Big Data:Clickstream data

Call centre data

E-mail and instant-messaging

Sensor data

Unstructured data

Geographic data

Satellite data

Image data

Temporal data

and more ...

-

-

-

-

-

-

-

-

-

-

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 6/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

6 of 20 24/9/21, 7:59 pm

Page 7: Cluster Computing - University of Wollongong

Cluster ComputingOutline

Big Data

Traditional Data Architectures

Meet Hadoop !

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 7/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

7 of 20 24/9/21, 7:59 pm

Page 8: Cluster Computing - University of Wollongong

Traditional Data Architectures

Data warehousing technologies

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 8/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

8 of 20 24/9/21, 7:59 pm

Page 9: Cluster Computing - University of Wollongong

Traditional Data Architectures

The strength of traditional data architectures:

The challenges for traditional data architectures:

Centralised governance of data repositories

Light-fast inquires performed regularly in daily business

Optimisation for OLTP and OLAP

Security and access control

Fault-Tolerance and backup

-

-

-

-

-

New types of data such as unstructured data and semi-structured data

Increasingly large amounts of data bowing into organisations

New computational paradigms use non-traditional NoSQL databases to rapidlymine and analyse very large data sets

Increasing cost of storing and analysing the large amounts of data

Increasing use of data analytics, which requires signiTcant storage andprocessing capabilities

-

-

-

-

-

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 9/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

9 of 20 24/9/21, 7:59 pm

Page 10: Cluster Computing - University of Wollongong

Traditional Data Architectures

A sample Data Lake architecture

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 10/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

10 of 20 24/9/21, 7:59 pm

Page 11: Cluster Computing - University of Wollongong

Traditional Data Architectures

Hardware for Big Data has two scalability dimensions

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 11/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

11 of 20 24/9/21, 7:59 pm

Page 12: Cluster Computing - University of Wollongong

Cluster ComputingOutline

Big Data

Traditional Data Architectures

Meet Hadoop !

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 12/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

12 of 20 24/9/21, 7:59 pm

Page 13: Cluster Computing - University of Wollongong

Meet Hadoop !

Hadoop, in terms of its developers, is a project that develops open-source software for reliable, scalable, distributed computing

Features of HadoopCapability to handle large data sets, e.g. simple scalability and coordination

File size range from gigabytes to terabytes

Can store millions of those Tles

High fault tolerance

Supports data replication

Supports streaming access to data

Supports batch processing

Support interactive, iterative and stream processing

Implements a data consistency model of write-once-read-many access model

Run on commodity hardware, not high-performance computers

Inexpensive

It can be deployed on premises or in the cloud

-

-

-

-

-

-

-

-

-

-

-

-TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 13/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

13 of 20 24/9/21, 7:59 pm

Page 14: Cluster Computing - University of Wollongong

Meet Hadoop !

Core components of Hadoop

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 14/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

14 of 20 24/9/21, 7:59 pm

Page 15: Cluster Computing - University of Wollongong

Hadoop Ecosystem

Hadoop ecosystem

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 15/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

15 of 20 24/9/21, 7:59 pm

Page 16: Cluster Computing - University of Wollongong

Commercial Hadoop Landscape

Commercial Hadoop landscape

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 16/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

16 of 20 24/9/21, 7:59 pm

Page 17: Cluster Computing - University of Wollongong

Meet Hadoop !

Master-slave architecture of Hadoop clusters

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 17/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

17 of 20 24/9/21, 7:59 pm

Page 18: Cluster Computing - University of Wollongong

Meet Hadoop !

Hadoop clusters can support up to 10,000 server and receives near-to-linear scalability in computing power

A typical Hadoop cluster consists of:A set of master nodes (servers) where the daemons supporting key Hadoopframe-works run

A set of worker nodes that host the storage (HDFS) and computing (YARN) work

One or more edge servers, which are used for accessing the Hadoop cluster tolaunch applications

One or more relational databases such as MySQL for storing the metadatarepositories

Dedicated servers for special frameworks such as Kafka

-

-

-

-

-

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 18/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

18 of 20 24/9/21, 7:59 pm

Page 19: Cluster Computing - University of Wollongong

Meet Hadoop !

Hadoop also support the pseudo-distributed mode

Our lab setting is the pseudo-distributed mode

All HDFS and YARN daemons running on a single node.

Highly simulate the full cluster

Easy for beginner's practice

Easy for testing and debug

-

-

-

-

The single node is a Ubuntu 14.04 Virtual Machine (VM)-

TOP                    ISIT312 Big Data Management, SIM, Session 4, 2021 19/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

19 of 20 24/9/21, 7:59 pm

Page 20: Cluster Computing - University of Wollongong

References

White T., Hadoop The DeTnitive Guide: Storage and analysis at Internetscale, O'Reilly, 2015 (Available through UOW library)

Vohra D., Practical Hadoop ecosystem: a deTnitive guide to Hadoop-related frameworks and tools, Apress, 2016 (Available through UOWlibrary)

Aven J., Hadoop in 24 Hours, SAMS Teach Yourself, SAMS 2017

Alapati S. R., Expert Hadoop Adiministration: Managing, tuning, andsecuring Spark, YARN and HDFS, Addison-Wesley 2017

TOP              Created by Janusz R. Getta, ISIT312 Big Data Management, SIM, Session 4, 2021 20/20

Cluster Computing file:///Users/jrg/312SIM-2021-4/LECTURES/01clustercomputing/01clustercomputing.html#1

20 of 20 24/9/21, 7:59 pm