21
Scaling the Infrastructure for Data Management Frontiers in Massive Data Analysis Chapter 3

Frontiers in Massive Data Analysis Chapter 3. Difficult to include data from multiple sources Each organization develops a unique way of representing

Embed Size (px)

Citation preview

Page 1: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Scaling the Infrastructure for Data Management

Frontiers in Massive Data Analysis Chapter 3

Page 2: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Difficult to include data from multiple sources

Each organization develops a unique way of representing the data

Organizations are codeveloping shared metadata structures

Scaling Data Sets

Page 3: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Instead of developing a complicated metadata structure, different organizations share their data with a basic set of operations

More complex tools are developed as they are needed

Dataspaces

Page 4: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Data created from mining confidential must meet certain legal and corporate privacy requirements

Private data has to be protected from malicious users as well

Sharing Private Data

Page 5: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Raw processing speed is not increasing as quickly, so manufacturers are moving towards more processors instead of faster processors

I/O performance has to increase to meet the requirements of supporting multiple cores simultaneously

Distributed Systems

Page 6: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Hardware elements that can perform specialized tasks quickly

GPUs are often used for rapidly calculating floating point values, but are limited by I/O bottlenecks and limited software tools

Hardware Parallelism

Page 7: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

CPUs have become more parallel by combining more cores per socket and how many operations can be executed per clock cycle

More cores at a slower speed have superior performance and power efficiency

Multicore CPUs

Page 8: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

The DSMS runs queries on (typically real time) input streams

The feeds are analyzed and summarized continuously

Data Stream Management Systems

Page 9: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Can use a structured query language similar to SQL that uses windowing to limit how much data is analyzed

Can also use a “boxes-and-arrows” system that provides a graphical interface. The user selects what tasks execute in a box and connects the boxes with arrows to define how data is analyzed

DSMS Methods

Page 10: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

A clustered system consists of multiple high performance nodes that execute submitted jobs

Think of the HPC systems on campus

A job manager controls load balancing and queue management

Clustered Batch Systems

Page 11: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Provides access to distributed file systems stored on different servers

The user is presented with a standard file system that hides the underlying distributed systems

Cluster File Systems

Page 12: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

POSIX compliant systems provide the same interface that a standalone file system would provide

Makes it simple to convert programs to use clustered resources

POSIX Compliant File Systems

Page 13: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Metadata is managed separately by dedicated servers which forward client requests to the correct file server

Distributed systems run into synchronization issues as the cluster grows large

POSIX Compliant File Systems

Page 14: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

These systems were designed to solve the issues that POSIX systems encounter in large clusters

Metadata is still handled by dedicated servers

Non-POSIX Compliant File Systems

Page 15: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Designed to handle distributed analysis tasks

Uses a large block size (64 MB) to minimize metadata requests by clients

Clients are expected to handle inconsistencies in the file systems by comparing checksums

Google File System

Page 16: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Maps a collection of nodes to partition data, then shuffles the hashed files so that common records are passed to the same node

Simplifies analysis on distributed data

MapReduce

Page 17: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Resources in a multi-tenant cluster are dynamically allocated as a user’s needs change

Allows users to gain access to large systems without the overhead associated with maintaining a large cluster

Cloud Systems

Page 18: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Databases reliably store and retrieve data and can provide querying over the data sets

Large parallel databases are spread over servers without a cluster file system managing nodes

Distributed Databases

Page 19: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Data can be partitioned by evenly spreading data among the nodes or spreading the data based on hashes on some of the fields

The nodes evaluate queries on local partitions then combine the results from each node

Distributed Databases Organization

Page 20: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

If certain tables are frequently joined together in queries, store them on the same node

When joining tables from different nodes, transfer the smaller of the two

Improving Distributed Database Efficiency

Page 21: Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing

Parallel databases are very difficult to tune and populate with data

Very difficult to develop and debug parallel programs

Parallel Complications