Frontiers in Massive Data Analysis Chapter 3. Difficult to include data from multiple sources Each organization develops a unique way of representing

Scaling the Infrastructure for Data Management

Frontiers in Massive Data Analysis Chapter 3

Difficult to include data from multiple sources

Each organization develops a unique way of representing the data

Organizations are codeveloping shared metadata structures

Scaling Data Sets

Instead of developing a complicated metadata structure, different organizations share their data with a basic set of operations

More complex tools are developed as they are needed

Dataspaces

Data created from mining confidential must meet certain legal and corporate privacy requirements

Private data has to be protected from malicious users as well

Sharing Private Data

Raw processing speed is not increasing as quickly, so manufacturers are moving towards more processors instead of faster processors

I/O performance has to increase to meet the requirements of supporting multiple cores simultaneously

Distributed Systems

Hardware elements that can perform specialized tasks quickly

GPUs are often used for rapidly calculating floating point values, but are limited by I/O bottlenecks and limited software tools

Hardware Parallelism

CPUs have become more parallel by combining more cores per socket and how many operations can be executed per clock cycle

More cores at a slower speed have superior performance and power efficiency

Multicore CPUs

The DSMS runs queries on (typically real time) input streams

The feeds are analyzed and summarized continuously

Data Stream Management Systems

Can use a structured query language similar to SQL that uses windowing to limit how much data is analyzed

Can also use a “boxes-and-arrows” system that provides a graphical interface. The user selects what tasks execute in a box and connects the boxes with arrows to define how data is analyzed

DSMS Methods

A clustered system consists of multiple high performance nodes that execute submitted jobs

Think of the HPC systems on campus

A job manager controls load balancing and queue management

Clustered Batch Systems

Provides access to distributed file systems stored on different servers

The user is presented with a standard file system that hides the underlying distributed systems

Cluster File Systems

POSIX compliant systems provide the same interface that a standalone file system would provide

Makes it simple to convert programs to use clustered resources

POSIX Compliant File Systems

Metadata is managed separately by dedicated servers which forward client requests to the correct file server

Distributed systems run into synchronization issues as the cluster grows large

POSIX Compliant File Systems

These systems were designed to solve the issues that POSIX systems encounter in large clusters

Metadata is still handled by dedicated servers

Non-POSIX Compliant File Systems

Designed to handle distributed analysis tasks

Uses a large block size (64 MB) to minimize metadata requests by clients

Clients are expected to handle inconsistencies in the file systems by comparing checksums

Google File System

Maps a collection of nodes to partition data, then shuffles the hashed files so that common records are passed to the same node

Simplifies analysis on distributed data

MapReduce

Resources in a multi-tenant cluster are dynamically allocated as a user’s needs change

Allows users to gain access to large systems without the overhead associated with maintaining a large cluster

Cloud Systems

Databases reliably store and retrieve data and can provide querying over the data sets

Large parallel databases are spread over servers without a cluster file system managing nodes

Distributed Databases

Data can be partitioned by evenly spreading data among the nodes or spreading the data based on hashes on some of the fields

The nodes evaluate queries on local partitions then combine the results from each node

Distributed Databases Organization

If certain tables are frequently joined together in queries, store them on the same node

When joining tables from different nodes, transfer the smaller of the two

Improving Distributed Database Efficiency

Parallel databases are very difficult to tune and populate with data

Very difficult to develop and debug parallel programs

Parallel Complications

Documents

Frontiers in Massive Data Analysis Chapter 3. Difficult to include data from multiple sources Each organization develops a unique way of representing