Upload
gillian-manning
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Scaling the Infrastructure for Data Management
Frontiers in Massive Data Analysis Chapter 3
Difficult to include data from multiple sources
Each organization develops a unique way of representing the data
Organizations are codeveloping shared metadata structures
Scaling Data Sets
Instead of developing a complicated metadata structure, different organizations share their data with a basic set of operations
More complex tools are developed as they are needed
Dataspaces
Data created from mining confidential must meet certain legal and corporate privacy requirements
Private data has to be protected from malicious users as well
Sharing Private Data
Raw processing speed is not increasing as quickly, so manufacturers are moving towards more processors instead of faster processors
I/O performance has to increase to meet the requirements of supporting multiple cores simultaneously
Distributed Systems
Hardware elements that can perform specialized tasks quickly
GPUs are often used for rapidly calculating floating point values, but are limited by I/O bottlenecks and limited software tools
Hardware Parallelism
CPUs have become more parallel by combining more cores per socket and how many operations can be executed per clock cycle
More cores at a slower speed have superior performance and power efficiency
Multicore CPUs
The DSMS runs queries on (typically real time) input streams
The feeds are analyzed and summarized continuously
Data Stream Management Systems
Can use a structured query language similar to SQL that uses windowing to limit how much data is analyzed
Can also use a “boxes-and-arrows” system that provides a graphical interface. The user selects what tasks execute in a box and connects the boxes with arrows to define how data is analyzed
DSMS Methods
A clustered system consists of multiple high performance nodes that execute submitted jobs
Think of the HPC systems on campus
A job manager controls load balancing and queue management
Clustered Batch Systems
Provides access to distributed file systems stored on different servers
The user is presented with a standard file system that hides the underlying distributed systems
Cluster File Systems
POSIX compliant systems provide the same interface that a standalone file system would provide
Makes it simple to convert programs to use clustered resources
POSIX Compliant File Systems
Metadata is managed separately by dedicated servers which forward client requests to the correct file server
Distributed systems run into synchronization issues as the cluster grows large
POSIX Compliant File Systems
These systems were designed to solve the issues that POSIX systems encounter in large clusters
Metadata is still handled by dedicated servers
Non-POSIX Compliant File Systems
Designed to handle distributed analysis tasks
Uses a large block size (64 MB) to minimize metadata requests by clients
Clients are expected to handle inconsistencies in the file systems by comparing checksums
Google File System
Maps a collection of nodes to partition data, then shuffles the hashed files so that common records are passed to the same node
Simplifies analysis on distributed data
MapReduce
Resources in a multi-tenant cluster are dynamically allocated as a user’s needs change
Allows users to gain access to large systems without the overhead associated with maintaining a large cluster
Cloud Systems
Databases reliably store and retrieve data and can provide querying over the data sets
Large parallel databases are spread over servers without a cluster file system managing nodes
Distributed Databases
Data can be partitioned by evenly spreading data among the nodes or spreading the data based on hashes on some of the fields
The nodes evaluate queries on local partitions then combine the results from each node
Distributed Databases Organization
If certain tables are frequently joined together in queries, store them on the same node
When joining tables from different nodes, transfer the smaller of the two
Improving Distributed Database Efficiency
Parallel databases are very difficult to tune and populate with data
Very difficult to develop and debug parallel programs
Parallel Complications