Upload
tanushree-shenvi
View
229
Download
0
Embed Size (px)
Citation preview
8/14/2019 Chap8 Infrastructure
1/32
Infrastructure of Data Warehouse
Ms. Ashwini Rao
Asst.Prof.IT
8/14/2019 Chap8 Infrastructure
2/32
Infrastructure supporting architecture
8/14/2019 Chap8 Infrastructure
3/32
Infrastructure
Elements that enable the architecture to beimplemented.
Operational
help to keep the DW going People
Procedures
Training
Management software
Physical Hardware components
Operating system
Network, network software
8/14/2019 Chap8 Infrastructure
4/32
Physical Infrastructure
8/14/2019 Chap8 Infrastructure
5/32
Features of Hardware & OS
Hardware Scalability
Vendor support
Vendor stability
Vendor reference OS
Scalability
Security
Reliability
Availability
Preemptive multitasking
Memory protection RS SPAM
8/14/2019 Chap8 Infrastructure
6/32
Possible options of Hardware & OS
Mainframes Old hardware
Designed for OLTP
Expensive
Not easily scalable
Open System Servers UNIX servers are most opted
Robust
Adapted for parallel processing
NT Servers
Medium-sized data warehouses Limited parallel processing
Cost effective for small or medium DW
8/14/2019 Chap8 Infrastructure
7/32
Platform Options
A computing platform is the set hardware
components, operating system, network &
network software.
Both Online Transaction Processing and
Decision Support Systems need a computing
platform.
8/14/2019 Chap8 Infrastructure
8/32
Single Platform Option
All functions from back-end data extraction tofront-end query processing is performed on oneplatform.
Data flows smoothly, no conversions required No middleware required
Limitations
Legacy platform stretched to capacity
Non-availability of tools Multiple legacy platforms
Companys migration policy
8/14/2019 Chap8 Infrastructure
9/32
Hybrid Platform Option
Eliminate s the drawbacks of single platform option
Data extraction: Each source is extracted on its own
computing platform
Initial reformatting & merging: The extracted file from
each source is reformatted & merged, on their respective
platforms
Preliminary data cleansing: Verify extracted data for
missing values & data types.
Transformation & Consolidation: Performed on the
platform where the staging area resides.
Validation & Final Quality Check
Creation of Load Images
8/14/2019 Chap8 Infrastructure
10/32
Data Movement Considerations
Shared Disk
Mass Data Transmission
Through ports
Real Time Connection
TCP/IP
Manual Methods External medium
8/14/2019 Chap8 Infrastructure
11/32
Data movement options
8/14/2019 Chap8 Infrastructure
12/32
Client/Server architecture for DW
8/14/2019 Chap8 Infrastructure
13/32
Considerations on client
workstations
Depends on type of users
casual user-Web browser and HTML reports Analyst-more powerful workstation machine
Practically feasible solution is a minimum
configuration on an appropriate platform thatwould support a standard set of information
delivery tools in DW
8/14/2019 Chap8 Infrastructure
14/32
Platform options as DW matures
8/14/2019 Chap8 Infrastructure
15/32
Parallel processing
Symmetric multiprocessing
Clusters
Massively parallel processing Cache-coherent Non uniform Memory
Architecture
8/14/2019 Chap8 Infrastructure
16/32
Symmetric Multiprocessing
8/14/2019 Chap8 Infrastructure
17/32
Features:
This is a shared-everything architecture, the simplest parallel
processing machine.
Each processor has full access to the shared memory through a
common bus. Communication between processors occurs through common
memory.
Benefits:
Provides high concurrency. You can run many concurrent queries.
Balances workload very well.
Gives scalable performance. Simply add more processors to the
system bus.
Being a simple design, you can administer the server easily.
Symmetric Multiprocessing
8/14/2019 Chap8 Infrastructure
18/32
Limitations:
Available memory may be limited.
May be limited by bandwidth for processor-to-processor communication, I/O, and bus
communication.
Availability is limited; like a single computer
with many processors.
Symmetric Multiprocessing
8/14/2019 Chap8 Infrastructure
19/32
Clusters
8/14/2019 Chap8 Infrastructure
20/32
Clusters
Features:
Each node consists of one or more processors and associated memory.
Memory is not shared among the nodes; it is shared only within each
node.
Communication occurs over a high-speed bus.
Each node has access to the common set of disks.
This architecture is a cluster of nodes.
Benefits:
This architecture provides high availability; all data is accessible even if
one node fails.
Preserves the concept of one database.
This option is good for incremental growth.
8/14/2019 Chap8 Infrastructure
21/32
Clusters
Limitations:
Bandwidth of the bus could limit the scalabilityof the system.
This option comes with a high operating systemoverhead.
Each node has a data cache; the architectureneeds to maintain cache consistency forinternode synchronization.
Main memory is like a big file cabinet stretchingacross the entire room.
8/14/2019 Chap8 Infrastructure
22/32
Massively Parallel Processing
8/14/2019 Chap8 Infrastructure
23/32
Features:
This is a shared-nothing architecture.
This architecture is more concerned with disk access than memory access.
Works well with an operating system that supports transparent disk access.
If a database table is located on a particular disk, access to that disk depends
entirely on the processor that owns it. Internode communication is by processor-to-processor connection.
Benefits:
This architecture is highly scalable.
The option provides fast access between nodes.
Any failure is local to the failed node; improves system availability. Generally, the cost per node is low.
Limitations:
The architecture requires rigid data partitioning.
Data access is restricted.
Massively Parallel Processing
8/14/2019 Chap8 Infrastructure
24/32
NUMA
8/14/2019 Chap8 Infrastructure
25/32
Features:
This is the newest architecture.
The NUMA architecture is like a big SMP broken into smaller SMPs that are easier
to build.
Hardware considers all memory units as one giant memory. The system has a
single real memory address space over the entire machine; memory addresses
begin with 1 on the first node and continue on the following nodes. Each node
contains a directory of memory addresses within that node.
In this architecture, the amount of time needed to retrieve a memory value varies
because the first node may need the value that resides in the memory of the third
node. That is why this architecture is called non uniform memory access
architecture.
Benefits:
Provides maximum flexibility.
Overcomes the memory limitations of SMP.
Better scalability than SMP.
NUMA
8/14/2019 Chap8 Infrastructure
26/32
Limitations:
ProgrammingNUMA architecture is more
complex than even with MPP.
Software support for NUMA is fairly limited.
Technology is still maturing.
NUMA
8/14/2019 Chap8 Infrastructure
27/32
Database Software
Many operations can be parallelized mass loading of data
full table scans
queries with exclusion conditions, queries with
grouping selection with distinct values
aggregation
sorting
creation of tables using subqueries, creating andrebuilding indexes
inserting rows into a table from other tables
8/14/2019 Chap8 Infrastructure
28/32
Types of parallelization
8/14/2019 Chap8 Infrastructure
29/32
Software Tools
8/14/2019 Chap8 Infrastructure
30/32
Summing up
Infrastructure acts as the foundation supportingthe data warehouse architecture
Data warehouse infrastructure consists of
operational infrastructure and physicalinfrastructure.
Hardware and operating systems make up thecomputing environment for the DW.
Several options exist for the computing platformsneeded to implement the various architecturalcomponents.
8/14/2019 Chap8 Infrastructure
31/32
Summing up
Selecting the server hardware is a key decision.Invariably, the choice is one of the four parallel serverarchitectures.
Current database software products are able to
perform interquery and intraquery parallelization.
Software tools are used in the data warehouse for datamodeling, data extraction, data transformation, dataloading, data quality assurance, queries and reports,
and online analytical processing (OLAP).
Tools are also used as middleware, alert systems,andfor data warehouse administration.
8/14/2019 Chap8 Infrastructure
32/32