51
Architecture and Infrastructure Module 2 G.Anuradha

Architecture and Infrastructure

  • Upload
    tacita

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Architecture and Infrastructure. Module 2 G.Anuradha. What is architecture?. The structure that brings all the components of a data warehouse together is known as the architecture. Many factors affect the architecture of a DW Integrated data Data preparation and storing Data delivery - PowerPoint PPT Presentation

Citation preview

Architecture and Infrastructure

Architecture and InfrastructureModule 2G.Anuradha1

What is architecture?The structure that brings all the components of a data warehouse together is known as the architecture.Many factors affect the architecture of a DWIntegrated dataData preparation and storingData deliveryTechnologyComprehensive blueprintArchitecture in 3 major areasData acquisitionData storageInformation delivery

Distinguishing characteristics of architectureDifferent Objectives and ScopeFor providing strategic information DW should have elaborate architectureScope depends on the sources used in the acquisition regionData ContentDealing with historical, read only dataComplex Analysis and Quick ResponseDrill down, roll up, slice, dice, what if scenariosFlexible and DynamicDesign should be dynamic after designing as wellMetadata-drivenEvery movement is trapped in it. Test your fundas

ACROSSBusiness dimension(5)6. Smaller than DW(8)7. Combining data from different operational systems(10)8. Initial loading(7)DOWN2. Remove useful information from operational data(10)3. Monitoring the entire function (10)4. Historical(8)5. Data about entire warehouse(8)Solution

Architecture supporting the flow of dataData Source(internal & External)Data StagingTransformationCleansingIntegration of DataData StorageLoading of data from Staging AreaStoring for Information DeliveryMetadataStorage mechanism for data about dataInformation DeliveryDependent data marts, MDDBs, Query and reporting facilitiesManagement and control moduleUmbrella component having two important functionsMonitor all ongoing operationsProblem recovery

List of services and functions-Data ExtractionSelect data sources and determine the types of filters to be applied to individual sourcesGenerate automatic extract files from operational systems using replication and other techniquesCreate intermediary files to store selected data to be merged later Transport extracted files from multiple platforms Provide automated job control services for creating extract files Reformat input from outside sources Reformat input from departmental data files, databases, and spreadsheets Generate common application code for data extraction Resolve inconsistencies for common data elements from multiple sourcesList of services and functions-Data TransformationMap input data to data for data warehouse repository Clean data, deduplicate, and merge/purge Denormalize extracted data structures as required by the dimensional model of the data warehouse Convert data types Calculate and derive attribute values Check for referential integrity Aggregate data as needed Resolve missing values Consolidate and integrate data13List of functions and services-Data stagingProvide backup and recovery for staging area repositories Sort and merge files Create files as input to make changes to dimension tables If data staging storage is a relational database, create and populate database Preserve audit trail to relate each data item in the data warehouse to input source Resolve and create primary and foreign keys for load tables Consolidate datasets and create flat files for loading through DBMS utilities If staging area storage is a relational database, extract load files

Data Storageloading the data from the staging area into the data warehouse repositorybefore loading data into the data ware the metadata repository gets populatedFor top-bottom approach there could be movements of data from the enterprise-wide data warehouse repository to the repositories of the dependent data martsFor bottom-up approach data movements stop with the appropriate conformed data marts

Information DeliveryInformation access in a data warehouse is through online queries and interactive analysis sessionsdata warehouse will also be producing regular and ad hoc reports.data warehouse feeds data to proprietary multidimensional databases (MDDBs) where summarized data is kept as multidimensional cubes of information

Data stores for information delivery

Function and servicesProvide security to control information access and monitor user accessAllow users to browse data warehouse content by hiding internal complexitiesAutomatically reformat queries for optimal execution, from aggregate tables as wellProvide self-service report generation for users, consisting of a variety of flexible options to create, schedule, and run reports Store result sets of queries and reports for future use Provide multiple levels of data granularity Provide event triggers to monitor data loading Make provision for the users to perform complex analysis through OLAP Enable data feeds to downstream, specialized decisions support systems such as EIS and data miningSumming upArchitecture is the structure that brings all the components together.The architectural components support the functioning of the data warehouse in the three major areas of data acquisition, data storage, and information delivery.Infrastructure of DWG.AnuradhaInfrastructureElements that enable the architecture to be implemented.Operational help to keep the DW goingPeopleProceduresTrainingManagement softwarePhysicalHardware componentsOperating systemNetwork, network software

Features of Hardware & OSHardwareScalabilityVendor supportVendor stabilityOSScalabilitySecurityReliabilityAvailabilityPreemptive multitaskingMemory protectionPossible optionsMainframesOld hardwareDesigned for OLTPExpensiveNot easily scalableOpen System ServersUNIX servers are most optedRobustAdapted for parallel processingNT ServersMedium-sized data warehousesLimited parallel processingCost effective for small or medium DWPlatform OptionsA computing platform is the set hardware components, operating system, network & network software.Both Online Transaction Processing and Decision Support Systems need a computing platform.Single Platform OptionAll functions from back-end data extraction to front-end query processing is performed on one platform.Data flows smoothly, no conversions requiredNo middleware requiredLimitationsLegacy platform stretched to capacityNon-availability of toolsMultiple legacy platformsCompanys migration policyHybrid Platform OptionEliminate s the drawbacks of single platform optionData extraction: Each source is extracted on its own computing platformInitial reformatting & merging: The extracted file from each source is reformatted & merged, on their respective platformsPreliminary data cleansing: Verify extracted data for missing values & data types.Transformation & Consolidation: Performed on the platform where the staging area resides.Validation & Final Quality CheckCreation of Load ImagesOptions for staging areaLegacy platforms when all data sources are on the same platform, we can create a DW also on the sameData storage platform the warehouse DBMS runs here. This can be used for staging also.Separate optimal platform a separate platform for staging dataServer HardwareServer hardware is most importantScalabilityQuery processingData movement options

Client/Server architecture for DW

Considerations on client workstationsDepends on type of userscasual user-Web browser and HTML reportsAnalyst-more powerful workstation machinePractically feasible solution is a minimum configuration on an appropriate platform that would support a standard set of information delivery tools in DW

Platform options as DW matures

Parallel processingSymmetric multiprocessingClustersMassively parallel processingCache-coherent Nonuniform Memory ArchitectureSymmetric Multiprocessing

Clusters

Massively Parallel Processing

NUMA or ccNUMA

Database SoftwareMany operations can be parallelizedmass loading of data, full table scans, queries with exclusion conditions, queries with grouping, selection with distinct values, aggregation, sorting, creation of tables using subqueries, creating and rebuilding indexes, inserting rows into a table from other tables, enabling constraints, star transformationTypes of parallelization

Software Tools

Summing upInfrastructure acts as the foundation supporting the data warehouse architectureData warehouse infrastructure consists of operational infrastructure and physical infrastructure. Hardware and operating systems make up the computing environment for the DW. Several options exist for the computing platforms needed to implement the various architectural components.Summing upSelecting the server hardware is a key decision. Invariably, the choice is one of the four parallel server architectures.Current database software products are able to perform interquery and intraquery parallelization.Software tools are used in the data warehouse for data modeling, data extraction, data transformation, data loading, data quality assurance, queries and reports, and online analytical processing (OLAP). Tools are also used as middleware, alert systems,and for data warehouse administration.METADATAData dictionary or data catalogContains data about the data in the DW likedata structuresfiles and addressesindexesTypes of MetadataOperationalExtraction & TransformationalEnd-UserNeed for a MetadataFor using the DWFor building the DWFor administering the DWAutomation of the DWMetadata by functional areasEvery DW process occurs in one of these 3 areasData acquisitionData storageInformation deliveryData acquisition - metadata

Data storage - metadata

Information Delivery metadata

Types of MetadataBusiness metadataPortrays DW from the end user perspectiveShows business names, not actual file namesLess structured as compared to technical metadataUsed by business analysts and other end users.Technical metadataShows the actual structure and content of the DWActs as a guide to build, maintain and administer the DWUsed the the data warehouse administrator, and other IT staff working on the DW.How to provide metadataMetadata requirementsSourcesChallengesRepositoryIntegration and standardsImplementation options