75
SAP BOBJ Dataservices Data services is an end-to-end Data Integration, Data Management and Text Analytics Software. Any software which provides Extraction, Integration, Transformation and Loading functionalities are called Data Integration Software. Data Management:A software which helps to cleanse master data is called Data Management software. Any software which provides parsing, correcting, standardization, enhancement, match and consolidation functionalities are called Data Management software. Text Analytics: Converting the unstructured text data into structured data such a software is called text analytics software. Advantages of BODS: One product, multiple services ; single window application; supports unstructured sources(big database); tight integration with SAP; supports in memory database (HANA); migration tool; content management tool; supports real time extractions; price; less maintenance cost. Architecture 1) Designer: It is a desktop application and client tool; it is a developer tool to develop BODS jobs; it supports batch job and real time jobs; it is only aapplication , it does not store data. 2) Repository: is a storage location in any database to store BODS objects. Repository is a group of BODS metadata tables. DB support BODS repository are oracle; MySQL; MySQL; DB2; Sybase; Hana(4.1 onwards). There are three types of repositories: A local repository(known in Designer as the Local Object Library) is used by an application designer to store definitions of source and target metadata and Data Services objects. A central repository(known in Designer as the Central Object Library) is an optional component that can be used to support multiuser development. The Central Object Library provides a shared library that allows developers to check objects in and out for development.

Bods Interview

Embed Size (px)

DESCRIPTION

hi

Citation preview

SAP BOBJ DataservicesData services is an end-to-end Data Integration, Data Management and Text Analytics Software.Any software which provides Extraction, Integration, Transformation and Loading functionalities are called Data Integration Software.Data Management:A software which helps to cleanse master data is called Data Management software. Any software which provides parsing, correcting, standardization, enhancement, match and consolidation functionalities are called Data Management software.Text Analytics: Converting the unstructured text data into structured data such a software is called text analytics software.Advantages of BODS: One product, multiple services ; single window application; supports unstructured sources(big database); tight integration with SAP; supports in memory database (HANA); migration tool; content management tool; supports real time extractions; price; less maintenance cost.Architecture1) Designer: It is a desktop application and client tool; it is a developer tool to develop BODS jobs; it supports batch job and real time jobs; it is only aapplication , it does not store data.2) Repository: is a storage location in any database to store BODS objects. Repository is a group of BODS metadata tables. DB support BODS repository are oracle; MySQL; MySQL; DB2; Sybase; Hana(4.1 onwards).There are three types of repositories: A local repository(known in Designer as the Local Object Library) is used by an application designer to store definitions of source and target metadata and Data Services objects. A central repository(known in Designer as the Central Object Library) is an optional component that can be used to support multiuser development. The Central Object Library provides a shared library that allows developers to check objects in and out for development. A profiler repositoryis used to store information that is used to determine the quality of data.3) BODS Services: Job Server; Data Engine; Access Server.I. Job Server: is a service which process BODS batch jobs as part of processing. Whenever a batch is executed or triggered by user immediately the assigned job server will study the design of job and estimate the number of threads and process to be allocated.II. Data Engines: are services which takes responsibility in generating threads and process as per job server estimation; Threads will involve in extraction & loading and process will involve in transformation of data; Collectively the Job server and data engines will take responsibility to processing batch jobs.III. Access Server:it is also called messaging broker; Access server is a service which will take responsibility in processing real time jobs; To manage real time jobs an extra configuration called real time configuration to be done both sender & receiver side. This configuration sends a message alert whenever data got updated in sender side;Access server is a service which is running 24/7and waits for a message from sender to trigger real time job; Once the real time job is successfully finish, access server will send a message back to sender as acknowledgement.4) Data quality components:I. Directories:They are also one kind of repositories which stores cleansing packages. They are physical folders in OS level with name called reference data under LINK DIR/data quality /reference _data which holds country specific data; Directories enable match and consolidate function as part of data cleansing.II. Dictionaries:are storage location in database which consists of data classification, which is a format of an entity; dictionary helps in enabling standardization of data.III. Address server: is a service which will process data quality transformation.5) BODS Management Console: It is a administrator tool; It is a web application; URL http://:/ data services; Its a secured application 3.x (admin U/N, pwd) from 4.x cmccredentials.I. Administrator Scheduling, monitoring, and executing batch jobs; Configuring, starting, and stopping real-time services; Configuring Job Server, Access Server, and repository usage; Configuring and managing adapters; Managing users; Publishing batch jobs and real-time services via web services; Reporting on metadata.II. Auto Documentation:View, analyze, and print graphical representations of all objects as depicted in Data Services Designer, including their relationships, properties, and more.III. Data Validation:Evaluate the reliability of your target data based on the validation rules you create in your Data Services batch jobs to quickly review, assess, and identify potential inconsistencies or errors in source data.IV. Impact and Lineage Analysis: Analyze end-to-end impact and lineage for Data Services tables and columns,and SAP Business Objects Business Intelligence platform objects such asuniverses, business views, and reports.V. Operational Dashboard:View dashboards of status and performance execution statistics of DataServices jobs for one or more repositories over a given time period.VI. Data Quality Reports:Use data quality reports to view and export SAP Crystal Reports for batchand real-time jobs that include statistics-generating transforms. Report typesinclude job summaries, transform-specific reports, and transform groupreports.6) CMC / IPS (central management console/ Information platform services)CMC is BO BI platform like netweaver for sap; cmc is a web application; it is an administrator application; is a secured and role based application; In cmc w.r.t. BODS we can do repository management, user management, security mgmt., license management. URL: http:// :/BOE/CMC.Installation of BODS 4.0: We need database(MySQL) and web application server(Tomcat)I. Non BOE customer: IPSBODS Server BODS Client.II. BOE Customer: BO CMC BODS Server BODS client.III. BOE Customer with dual platform: BO CMC; IPS BODS Server BODS Client.BODS Tools1. CMC/IPS toolsI. Central configuration manager:Helps in managing of BO services; desktop tool/ server tool/ administrator tool; we can manage EIM adoptive processing service which helps in the creation of RFC connection for sap extraction.II. Central management console: III. Life cycle management console (LCM):Helps in deploying BOE content (CMC content); version management tool;web app; server tool ; administrator tool; URL: http:// :; from BOE 4.0 sp 5 onwards in cmc we have two new options instead of LCM : promotion mgmt., subversion mgmt.IV. Upgrade Mgmt tool(UMT):Helps in upgrading BOE content and applications from lower(4.0) version to higher(4.1) version; it is a desktop tool; server tool; administrator tool;V. Web deployment tool (WDT):Helps in deploying web applications i.,e when our web applications get corrupted we can deploy using WDT i.,e deploy .WAR files into Web application server. 2. BODS Server toolsI. Repository Manager:It s a tool to create repositories; It helps to check the version of the repository; upgrade repository. If BODS version gets upgraded we will upgrade the repository. It is administrator tool; desktop tool; server tool.II. Server Manager: It is a tool to create Job server, Access server; Assignment of repository to job server is done here; It is server tool ; administrator tool; desktop tool; we can do email configurations; we can define pageble memory path settings; we can do email configurations; we can do real time job configuration at receiver side.III. BODS mgmt console:IV. Metadata integrator: It is a tool which will integrate BODS and BO BI; here in BODS we will get all metadata of (universes, reports etc.) and so the impact and linage analysis; It is desktop tool; Server tool; administrator tool; we can schedule the metadata integrator to get the data from cmsdb to BODS repo with regular intervals.V. License manager: We can add, delete, modify licenses; It is a desktop tool; Administrator tool; server tool.3. BODS Client toolsI. Designer: It is a desktop application and client tool; it is a developer tool to develop BODS jobs; it supports batch job and real time jobs; it is only aapplication , it does not store data.II. Locale Selector :It will help you to select a default language of designer; Developer tool; it is client tool; it is desktop tool.III. Documentation: Technical manual of the tool.BODS ObjectsBODS objects are two typesI. Reusable objects: A reusable object has a single definition; all calls to the object refer to that definition. Accessreusable objects through the local object library. A data flow, work flow, datastore, function etc are example of reusable objects. II. Single-use objects: Some objects are defined only within the context of a single job or data flow, for example scripts, project, loop, try-catch, conditionsetc are single use objects.Projects: A project is a reusable object that allows you to group jobs;A project is the highest level of organization offered by the software; You can use a project to group jobs that have schedules that depend on one another or that you want to monitor together.Jobs: A job is the only object you can execute. You can manually execute and test jobs in development.Dataflow: It is an object to define a ETL definition in BODS (what to extract, how to extract and where to load).Workflow: It is an object which helps in grouping of data flows (or) grouping of set of ETL operationsDatastores: Datastores represent connection configurations between the software and databases or applications:There are three kinds of Datastores:I. Database Datastores:provide a simple way to import metadata directly froma RDBMS.II. Application Datastores:let users easily import metadata from mostEnterprise Resource Planning (ERP) systems.III. Adapter Datastores:can provide access to an applications data and metadataor just metadata. For example, if the data source is SQL-compatible, theadapter might be designed to access metadata, while Data Services extractsdata from or loads data directly to the application.Formats: It is an object which helps in connecting between file type sources and targets in BODS.Scripts: It is an object which enables BODS customization; We write code in BODS scripiting language (similar to sql).A script is a single-use object that is used to call functions and assign values ina work flow.A script can contain: Function calls If statements While statements Assignment statements Operators.Transforms:It is a readymade logic provide by SAP along with this tool to perform data transformations.There are more than 20 transforms (data integrator, platform, data quality).Template table is a BODS object which helps in creating table in any database; Template tables are used as target table in BODS; Once a job is executed the template table object is created in target database.BODS GUIProject area : Contains the current project (and the job(s) and other objects within it) available to you at agiven time.a) Designer tab:wa can design objects / view objects; the objects are shown in hierarchical fashion.b) Monitor tab:it is a place where we can see current running jobs (red-fail, yellow-currently running, green- successful).c) Log tab:It is a place where we can see the logs of job( log history 120 days by default)Workspace: The area of the application window in which you define, display, and modify objects.Local object library :Provides access to local repository objects including built-in system objects, such as transforms, and the objects you build and save, such as jobs and data flows; All are reusable objects.Tool palette: Buttons on the tool palette enable you to add new objects to the workspace by drag and drop functionality.BODS Job Design Methodology1. Identify the source and target system types. Systems are 2 types a)Database type b)File type.2. Import the source and target systems metadata (schema) into BODS; File type Format, DB typeDatastore.3. Create a BODS Joba. Open an existing project / create a project.b. Define ETL definition i.,e create dataflow (define source, define target, define transformation).c. Validate and save the job.d. Execute and monitor the job.Excel workbook Sheet extractionBODS 3.x supports .xls; from BODS 4.0 onwards .xls and .xlsxThe add-on called Microsoft access database engine should be installed on both server and client side to make designer application to read Microsoft excel files; This add-on is provided by sap along withBODS software and is placed in c:// program files/sapbo/data services/ ext/ Microsoft.Multiple excel file extraction dynamically:Recommendations:All fields structure should be unique; All fields data should be in Named ranges/ sheets; All fields data should be in same sheet name / sheet number; Maintain common prefix or suffix in file name.Solutions:1. Wild card character (* or ?); OR 2. using list of filenames with comma delimiterDynamic excel file selection:Recommendations: Structure in all files should be unique; data should be in sheet/ named range; Maintain unique sheet name or sheet number; Naming convention of a file name should contain time stamp when it distributed.Solutions:1. Variable is to pass the file name dynamically during runtime;2. Script is used to define a file nameMultiple excel workbook sheets extraction dynamically:Recommendations:Structure should be unique in all sheets; Extra sheet in a file should have the list of sheet names where data is placed.Solution:1. Variable to pass the sheet name dynamically during runtime2. Script to define the sheet name.3. Loop to repeat the process to get the data from all the sheets.Limitations of excel workbook in BODS: BODS is not capable to distribute data in excel workbook; BODS 3.2 does not support .xlxs format; We dont have any property to terminate extraction if we get n number of errors in excel.Flat file extractionA flat file is a structured file; a flat file has no data limitations.Types of flat files:Delimited:A structure text file with markers to identify row, column and text id called delimited flat file. Columncomma, space, tab, semicolon; Row windows new line, Unix new line, new line; Text single quotes, double quotes. Custom row BODS will support special markers also.Fixed width flat file:is a structured flat file where column identified by some fixed with (size); Row windows new line, Unix new line, new line; Text single quotes, double quotes. Custom row BODS will support special markers also.SAP transport flat file formatsupports to read a flat file from and write a flat file to SAP appserver.Unstructuredtextflat fileis used to read data from unstructured text files ex: LinkedIn, face book; Here data is in the form of text (characters).Unstructured binary: Here data is in the form of binary ex .doc,pdf, xml, html etc.Advantages of flat file are we can process parallel threads; we have the option terminate flat file if it contains specific number of errors; with the help of query transform we can generate flat file.XML data extractionIt is a nested structure; we get header and data separately; Supports level of extraction; Supports real time. We get xml data in two fashion .dtd format: helps in importing xml metadata if a header file is provided in dtd fashion; xml schema definition: helps in importing metadata if a header file is provided in xml schema definition.Make current is available under query transform in output schema which allows to select the required columns from nested schema, we cant apply this on multiple nodes at a time. Unnest with subschema: is available under query transform in output schema which helps to convert the nested schema structure into flat files.XML_PIPELINE Transform:Functionality: Helps in extracting nested structured data; It can be used to extract the part of xml data Type: Data Integrator; Behaviour: Transform; Input: Nested input (takes single input); Output : Multiple same outputs; Properties: no create properties; no target properties; no source object editor properties; no target object editor properties; Schema In: Source schema (nested structure); Schema out: Under user control (wecan define schema i., e create columns here); Limitation: Does not allows to select multiple columns from same level nodes.COBOL data extractionCobol copybook format: We will get header, file(fixed width flat file and variable width flat file) in separate files; Supports nested structure; Supports only one node column extraction at a time. Cobol copy books always acts as sources only; If there are 3 nodes in Cobol copy book then we get 3 options, in that we can select any one node.Time Dimension ImplementationTime dimension is global standard dimension, no need to depend on source system. Reference entity: based on reference entity we can derive new entity; Day is the least reference entity in time dimension.DATE_GENERATION is a transform which provides least reference time attribute which is day, based on this we can derive new time entity according to our requirement.DATE_GENERATION Transform:Functionality:This transform generates a column which holds the date values based on the start & end dates provided as an input to the transform by considering the increment provided to it.Type: Data Integrator transform; Behavior: It acts like a source;Input: No input; Output: Multiple same outputs;Schema in: No input schema because it acts as a source; Output schema: Under system control; we get a single output column DI_GENERATED_DATE.Time dependent dimension ImplementationA dimension in which the attributes changes over period of time, if that changes capture in a dimension then such dimensions are called Time dependent dimension; We use them for Master data tracking or History reports; We can define validity of an attribute (Effective to date and effective from date).EFFECTIVE_DATE Transform:Functionality: Derives valid_to column based on valid_from column and sequence column; Type: Data Integrator; Behaviour: Transform; Input: Input should contain effective_from column and sequence column; Output: Multiple same outputs; Schema out: Input schema + one additional date column.Slowly changing DimensionsDimensions are changing slowly such dimensions are called slowly changing dimensions; SCD are 3 tpes SCD-T0, SCD-T1, SCD-T2; T3, T4 and T6(Hybrid SCD-T2) are different representations od SCD-T2.SCD-T0: It holds current data only; It should not maintain historical data and it should not capture changes; Functionality to enable in ETL side is truncate and reload; In BODS it is implemented by check box Delete data before loading at database table target editor properties.SCD-T1: It holds current data and it also holds historical data, but it should not capture changes.Functionality to enable in ETL side is overwrite In BODS it is implemented by Table Comparison transform and Map Operation(optional).SCD-T2: It holds current data and it also contain Historical data and also capture change data.Functionality to enable in ETL side is every change to be inserted( we add 4 additional colums SID,CUR_REC _IND,EFF_from,EFF_to)TRANSFORMSTABLE COMPARISION:Functionality: It helps to identify the type of each input record which you extracted; Type: Data Integrator transform; Behaviour: Transform; Prerequistics: here should be atleast one common key (primary key) in between input table and comparision table; Input table columns and comparision table names and data types should be same(not all columns, only common columns); Input: songle input; Output: Multiple single output; Schema out: under system control. Compares two data sets and produces the difference between them as a data set with rows flagged as INSERT or UPDATE. The Table_Comparison transform allows you to detect and forward changes that have occurred since the last time a target was updated.

3)Table Comparison :

Compares two data sets and produces the difference between them as a dataset with rows flagged as INSERT, UPDATE, or DELETE.

Row-by-row select Look up the target table using SQL every time it receives an input row. This option is best if the target table is large. Cached comparison table To load the comparison table into memory. This option is best when the table fits into memory and you are comparing the entire target table. Sorted input read the comparison table in the order of the primary key column(s) using sequential read.This option improves performance because Data Integrator reads the comparison table only once.Add a query between the source and the Table_Comparison transform. Then, from the querys input schema, drag the primary key columns into the Order By box of the query.

Input primary key column(s)The input data set columns that uniquely identify each row. These columnsmust be present in the comparison table with the same column names anddata types.

Input contains duplicate keysGenerated Key column Detect Deleted row(s) from comparison tableDetect all rowsDetect row with largest generated key value.

If we are checking the option Input contains duplicate keysthen in the Generated Key column we will be selecting the Primary key which retrieves the largest generated key value in the duplicate data.

If we are selecting the option Detect Deleted row(s) from comparison tableIt will flag the record/records with DELETE.MAP_OPERATION TransformFunctionality: Helps in changing the record modes of a record in BODS; Type: Platform transform; Behaviour: Behaves like a transform; Input: Single input; Output: Multiple single output; Schema out: Under system control; output schema equals input schema.

HISTORY_PRESERVING TransformFunctionality: Helps in defining current record indicator flag and validity if each input record as part of SCD-T2 implementation.;Type: Data Integrator; Behaviour: Transform; Input: single input (Input should be table comparision); Output: multiple single outputs; Schema out: under system control; output schema structure equal to table comparison structure.

Allows you to produce a new row in your target rather than updating an existing row. You can indicate in which columns the transform identifies changes to be preserved. If the value of certain columns change, this transform creates a new row for each row flagged as UPDATE in the input data set.

KEY_GENERATION TransformFunctionality: Helps in implementing SID (like sequence generator in database); Type: Data Integrator; Behavior: Transform; Input: single input; Output: Multiple single output. Generates new keys for new rows in a data set. The Key_Generation transform looks up the maximum existing key value from a table and uses it as the starting value to generate new keys.

MAP OPERATION Allows conversions between data manipulation operations. The Map_Operation transform allows you to change operation codes on data sets to produce the desired output. For example, if a row in the input data set has been updated in some previous operation in the data flow, you can use this transform to map the UPDATE operation to an INSERT. The result could be to convert UPDATE rows to INSERT rows to preserve the existing row in the target. Data Integrator can push Map_Operation transforms to the source database.

MERGE TransformFunctionality: It enables data integration functionality in BODS; Type: Platform Transform; Behavior: Transform; Input: Multiple inputs: Output: Multiple single outputs; Pre requisites: Before passing inputs to merge transform all the sources should have unique structure (No of columns, column names, data types order of columns); Schema out: Under system control; output schema is equal to input schema.

Combines incoming data sets, producing a single output data set with the same schema as the input data sets.

ROW_GENERATION TransformUsing this transform we can maintain one special character in dimension table( which handle s null records DWH).Functionality: Helps in generating the number of rows requesting; Type: Platform transform; Behavior: Source: Input No input; Output: Multiple single outputs.

Objectives: Produces a data set with a single column. The column values start from zero and increment by one to a specified number of rows.Descriptions: Data Inputs None. Options Row count A positive integer indicating the number of rows in the output data set. For added flexibility, you can enter a variable for this option. Join rank A positive integer indicating the weight of the output data set if the data set is used in a join. Sources in the join are accessed in order based on their join ranks. The highest ranked source is accessed first to construct the join. Cache Select this check box to hold the output from the transform in memory to be used in subsequent transforms. Select Cache only if the resulting data set is small enough to fit in memory. Editor The Row Generation transform editor includes the target schema, and transform options. Data Outputs The Row_Generation transform produces a data set with a single column and the number of rows specified in the Row option. The rows contain integer values in sequence starting from zero and incrementing by one in each row.

PIVOT TransformFunctionality: Helps in applying pivoting on input data i.,e converting columns into rows: Type: Data integrator; Behavior: Transform;Input: Single input: Output: Multiple single output: Schema out: Is under system control; Default output schema of a pivot transform: pivot_seq(int);pivot_HDR(varchar); pivot_data (varchar); Output schema of pivot = no of non pivotcolums + default schema; pivot sequence columns should be primary key.Pivot Sequence Column : For each row created from a pivot column, Data Integrator increments and stores a sequence number.Non-Pivot Columns : The columns in the source that are to appear in the target without modification.Pivot set : The number that identifies a pivot set. Each pivot set must have a a group of pivot columns,unique Data field column and the Header column. Data Integrator automatically saves this information.Data field column : Contains the pivoted data. This column contains all of the Pivot columns values.Header column : lists the names of the columns where the corresponding data originated.

Creates a new row for each value in a column that you identify as a pivot column. The Pivot transform allows you to change how the relationship between rows is displayed. For each value in each pivot column, DI produces a row in the output data set. You can create pivot sets to specify more than one pivot column.

REVERSE_PIVOT TransformFunctionality: Helps in applying pivoting operation on input data i.,e converting rows into columns; Type: Data Integrator; Behavior: Transform; Input: Onlyone input of flat structure: output: Multiple single output; Schema out: Under system control.Non-pivot columnsThe columns in the source table that will appear in the target table without modification.Pivoted columns A set of columns will be created for each unique value in the Pivot axis column.Pivot axis columnThe column that determines what new columns are needed in the output table. At run time, a new column is created for each Pivoted column and each unique value in this column.Duplicate valueAction taken when a collision occurs. A collision occurs when there is more than one row with the same key and value in the Pivot axis column. In this case, you can store either the first row or the last row, or you can abort the transform process.Axis valueThe value of the pivot axis column that represents a particular set of output columns. Column PrefixText added to the front of the Pivoted column names when creating new column names for the rotated data.

Creates one row of data from several existing rows. The Reverse Pivot transform allows you to combine data from several rows into one row by creating new columns. For each unique value in a pivot axis column and each selected pivot column, DI produces a column in the output data set.

CASE TransformFunctionality: Helps in distributing or categorization of data: Type: Platform;Behavior: Transform: Input single flat structure input: Output Multiple distinct output; Output schema structure equal to input schema structure.

Hierarchies: Grouping of master data is hierarchies.In Hierarchy Flattening data stores in two fashions; Horizontal fashion: Enables drill down and drill up functionality (mandatory format); Vertical fashion: enables global filtering (performance tuning option).Horizontal Structure current leaf, level 0, level 1level n leaf level; Vertical structure: Ancestor ID, Descendent ID, root flag, leaf flag, depthHIERARCY_FLATTENING TransformFunctionality: Helps in handling hierarchy data in BODS; It stores in 2 fashion; Input: either 2 structured flat sources either 2 or 4 columns which includes parent and childs; Output: Multiple single output.

Constructs a complete hierarchy from parent/child relationships, then produces a description of the hierarchy in vertically or horizontally flattened format. Flattening Types Horizontal Vertical

Horizontal Flattening Each row of the output describes a single node in the hierarchy and the path to that node from the root. This mode requires that you specify the maximum path length through the tree as the Maximum depth. If there is more than one path to a node, all paths are described in the result.

Vertical Flattening Each row of the output describes a single relationship between ancestor and descendent and the number of nodes the relationship includes. There is a row in the output for each node and all of the descendents of that node. Each node is considered its own descendent and therefore is listed one time as both ancestor and descendent.

MAP_CDC_OPERATION TransformFunctionality: Helps in handling how source based CDC sources system data should update in target i.,e it simple updates values (I,U,D) in column DI_Operation_Type based on CDC column: Type Data integrator ;Behavior: Transform; Input : Only on e input of type oracle/ Microsoft CDC table or SAP extractor; Output: Multiple single outputs( with different records modes).

Using its input requirements performs three functions: Sorts input data based on values in Sequencing column box and the Additional Grouping Columns box. Maps output data based on values in Row Operation Column box. Source table rows are mapped to INSERT, UPDATE, or DELETE operations before passing them on to the target. Resolves missing, separated, or multiple before- and after-images for UPDATE rows. While commonly used to support Oracle or mainframe changed-data capture, this transform supports any data stream as long as its input requirements are met. This transform is typically the last object before the target in a data flow because it produces INPUT, UPDATE and DELETE operation codes. Data Integrator produces a warning if other objects are used.

VALIDATION TransformFunctionality: helps in enabling validation layer in BODS; Type: Platform; Behavior: Transform; Input: Single input with flat structure(no xml); Output: 3.x (2 set distinct output); 4.x (3 set distinct outputs); Schema out: Pass output schema of pass set is equal to input schema; Fail output schema os equal to onput schema plus 2 colums (DI_ERRORACTION and DI_ERRORCOLUMN, if we enable the check box we also get DI_ROWID colum); Rule violation set schema DI_ROWID,DI_ROWNAME and DI_COLUMNNAME.

One source in a data flow. Qualifies a data set based on rules for input schema columns. Allows one validation rule per column. Filter out or replace data that fails your criteria. Outputs two schemas: Pass and Fail.

DATA TRANSFER TransformFunctionality: Helps I pushing the source data connected to any database or BODS job server location Or to cache memory level; Type: Data Integrator; Behaviour: Transform; Input: Only one input of flat type; Output: Multiple Single output; Output schema is equal to input schema.

SQL TransformFunctionality: A transform which helps to pushdown select operations to database level; Type: Platform; Behavior: source; Input: No input; Output: Multiple single output; Output schema equal to select statement output schema.

Performs the indicated SQL query operation. Use this transform to perform standard SQL operations for things that cannot be performed using other built-in transforms. The options for the SQL transform include specifying a datastore, join rank, cache, array fetch size, and entering SQL text.

QUERY TRANSFORM The Query transform can perform the following operations: Choose (filter) the data to extract from sources Join data from multiple sources Map columns from input to output schemas Perform transformations and functions on the data Perform data nesting and unnesting Add new columns, nested schemas, and function results to the output schema Assign primary keys to output columns

DATA ASSESSMENT using BODS

For analyzing the data and measuring the data;mainly about studying the sources w.r.t dataFunctions available in BODS View data function, Validation Transform Sstatistics,Auditing and data profiling;Profiling types: Column profiling and Relationship profiling; Column profiling are two types Basic column profiling (min length, max length, medium length, min length % max% medium%, no of null %, no of blanks%) and Detail column profiling (distinct %, pattern %, median).Relationship Profiling: Defining relationship between two objects; % of common data, % of non-common data.SAP has separate tool: Data insight or Information steward.BODS VariablesEnables dynamism in etl jobs;Global variable(G_XXX): Job dependent; Can be used any where inside the job. (not across the job); Prossessing option: default value, User entry and through script: Input value: single value.

Global variables are global within a job. Setting parameters is not necessary when you use global variables. However, once you use a name for a global variable in a job, that name becomes reserved for the job. Global variables are exclusive within the context of the job in which they are created.

Local Variable(L_XXX): Job dependent and workflow dependent; Can be used only thwpalce where it is defined; Processing option: through script; Input type vle: single value.

To pass a local variable to another object, define the local ariable, then from the calling object, create a parameter and map the parameter to the local variable by entering a parameter value.

Parameters (P_XXX):Generally parameters are used in functions to pass the input and take the output like input parameters and output parameters ; workflow and dataflow dependent; can be used only the place where it is defined; processing options: Only through local variable assignment; Input value type: single value.

Pass their values into and out of work flows Pass their values into data flows Each parameter is assigned a type: input, output, or input/output. The value passed by the parameter can be used by any object called by the work flow or data flow.

Substitution Parameters ($$_XXX): They hold file path information; they are repository dependent; can be used anywhere inside the repository; Processing options: default value, user entry and through script; Input value type : single.LOOKUP()Lookups always work at column level, transform works at row levelNormal lookup: Derives only one column, lookup source is only db, supports only equicondition(=); only one equi condition; performance tuning (preload cache, demand load cache)Lookup_ext(): It derives multiple columns from one lookuo_ext function; lookup sources are db tables and flat files; supports all operators (=,..etc); supports multiple conditions/columns; supports scd t2 sources, performance tuning(preload cache , demand load cache and run as separate process).Lookup_seq(): Derives only one colum; lookup sources only db tables; supports only equi conditions; supports two equi condition; meant for scd t2 handling: no performance tuning options.LOOKUP : Retrieves a value in a table or file based on the values in a different sourcetable or file.

LOOKUP EXT :

Graphic editor in the function wizard. Retrieve a value in a table or file based on the values in a different source table or file,but it also provides extended functionality allowing you to: Return multiple columns from a single lookup Choose from more operators to specify a lookup condition Perform multiple (including recursive) lookups Call lookup_ext in scripts and custom functions (which also lets youreuse the lookup(s) packaged inside scripts) Define custom SQL, using the SQL_override parameter, to populatethe lookup cache, narrowing large quantities of data to only the sectionsrelevant for your lookup(s) Use lookup_ext to dynamically execute SQL Call lookup_ext, using the function wizard, in the query output mappingto return multiple columns in a Query transform Design jobs to use lookup_ext without having to hard-code the name ofthe translation file at design time. Use lookup_ext with memory datastore tables

LOOKUP SEQ :

Retrieves a value in a table or file based on the values in a different sourcetable or file and a particular sequence value.NEW FUNCTION CALL :

** Works same as Normal Lookup Ext.

** The only difference is this can be editable in the wizard form where as in normal Lookup Ext which is not possible.

** No Need of Global/Local Variables usage in the output ports.BODS Project DeploymentRepository content: Two methods Direct method,(repo-to-repo); good for first time deployment. Export/ import method (atl/xml) (good for version management).BODS mgmt. console content: All the configuration which we do in BODS mgmt. console are not stored in repository. They are stored in the files admin.xml, as.xml and sapconnection.xml which are in /cong and /binCMC content: Use LCM or promotion management to deploy source cms database to target.Substitution parameter configuration and System configuration content: Using export /import method inside BODS designer option we can deploy to target system.CENTRAL REPOSITORYLimitation of local repository: no multi user access, no auto lock feature, no version managmentand no security.Central repository enables multiuser environment and version management in BODS..Two types of central repository Unsecured central repository and secured central repository.

Exporting a new job to Central Repository1. Right click on the new job that you created we will get an object with name Export to central Repository just click on that then it will export the job to the path that mentioned in the Repository.2. For doing some modifications and to maintain versions in central repository Follow these steps mentioned below. Checking out objects from the central repository To check out an object and its dependent objects1. Open the central object library.2. In the central object library, right-click the job_.3. Click Check Out > Object and dependents. Checking in objects to the central repository To check in a single object1. Open the central object library.2. Right-click the DF_EmpLoc data flow in the central object library and click Check In > Object. A Comment window opens.3. Type required comment with which we will get a version to find out what modification we have done on the job which is already present.

BODS Performance Tuning TechniquesEnvironmental performance tuning techniques: OS leve I/O ratio; Application level Increase the app level cache memory; Database level primary indexes, regular db statistics collection; Network level increase the bandwidth; BODS app side check boxes in execution properties such asmonitor sample rate,print all trace message, disable data validation statistics collection, enable auditingMulti-Threading: Partitioning the source table, falt file allocate parallel threads.Multi-processing: Transform level run as a separate process; Data flow level Degree of parallelismPaarallel Processing: parallel data flow is parallel processingPush down: Pushing BODS operations to other system; BODS automatically push operation if the source and target are from same schema; if different schema can useDB link OR Data transfer transform OR SQL Transform.Bulk Loading: Only enabled for databse type targets; performed with the help of SQL loader tool in BODS; Target object editor properties check box bulk loadThrough putDataflow levelcache type (In memory cache, pageble cache); Transform level Table comparison (cached comparison method); lookup function pre load cache and demand load cache.Other: Source level Array fetch size (best practice 1000-5000); Join rank (normal, ranking and cache algorithms); Target Level Rows per commit 1000-5000; Advanced->general-> Number of loaders.Error Handling: Target editor property-> error handling-> use overflow file.Source-based performance options Using array fetch size Caching data Join ordering Minimizing extracted data Target-based performance options Loading method and rows per commit Staging tables to speed up auto-correct loads Job design performance options Improving throughput Maximizing the number of pushed-down operations Minimizing data type conversion Minimizing locale conversion Improving Informix repository performance

1. Utilize a database (like Oracle / Sybase / Informix / DB2 etc...) for significant data handling operations (such as sorts, groups, aggregates). In other words, staging tables can be a huge benefit to parallelism of operations. In parallel design - simply defined by mathematics, nearly always cuts your execution time. Staging tables have many benefits. Please see the staging table discussion in the methodologies section for full details. 1. Localize. Localize all target tables on to the SAME instance of Oracle (same SID), or same instance of Sybase. Try not to use Synonyms (remote database links) for anything (including: lookups, stored procedures, target tables, sources, functions, privileges, etc...). Utilizing remote links will most certainly slow things down. For Sybase users, remote mounting of databases can definitely be a hindrance to performance. 1. If you can - localize all target tables, stored procedures, functions, views, sequences in the SOURCE database. Again, try not to connect across synonyms. Synonyms (remote database tables) could potentially affect performance by as much as a factor of 3 times or more. 1. Remove external registered modules. Perform pre-processing / post-processing utilizing PERL, SED, AWK, GREP instead. The Application Programmers Interface (API) which calls externals is inherently slow (as of: 1/1/2000). Hopefully Informatica will speed this up in the future. The external module which exhibits speed problems is the regular expression module (Unix: Sun Solaris E450, 4 CPU's 2 GIGS RAM, Oracle 8i and Informatica). It broke speed from 1500+ rows per second without the module - to 486 rows per second with the module. No other sessions were running. (This was a SPECIFIC case - with a SPECIFIC map - it's not like this for all maps). 1. Remember that Informatica suggests that each session takes roughly 1 to 1 1/2 CPU's. In keeping with this - Informatica play's well with RDBMS engines on the same machine, but does NOT get along (performance wise) with ANY other engine (reporting engine, java engine, OLAP engine, java virtual machine, etc...) 1. Remove any database based sequence generators. This requires a wrapper function / stored procedure call. Utilizing these stored procedures has caused performance to drop by a factor of 3 times. This slowness is not easily debugged - it can only be spotted in the Write Throughput column. Copy the map, replace the stored proc call with an internal sequence generator for a test run - this is how fast you COULD run your map. If you must use a database generated sequence number, then follow the instructions for the staging table usage. If you're dealing with GIG's or Terabytes of information - this should save you lot's of hours tuning. IF YOU MUST - have a shared sequence generator, then build a staging table from the flat file, add a SEQUENCE ID column, and call a POST TARGET LOAD stored procedure to populate that column. Place the post target load procedure in to the flat file to staging table load map. A single call to inside the database, followed by a batch operation to assign sequences is the fastest method for utilizing shared sequence generators. 1. TURN OFF VERBOSE LOGGING. The session log has a tremendous impact on the overall performance of the map. Force over-ride in the session, setting it to NORMAL logging mode. Unfortunately the logging mechanism is not "parallel" in the internal core, it is embedded directly in to the operations. 1. Turn off 'collect performance statistics'. This also has an impact - although minimal at times - it writes a series of performance data to the performance log. Removing this operation reduces reliance on the flat file operations. However, it may be necessary to have this turned on DURING your tuning exercise. It can reveal a lot about the speed of the reader, and writer threads. 1. If your source is a flat file - utilize a staging table (see the staging table slides in the presentations section of this web site). This way - you can also use SQL*Loader, BCP, or some other database Bulk-Load utility. Place basic logic in the source load map, remove all potential lookups from the code. At this point - if your reader is slow, then check two things: 1) if you have an item in your registry or configuration file which sets the "ThrottleReader" to a specific maximum number of blocks, it will limit your read throughput (this only needs to be set if the sessions have a demonstrated problems with constraint based loads) 2) Move the flat file to local internal disk (if at all possible). Try not to read a file across the network, or from a RAID device. Most RAID array's are fast, but Informatica seems to top out, where internal disk continues to be much faster. Here - a link will NOT work to increase speed - it must be the full file itself - stored locally. 1. Try to eliminate the use of non-cached lookups. By issuing a non-cached lookup, you're performance will be impacted significantly. Particularly if the lookup table is also a "growing" or "updated" target table - this generally means the indexes are changing during operation, and the optimizer looses track of the index statistics. Again - utilize staging tables if possible. In utilizing staging tables, views in the database can be built which join the data together; or Informatica's joiner object can be used to join data together - either one will help dramatically increase speed. 1. Separate complex maps - try to break the maps out in to logical threaded sections of processing. Re-arrange the architecture if necessary to allow for parallel processing. There may be more smaller components doing individual tasks, however the throughput will be proportionate to the degree of parallelism that is applied. A discussion on HOW to perform this task is posted on the methodologies page, please see this discussion for further details. 1. BALANCE. Balance between Informatica and the power of SQL and the database. Try to utilize the DBMS for what it was built for: reading/writing/sorting/grouping/filtering data en-masse. Use Informatica for the more complex logic, outside joins, data integration, multiple source feeds, etc... The balancing act is difficult without DBA knowledge. In order to achieve a balance, you must be able to recognize what operations are best in the database, and which ones are best in Informatica. This does not degrade from the use of the ETL tool, rather it enhances it - it's a MUST if you are performance tuning for high-volume throughput. 1. TUNE the DATABASE. Don't be afraid to estimate: small, medium, large, and extra large source data set sizes (in terms of: numbers of rows, average number of bytes per row), expected throughput for each, turnaround time for load, is it a trickle feed? Give this information to your DBA's and ask them to tune the database for "wost case". Help them assess which tables are expected to be high read/high write, which operations will sort, (order by), etc... Moving disks, assigning the right table to the right disk space could make all the difference. Utilize a PERL script to generate "fake" data for small, medium, large, and extra large data sets. Run each of these through your mappings - in this manner, the DBA can watch or monitor throughput as a real load size occurs. 1. Be sure there is enough SWAP, and TEMP space on your PMSERVER machine. Not having enough disk space could potentially slow down your entire server during processing (in an exponential fashion). Sometimes this means watching the disk space as while your session runs. Otherwise you may not get a good picture of the space available during operation. Particularly if your maps contain aggregates, or lookups that flow to disk Cache directory - or if you have a JOINER object with heterogeneous sources. 1. Place some good server load monitoring tools on your PMServer in development - watch it closely to understand how the resources are being utilized, and where the hot spots are. Try to follow the recommendations - it may mean upgrading the hardware to achieve throughput. Look in to EMC's disk storage array - while expensive, it appears to be extremely fast, I've heard (but not verified) that it has improved performance in some cases by up to 50% 1. SESSION SETTINGS. In the session, there is only so much tuning you can do. Balancing the throughput is important - by turning on "Collect Performance Statistics" you can get a good feel for what needs to be set in the session - or what needs to be changed in the database. Read the performance section carefully in the Informatica manuals. Basically what you should try to achieve is: OPTIMAL READ, OPTIMIAL THROUGHPUT, OPTIMAL WRITE. Over-tuning one of these three pieces can result in ultimately slowing down your session. For example: your write throughput is governed by your read and transformation speed, likewise, your read throughput is governed by your transformation and write speed. The best method to tune a problematic map, is to break it in to components for testing: 1) Read Throughput, tune for the reader, see what the settings are, send the write output to a flat file for less contention - Check the "ThrottleReader" setting (which is not configured by default), increase the Default Buffer Size by a factor of 64k each shot - ignore the warning above 128k. If the Reader still appears to increase during the session, then stabilize (after a few thousand rows), then try increasing the Shared Session Memory from 12MB to 24MB. If the reader still stabilizes, then you have a slow source, slow lookups, or your CACHE directory is not on internal disk. If the reader's throughput continues to climb above where it stabilized, make note of the session settings. Check the Performance Statistics to make sure the writer throughput is NOT the bottleneck - you are attempting to tune the reader here, and don't want the writer threads to slow you down. Change the map target back to the database targets - run the session again. This time, make note of how much the reader slows down, it's optimal performance was reached with a flat file(s). This time - slow targets are the cause. NOTE: if your reader session to flat file just doesn't ever "get fast", then you've got some basic map tuning to do. Try to merge expression objects, set your lookups to unconnected (for re-use if possible), check your Index and Data cache settings if you have aggregation, or lookups being performed. Etc... If you have a slow writer, change the map to a single target table at a time - see which target is causing the "slowness" and tune it. Make copies of the original map, and break down the copies. Once the "slower" of the N targets is discovered, talk to your DBA about partitioning the table, updating statistics, removing indexes during load, etc... There are many database things you can do here. 1. Remove all other "applications" on the PMServer. Except for the database / staging database or Data Warehouse itself. PMServer plays well with RDBMS (relational database management system) - but doesn't play well with application servers, particularly JAVA Virtual Machines, Web Servers, Security Servers, application, and Report servers. All of these items should be broken out to other machines. This is critical to improving performance on the PMServer machine.

BODS Scripting languageKeywords Beginend; ifelse; while; trycatch; Special Charecters: # comment ; end of statement; () to define a function; {} to take variable values as text string; [] to take the value of integer; , to differentiate parameter values; ! not; || for concatenation; string representation; \ escape character, $ variable; * Multipilication.BODS Recovery MechanismTool based recovery mech: This tool is acailable in BODS ,but it is switched off , we can enable when required; go to job execute enable recovery; recover from last failed execution; recover as a unit (workflow level)Limitations: Not at dataflow level; job should be sequentialCustom based recovery: should be done once the last job run should be successfully done without any failure. Automatically recovering jobs An DI feature that allows you to run unsuccessful jobs in recovery mode. Manually recovering jobs A design technique that allows you to rerun jobs without regard to partial results in a previous run.

Manually Recovering Jobs A job designed for manual recovery must: Be such that it can be run repeatedly Implement special steps to recover data when a step did not complete successfully during a previous run.

BODS DebuggingDebugging is used to trace the job at data level; used to identify issues (record by record); debugd only dataflow; Pre requisites: without having breakpoints we cant debug(mandatory); conditions & filters (optional); Breakpoint is a point where interface debugging starts workingBODS interface debugger: available in designer; Tools->options->designer-> environment->interactive debugger

Interactive Debugger : It allows to examine and modify data row by row by placing filters and break points on lines in the data flow during a debug mode job execution. Using the debugger we can know what happen to the data after each transform or object in the flow. Designer displays four additional windows: 1. Call stack 2. Trace 3. Variables 4. View Data The left View Data pane shows the data in the source table, and the right pane shows one row at a time (the default when you set a breakpoint) that has passed to the Query Filters : A debug filter functions as a simple Query transform with a WHERE clause; however, complex expressions are not supported in a debug filter. Use a filter if you want to reduce a data set in a debug job execution. Breakpoints : We can set a breakpoint between a source and transform or two transforms. A breakpoint is the location where a debug job execution pauses and returns control to you. A breakpoint condition applies after UPDATE, NORMAL and INSERT row types and to the before DELETE. Steps to set filters and breakpoints : 1.Open the Job you want to debug with filters and breakpoints in the workspace.2.Open one of its Data Flows.3.Right-click the connecting line that you want to examine, and select Set Filter/Breakpoint. 4.In the breakpoint window, under the Filter or Breakpoint columns, select the Set check box.

Complete the Column, Operator, and Value columns accordingly. TRY CATCHA try/catch block is a combination of one try object and one or more catch objects that allow you to specify alternative work flows if errors occur while DI is executing a job. Try/catch blocks: Catch classes of exceptions thrown by DI, the DBMS, or the operating system Apply solutions that you provide Continue execution Try and catch objects are single-use objects. Scripts are single-use objects used to call functions and assign values to variables in a work flow. A script can contain the following statements: Function calls If statements While statements Assignment statements Operators MIRGRATION AND REPOSITORIES DI supports a number of environments, including large enterprises with many developers working on multiple projects. DI supports multi-site architectures whether centralized or not. The development process you use to create your ETL application involves 3 distinct phases: design, Test, and production Each phase may require a different computer in a different environment, and different security settings for each.

Data Integrator provides two migration mechanisms: Export/import migration works best with small to medium-sized projects where a small number of developers work on somewhat independent Data Integrator applications through all phases of development. Multi-user development works best in larger projects where two or more developers or multiple teams are working on interdependent parts of Data Integrator applications through all phases of development.

Exporting Objects to a Database

You can export objects from the current repository to another repository. However, the other repository must be the same version as the current one. The export process allows you to change environment-specific information defined in datastores and file formats to match the new environment

Exporting/Importing Objects to/from a File You can also export objects to a file. If you choose a file as the export destination, DI does not provide options to change environment specific information. Importing objects or an entire repository from a file overwrites existing objects with the same names in the destination repository. You must restart DI after the import process completes.

Using built-in functions : DI provides over 60 built-in functions.Aggregate : avg, count, count_distinct, max, min, sumConversion : cast, extract_from_xml, interval_to_char, julian_to_date, load_to_xml, long_to_varchar, num_to_interval , to_char, to_date, to_decimal, to_decimal_ext, varchar_to_long Database : total_rows, key_generation, sql Date : week_in_year, week_in_month, sysdate, quarter, month, last_date, julian, isweekend, fiscal_day, day_in_year, day_in_week, day_in_month, date_part, date_diff, concat_date_time, add_months Environment : get_env, get_error_filename, get_monitor_filename, get_trace_filename, is_set_env, set_env Lookup : lookup, lookup_ext, lookup_seq Math : ceil, floor, In, log, mod, power, rand, rand_ext, round, sqrt, trunc Miscellaneous : current_configuration, current_system_configuration, dataflow_name, datastore_field_value, db_type, db_version, db_database_name, db_owner, decode, file_exists, gen_row_num_by_group, gen_row_num, get_domain_description, get_file_attribute, greatest, host_name, ifthenelse, is_group_changed, isempty, job_name, least, nvl, previous_row_value, pushdown_sql, raise_exception, raise_exception_ext, repository_name, sleep, system_user_name, table_attribute, truncate_table, wait_for_file, workflow_name

DATA TRANSFER

XML_PIPELINE

VALIDATION TRANSFORM

QUERY TRANSFORM

TABLE COMPARISION

KEY GENERATION

Hierarcy FLATNENNING

Lookup_ext

List of functions

PIVOT TRANSFORM

EXECUTION PROPERTIES

SAP BusinessObjects Data Services 4.0 features.....

Documentation........................................................................................................................92.2 SAP integration.......................................................................................................................92.3 Security................................................................................................................................112.4 Text Data Processing.............................................................................................................122.5 Architecture..........................................................................................................................122.6 Transforms............................................................................................................................132.7 Operational excellence...........................................................................................................142.8 Functions..............................................................................................................................142.9 Source and target support.....................................................................................................152.10 Data Quality..........................................................................................................................152.10.1 Data Cleanse transform.........................................................................................................172.10.2 Geocoder transform...............................................................................................................182.10.3 Global Address Cleanse transform.........................................................................................192.10.4 Match transform.....................................................................................................................212.10.5 USA Regulatory Address Cleanse transform.........................................................................

What are the benefits of Datawarehousing?

Facilitate integration in an environment characterized by unintegrated applications. Integrate enterprise data across a variety of functions. Integrate external as well as internal data. Support strategic and longterm business planning. Support daytoday tactical decisions. Enable insight into business trends and business opportunities What is the difference between dimensional table and fact table?

A dimension table consists of tuples of attributes of the dimension. A fact table can be thought of as having tuples, one per a recorded fact. This fact contains some measured or observed variables and identifies them with pointers to dimension tables.

Data Warehousing helps you store the data while business intelligence helps you to control the data for decision making, forecasting etc.Dimension Table Examples Retail -- store name, zip code, product name, productcategory, day of week Telecommunications -- call origin, call destination Banking -- customer name, account number, branch,account officer Insurance -- policy type, insured partyDimension Table CharacteristicsDimension tables have the following characteristics:- Contain textual information that represents the attributes of thebusiness- Contain relatively static data- Are joined to a fact table through foreign key reference- They are hierarchical in nature and provide the ability to view data atvarying levels of details.Fact Table Examples Retail -- number of units sold, sales amount Telecommunications -- length of call in minutes, averagenumber of calls Banking -- average monthly balance Insurance -- claims amountFact Table CharacteristicsFact table have the following characteristics Contain numerical metrics of the business Can hold large volumes of data Can grow quickly Are joined to dimension table through foreign keysthat reference primary keys in the dimension tablesIdentifying Measures and Dimensions The attribute is perceivedas constant or discrete:Dimension Product Location Time Size The attribute variescontinuously:

Measures Balance Units Sold Cost Sales

Business intelligence usually refers to the information that is available for the enterprise to make decisions on. A data warehousing (or data mart) system is the backend, or the infrastructural, component for achieving business intellignce. Business intelligence also includes the insight gained from doing data mining analysis, as well as unstrctured data (thus the need fo content management systems). For our purposes here, we will discuss business intelligence in the context of using a data warehouse infrastructure.Business Intelligence Tools: Excel, Reporting tool, OLAP tool, Data mining toolBusiness intelligence users:Business operations reporting, Forecasting, Dashboard, Multidimensional analysis, Finding correlation among different factors

When should you use a STAR and when a SNOW-FLAKE schema?The star schema is the simplest data warehouse schema. Snow flake schema is similar to the star schema. It normalizes dimension table to save data storage space. It can be used to represent hierarchies of information. What is the difference between a data warehouse and a data mart?This is a heavily debated issue. There are inherent similarities between the basic constructs used to design a data warehouse and a data mart. In general a Data Warehouse is used on an enterprise level, while Data Marts is used on a business division/department level. A data mart only contains the required subject specific data for local analysis.

Data Warehouse A data warehouse is a central repository for all or significant parts of the data that an enterprise's various business systems collect. Typically, a data warehouse is housed on an enterprise mainframe server. Data from various online transaction processing (OLTP) applications and other sources is selectively extracted and organized on the data warehouse database for use by analytical applications and user queries. Data warehousing emphasizes the capture of data from diverse sources for useful analysis and access. Data Marts A data mart is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers. In scope, the data may derive from an enterprise-wide database or data warehouse or be more specialized. The emphasis of a data mart is on meeting the specific demands of a particular group of knowledge users in terms of analysis, content, presentation, and ease-of-use. Users of a data mart can expect to have data presented in terms that are familiar

We have two schema models which are suitable for Data Warehousing in most of the cases. Star Schema A star schema is a set of tables comprised of a single, central fact table surrounded by de-normalized dimensions. Each dimension is represented in a single table. Snowflake Schema If you normalize the star schema dimensions to separate tables and link them together, you will have a snowflake schema

Fact TablesTypes of Measures Additive facts Non-additive facts Semi-additive facts

Additive Facts Additive facts are facts that can be summed up throughall of the dimensions in the fact table.

Non - Additive FactsNon-additive facts are facts that cannot be summed up forany of the dimensions present in the fact table

Semi - Additive Facts- Semi-additive facts are facts that can be summed upfor some of the dimensions in the fact table, but notthe others.

Initial Load and Incremental/Refresh Initial Load: Single event that populates the database with historicaldata Involves large volumes of data Refresh/Incremental: Performed according to a business cycle Less data to load than first-time load

OLTP:

1. Captures transactional information necessary to run business operations2. Need performance 3. More DML operations4. OLTP systems often use fully normalized schemas to optimize update/insert/delete performance and data consistency5. A typical OLTP operation accesses only a handful of records6. OLTP systems usually store data for only a few weeks or months if needed

OLAP:

1. Analyze transaction information at an aggregate level to improve the decision-making process2. Need flexibility and broad scope3. Very rare DML operations4. Denormalized or partially denormalized (star schema) to optimize query performance5. A typical DW query scans thousands/millions of rows.6. Considerable historical data is maintained for analysis

Data mart - A logical subset of the complete data warehouse. A data mart is a complete pie-wedge of the overall data warehouse pie. A data warehouse is made up of the union of all its data marts. A data warehouse is fed from the staging area. Every data mart must be represented by a dimensional model and, within a single data warehouse, all such data mart must be built from dimensions and facts.

MappingThe definition of the relationship and data flow between source and target objects. MetadataData that describes data and other structures, such as objects, business rules, and processes. For example, the schema design of a data warehouse is typically stored in a repository as metadata, which is used to generate scripts used to build and populate the data warehouse. A repository contains metadata.Staging AreaA place where data is processed before entering the warehouse. CleansingThe process of resolving inconsistencies and fixing the anomalies in source data, typically as part of the ETL process. TransformationThe process of manipulating data. Any manipulation beyond copying is a transformation. Examples include cleansing, aggregating, and integrating data from multiple sources.

hat is Entity Relationship (E-R) ModelingEntity Relation (E-R) model is developed to answers the following issues of conventional Data Base Management System (DBMS).(i) Redundancy of data (ii) Lack of integratioWn and (iii) Lack of flexibility,

This modeling is based on the relational theory and abides by the 13 rules proposed by E.F. Codd that a DBMS implementation must follow to be qualified as truly relational. The data in E-R model is presented in a simple form of two-dimensional tables.

NormalizationNormalization is a process of decomposing the tables to prevent redundancy, insert & update anomalies.

First Normal Form (INF):- A table is said to be in a First Normal Form (1NF) if it satisfy the below three conditions:-1) If the columns of the table only contain atomic values (Single, indivisible).2) Primary key is defined for the table3) All the columns of the table are defined on the primary key.Second Normal Form (2NF):- A table is said to be in its Second Normal Form if it satisfied the following conditions:-1) It satisfies the condition for the First Normal Form (1NF),2) It do not includes and partial dependencies where a column is dependent only a part of a primary key.Third Normal Form (3NF):- A table is said to be in the Third Normal form (3NF) if it satisfy the following conditions:-1) It should be in the 2NF2) It should not contain any transitive dependency which means that any non key column of the table should not be dependent on another non key column.Denormalization:- Denormalization can be defined as the process of moving from higher normal form to a lower normal forms in order to speed up the database access.

time stamp:Stores a database wide unique number that gets updated every time row gets updatedSQL Join Types:There are different type of joins available in SQL:

INNER JOIN:returns rows when there is a match in both tables.

LEFT JOIN:returns all rows from the left table, even if there are no matches in the right table.

RIGHT JOIN:returns all rows from the right table, even if there are no matches in the left table.

FULL JOIN:returns rows when there is a match in one of the tables.

SELF JOIN:is used to join a table to itself, as if the table were two tables, temporarily renaming at least one table in the SQL statement.

CARTESIAN JOIN:returns the cartesian product of the sets of records from the two or more joined tables

TIMESTAMP(expr), TIMESTAMP(expr1,expr2)With a single argument, this function returns the date ordatetime expression expr as a datetime value. With two arguments, it adds the time expression expr2 to the date or datetime expression expr1 and returns the result as a datetime value.mysql

DELTA LOAD

A delta load, by definition, is loading incremental changes to the data. When doing a delta load to a fact table, for example, you perform inserts only... appending the change data to the existing table.

Load types:-1)Bulk Load2)Normal Load

Normal load:- 1)in case of less data.2)we can get its log details3)we can rollback and commit.4)Session recovery possible.5)performance may be low .

Bulk load :-1)In case of large data2)no log details are available.3)can't rollback and commit4)session recovery not possible.5)performance improves.

Following are commonly used constraints available in SQL: NOT NULL Constraint: Ensures that a column cannot have NULL value. DEFAULT Constraint : Provides a default value for a column when none is specified. UNIQUE Constraint: Ensures that all values in a column are different. PRIMARY Key: Uniquely identified each rows/records in a database table. FOREIGN Key: Uniquely identified a rows/records in any another database table. CHECK Constraint: The CHECK constraint ensures that all values in a column satisfy certain conditions. INDEX: Use to create and retrieve data from the database very quickly.

NOT NULL Constraint: By default, a column can hold NULL values. If you do not want a column to have a NULL value then you need to define such constraint on this column specifying that NULL is now not allowed for that column. A NULL is not the same as no data, rather, it represents unknown data. DEFAULT Constraint: The DEFAULT constraint provides a default value to a column when the INSERT INTO statement does not provide a specific value. UNIQUE Constraint: The UNIQUE Constraint prevents two records from having identical values in a particular column. In the CUSTOMERS table, for example, you might want to prevent two or more people from having identical age. PRIMARY Key: A primary key is a field in a table which uniquely identifies the each rows/records in a database table. Primary keys must contain unique values. A primary key column cannot have NULL values. A table can have only one primary key which may consist of single or multiple fields. When multiple fields are used as a primary key, they are called a composite key. If a table has a primary key defined on any field(s) then you can not have two records having the same value of that field(s).FOREIGN Key:A foreign key is a key used to link two tables together. This is sometimes called a referencing key. Primary key field from one table and insert it into the other table where it becomes a foreign key ie. Foreign Key is a column or a combination of columns whose values match a Primary Key in a different table. The relationship between 2 tables matches the Primary Key in one of the tables with a Foreign Key in the second table. If a table has a primary key defined on any field(s) then you can not have two records having the same value of that field(s).CHECK Constraint: The CHECK Constraint enables a condition to check the value being entered into a record. If the condition evaluates to false, the record violates the constraint and isn.t entered into the table.INDEX: The INDEX is used to create and retrieve data from the database very quickly. Index can be created by using single or group of columns in a table. When index is created it is assigned a ROWID for each rows before it sort out the data. Proper indexes are good for performance in large databases but you need to be careful while creating index. Selection of fields depends on what you are using in your SQL queries.Database NormalizationDatabase normalization is the process of efficiently organizing data in a database. There are two reasons of the normalization process: Eliminating redundant data, for example, storing the same data in more than one tables

DELETE1. DELETE is a DML Command.2. DELETE statement is executed using a row lock, each row in the table is locked for deletion.3. We can specify filters in where clause4. It deletes specified data if where condition exists.5. Delete activates a trigger because the operation are logged individually.6. Slower than truncate because, it keeps logs.7. Rollback is possible.TRUNCATE1. TRUNCATE is a DDL command.2. TRUNCATE TABLE always locks the table and page but not each row.3. Cannot use Where Condition.4. It Removes all the data.5. TRUNCATE TABLE cannot activate a trigger because the operation does not log individual row deletions.6. Faster in performance wise, because it doesn't keep any logs.7. Rollback is not possible.DELETE and TRUNCATE both can be rolled back when used with TRANSACTION.Primary KeyUnique Key

It will not accept null valuesOne and only one Null values are accepted.

There will be only one primary key in a tableMore than one unique key will be there in a table.

Clustered index is created in Primary keyNon-Clustered index is created in unique key.

Primary key allows each row in a table to be uniquely identified and ensures that no duplicate rows exist.Unique key constraint is used to prevent the duplication of key values within the rows of a table and allow null values.

Primary Key: -1 Primary key is used to avoid the duplication of records under same column-2 Primary key is used to uniquely identify the each record from table -3 You can use only one primary key at a time on single table-4 It does not allow any null value or duplicate valueForeign Key it is used to give the reference of another table primary key..when i update the child table record then it does not allow me..first i need to update from parent table then it is updated in child table

Primary Key: It will not allow "Null values" and "Duplicate values"Foreign Key:It will allow "Null values" and "Duplicte values" and it refers to a primary key in anoter table

SELECT column1, column2, columnN FROM table_name;

What is surrogate key? Explain it with an example.AnswerData warehouses commonly use a surrogate key to uniquely identify an entity. A surrogate is not generated by the user but by the system. A primary difference between a primary key and surrogate key in few databases is that PK uniquely identifies a record while a SK uniquely identifies an entity. E.g. an employee may be recruited before the year 2000 while another employee with the same name may be recruited after the year 2000. Here, the primary key will uniquely identify the record while the surrogate key will be generated by the system (say a serial number) since the SK is NOT derived from the data. Datawarehousing-What is surrogate key?-May 11, 2009 at 14:40 pm byVidya SagarWhat is surrogate key? Explain it with an example.A surrogate key is a unique identifier in database either for an entity in the modeled word or an object in the database. Application data is not used to derive surrogate key. Surrogate key is an internally generated key by the current system and is invisible to the user. As several objects are available in the database corresponding to surrogate, surrogate key can not be utilized as primary key. Advantages of surrogate keys include: Control over data Reduced fact table size22. What is a Stored Procedure? Its nothing but a set of T-SQL statements combined to perform a single task of several tasks. Its basically like a Macro so when you invoke the Stored procedure, you actually run a set of statements

To COMMIT means to make changes to data

COMMIT - save work doneSAVEPOINT - identify a point in a transaction to which you can later roll backROLLBACK - restore database to original since the last COMMITSET TRANSACTION - Change transaction options like what rollback segment to use

What is ods (operation data source)ODS is nothing but operational data source. It contains most recent data .it contains around 30-60 days data It is placed in between source system and staging area.. reports can also be taken in ODS. Operational Data Store (ODS): An ODS is an integrated database of operational data. Its sources include legacy systems and it contains current or near term data. An ODS may contain 30 to 60 days of information, while a data warehouse typically contains years of data. it is before staging Full Load: completely erasing the contents of one or more tables and reloading with fresh data.Incremental Load: applying ongoing changes to one or more tables based on a predefined scheduleODS (Operational Data Source) is the first point in the Datawarehouse. Its store the real time data of daily transactions as the first instance of Date.

Staging Area, is the later part which comes after the ODS. Here the Data is cleansed and temporarily stored before loaded into the Datawarehouse.

Interview Questions

12. TRUNCATE TABLE EMP; DELETE FROM EMP; Will the outputs of the above two commands Delete Command: 1. Its a DML Command 2. Data can be rolled back. 3. Its slower than Truncate command bcoz it logs each row deletion. 4. With delete command trigger can be fire. Truncate Command: 1. Its a DDL Command 2. Data Can not be rolled back. 3. Its is faster than delete bcoz it does not log rows. With Truncate command trigger can not be fire. both cases only the table data is removed, not the table structure.

13. What is the use of the DROP option in the ALTER TABLE commandDrop option in the ALTER TABLE command is used to drop columns you no longer need from the table. The column may or may not contain data Using alter column statement only one column can be dropped at a time. The table must have at least one column remaining in it after it is altered. Once a column is dropped, it cannot be recovered.

20. What is Data warehousing Hierarchy?HierarchiesHierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure.

Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies--one for product categories and one for product suppliers.

Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. This is one of the key benefits of a data warehouse.

When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization.

Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly.

LevelsA level represents a position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. Levels range from general to specific, with the root level as the highest or most general level. The levels in a dimension are organized into one or more hierarchies.

Level RelationshipsLevel relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information. They define the parent-child relationship between the levels in a hierarchy.

Hierarchies are also essential components in enabling more complex rewrites. For example, the database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known..

22. What is surrogate key ? where we use it explain with examplessurrogate key is a substitution for the natural primary key.

It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table.

Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME.

Another benefit you can get from surrogate keys (SID) is :

Tracking the SCD - Slowly Changing Dimension.

Let me give you a simple, classical example:

On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to belong to the new Business Unit 'BU2' but the old one should Belong to the Business Unit 'BU1.'

If you used the natural business key 'E1' for your employee within your datawarehouse everything would be allocated to Business Unit 'BU2' even what actualy belongs to 'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee Dimension with a new surrogate key.

This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.'

You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee was Employee Code 'E1' but for you it becomes Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural key enlargement process, is that you might not have all part of your new key within your fact table, so you might not be able to do the join on the new enlarge key -> so you need another id.

25. What is the main difference between schema in RDBMS and schemas in Data Warehouse....?RDBMS Schema * Used for OLTP systems * Traditional and old schema * Normalized * Difficult to understand and navigate * Cannot solve extract and complex problems * Poorly modeled DWH Schema * Used for OLAP systems * New generation schema * De Normalized * Easy to understand and navigate * Extract and complex problems can be easily solved * Very good model

26. What is Dimensional Modelling?Dimensional Modelling is a design concept used by many data warehouse designers to build their data-warehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measurements i.e., the dimensions on which the facts are calculated.

27. What is real time data-warehousing?In real-time data warehousing, your warehouse contains completely up-to-date data and is synchronized with the source systems that provide the source data. In near-real-time data warehousing, there is a minimal delay between source data being generated and being available in the data warehouse. Therefore, if you want to achieve real-time or near-real-time updates to your data warehouse, youll need to do three things:

1. Reduce or eliminate the time taken to get new and changed data out of your source systems. 2. Eliminate, or reduce as much as possible, the time required to cleanse, transform and load your data. 3. Reduce as much as possible the time required to update your aggregates. Starting with version 9i, and continuing with the latest 10g release, Oracle has gradually introduced features into the database to support real-time, and near-real-time, data warehousing. These features include:

Change Data Capture External tables, table functions, pipelining, and the MERGE command, and Fast refresh materialized views

28. What is a lookup table?When a table is used to check for some data for its presence prior to loading of some other data or the same data to another table, the table is called a LOOKUP Table.

40. What are the Different methods of loading Dimension tables?Conventional Load: Before loading the data, all the Table constraints will be checked against the data. Direct load:(Faster Loading) All the Constraints will be disabled. Data will be loaded directly. Later the data will be checked against the table constraints and the bad data won't be indexed.

DataWareHousing - ETL Project Life Cycle ( Simple to understand )-> Datawarehousing projects are categorized into 4 types.1) Development Projects.2) Enhancement Projects3) Migration Projects4) Production support Projects.-> The following are the different phases involved in a ETL project development life cycle.1) Business Requirement Collection ( BRD )2) System Requirement Collection ( SRD )3) Design Phasea) High Level Design Document ( HRD )b) Low level Design Document ( LLD )c) Mapping Design 4) Code Review5) Peer Review6) Testinga) Unit Testingb) System Integration Testing.c) USer Acceptance Testin