22
Deep Dive into ETL Implementation with SQL Server Integration Services Anton Rozenson [email protected]

Deep Dive into ETL Implementation with SQL Server Integration Services

  • Upload
    yul

  • View
    74

  • Download
    0

Embed Size (px)

DESCRIPTION

Deep Dive into ETL Implementation with SQL Server Integration Services. Anton Rozenson [email protected]. About this training event. First event with core focus on Microsoft BI technology Help the community by sharing our learning and experience based on real world scenarios - PowerPoint PPT Presentation

Citation preview

Page 1: Deep Dive into ETL Implementation with SQL Server Integration Services

Deep Dive into ETL Implementation with SQL Server Integration Services

Anton [email protected]

Page 2: Deep Dive into ETL Implementation with SQL Server Integration Services

About this training event• First event with core focus on Microsoft BI technology• Help the community by sharing our learning and experience

based on real world scenarios• Network with peers and learn from you!• These training events will be held every 2 months

Page 3: Deep Dive into ETL Implementation with SQL Server Integration Services

Agenda• Importance and complexity of ETL process

• ETL Architecture

• Changed Data Capture challenge and options

• Data Flow design and performance considerations

• SSIS project deployment

• Package execution options

• Performance monitoring in SSIS catalog

Page 4: Deep Dive into ETL Implementation with SQL Server Integration Services

Moving data• “Data Warehouse is a system that extracts, cleans, conforms,

and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making” – Ralph Kimball, Joe Casertam. (2004) “The Data Warehouse ETL Toolkit

• Estimated 80% of work in building Data warehouse solution is related to ETL design and implementation

• Data Warehouse is only as good as the data that it contains

Page 5: Deep Dive into ETL Implementation with SQL Server Integration Services

Common ETL Architecture

RDBMS

Cloud

Flat File

Sources Staging EDW

Data warehouse

• Changed data• Reference data• Artifacts and Error

tables

• Consumption ready

• De-normalized• Clean data

Page 6: Deep Dive into ETL Implementation with SQL Server Integration Services

ELT Architecture

RDBMS

Cloud

Flat File

Sources EDW

Data warehouse

• Changed data• Reference data• Artifacts and Error

tables

• De-normalized• Retains

traceable business key

Page 7: Deep Dive into ETL Implementation with SQL Server Integration Services

CDC ChallengeWhat Changed?

Unable to modify source systems to include CDC attribution?

Reliable timestamps?

When changed ?

Need to reliably determine modified data for incremental loading

New dataUpdated dataDeleted data

Page 8: Deep Dive into ETL Implementation with SQL Server Integration Services

CDC optionsSource system can provide time stamps

• Reliability of process and completeness of information?• Did all yesterday’s transactions committed?• Does Source system include work flows that can cause

late arrival of data records?• How to determine deleted records?

Operational Data Store (ODS) can help• Keeping change history• Clearly define state of data records• Stores metadata

Comparing EDW data to Source system to determine differences• Very expensive query affecting Source system• Does Source System incorporates archival process?

Page 9: Deep Dive into ETL Implementation with SQL Server Integration Services

CDC in SQL 2014In SQL Server, change data capture offers an effective solution to the challenge of efficiently performing incremental loads from source tables to data marts and data warehouses.

Change Data Capture process stores transaction information from SQL log into system tables in CDC Schema.

Data from CDC tables can be extracted by using table valued functions generated when CDC is enabled on the table.

Key concepts:• LSN – a binary timestamp representation used to restrict

changed set• All Changes – changed set includes all DML transactions• Net Changes – changed set includes last DML transactions

based on unique index.

Page 10: Deep Dive into ETL Implementation with SQL Server Integration Services

Demo

Page 11: Deep Dive into ETL Implementation with SQL Server Integration Services

Data Flow design challengesHow should Data Flow perform?

Consider following factors when designing an ETL Solution

• Source System structure and ability to execute logic such as sorting and filtering.

• Parallel processing should be used with caution. Fastest way to load a table is Fast Load with table lock. This prevents loading data in parallel. Partition switching can be an answer.

• Requirements for data availability in EDW

• Service Level Agreement (SLA)

Page 12: Deep Dive into ETL Implementation with SQL Server Integration Services

Data Flow blocking tasksData flow transformations in SSIS use memory/buffers in different ways.  The way a transformation uses memory can dramatically impact the performance of your package.  Transformation buffer usage can be classified into 3 categories: Non Blocking, Partially Blocking, and (Full) Blocking.

Non Blocking transformations: Audit, Character Map, Conditional Split, Copy Column, Data Conversion, Derived Column, Import Column, Lookup, Multicast, Percentage sampling, Row count, Row sampling, Script component

Partially Blocking transformations: Data mining, Merge, Merge Join, Pivot/Unpivot, Term Extraction, Term Lookup, Union All

Blocking transformations: Aggregate, Fuzzy Grouping, Fuzzy Lookup, Sort

Page 13: Deep Dive into ETL Implementation with SQL Server Integration Services

Work around blocking transformations in Data Flow

It is not always possible to avoid using blocking or partially blocking transformations, but in some cases it is possible.

For example Merge transformation requires data set to be sorted. While Sort transformation is expensive, in some cases sorting can be handled in the source query. Make sure to set IsSorted property of Data Source output to true and assign proper SortKeyPosition to output columns.

Another example is Aggregate Transformation. Use Script transformation to perform aggregation of data and return result to a variable.

Page 14: Deep Dive into ETL Implementation with SQL Server Integration Services

Demo

Page 15: Deep Dive into ETL Implementation with SQL Server Integration Services

Project Deployment modelProject deployment model allows following features:

• Parameters can be used in expressions or tasks. Parameters can reference an environment variable. Environment variable values are resolved at the time of package execution.

• An environment is a container of variables that can be referenced by Integration Services projects. Environments allow you to organize the values that you assign to a package. For example, you might have environments named "Dev", "test", and "Production".

• SSISDB catalog allows you to use folders to organize your projects and environments.

• Catalog stored procedures and views can be used to manage Integration Services objects in the catalog.

Page 16: Deep Dive into ETL Implementation with SQL Server Integration Services

Package ExecutionAn execution is an instance of a package execution.

Package execution can be scheduled via SQL Agent job. SQL Agent provides an easy to use interface for mapping of Project parameters to environment variables.

Packages can also be executed via Execute package tasks from another SSIS package. This allows creation of robust workflow incorporated into the Master Package.

Page 17: Deep Dive into ETL Implementation with SQL Server Integration Services

Package ExecutionSSIS catalog allows package execution to be controlled programmatically from within T-SQL. A Number of stored procedures are provided to manage Package Execution.

catalog.create_execution creates an instance of package execution and assigns Execution_ID.

catalog.set_execution_parameter_value assigns parameters to the instance of package execution. Execution parameters control Logging Level, Dump settings, Synchronized execution option as well as ability to assign values to Project or Package scoped parameters.

catalog.start_execution starts an instance of execution.

Page 18: Deep Dive into ETL Implementation with SQL Server Integration Services

Execution MonitoringCatalog provides a set of standard reports allowing administrators easy access to execution performance and statistics.

For details about executions, validations, messages that are logged during operations, and contextual information related to errors, query these views.

executions list of Executions includes environmental data

execution_data_statistics data flow performance information

execution_parameter_values list of run time parameters

event_messages messages that were logged during executions

Page 19: Deep Dive into ETL Implementation with SQL Server Integration Services

Demo

Page 20: Deep Dive into ETL Implementation with SQL Server Integration Services

Questions

Page 21: Deep Dive into ETL Implementation with SQL Server Integration Services

Additional ResourcesWhat's New in SQL Server 2014http://msdn.microsoft.com/en-us/library/bb500435.aspx

SSIS Cataloghttp://msdn.microsoft.com/en-us/library/hh479588.aspx

Deployment of Projects and Packageshttp://msdn.microsoft.com/en-us/library/hh213290(v=sql.120).aspx

Change Data Capturehttp://technet.microsoft.com/en-us/library/bb522489(v=sql.105).aspx

Change Data Capture (SSIS)http://msdn.microsoft.com/en-us/library/bb895315.aspx

CDC Flow Componentshttp://msdn.microsoft.com/en-us/library/hh231087(v=sql.120).aspx

Enable and Disable Change Data Capture (SQL Server)http://msdn.microsoft.com/en-us/library/cc627369.aspx

SQL Server OLE DB Deprecation and Integration Serviceshttp://blogs.msdn.com/b/mattm/archive/2012/01/09/sql-server-ole-db-deprecation-and-integration-services.aspx

oData Data source setuphttp://www.microsoft.com/en-us/download/details.aspx?id=42280

oData sampleshttp://services.odata.org/

Page 22: Deep Dive into ETL Implementation with SQL Server Integration Services

Contact Uswww.gnetgroup.com

Neelesh RahejaVP, Consulting [email protected]@PracticalBI

Anton RozensonBI Solution [email protected]

blog.gnetgroup.com

linkedin.com/company/143712

facebook.com/gnetgroup

twitter.com/GnetGroup

youtube.com/user/GNetGroup