Introduction to Teradata

Preview:

DESCRIPTION

Brief Introduction to Teradata

Citation preview

Introduction to theTeradata Database

Course Modules

This course consists of:

Module 1: Teradata Database OverviewModule 2: Relational Database ConceptsModule 3: Teradata and the Data WarehouseModule 4: Components and ArchitectureModule 5: Databases and UsersModule 6: Data Distribution and AccessModule 7: Secondary Indexes and Full-Table ScansModule 8: Fault Tolerance and Data ProtectionModule 9: Client Tools and Utilities

Teradata Database Overview

Module 1

What is the Teradata Database?

• Relational Database Management System• Built on a Parallel Architecture• Runs on MP-RAS UNIX,• Microsoft Windows 2000/2003• Server, and SuSE Linux

Teradata Parallel Architecture

• More warehouse data• Linear Scalability (10GB to 100+TB)• Hashing provides for automatic data distribution• Parallel-Aware Optimizer• Single, Administrative View• Ad hoc queries with ANSI

Teradata Database Advantages

• Proven Linear Scalability - increased workload without decreased throughput

• Most Concurrent Users - multiple complex queries• Unconditional Parallelism - sorts, aggregations and full-table scans are

performed in parallel• Mature Optimizer - robust and parallel aware, handles complex

queries, multiple joins per query, ad hoc processing• Low TCO - ease of setup and maintenance, robust parallel utilities, no

re-orgs,• Automatic data distribution, low disk to data ratio, robust expansion

utility• High Availability - no single point of failure, fault-tolerant architecture• Single View of the Business - single database server for multiple clients

Teradata Database Manageability

Things Teradata Database Administrators never have to do!

• Reorganize data or index space• Pre-allocate table or index space• Physically format partitions or disk space• Pre-prepare data for loading (convert, sort, split, etc.)• Ensure that queries run in parallel• Unload/reload data spaces due to expansion

** The Administrator knows that if the data is to be doubled, the system can be easily expanded to accommodate it.** The amount of work required to create a table which willcontain 100 rows is the same as that to create a tablewhich will contain 1,000,000,000 rows.

Teradata Database Features

• Designed to process large quantities of detail data• Ideal for data warehouse applications• Parallelism makes easy access to very large tables possible• Open architecture - uses industry standard components• Performance increase is linear as components are added• Runs as a database server to client applications• Runs on multiple hardware platforms (SMP) and Teradata

hardware(MPP)

Module 2

Relational Database Concepts

What is a Database?

A database is a collection of permanently stored data that is:

– Logically related - the data relates to other data– Shared - many users may access the data.– Protected - access to data is controlled.– Managed - the data has integrity and value.

Logical/Relational Modeling

The Logical Model• Should be designed without regard to usage

– Accommodates a wide variety of front end tools – Allows the database to be created more quickly

• Should be the same regardless of data volume• Data is organized according to what it represents (real world business data in

table (relational) form)• Includes all the data definitions within the scope of the application or

enterprise• Is generic – the logical model is the template for physical implementation on

any RDBMS platform

Normalization• Process of reducing a complex data structure into a simple, stable oneInvolves removing redundant attributes, keys, and relationships from the conceptual data model

Relational Databases

The employee table has: – Nine columns of data – Six rows of data - one per employee – Only one row format for the entire table – Missing data values represented by nulls – Column and row order are arbitrary

Relational Databases are founded on Set Theory and based on the Relational Model.• A Relational Database consists of a collection of logically related tables.• A table is a two dimensional representation of data consisting of rows and columns.

Primary Keys

Foreign Keys

Answering Questions with a Relational Database

Relational Advantages

Advantages of a Relational Database compared to other databasemethodologies include:• More flexible than other types• Allows businesses to quickly respond to changing conditions• Being data-driven vs. application driven• Modeling the business, not the processes• Makes applications easier to build because the data does more of the work• Supporting trend toward end-user computing• Being easy to understand• No need to know the access path• Solidly founded in set theory

Module 3

Teradata and the Data Warehouse

Evolution of Data Processing

unit of work

A transaction is a logical

The Advantage of Using Detail Data

Data Warehouse Usage Evolution

Active Data Warehousing

• Performance - response time within seconds• Scalability

– large amounts of detailed data – mixed workloads (both tactical and strategic queries) for mission critical applications – concurrent users

• Availability and Reliability - 7 x 24• Data Freshness - accurate, up to the minute data, including access to

operational data store level information

The Data Warehouse

A central, enterprise-wide database that contains information extracted from operational systems.

Based on enterprise-wide model• Can begin small but may grow large rapidly• Populated by extraction/loading of data from operational systems• Responds to end-user “what if” queries• Minimizes data movement/ synchronization• Provides a Single View of the business

Data Marts

A data mart is a special purpose subset of enterprise data for a particular function or application. It may contain detail or summary data or both.Data mart types: – Independent - created directly from operational systems to a separate physical data store – Logical - exists as a subset of existing data warehouse via Views – Dependent - created from data warehouse to a separate physical data store

Module 4

Components and Architecture

What is a Node?

• Teradata software, gateway software and channel-driver software run as processes• Parsing Engines (PE) and Access Module Processors (AMP) are Virtual Processors (VPROC) which run under control of Parallel Database Extensions (PDE)• Each AMP is associated with a Virtual Disk (VDISK)• A single node is called a Symmetric Multi-Processor (SMP)• All AMPs and PEs communicate via the BYNET

Major Components of the Teradata Database

The Parsing Engine (PE)

The Parsing Engine is responsible for:• Managing individual sessions (up to 120 sessions per PE)• Parsing and optimizing your SQL requests• Building query plans with the parallel-aware, cost-based, intelligent Optimizer• Dispatching the optimized plan to the AMPs• EBCDIC/ASCII input conversion (if necessary)• Sending the answer set response back to the requesting client

The BYNET

Dual redundant, fault-tolerant, bi-directional interconnect network that enables:• Automatic load balancing of message traffic• Automatic reconfiguration after fault detection• Scalable bandwidth as nodes are addedThe BYNET connects and communicates with all the AMPs on the system:• Between nodes, the BYNET hardware carries broadcast and point-to-point communications• On a node, BYNET software and PDE together control which AMPs receive a

The Access Module Processor (AMP)

The AMP is responsible for:• Storing rows to and retrieving

rows from its VDISK• Lock management• Sorting rows and aggregating

columns• Join processing• Output conversion and

formatting (ASCII, EBCDIC)• Creating answer sets for clients• Disk space management and

accounting• Special utility protocols• Recovery processing

The MPP System

• The BYNET (both software and hardware) connects two or more SMP Nodes to create a Massively Parallel Processing (MPP) system.• The Teradata Database is linearly expandable by adding nodes.

Teradata Database Software

Channel-Attached Client Software

CLI (Call-Level Interface)• Request and response control• Buffer allocation and initialization• Lowest level interface to the Teradata Database• Library of routines for blocking/unblocking requests and responses to/from RDBMS• Performs logon and logoff functions

TDP (Teradata Director Program)• Manages session traffic between CLI and the Teradata Database• Session balancing across multiple PEs• Failure notification (application failure, Teradata Database restart)• Logging, verification, recovery, restart, security

Network-Attached Client SoftwareODBC• Call-level interface• Teradata Database ODBC driver is used to connect applications with the Teradata Database

MTDP (Micro Teradata Director Program)• Performs many TDP functions including session management but not session balancing across PEs

MOSI (Micro Operating SystemInterface)• Provides operating system and network protocol independent interface.

Module 5

Databases and Users

Databases and Users Defined

• Databases and Users are the repositories for objects: – Tables - require Perm Space – Views - do not require Perm Space – Macros - do not require Perm Space – Triggers - do not require Perm Space – Stored Procedures - require Perm Space• Space limits are specified for each database and for each user: – Perm Space - maximum amount of space available for permanent tables – Spool Space - maximum amount of work space available for request processing – Temp Space - maximum amount of space available for global temporary tables• A database is created with the CREATE DATABASE command.• A user is created with the CREATE USER command.• The only difference between a database and a user is the user has a password and may logon to the system.• A database or user with no perm space may not contain permanent tables but may contain views and macros.

Teradata Database Space Management

• A new database or user must be created from an existing database or user.• All Perm Space limits are subtracted from the owner.• Perm Space is a zero-sum game – the total of all Perm Space limits must equal the total amount of disk space available.• Perm Space currently not being used is available for Spool Space or Temp Space.

Module 6

Data Distribution and Access

How Does the Teradata Database Distribute Rows?

– The Teradata Database uses a hashing algorithm to randomly distribute table rows across the AMPs. – The Primary Index choice determines whether the rows of a table will be evenly or unevenly distributed across the AMPs. – Evenly distributed table rows result in evenly distributed workloads. – Each AMP is responsible for its subset of the rows of each table. – The rows are not placed in any particular order.The benefits of unordered rows include: – No maintenance needed to preserve order. – The order is independent of any query being submitted.The benefits of hashed distribution include: – The distribution is the same regardless of data volume. – The distribution is based on row content, not data demographics.

Primary Key (PK) vs. Primary Index (PI)

• The PK is a relational modeling convention which uniquely identifies each row.• The PI is a Teradata convention which determines row distribution and access.• A well designed database will have tables where the PI is the same as the PK as well as tables where the PI is defined on columns different from the PK.• Join performance and known access paths might dictate a PI that is different from the PK.

Primary Indexes

• The physical mechanism used to assign a row to an AMP• A table must have a Primary Index• The Primary Index cannot be changedUPA If the index choice of column(s) is unique, we call this a UPI (Unique Primary Index). A UPI choice will result in even distribution of the rows of the table across all AMPs.Reasons to Choose a UPI: UPI’s guarantee even data distribution, eliminateduplicate row checking, and are always a one-AMP operation.NUPA• If the index choice of column(s) isn’t unique, we call this a NUPI (Non-Unique Primary Index).• A NUPI choice will result in even distribution of the rows of the table proportional to the degree of uniqueness of the index.• NUPIs can cause skewed data.Why would you choose an Index that is different from the Primary Key?1. Join performance2. Known access paths

Defining the Primary Index

• The Primary Index (PI) is defined at table creation. Every table must have one Primary Index. The Primary Index may consist of 1 to 64 columns. The Primary Index of a table may not be changed. The Primary Index is the mechanism used to assign a row to an AMP. The Primary Index may be Unique (UPI) or Non-Unique (NUPI). Unique Primary Indexes result in even row distribution and eliminate duplicate row checking.• Non-Unique Primary Indexes result in even row distribution proportional to the number of duplicate values. This may cause skewed distribution.

Row Distribution via Hashing

Unique Primary Index (UPI) Access

Non-Unique Primary Index (NUPI) Access

UPI Row Distribution

• Order_Number values are unique (UPI).• The rows will distribute evenly across the AMPs.

NUPI Row Distribution

• Customer_Number values are non-unique (NUPI).• Rows with the same PI value distribute to the same AMP causing skewed row distribution.

Highly Non-Unique NUPI Row Distribution

• Order_Status values are highly non-unique (NUPI).

• Only two values exist. The rows will be distributed to two AMPs.

• This table will not perform well in parallel operations.

• Highly non-unique columns are poor PI choices.

• The degree of uniqueness is critical to efficiency.

Partitioned Primary Index (PPI)

The Orders table defined with aNon-Partitioned Primary Index(NPPI) on Order_Number (O_#)

Partitioned Primary Indexes:• Improve performance on range constraint queries• Use partition elimination to reduce the number of rows accessed

The Orders table defined with aPrimary Index on Order_Number(O_#) Partitioned By Order_Date(O_Date) (PPI)

Module 7

Secondary Indexes and Full-Table Scans

Secondary Indexes

•A secondary index is an alternate path to the rows of a table.•A table may have from 0 to 32 secondary indexes.•A secondary index: – does not affect table row distribution. – is chosen to improve access performance. – may reference from 1 to 64 table columns. – may be defined at table creation. – may be defined after the table is created. – may be dropped at any time. – uses a sub-table which utilizes Perm Space. – may impact table maintenance performance (row inserts, row updates and/or row deletes).

Defining a Secondary Index

Unique Secondary Index (USI)A Unique Secondary Index requires unique column values in each rowAccess to a referenced value requires 2 AMPs (serial operation) and returns 0 or 1 rows.SQL to create: CREATE UNIQUE INDEX (social_security) on Employee;

Non-Unique Secondary Index (NUSI)• A Non-Unique Secondary Index (NUSI) allows duplicate column values in the rows.• Access to a referenced value requires all AMPs (parallel operation) and returns 0 to n rows.• SQL to create:CREATE INDEX (last_name) on Employee;CREATE INDEX (last_name, first_name) on Employee;

Other Types of Secondary Indexes• Join Index– Define a pre-join table on frequently joined columns (with optional aggregation) without denormalizing the database.– Create a full or partial replication of a base table with a PI on a FK column to facilitate joins of large tables by hashing their rows to the same AMP.– Define a summary table without denormalizing the database.– Can be defined on one or several tables.• Sparse IndexAny join index, whether simple or aggregate, multi-table or single-table, can be sparse.– Uses a constant expression in the WHERE clause of its definition to narrowly filter its row population.

• Hash Index– Used for the same purposes as single-table join indexes.– Create a full or partial replication of a base table with a PI on a FK column to facilitate joins of large tables by hashing them to the same AMP.– Can be defined on one table only.Value-Ordered NUSI– Very efficient for range conditions and conditions with an inequality on the secondary index column set.

Primary Index vs. Secondary Index

Full-Table Scans

• Every data block of the table is read once• All AMPs scan their portion of the table in parallel.• The Primary Index choice will affect parallel scan performance (UPI is even; NUPI is potentially skewed).• Full-table scans typically occur when: – the index columns are not used in the query – a non-equality or range test is specified for the index columns

Module 8

Fault Tolerance and Data Protection

Locks

4 Types of Locks• Exclusive—prevents any other type of concurrent access• Write—prevents other Read, Write, Exclusive locks• Read—prevents Write and Exclusive locks• Access—prevents Exclusive locks only

3 Levels of Locks• Database—applies to all tables/views in the database• Table/View—applies to all rows in the table/view• Row Hash—applies to all rows with same row hash

Lock requests are based on the SQL request:• SELECT—requests a Read lock• UPDATE—requests a Write lock• CREATE TABLE—requests an Exclusive lock

Lock requests may be upgraded or downgraded:• LOCKING TABLE Table1 FOR ACCESS . . .• LOCKING TABLE Table1 FOR EXCLUSIVE . . .

Transient Journal

• Maintains a copy on each AMP of before images of all rows affected.• Provides rollback of changed rows in the event of TXN failure.• Activities are automatic and transparent to user.• Before images are reapplied to table if TXN fails.• Before images are discarded upon TXN completion.

Successful TXNBEGIN TRANSACTIONUPDATE Row A — Before image Row A recorded(Add $100 to checking)UPDATE Row B — Before image Row B recorded(Subtract $100 from savings)END TRANSACTION — Discard before images

Failed TXNBEGIN TRANSACTIONUPDATE Row A — Before image Row A recordedUPDATE Row B — Before image Row B recorded(Failure occurs)(Rollback occurs) — Reapply before images(Terminate TXN) — Discard before images

RAID ProtectionRAID 1 (Mirroring)• Each physical disk in the array has an exact copy in the same array.• The array controller can read from either disk and write to both.• When one disk of the pair fails, there is no change in performance.• Mirroring reduces available disk space by 50%.• Array controller reconstructs failed disks quickly.RAID 5 (Parity)• Data and parity striped across rank of 4 disks.• If a disk fails, any missing block may bereconstructed using the other three disks.• Parity reduces available disk space by 25% in a 4-disk rank.• Reconstruction of failed disks takes longer than RAID 1.

Summary: RAID-1 - Good performance with disk failures Higher cost in terms of disk spaceRAID-5 - Reduced performance with disk failures Lower cost in terms of disk space

FallbackA Fallback table is fully available in the event of an unavailable AMP.A Fallback row is a copy of a primary row stored on a different AMP in the same CLUSTER of AMPs.

Benefits of Fallback:• May be specified at the table or database level• Permits access to table data during AMP off-line period• Adds a level of data protection beyond disk array RAID 1 & 5• Highest level of data protection is RAID 1 and Fallback• Automatically restores data changed during AMP off-line• Critical for high availability applicationsCosts of Fallback:• Twice the disk space for table storage is needed• Twice the I/O for INSERTs, UPDATEs and DELETEs is needed

Recovery Journal for Down AMPs

Recovery Journal is: • Automatically activated when an AMP is taken off-line• Maintained by other AMPs in the cluster• Totally transparent to users of the systemWhile AMP is off-line:• Journal is active• Table updates continue as normal• Journal logs Row-IDs of changed rows for down-AMPWhen AMP is back on-line:• Restores rows on recovered AMP to current status• Journal discarded when recovery complete

Cliques

• A clique is a defined set ofnodes with failover capability.• All nodes in a clique are able to access the vdisks of all AMPs in the clique.• If a node fails, its vprocs willmigrate to the remaining nodesin the clique.• Each node can support 128 Vprocs

Archiving and Recovering Data

Archive Recovery Utility (ARC)• Runs on IBM, UNIX, Linux and Win2K• Archives data from RDBMS• Restores data from archive media• Permits data recovery to aspecified checkpointOther Archive Applications• BakBone NetVault• Symantec NetBackupCommon uses of ARC• Dump database objects for backup or disaster recovery• Restore non-fallback tables after disk failure.• Restore tables after corruption from failed batch processes.• Recover accidentally dropped tables, views, or macros.• Recover from miscellaneous user errors.• Copy a table and restore it to another Teradata Database.

Module 9

Client Tools and Utilities

Query Tools - BTEQ

SQL front-end to Teradata Database and other ODBC compliant databases

Query Tools – Teradata SQL Assistant

Fast Load Utility

• Fast batch utility for loading a single empty table• Automatic checkpoint/restart capability• Errors reported and collected in error tables• Supports INMOD routines and Access Modules• Loads data in two phases

MultiLoad Utility

• Loads/maintains up to five empty or populated tables• Performs block level operations against target tables• Affected data blocks are written once• Multiple operations with one pass of input files• Uses conditional logic to applying updates• Supports INSERT, UPDATE, DELETE and UPSERT operations• Supports INMOD routines and Access Modules• Errors reported and collected in error tables• Provides automatic checkpoint/restart capability

FastExport Utility

• Exports large volumes of formatted data from one or more tables on the Teradata Database to a host file or user-written application

• Supports multiple sessions• Export from multiple tables• Provides automatic checkpoint/restart capability

TPump Utility• Allows near real-time updates from transactional systems into the warehouse• Allows constant loading of data into a table• Performs INSERT, UPDATE, DELETE, and ATOMIC UPSERT operations, or acombination, to more than 60 tables at a time• High-volume SQL-based continuous update of multiple tables• Allows target tables to:– Have secondary indexes, referential integrity, constraints and enabled triggers– Be MULTISET or SET– Be populated or empty• Allows conditional processing• Supports automatic restarts• No session limit—use as many sessions as necessary

• No limit to the number of concurrent instances• Uses row-hash locks, allowing concurrent updates on the same table• Can be stopped at any time with work committed with no ill effect• Designed for highest possible throughput• Gives users the control over the rate per minute (throttle) at which statements are sent to the database either dynamically or by script.

Teradata Parallel Transporter

• Parallel Extract, Transform and Load (end-to-end parallelism) eliminatessequential bottlenecks• Data Streams eliminate the overhead of persistent storage• Single SQL-like scripting language• Access to various data sources• Open API enables Third Party and user application integration

Teradata Parallel Transporter Operators

Teradata ManagerGraphical system management tool - Collects, analyzes, and displays:– Performance information

Teradata Dynamic Workload ManagerQuery workload management tool (formerly Teradata Dynamic Query Manager) that: Restricts (i.e. runs, suspends, schedules later or rejects) query based on set thresholds

Based on analysis control:Too long -- Toomany rowsBased on object control:- User ID- Table- Day/time- Group IDLogs workload performance for analysis Based on environmentalfactors- CPU- Disk utilization- Network activity- Number of users

Analyst Tools – Teradata Visual Explain

– Provides the ability to capture and graphically represent the steps of a query plan and perform comparisons of two or more plans– Stores query plans in a Query Capture Database (QCD)

Analyst Tools – Teradata System Emulation Tool– Emulates a target system by exporting and importing all information necessary to emulate in a test environment

- Use with TargetLevel Emulationto generate queryplans on a testsystem as if theywere run on thetarget system

- Verifies queriesand reproducesoptimizer relatedissues in a testenvironment

Analyst Tools – Teradata Index Wizard

Recommends secondary indexes for tables, based on a particular workload

Analyst Tools – Teradata Statistics Wizard– Recommends and automates the Statistics Collection process– Recommends Statistics to be re-collected due to table growth

Thank You

Recommended