258
Greenplum SQL & Performance Tuning

Course 3 GP Advanced SQL and Tuning

  • Upload
    nandy39

  • View
    14

  • Download
    0

Embed Size (px)

DESCRIPTION

Green Plum

Citation preview

Page 1: Course 3 GP Advanced SQL and Tuning

Greenplum SQL&

Performance Tuning

Page 2: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Course Introduction

Page 3: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Objectives

• Review Greenplum “Shared Nothing” Architecture & Concepts

• By implementing a Greenplum Data Mart students will Create a Greenplum Database and Schemas Evaluate Logical and Physical Data Models Determine appropriate Distribution Keys and create tables Determine and implement partitioning and indexing strategies Determine and implement optimal load methodologies

• By fulfilling reporting requirements students will Be able to “Explain the Explain Plan” Create and use statistics Create detail and summary reports using OLAP functions Create and use temporary tables Write basic PostGREs functions and create data types Identify and optimize problem queries

Page 4: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Prerequisites

• Greenplum Fundamentals Training Course

• Experience with Unix or Linux

• Experience with SQL (PostGRESQL is nice!)

• Experience with VI or EMACS

• Experience with Shell Scripting

Page 5: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Module 1

Greenplum Architecture Review

Page 6: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Greenplum Architecture

• MPP = Massively Parallel Processing• Multiple units of parallelism working on the same task

• Parallel Database Operations

• Parallel CPU Processing

• Greenplum Units of Parallelism are “Segments”

•“Shared Nothing” Architecture• Segments only operate on their portion of the data

• With Greenplum, each segment is a separate PostGres Database

• Segments are self-sufficient

• Dedicated CPU Processes

• Dedicated storage that is only accessible by the segment

Page 7: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Greenplum Architecture

Page 8: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Master Host, Segment Hosts andSegment Instances

Page 9: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Interconnect

Page 10: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

How does Architecture Relate to Queries?

• Data is spread across the entire system so …• Good queries reduce the amount of data motion

between segments• Good physical implementation allows queries to easily

eliminate rows• Good data distribution reduces skewing

• The Greenplum database has many “high availability”

features• High availability may incur a performance cost• Queries and Data Manipulation (DML) operations must

be optimized with the architecture in mind.

Page 11: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

LAB

Module 1

Greenplum Architecture Review

Page 12: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Module 2

Data Modeling,Databases & Schemas

Page 13: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Data Models

• Logical Data Model Graphical representation of the business

requirements Contains the things of importance in an organization

and how they relate to one another Contains business textual definitions and examples

• Enhanced Logical Data Model Gathers additional information regarding objects in

LDM

• Physical Logical Data Model How the Logical Model is implemented

Page 14: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Data Modeling Terms• Entity

Can exist on it’s own Can be uniquely identified

• AttributeDefines a property of the entity

• Relationship How entities relate to each other

• Primary Key A single or combination of attributes that uniquely

identifies the entity.

Page 15: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Types of Logical Data Models

• Dimensional Model (Star Schema)- Easiest for users to understand

- Single layer of dimensions

- May have data redundancy

Page 16: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Dimensional Model – Star Schema

DimensionKnown entry point to a

Fact

A defined attribute of a

Fact

Simple Primary Key

FactMeasurable

Relates to a specific

instance

May have compound

Primary Key

Page 17: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Types of Logical Data Models

• Snowflake Schema- Greater reporting ability

- Can break up large dimensions into manageable chunks

- May have redundant data

Page 18: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Snowflake Schema

Page 19: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Types of Logical Data Models

• Third Normal Form- Most flexible option

- No redundant data

- Most difficult for users to use

Page 20: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Third Normal Form (3NF) ModelCustomer Dimension - 3rd Normal Form

Customer

Customer IDCustomer NameAddressZip CodeZip Code Plus FourPhone

Country

Country CodeCountry Name

State

StateLong NameCapital City

Zip Code

Zip CodeCityStateCountry Code

CountryCode

Zip Code

State

Page 21: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Data Warehousing & Models

• Star and Snowflake Schemas are most common

in Data Warehouses Fast retrieval of large numbers of rows Easier to aggregate Intuitive to end users

Page 22: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Enhanced Logical Data Model

• Provides more detail than the Logical Model Defines the cardinality of the data attributes Defines the anticipated initial data volume (in rows) Defines the anticipated data growth over time

Page 23: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Enhanced Logical Data Model

Attribute Source Uniqueness Data Type Data Example

Store ID Stores Unique Small Integer

1,2 … N

Store Name Stores Non Character Tom’s Groceries

Has Pharmacy

Stores Non Character T,F

Entity: Store

Source: SOLS (Stores On-Line System)

stores table

Row Count at Source: 35

Anticipated Annual Growth Rate: 100

Page 24: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Physical Data Model

• Table The physical manifestation of an entity

• Columns How the attributes of the entity are physically

represented

• Constraints Foreign Key (disabled in Greenplum 3.1) Uniqueness Primary Key Null / Not Null Check Values

Page 25: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Logical to Physical – Logical Data Model

FACT

DIMENSIONSStore

Store IDStore NameAddressCityStateZip CodeZip Code Plus FourPhoneCountry CodeHas Pharmacy?Has Grocery?Has Deli?Has Butcher?Has Bakery?

Transaction

Transaction IDTerminal IDTransaction DateStore IDCustomer IDItem CountSales AmountDiscount AmountCoupon AmountCash AmountCredit Card AmountDebit AmountOther Amount

Customer

Customer IDCustomer NameAddressCityStateZip CodeZip Code Plus FourCountry CodePhone

Country

Country CodeCountry Name

DimensionsStore

Country

Customer

FactTransaction

Page 26: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Logical to Physical – Physical Data Model

FACT

DIMENSIONS

Store

storeId SMALLINT NOT NULLstoreName CHAR(20) NOT NULLaddress VARCHAR(50) NOT NULLcity VARCHAR(40) NOT NULLstate CHAR(2) NOT NULLzipcode CHAR(8) NOT NULLzipPlusFour CHAR(4) NOT NULLcountrycd CHAR(3) NOT NULLphone CHAR(13)pharmacy CHAR(1)grocery CHAR(1)deli CHAR(1)butcher CHAR(1)bakery CHAR(1)

Transaction

transId BIGINT NOT NULLterminalId INTEGER NOT NULLtransDate DATE NOT NULLstoreId SMALLINT NOT NULLcustomerId INTEGERitemCnt INTEGER NOT NULLsalesAmt DECIMAL(9,2) NOT NULLtaxAmt DECIMAL(9,2) NOT NULLdiscountAmt DECIMAL(9,2)couponAmt DECIMAL(9,2)cashAmt DECIMAL(9,2)checkAmt DECIMAL(9,2)ccAmt DECIMAL(9,2)debitAmt DECIMAL(9,2)otherAmt DECIMAL(9,2)

Customer

customerId INTEGER NOT NULLcustName CHAR(50) NOT NULL

address CHAR(50) NOT NULL city CHAR(40) NOT NULL state CHAR(2) NOT NULLzipcode CHAR(7) NOT NULLzipPlusFour CHAR(4) NOT NULLcountrycd CHAR(3) NOT NULL

phone CHAR(13)

Country

countrycd CHAR(3) NOT NULLcountryname VARCHAR(100) NOT NULL

Notes:Entities become tables

Attributes become

columns

Relationships may become

foreign key constraints

Page 27: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Data Segregation

• Database May have multiple in a Greenplum system Greenplum databases do not share data

• Schema A physical grouping of database objects

• Views All or some table columns may be seen Allows for user defined columns

- Instantiated at run time

Page 28: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

About Greenplum Databases

• A Greenplum Database system may have

multiple databases Clients connect to one database at a time Default database template: template1 Other database templates

- template0

- postgres

NEVER DROP OR CHANGE TEMPLATE

DATABASES!

Page 29: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Database SQL Commands

To Create: CREATE DATABASE or createdb

To Drop: DROP DATABASE or dropdb

To Modify: ALTER DATABASE Change database name Assign to a new owner Set configuration parameters

Page 30: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PSQL Database Tips

PSQL prompt shows what database you are in.EXAMPLE: template1=# (super user)

template1=> (non-super user)

To show a list of all databases:\l (a lower case L)

To connect to a different database:\c db_name

BEST PRACTICE: Use the PGDATABASE environment variable in the .bashrc

shell script to set the default database!

Page 31: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

About Schemas

Schemas are a way to logically organize objects within a

database (tables, views, functions, etc)

Use “fully qualified” names to access objects in a schema

EXAMPLE: myschema.mytable

Every database has a public schema

Other system level schemas include: pg_catalog,

Information_schema, pg_toast, pg_bitmapindex

Page 32: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Schema SQL Commands

To Create: CREATE SCHEMA

To Drop: DROP SCHEMA

To Modify: ALTER SCHEMA Change schema name Assign to a new owner

Page 33: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PSQL Schema Tips

To see the current schemaselect current_schema();

To see a list of all schemas in the database\dn

To see the schema search pathshow search_path;

Page 34: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Default Schemas

• public schema No objects at database creation Generally used to store “public” objects

• pg_catalog This is the data dictionary

• information_schema This schema contains views and tables to make the

data dictionary easier to understand with less joins.

• pg_bitmapindex Used to store bitmap index objects

Page 35: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

TOAST Schema

• This schema is used specifically to store large

attribute objects (> 1 page of data)

The

Oversized

Attribute

Storage

Technique

Page 36: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

LAB

Module 2

Data Modeling,Databases & Schemas

Page 37: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Module 3

Physical Design Decisions

Page 38: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Key Design Considerations

• Data Distribution Selecting the best distribution key

• Checking for Data Skew

• Partition or not to Partition

• Data Types Select the smallest type that will store the data

• Constraints Table Column

Page 39: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Primary Key VS Distribution Key

A primary key is a logical model concept which allows each row to be uniquely identified

A distributed by key is a physical Greenplum database concept which determines how a row is physically stored

A well designed Greenplum schema could have a large percentage of tables where the physical distributed by key is different from the logical primary key (tables with common distributed by keys provide faster join operations by reducing number of motion files)

Page 40: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Key ComparisonPrimary Key Distribution By Key

Logical concept of data modeling Physical mechanism for storage

Optional for Greenplum (becomes the distribution index if specified)

Requirement for every Greenplum table (first column if not specified in “distributed by”)

No limit to number of columns Limits:

Documented in data model Defined in Create Table command

Must be unique Can be non-unique

Uniquely identifies each row Can uniquely or non-uniquely identify each row

Value can not be changed Value can not be changed

Can not be NULL Can be NULL

Does not define an access path (DBMS independent)

Defines a storage path

Chosen for logical correctness Chosen for distribution and performance

Page 41: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Data Distribution

CREATE TABLE tablename ( column_name1 data_type NOT NULL,column_name2 data_type NOT NULL,column_name3 data_type NOT NULL DEFAULT default_value,. . . )

[DISTRIBUTED BY (column_name)] hash algorithm

[DISTRIBUTED RANDOMLY] round-robin algorithm Every table in the Greenplum

database has a distribution method

OR

Page 42: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Distribution Key Considerations

• A distribution key can not be changed or altered

after a table is created

• If a table has a unique constraint it must be

declared as the distribution key

• User defined data types and geometric data

types are not eligible as distribution keys

Page 43: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Default Distributions Not Recommended

• If a table has a primary key and a distribution

key is NOT specified, by default the primary key

will be used as the distribution key

• If the table does not have a primary key and a

distribution key is NOT specified, then the first

eligible column of the table will be used as the

distribution key User defined data types and geometric data types are

not eligible as distribution keys If a table does not have an eligible column then a

random distribution is used

Page 44: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Distributions and Performance

Response time is the

completion time for ALL

segment instances in a

Greenplum database

To optimize performance use a hash distribution method (DISTRIBUTED BY) that distributes data

evenly across all segment instances

SegmentInstance CPUDisk I/ONetwork

0

1

2

3

4

5

Page 45: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Hash Distributions and Data Skew

Select a distribution key with unique values and high cardinality

Gender = M or F

SegmentInstance CPUDisk I/ONetwork

0

1

2

3

4

5

Page 46: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Hash Distributions and Processing Skew

Using a DATE column as the distribution key may distribute rows evenly across segment instances but may result in

processing skew

SegmentInstance CPUDisk I/ONetwork

0

1

2

3

4

5

Select a distribution key that will not result in processing skew

JAN

FEB

MAR

APR

MAY

JUN

Page 47: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Use the Same Distribution Key for Commonly Joined Tables

Segment Instance 0

customer (c_customer_id)

freg_shopper (f_customer_id) =

Segment Instance 1

customer (c_customer_id)

freq_shopper(f_customer_id) =

Optimize for Local Joins!Distribute on the same key

used in the join(WHERE clause)

Segment Host

Segment Host

Page 48: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Avoid Redistribute Motion for Large Tables

WHERE customer.c_customer_id = freg_shopper.f_customer_id

freq_shopper table is dynamically redistributed on f_customer_id

Segment Instance 0

customer (c_customer_id)customer_id =102

freg_shopper (f_trans_number)

Segment Instance 1

customer (c_customer_id)customer_id=745

freq_shopper(f_trans_number)customer_id=102

Segment Instance 1

customer (c_customer_id)

freq_shopper(f_trans_number)customer_id=745

Segment Host Segment Host Segment Host

Page 49: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Avoid Broadcast Motion for Large Tables

Segment Instance 0

customer (c_customer_id)

state(s_statekey)AK, AL, AZ, CA…

Segment Instance 1

customer (c_customer_id)

state(s_statekey)AK, AL, AZ, CA…

Segment Instance 1

customer (c_customer_id)

state(s_statekey)AK, AL, AZ, CA…

Segment Host Segment Host Segment Host

WHERE customer.c_statekey = state.s_statekey

The state table is dynamically broadcasted to all segment instances

Page 50: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Commonly Joined Tables Use the Same Data Type for Distribution Keys

customer (c_customer_id) 745::int

freq_shopper (f_customer_id)

745::varchar(10)

• Values might appear the same but they are

stored differently at the disk level

• Values might appear the same but they HASH

to different values• Resulting in like rows being stored on different

segment instances

• Requiring a redistribution before the tables can be

joined

Page 51: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

DISTRIBUTED RANDOMLY

• Uses a round robin algorithm

• Any query that joins to a table that is

distributed randomly will require a

redistribute motion or a broadcast motion

• Acceptable for small tables For example dimension tables

• Not acceptable for large tables For example fact tables

Page 52: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Distribution Strategies to Avoid Do not generate or fabricate columns in order to create a unique distribution key simply to avoid skew

DISTRIBUTED BY (col1, col2, col3) if col1, col2, col3 is never used in joins

Homegrown serial key algorithm

Do not distribute on boolean data types Do not distribute on floating point data types Do not DISTRIBUTE RANDOMLY because it is the easy choice!

Page 53: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Check for Data Skew

• Use gpskew to check for data skewgpskew –t schema.table –a database_name

• For a partitioned table use the gpskew –h optiongpskew –t schema.table –a database_name -h

Page 54: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

gpskew Examplegpadmin@mdw > gpskew -t public.customer

. . .

. . . Number of Distribution Column(s)      = {1}

. . . Distribution Column Name(s)           = cust_id

20080316:20:12:08:gpskew:mdw:gpadmin-[INFO]

. . . Total records (inc child tables)      = 5277811399

. . .

20080316:20:12:08:gpskew:mdw:gpadmin-[INFO]:-Skew result

. . . Maximum Record Count                  = 66004567

. . . Segment instance hostname             = gpdbhost

. . . Segment instance name                 = dw100

. . . Minimum Record Count                  = 65942375

. . . Segment instance hostname             = gpdbhost

. . . Segment instance name                 = dw100

. . . Record count variance                 = 62192

Page 55: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Redistribute Using CREATE TABLE LIKE

• To change the distribution of a table use the

CREATE TABLE LIKE statement

CREATE TABLE customer_tmp (LIKE customer)

DISTRIBUTED BY (customer_id);

INSERT INTO customer_tmp SELECT * FROM

customer;

DROP customer;

ALTER TABLE customer_tmp RENAME TO

customer;

Specify the New Distribution Key

Page 56: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Why Partition a Table?

Provide more efficiency in querying against a subset of large volumes of transactional detail data as well as to manage this data more effectively• Businesses have recognized the analytic value of detailed transactions and are

storing larger and larger volumes of this data

Increase query efficiency by avoiding full table scans- without the overhead and maintenance costs of an index for date

• As the retention volume of detailed transactions increases, the percent of transactions that the “average” query requires for execution decreases

Allow “instantaneous” dropping of “old” data and simple addition of “new” data• Support a “rolling n periods” methodology for transactional data

Page 57: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Partitions in Greenplum

A mechanism in Greenplum for use in physical database design

Increases the available options to improve the performance of a certain class of queries• Only the rows of the qualified partitions (child table) in a query need to be

accessed

Two types of partitioning:• Range partitioning (division of data based on a numerical range, such as

date or price) • List partitioning (division of data based on a list of values, such as state or

region)

Partitioning in Greenplum works using table inheritance and CHECK table constraintsPartitioned tables are also distributed

Page 58: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Candidates for Partitioning

Any table with large data volumes where queries commonly select using a date range or has rolloff requirements

• Examples- Fact tables with dates

- Transaction Tables with dates

- Activity Tables with dates

- Statement Tables with numeric statement periods (YYYYMM)

- Financial Summary Tables with end of month dates

Page 59: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Partitioning Column Candidates

Recommended and simple• Dates using date expressions

- Trade date- Order date- Billing date- End of Month date

• Numeric period that are integers- Year and Month - Cycle run date- Julian date

Page 60: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Greenplum Table PartitioningGreenplum partitions tables in the CREATE TABLE DDL statement.

Partitions may be by range (date, id, etc.)Partitions may be by value list (“Male”,”Female”, etc.)

Partition Operations include:CREATE TABLE … PARTITION BYALTER TABLE ADD PARTITION

DROP PARTITIONRENAME PARTITIONEXCHANGE PARTITION

Page 61: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Partitioned Table Example

Page 62: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

DDL to Create Table (prior slide)CREATE TABLE baby.rank ( id INT,rank INT, year INT, gender CHAR(1), count

INT )

DISTRIBUTED BY ( id, gender, year )

PARTITION BY LIST ( gender )

SUBPARTITION BY RANGE ( year )

SUBPARTITION TEMPLATE

( SUBPARTITION "2001" START ( 2001 ) INCLUSIVE,

SUBPARTITION "2002" START ( 2002 ) INCLUSIVE,

SUBPARTITION "2003" START ( 2003 ) INCLUSIVE,

SUBPARTITION "2004" START ( 2004 ) INCLUSIVE,

SUBPARTITION "2005" START ( 2005 ) INCLUSIVE END ( 2006 ) EXCLUSIVE )

( PARTITION females VALUES ( 'F' ),

PARTITION males VALUES ( 'M' ) );

Page 63: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Table Partitioning Best Practices

• Use table partitioning on large distributed tables

to improve query performanceTable partitioning is not a substitute for table

distribution

• If the table can be divided into rather equal parts

based on a defining criteria For example, range partitioning on date

• And the defining partitioning criteria is the same

access pattern used in query predicates WHERE date = ‘1/30/2008’

• No overlapping ranges or duplicate list values

Page 64: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Table Partitioning Example

Distributed by c_customer_id

Partitioned by week_number

SELECT . . . FROM customer WHERE week_number = 200702

Segment Instance 0

customer (c_customer_id)(week_number)week_number = 200701week_number = 200702week_number = 200703

…week_number = 200752

Segment Instance 1

Segment Instance 51

Segment Host Segment Host Segment Host

. . .

customer (c_customer_id)(week_number)week_number = 200701week_number = 200702week_number = 200703

…week_number = 200752

customer (c_customer_id)(week_number)week_number = 200701week_number = 200702week_number = 200703

…week_number = 200752

Page 65: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Partition Elimination SELECT . . . FROM customer

WHERE week_number = 200702 Segment

Instance 0

customer (c_customer_id)(week_number)week_number = 200701week_number = 200702week_number = 200703week_number = 200704week_number = 200705week_number = 200706week_number = 200707week_number = 200708week_number = 200709. . . week_number = 200750week_number = 200751week_number = 200752

Only the 200702 week partition is

scanned.

The other 51 partitions on the segment

instance are eliminated from the query

execution plan.

The primary goal of table partitioning is to achieve

partition elimination

Page 66: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

CREATING PARTITIONED TABLES

CREATE TABLE customer (

c_customer_id INT,

week_number INT,

. . .

) DISTRIBUTED BY (c_customer_id)

PARTITION BY RANGE ( week_number )

(START (200701) END (200752) INCLUSIVE)

;

Page 67: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Maintaining Rolling Windows

Use ALTER TABLE to add a new child table (week_number=200801)

Use ALTER TABLE to delete an old child table (week_number=200701)

Segment Instance 0

Segment Instance 1

Segment Instance 51

Segment Host Segment Host Segment Host

. . .

customer (c_customer_id)(week_number)week_number = 200702week_number = 200703

…week_number = 200752week_number = 200801

customer (c_customer_id)(week_number)week_number = 200702week_number = 200703

…week_number = 200752week_number = 200801

customer (c_customer_id)(week_number)week_number = 200702week_number = 200703

…week_number = 200752week_number = 200801

Page 68: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

customer parent

week_number=200701 child

week_number = 200702 child

week_number = 200703 child

week_number = 200704 child

. . .

week_number = 200752 child

Loading Partitioned Tables

• Data may be loaded into the parent

table Parent table contains no data By default data is not automatically

inserted into the correct child tables-Use rewrite rules

• Data may be loaded into the child

tables

Segment Instance 0

For the best load performance load directly into child tables

OR

Load into “clone” of child table and EXCHANGE partitions

NFS

Page 69: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

customer parent

week_number = 200701 child

week_number = 200702 child

week_number = 200703 child

week_number = 200704 child

. . .

week_number = 200752 child

Loading Child Tables

• Data must be partitioned to ensure it is

loaded into the correct child table Manually partition the data files

OR Use external tables to filter the data

INSERT INTO customer_200701

SELECT * FROM ext_customer

WHERE week_number = 200701;

Segment Instance 0

Page 70: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Partitioned Table with No Sub-PartitionsHeterogeneous Example

CREATE TABLE orders

( order_id NUMBER(12),

order_date TIMESTAMP WITH LOCAL TIME ZONE,

order_mode VARCHAR2(8),

customer_id NUMBER(6),

order_status NUMBER(2),

order_total NUMBER(8,2),

sales_rep_id NUMBER(6),

promotion_id NUMBER(6)

)

DISTRIBUTED BY (customer_id)

PARTITION BY RANGE (order_date)

PARTITIONS 100

(

starting '2005-12-01'

ending '2007-12-01' every 3 months,

ending '2008-12-01' every 4 weeks,

ending '2008-12-03' every 6 hours,

ending '2008-12-04' every 2 secs

);

Page 71: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Greenplum ReleaseMulti-level Partition example

CREATE TABLE orders

( order_id NUMBER(12),

order_date TIMESTAMP WITH LOCAL TIME ZONE,

order_mode VARCHAR2(8),

customer_id NUMBER(6),

order_status NUMBER(2),

order_total NUMBER(8,2),

sales_rep_id NUMBER(6),

promotion_id NUMBER(6)

)

DISTRIBUTED BY (customer_id)

PARTITION BY RANGE (order_date)

SUBPARTITION BY HASH (customer_id)

SUBPARTITIONS 8

(

partition minny,

starting '2004-12-01' ending '2006-12-01',

partition maxy

);

Page 72: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

What is the difference between heterogeneous and multi-level partitions?

• Heterogeneous partitions - when all

partitions use the same column value, for

example date but the partition are

determined by the expression (daily,

weekly, monthly).

• Multi-level partitions – when the

partitions are different levels based on

different columns

Page 73: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Data Type Best Practices

• Use data types to constrain your column dataUse character types to store stringsUse date or timestamp types to store datesUse numeric types to store numbers

• Choose the type that uses the least spaceDon’t use a BIGINT when an INTEGER will work

• Use identical data types for columns used in

join operationsConverting data types prior to joining is resource

intensive

Page 74: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Greenplum Data Types

Page 75: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Greenplum Data Types (continued)

Page 76: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

More Greenplum Data Types

Page 77: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Table Constraints

Primary Key Constraint * EXAMPLE: CONSTRAINT mytable_pk PRIMARY KEY (mycolumn)

With OIDs

EXAMPLE: WITH (OIDS=TRUE)

This is used when updating or deleting rows via an ODBC

connection from another database. (Oracle, SQL Server, etc.)

* If a table has a primary key, this column (or group of columns) is

chosen as the distribution key for the table. If a table has a primary key,

then it can not have other columns with a unique constraint.

Page 78: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Column Constraints

Check Constraints

EXAMPLE: price_amt DECIMAL(10,2) CHECK (price_amt > 0.00)

Not Null Constraint EXAMPLE: customer_id INTEGER NOT NULL

Uniqueness Constraint * EXAMPLE: customer_id INTEGER UNIQUE

* Greenplum Database has some special conditions for unique constraints. A

table can only have one unique constraint and the unique column (or set of

columns) must also be declared as the distribution key for the table.

Page 79: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

LAB

Module 3

Physical Design Decisions

Page 80: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Module 4

Data Loading“ETL” versus “ELT”

Page 81: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

ETL – Extract, Transform & Load

Extract data from one or more source systemsTransform data using source system, ETL server or ETL Tool

(i.e. Informatica or DataStage) required by business rulesLoad data into the data warehouse.

Page 82: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

The Greenplum Way - ELT

Extract data from one or more source systems Load data in a parallel manner via Greenplum External Table

into stage or temporary table in the MPP data warehouse Transform data warehouse in a parallel manner and load into

target tables.

Page 83: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Loading Data – Greenplum Methods

• External Tables File Based Web Based gpfdist – Massively Parallel Load Tool gpload

• SQL Copy Command

• ETL Tool How to do ELT with a tool

Page 84: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Loading with External Tables

•What are external tables? read-only data resides outside database SELECT, JOIN, SORT, etc. just like a regular table

•Advantages: Parallelism ELT flexibility Error handling of badly formatted rows Multiple data sources

•Two Types: External Tables: File-based Web Tables: URL or Command-based

Page 85: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

File-Based External Tables•Two Access Protocols

file gpfdist

•File Formats text csv

•Example External Table DefinitionCREATE EXTERNAL TABLE ext_expenses (name text, date date, amount float4, category text, description text) LOCATION ( 'file://seghost1/dbfast/external/expenses1.csv','file://seghost1/dbfast/external/expenses2.csv','file://seghost2/dbfast/external/expenses3.csv','file://seghost2/dbfast/external/expenses4.csv','file://seghost3/dbfast/external/expenses5.csv','file://seghost3/dbfast/external/expenses6.csv’ )FORMAT 'CSV' ( HEADER );

Page 86: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Parallel File Distribution Program (gpfdist)•Ensures full parallelism for the best performance•Can be run on a server outside the Greenplum array•C program - uses HTTP protocol•200 MB/s data distribution rate per gpfdist•Configuration parameter: gp_external_max_segs

•Example Start Commands:gpfdist -d /var/load_files/expenses -p 8081 >> gpfdist.log 2>&1 &gpfdist -d /var/load_files/expenses -p 8082 >> gpfdist.log 2>&1 &

•Example External Table Definition:CREATE EXTERNAL TABLE ext_expenses ( name text, date date,  amount float4, description text )   LOCATION ('gpfdist//etlhost:8081/*','gpfdist//etlhost:8082/*')  FORMAT 'TEXT' (DELIMITER '|') ENCODING ’UTF-8’ LOG ERRORS INTO ext_expenses_loaderrors SEGMENT REJECT LIMIT 10000 ROWS ;

Page 87: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Parallel File Distribution Program (gpfdist)

Master Host

Inte

rcon

nect

- G

igab

it E

ther

net S

witc

h

Segment Host

ETL Server

gpfdist

gpfdist

External Table Files

External Table Files

Segment Host

Segment Host

Segment Host

Page 88: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

About External and Web Tables

•Two TypesExternal Tables

-Greenplum file server (gpfdist)

-File-based (file://)

Web Tables-URL-based (http://)

-Output-based (OS commands, scripts)

•Used for ETL and Data Loading

•Data accessed at runtime by segments (in

parallel)

•Read-only

Page 89: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

External Table SQL Syntax

CREATE EXTERNAL TABLE table_name (column_name data_type [, ...])

LOCATION ('file://seghost[:port]/path/file' [, ...]) | ('gpfdist://filehost[:port]/file_pattern' [, ...])

FORMAT 'TEXT | CSV' [( [DELIMITER [AS] 'delimiter'] [NULL [AS] 'null string'] [ESCAPE [AS] 'escape' | 'OFF'] [HEADER] [QUOTE [AS] 'quote'] [FORCE NOT NULL column [, ...] )]

[ ENCODING 'encoding' ]

[ [LOG ERRORS INTO error_table] SEGMENT REJECT LIMIT count [ROWS | PERCENT] ]

DROP EXTERNAL TABLE table_name;

Page 90: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

External Web Tables•Dynamic data – not rescannable•Two Forms:

URLs Commands or scripts

- executed on specified number of segment instances- can disable by setting: gp_external_enable_exec=off

•Output Data Formats text csv

Example Web Table DefinitionsCREATE EXTERNAL WEB TABLE ext_expenses (name text, date date, amount float4, description text) LOCATION ( 'http://intranet.company.com/expenses/sales/expenses.csv', 'http://intranet.company.com/expenses/finance/expenses.csv', 'http://intranet.company.com/expenses/ops/expenses.csv‘ )FORMAT 'CSV' ( HEADER );

CREATE EXTERNAL WEB TABLE log_output (linenum int, message text) EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST FORMAT 'TEXT' (DELIMITER '|');

CREATE EXTERNAL WEB TABLE du_space (storage text) EXECUTE 'du -sh' ON ALL FORMAT 'TEXT';

Page 91: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

External Table Error Handling• Load good rows and catch poorly formatted rows, such as:

rows with missing or extra attributes rows with attributes of the wrong data type rows with invalid client encoding sequences

• Does not apply to constraint errors: PRIMARY KEY, NOT NULL, CHECK or UNIQUE constraints

• Optional error handling clause for external tables:[LOG ERRORS INTO error_table] SEGMENT REJECT LIMIT count [ROWS | PERCENT]( PERCENT based on gp_reject_percent_threshold parameter )

Example:CREATE EXTERNAL TABLE ext_customer (id int, name text, sponsor text) LOCATION ( 'gpfdist://filehost:8081/*.txt' ) FORMAT 'TEXT' ( DELIMITER '|' NULL ' ') LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5 ROWS;

Page 92: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

External Tables and Planner Statistics

• Data resides outside the database

• No database statistics for external table data

• Not meant for frequent or ad-hoc access

• Can manually set rough statistics in pg_class:

UPDATE pg_class

SET reltuples=400000, relpages=400

WHERE relname='myexttable';

Page 93: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

gpload

• Runs a load job as defined in a YAML formatted control file

• gpload is a data loading utility that acts as an interface external

table parallel loading feature. Using a load specification defined in a

YAML formatted control file, gpload executes a load by invoking the

Greenplum parallel file server (gpfdist), creating an external table

definition based on the source data defined, and executing an

INSERT, UPDATE or MERGE operation to load the source data into

the target table in the database.

• Usage:gpload -f control_file [-l log_file] [-h hostname] [-p port] [-U username] [-d database] [-W] [-v | -V] [-q] [-D] gpload -? | --version

• Example:gpload -f my_load.yml

Page 94: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

gpload - Control File Format

• The gpload control file uses the YAML 1.1 document format and

then implements its own schema for defining the various steps of

a Greenplum Database load operation. The control file must be a

valid YAML document.

• Example load control file:VERSION: 1.0.0.1

DATABASE: ops

USER: gpadmin

HOST: mdw-1

PORT: 5432

GPLOAD:

INPUT:

- SOURCE:

LOCAL_HOSTNAME:

- etl1-1

- etl1-2

- etl1-3

Page 95: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

gpload - Control File Format (contd.)- etl1-4

PORT: 8081

FILE: /var/load/data/*

- COLUMNS:

- name: text

- amount: float4

- category: text

- desc: text

- date: date

- FORMAT: text

- DELIMITER: |

- ERROR_LIMIT: 25

- ERROR_TABLE: payables.err_expenses

OUTPUT:

- TABLE: payables.expenses

- MODE: INSERT

Page 96: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

COPY SQL Command

• PostgreSQL command

• Optimized for loading a large number of rows

• Loads all rows in one command (not parallel)

• Loads data from a file or from standard input

• Supports error handling as does external tables

EXAMPLE:COPY mytable FROM '/data/myfile.csv' WITH CSV HEADER;

Page 97: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

When to use SQL Copy When scripting a process that is loading small amounts of data (<

10K rows). One-time data loads. Use COPY to load multiple rows in one command, instead of using

a series of INSERT commands. The COPY command is optimized for loading large numbers of

rows; it is less flexible than INSERT, but incurs significantly less

overhead for larger data loads. Since COPY is a single command, there is no need to disable

autocommit if you use this method to populate a table.

Page 98: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Why use an ETL tool?

ETL tool may be corporate standard ETL tool may be used to do standard data

quality processes ETL tool may be used for standard code

standardization (multiple source systems with

different values for same code – i.e. status

code) ETL tool may be used for house holding or new

identification keys (unique customer ids,

product ids) Use ETL tool to provide connectivity options, for

example MVS connectivity.

Page 99: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Architecting ELT Solutions with an ETL tool

Use ETL tool to create file for external table Use the ETL tool to execute SQL commands Use ETL tool for job scheduling of external

table loads

Page 100: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

System Generated Keys in Greenplum

• Primary Key contains many columns?

• Distribution Key rarely used in queries or

joins?

• Solution: A System Generated Key Use a Sequence Use SQL

Page 101: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

About SequencesGreenplum Master sequence generator process (seqserver)

Sequence SQL Commands: CREATE SEQUENCEALTER SEQUENCEDROP SEQUENCE

Sequence Function Limitations:lastval and currval not supportedsetval cannot be used in queries that update datanextval not allowed in UPDATEs or DELETEs if mirrors are enabled nextval may grab a block of values for some queries

PSQL Tips:To list all sequences while in psql: \dsTo see an sequence definition: \d+ sequence_name

Page 102: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Sequence ExampleCreate a sequence named ‘myseq’CREATE SEQUENCE myseq START 101;

Insert a row into a table that gets the next value:INSERT INTO distributors VALUES (nextval('myseq'), 'acme');

Reset the sequence counter value on the master:SELECT setval('myseq', 201);

Not allowed in Greenplum DB (set sequence value on segments):INSERT INTO product VALUES (setval('myseq', 201), 'gizmo');

NOTE: THERE COULD BE GAPS IN THE SEQUENCE NUMBERS!

Page 103: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Using SQLCREATE TEMPORARY TABLE MAXI AS (SELECT COALESCE(MAX(CustomerID),0) AS MaxID FROM dimensions.Customer)DISTRIBUTED BY (MaxID);

SELECT (RANK() OVER (ORDER BY phone)) + MAXI.MaxID, custName, address, city, state, zipcode, zipPlusFour, countrycd, phone FROM public.customer_external CROSS JOIN MAXI;

Create a volatile table with the max current value

Join the single row table with the max value to every row in the source table, add the rank and voila, no gaps!

Page 104: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

LAB

Module 4

Data Loading“ETL” versus “ELT”

Page 105: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Module 5

Explain the Explain PlanAnalyzing Queries

Page 106: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

The Greenplum Explain Utility

• Allows the user to view how the optimizer will

handle the query.

• Shows the cost (in page views) and an

estimated run time.

• EXPLAIN ANALYZE will show the difference

between the estimates and the actual run costs.

Page 107: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

EXPLAIN or EXPLAIN ANALYZE

• Use EXPLAIN ANALYZE to run the query and

generate query plans and planner estimates

EXPLAIN ANALYZE <query>;

• Run EXPLAIN ANALYZE on queries to identify

opportunities to improve query performance

• EXPLAIN ANALYZE is also a engineering tool

used by Greenplum developers Focus on the query plan elements that are relevant

from an end user perspective to improve query

performance

Page 108: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

EXPLAIN Output• Query plans are a right to left plan tree of nodes

read from the bottom up

• Each node feeds its results to the node directly

above

• There is one line for each node in the plan tree

• Each node represents a single operation Sequential Scan, Hash Join, Hash Aggregation, etc

• The plan will also include motion nodes

responsible for moving rows between segment

instances Redistribute Motion, Broadcast Motion, Gather

Motion

Page 109: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

EXPLAIN Example Gather Motion 48:1  (slice1)  (cost=14879102.69..14970283.82 rows=607875 width=548)

   Merge Key: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"

   ->  Unique  (cost=14879102.69..14970283.82 rows=607875 width=548)

         Group By: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"

         ->  Sort  (cost=14879102.69..14894299.54 rows=6078742 width=548)

               Sort Key (Distinct): b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"

               ->  Result  (cost=0.00..3188311.04 rows=6078742 width=548)

                     ->  Append  (cost=0.00..3188311.04 rows=6078742 width=548)

                           ->  Seq Scan on display_run b  (cost=0.00..1.02 rows=1 width=548)

                                 Filter: local_time >= '2007-03-01 00:00:00'::timestamp without time zone

AND local_time < '2007-04-01 00:00:00'::timestamp without time zone AND (url::text

'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text)

                           ->  Seq Scan on display_run_child_2007_03_month b  

(cost=0.00..3188310.02 rows=6078741 width=50)

                                 Filter: local_time >= '2007-03-01 00:00:00'::timestamp without time zone

AND local_time < '2007-04-01 00:00:00'::timestamp without time zone AND (url::text

'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text)

Page 110: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Begin by Examining Rows and CostAggregate (cost=16141.02..16141.03 rows=1 width=0)

-> Gather Motion 2:1 (slice2) (cost=16140.97..16141.00 rows=1 width=0)

-> Aggregate (cost=16140.97..16140.98 rows=1 width=0)

-> Hash Join (cost=2999.46..15647.43 rows=197414 width=0)

Hash Cond: ps.ps_partkey = part.p_partkey

-> Hash Join (cost=240.46..9920.71 rows=200020 width=4)

Hash Cond: ps.ps_suppkey = supplier.s_suppkey

-> Seq Scan on partsupp ps (cost=0.00..6180.00 rows=400000

width=8)

-> Hash (cost=177.97..177.97 rows=4999 width=4)

-> Broadcast Motion 2:2 (slice1) (cost=0.00..177.97 rows=4999

width=4)

-> Seq Scan on supplier (cost=0.00..77.99 rows=4999

width=4)

-> Hash (cost=1509.00..1509.00 rows=100000 width=4)

-> Seq Scan on part (cost=0.00..1509.00 rows=100000 width=4)Identify plan nodes where the estimated cost is very high and the number of rows are very large(where the majority of time is spent)

Page 111: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Validate Partition Elimination Gather Motion 48:1  (slice1)  (cost=174933650.92..176041040.58 rows=7382598 width=548)

   Merge Key: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"

   ->  Unique  (cost=174933650.92..176041040.58 rows=7382598 width=548)

         Group By: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"

         ->  Sort  (cost=174933650.92..175118215.86 rows=73825977 width=548)

               Sort Key (Distinct): b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"

               ->  Result  (cost=0.00..31620003.26 rows=73825977 width=548)

                     ->  Append  (cost=0.00..31620003.26 rows=73825977 width=548)

                           ->  Seq Scan on display_run b  (cost=0.00..1.02 rows=1 width=548)

                                 Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text

                           ->  Seq Scan on display_run_child_2007_03_month b  (cost=0.00..2635000.02 rows=6079950 width=50)

                                 Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text

                           ->  Seq Scan on display_run_child_2007_04_month b  (cost=0.00..2635000.02 rows=6182099 width=50)

                                 Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text

                           ->  Seq Scan on display_run_child_2007_05_month b  (cost=0.00..2635000.02 rows=6186057 width=50)

                                 Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text

                           ->  Seq Scan on display_run_child_2007_06_month b  (cost=0.00..2635000.02 rows=6153744 width=50)

                                 Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text

                           ->  Seq Scan on display_run_child_2007_07_month b  (cost=0.00..2635000.02 rows=6124871 width=50)

                                 Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text

                           ->  Seq Scan on display_run_child_2007_08_month b  (cost=0.00..2635000.02 rows=6147722 width=50)

                                 Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text

                           ->  Seq Scan on display_run_child_2007_09_month b  (cost=0.00..2635000.02 rows=6244890 width=50)

                                 Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text

                           ->  Seq Scan on display_run_child_2007_10_month b  (cost=0.00..2635000.02 rows=6083077 width=50)

                                 Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text

                           ->  Seq Scan on display_run_child_2007_11_month b  (cost=0.00..2635000.02 rows=6161248 width=50)

                                 Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text

                           ->  Seq Scan on display_run_child_2007_12_month b  (cost=0.00..2635000.02 rows=6156930 width=50)

                                 Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text

All child partitions are scanned

Page 112: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Partition Elimination Gather Motion 48:1  (slice1)  (cost=14879102.69..14970283.82 rows=607875 width=548)

   Merge Key: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"

   ->  Unique  (cost=14879102.69..14970283.82 rows=607875 width=548)

         Group By: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"

         ->  Sort  (cost=14879102.69..14894299.54 rows=6078742 width=548)

               Sort Key (Distinct): b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"

               ->  Result  (cost=0.00..3188311.04 rows=6078742 width=548)

                     ->  Append  (cost=0.00..3188311.04 rows=6078742 width=548)

                           ->  Seq Scan on display_run b  (cost=0.00..1.02 rows=1 width=548)

                                 Filter: local_time >= '2007-03-01 00:00:00'::timestamp without time zone AND local_time < '2007-04-01

00:00:00'::timestamp without time zone AND (url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user

time’::text)

                           ->  Seq Scan on display_run_child_2007_03_month b  

(cost=0.00..3188310.02 rows=6078741 width=50)

                                 Filter: local_time >= '2007-03-01 00:00:00'::timestamp without time zone AND local_time < '2007-04-01

00:00:00'::timestamp without time zone AND (url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user

time’::text)

To eliminate partitions the WHERE clause must be the same as the partition criteria AND must

contain an explicit value (no subquery)

Page 113: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Optimal Plan Heuristics

• Fast Operators Sequential Scan Hash Join Hash Agg Redistribute Motion

•Slow OperatorsNested Loop JoinMerge Join Sort Broadcast Motion

NEW

Page 114: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Eliminate Nested Loop JoinsGather Motion 24:1  (slice4)  (cost=4517.05..214553.33 rows=416845 width=371)

   ->  Nested Loop  (cost=4517.05..214553.33 rows=416845 width=371)

         InitPlan  (slice6)

           ->  Gather Motion 24:1  (slice3)  (cost=1.09..10.23 rows=1 width=4)

                 ->  Seq Scan on calendar_dim  (cost=1.09..10.23 rows=1 width=4)

                       Filter: calendar_date = $0

                       InitPlan  (slice5)

                         ->  Gather Motion 24:1  (slice2)  (cost=0.00..1.09 rows=1 width=8)

                               ->  Seq Scan on event_status  (cost=0.00..1.09 rows=1 width=8)

                                     Filter: event_name::text = 'bonusbuy'::text

         ->  Seq Scan on primary_contact_events  (cost=0.00..7866.45 rows=416845 width=109)

         ->  Materialize  (cost=4506.82..4507.06 rows=24 width=262)

               ->  Broadcast Motion 24:24  (slice1)  (cost=0.00..4506.80 rows=24 width=262)

                     ->  Seq Scan on contact_type  (cost=0.00..4506.55 rows=1 width=262)

                           Filter: call_type::text = 'Regional Group'::text AND created_dt::date = $1

Eliminate nested loop joins for hash joins

Page 115: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Replace Large SortsRedistribute Motion 48:48  (slice2)  (cost=8568413851.73..8683017216.78 rows=83317

width=154)

   ->  GroupAggregate  (cost=8568413851.73..8683017216.78 rows=83317 width=154)

         Group By: b."host", "customer_id"

         ->  Sort  (cost=8568413851.73..8591334233.13 rows=9168152560 width=154)

               Sort Key: b."host", "customer_id"

               ->  Redistribute Motion 48:48  (slice1)  (cost=2754445.61..1748783807.47

rows=9168152560 width=154)

                     Hash Key: b."host", "customer_id"

                     ->  Hash Join  (cost=2754445.61..1565420756.27 rows=9168152560

width=154)

                           Hash Cond: a.pack_id = b.pack_id

                           Join Filter: (a.local_time - b.local_time) >= '-00:10:00'::

Replace large sorts with HashAgg

Page 116: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Re-write Sort Example• The optimizer dynamically re-writes a single count

distinct to replace a sort for a HashAgg by pushing the

distinct into a subquery

• For example

SELECT count( distinct l_quantity ) FROM lineitem;

SELECT count(*) FROM (SELECT l_quantity FROM lineitem

GROUP BY l_quantity) as foo;

• The optimizer can not rewrite multiple count distincts Re-write the query to push the distinct into a

subquery that inserts the rows into a temp table Then query the temp table that contains the distinct

values to obtain the count

Page 117: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Eliminate Large Table Broadcast Motion->  Hash  (cost=18039.92..18039.92 rows=20 width=66)

                           ->  Redistribute Motion 24:24  (slice3)  (cost=0.55..18039.92 rows=20 width=66)

                                 Hash Key: cust_contact_activity.src_system_id::text

                                 ->  Hash Join  (cost=0.55..18039.52 rows=20 width=66)

                                       Hash Cond: cust_contact_activity.contact_id::text =

cust_contact.contact_id::text

                                       ->  Seq Scan on cust_contact_activity  (cost=0.00..15953.63 rows=833663

width=39)

                                       ->  Hash  (cost=0.25..0.25 rows=24 width=84)

                                             ->  Broadcast Motion 24:24 (slice2) (cost=0.00..0.25 rows=24 width=84)

                                                   ->  Seq Scan on cust_contact  (cost=0.00..0.00 rows=1 width=84)

Use gp_segments_for_planner to increase the cost of the motion to favor

a redistribute motion

Small table broadcast

is acceptable

Page 118: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Increase work_mem -> HashAggregate (cost=74852.40..84739.94 rows=791003 width=45)

Group By: l_orderkey, l_partkey, l_comment

Rows out: 2999671 rows (seg1) with 13345 ms to first row, 71558 ms to end, start offset by 3.533

ms.

Executor memory: 2645K bytes avg, 5019K bytes max (seg1).

Work_mem used: 2321K bytes avg, 4062K bytes max (seg1).

Work_mem wanted: 237859K bytes avg, 237859K bytes max (seg1) to lessen workfile I/O

affecting 1 workers.

. . .

-> Seq Scan on lineitem (cost=0.00..44855.70 rows=2999670 width=45)

Rows out: 2999671 rows (seg1) with 0.571 ms to first row, 4167 ms to end, start offset by 4.105

Slice statistics:

(slice0) Executor memory: 211K bytes.

(slice1) * Executor memory: 2840K bytes avg x 2 workers, 5209K bytes max (seg1).

Work_mem: 4062K bytes max, 237859K bytes wanted.

Settings: work_mem=4MB

Total runtime: 73326.082 ms

(24 rows)

Page 119: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Spill Files• Operations performed in memory are optimal

Hash Join, Hash Agg, Sort, etc.

• If there is not sufficient memory rows will be written out

to disk as spill files Certain amount of overhead with any disk I/O operation

• Spill files are located within the pgsql_temp directory for

the database

[jesh:~/gpdb-data/seg1/base/16571/pgsql_tmp] ls -ltrh

total 334464

-rw------- 1 jeshle jeshle 163M Jan 9 23:41 pgsql_tmp_SortTape_Slice1_14022.205

[jesh:~/gpdb-data/seg1/base/16571/pgsql_tmp] ls -ltrh

total 341632

-rw------- 1 jeshle jeshle 167M Jan 9 23:41 pgsql_tmp_SortTape_Slice1_14022.205

[jesh:~/gpdb-data/seg1/base/16571/pgsql_tmp] ls -ltrh

total 341632

-rw------- 1 jeshle jeshle 167M Jan 9 23:41 pgsql_tmp_SortTape_Slice1_14022.205

[jesh:~/gpdb-data/seg1/base/16571/pgsql_tmp]

Page 120: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Identify Respill in Hash Agg Operations . . .

-> HashAggregate (cost=74852.40..84739.94 rows=791003 width=45)

Group By: l_orderkey, l_partkey, l_comment

Rows out: 2999671 rows (seg1) with 13345 ms to first row, 71558 ms to end . . .

Executor memory: 2645K bytes avg, 5019K bytes max (seg1).

Work_mem used: 2321K bytes avg, 4062K bytes max (seg1).

Work_mem wanted: 237859K bytes avg, 237859K bytes max (seg1) to lessen workfile I/O

affecting 1 workers.

(seg1) 2999671 groups total in 5 batches; 64 respill passes; 23343536 respill rows.

(seg1) Initial pass: 44020 groups made from 44020 rows; 2955651 rows spilled to workfile.

(seg1) Hash chain length 5.0 avg, 18 max, using 602986 of 607476 buckets.

-> Seq Scan on lineitem (cost=0.00..44855.70 rows=2999670 width=45)

Rows out: 2999671 rows (seg1) with 0.571 ms to first row, 4167 ms to end, start offset by 4.105

Slice statistics:

(slice0) Executor memory: 211K bytes.

(slice1) * Executor memory: 2840K bytes avg x 2 workers, 5209K bytes max (seg1). Work_mem: 4062K

bytes max, 237859K bytes wanted.

Settings: work_mem=4MB

Page 121: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Eliminate Respill in Hash Agg Operations -> HashAggregate (cost=74852.40..84739.94 rows=791003 width=45)

Group By: l_orderkey, l_partkey, l_comment

Rows out: 2999671 rows (seg1) with 9374 ms to first row, 19169 ms to end, start offset by ms.

Executor memory: 5690K bytes avg, 8063K bytes max (seg1).

Work_mem used: 2321K bytes avg, 4062K bytes max (seg1) .

Work_mem wanted: 237859K bytes avg, 237859K bytes max (seg1) to lessen workfile I/O

affecting 1 workers.

(seg1) 2999671 groups total in 100 batches; 0 respill passes; 0 respill rows.

(seg1) Initial pass: 44020 groups made from 44020 rows; 2955651 rows spilled to workfile.

(seg1) Hash chain length 3.5 avg, 16 max, using 850893 of 880400 buckets.

-> Seq Scan on lineitem (cost=0.00..44855.70 rows=2999670 width=45)

Rows out: 2999671 rows (seg1) with 0.480 ms to first row, 4199 ms to end, start offset by 18 ms.

Slice statistics:

(slice0) Executor memory: 211K bytes.

(slice1) * Executor memory: 5884K bytes avg x 2 workers, 8254K bytes max (seg1). Work_mem: 4062K

bytes max, 237859K bytes wanted.

Settings: work_mem=4MB

Total runtime: 20987.615 ms

Use gp_hashagg_spillbatch_min andgp_hashagg_spillbatch_max

Page 122: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Review Join Order

Aggregate (cost=16141.02..16141.03 rows=1 width=0)

-> Gather Motion 2:1 (slice2) (cost=16140.97..16141.00 rows=1 width=0)

-> Aggregate (cost=16140.97..16140.98 rows=1 width=0)

-> Hash Join (cost=2999.46..15647.43 rows=197414 width=0)

Hash Cond: ps.ps_partkey = part.p_partkey

-> Hash Join (cost=240.46..9920.71 rows=200020 width=4)

Hash Cond: ps.ps_suppkey = supplier.s_suppkey

-> Seq Scan on partsupp ps (cost=0.00..6180.00 rows=400000

width=8)

-> Hash (cost=177.97..177.97 rows=4999 width=4)

-> Broadcast Motion 2:2 (slice1) (cost=0.00..177.97 rows=4999

width=4)

-> Seq Scan on supplier (cost=0.00..77.99 rows=4999

width=4)

-> Hash (cost=1509.00..1509.00 rows=100000 width=4)

-> Seq Scan on part (cost=0.00..1509.00 rows=100000 width=4)

Build and probe on the

largest table(s) last

Page 123: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Use join_collapse_limit to Specify Join Order

• Use join_collapse_limit to specify join order

• Must be ANSI style join

• For example

SELECT COUNT(*) FROM partsupp ps

JOIN supplier ON ps_suppkey = s_suppkey

JOIN part ON p_partkey = ps_partkey;

• Non-ANSI style joinSELECT COUNT(*) FROM partsupp, supplier, part

WHERE ps_suppkey = s_suppkey

AND p_partkey = ps_partkey;

• Set join_collapse_limit = 1

Page 124: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Analyzing Query Plans Summary

Identify plan nodes with a large number of rows and

high cost Validate partitions are being eliminated Eliminate large table broadcast motion Replace slow operators to favor fast operators

Nested Loop Join, Sort Merge Join, Sort, Agg Identify spill files and increase memory

work_mem used versus work_mem wanted Eliminate respill in HashAggregate Review join order

Page 125: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

LAB

Module 5

Explain the Explain PlanAnalyzing Queries

Page 126: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Module 6

Improve PerformanceWith Statistics

Page 127: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

EXPLAIN ANALYZE Estimated Costs

• EXPLAIN ANALYZE provides cost estimates for

the execution of the plan node as follows

• Cost Measured in units of disk page fetches

• Rows The number of rows output by the plan node

• Width Total bytes of all the rows output by the plan node

Page 128: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

ANALYZE and Database Statistics

• Database statistics used by the optimizer and

query planner

• Run ANALYZE after loading data

• Run ANALYZE after large INSERT, UPDATE and

DELETE operations

• RUN ANALYZE after CREATE INDEX operations

• RUN ANALYZE after database restores from

backupsANALYZE is the most important ongoing

administrative task for optimal query performance

Page 129: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

ANALYZE and VACUUM ANALYZE

• Use the ANALYZE command to generate

database statistics May specify table and column names

ANALYZE [table [ (column [, ...] ) ]]

• Use VACUUM ANALYZE to vacuum the database

and generate database statistics May specify table and column names

VACUUM ANALYZE [table [(column [, ...] )]]

Page 130: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Use SET STATISTICS to Increase Sampling• Use ALTER TABLE … SET STATISTICS to increase

sampling for statistics collected for a given columnDefault statistics value is 25

• Increasing the statistics value may improve query

planner estimatesFor columns used in query predicates and JOINS

(WHERE clause) Example

ALTER TABLE customer ALTER customer_id

SET STATISTICS 35;

• Larger values increases the time it takes to

ANALYZE

Page 131: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

default_statistics_target

• Use default_statistics_target server configuration

parameter to increase sampling for statistics

collected for ALL columns

• Increasing the target value may improve query

planner estimates

• Default is 25

• Larger values increases the time it takes to

ANALYZE

Page 132: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

gp_analyze_relative_error

• The gp_analyze_relative_error server

configuration parameter sets the estimated

acceptable error in the cardinality of the table

• A value of 0.5 is equivalent to an acceptable

error of 50%

• The default value is 0.5

• Decreasing the relative error fraction (accepting

less error) will increase the number of rows

sampled

gp_analyze_relative_error = .25

Page 133: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

ANALYZE Best Practices

• For large data warehouse environments it may

not be feasible to run ANALYZE on the entire

database or on all columns in a table

• Run ANALYZE for Columns used in a JOIN condition Columns used in a WHERE clause Columns used in a SORT clause Columns used in a GROUP BY or HAVING

Clause

NEW

Page 134: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

LAB

Module 6

Improve PerformanceWith Statistics

Page 135: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Module 7

Indexing Strategies

Page 136: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Indexes• Most data warehouse environment operate on

large volumes of data Low selectivity Sequential scan is the preferred method to read the data in a

Greenplum MPP environment

• For queries with high selectivity indexes may

improve performance Avoid indexes on frequently updated columns Avoid overlapping indexes Use bitmap indexes for columns with very low cardinality Drop indexes before loading data and recreate indexes after the

load Analyze after re-creating indexes

Page 137: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Greenplum Index Types

• B-Tree Use for fairly unique columns Use for those columns that are single row queries

• Bitmap Use for low cardinality columns Use when column is often included in predicate

• Hash Available, but not recommended

Page 138: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Create Index Syntax

CREATE [UNIQUE] INDEX [CONCURRENTLY]

name ON table

[USING method] ( {column | (expression)} [opclass]

[, ...] )

[ WITH ( FILLFACTOR = value )

[WHERE predicate]

Page 139: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

B-Tree Index

• Supports single value row lookups

• Can be Unique or Non-Unique

• Can be Single or Multi-Column If multi-column, all columns in the index must be

included in the predicate for the index to be used.

EXAMPLE:

CREATE INDEX transid_btridx

ON facts.transaction

USING BTREE (transactionid)

;

Page 140: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Bitmap Index

• Single column index is recommended

• Provides very fast retrieval

• Low cardinality columns Gender State / Province Code

EXAMPLE:

CREATE INDEX store_pharm_bmidx ON dimensions.store

USING BITMAP (pharmacy);

CREATE INDEX store_grocery_bmidx ON dimensions.store

USING BITMAP (grocery);

CREATE INDEX store_deli_bmidx ON dimensions.store

USING BITMAP (deli);

Page 141: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Index on Expressions

• Only use when the expression appears often in

query predicates.

• Very high overhead maintaining the index

during insert and update operations.

EXAMPLE:

CREATE INDEX lcase_storename_idx

ON store (LOWER(storename));

SUPPORTS:

SELECT * FROM store WHERE LOWER(storename) = ‘top foods’;

Page 142: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Index with Predicate (Partial Index)

• Pre-selects rows based on predicate

• Use to select small numbers of rows from large

tables

EXAMPLE:

CREATE INDEX canada_stores_idx

ON facts.transaction

WHERE storeid IN(8,32);

Page 143: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

To Index or Not to Index

• Will the optimizer use the index? Test it with production volumes No hints in Greenplum!

• Is the column(s) used in query predicates? Does the frequency of use justify the overhead? Is the space available?

• Should a unique B-Tree Index be the primary

key? Consider a Primary Key Constraint instead

Page 144: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Index Costs

• Indexes are not free!

• Indexes take space

• Indexes incur overhead during inserts and

updates

• Indexes incur processing overhead during

creation

Page 145: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Index Best Practices

• Only create indexes when full table scans do

not perform.

• Name the index.

• Use B-Tree indexes for single column lookups

• Use Bitmap Indexes for low cardinality columns

• Use Partial Indexes for queries that frequently

access small subsets of much larger data sets.

Page 146: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

LAB

Module 7

Indexing Strategies

Page 147: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Module 8

Advanced ReportingUsingOLAP

Page 148: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

What is OLAP?

• On-Line Analytic Processing Quickly provide answers to multi-dimensional

queries

• “Window” Functions Allows access to multiple rows in a single pass

• OLAP Grouping Sets More flexible than standard GROUP BY

Page 149: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

SELECT - OLAP Grouping Extensions

• Standard GROUP BY Example

• ROLLUP

• GROUPING SETS

• CUBE

• grouping(column [, ...]) function

• group_id() function

Page 150: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Standard GROUP BY Example

summarize product sales by vendor…

SELECT pn, vn, sum(prc*qty)

FROM sale

GROUP BY pn, vn

ORDER BY 1,2,3;

pn | vn | sum

-----+----+---------

100 | 20 | 0

100 | 40 | 2640000

200 | 10 | 0

200 | 40 | 0

300 | 30 | 0

400 | 50 | 0

500 | 30 | 120

600 | 30 | 60

700 | 40 | 1

800 | 40 | 1

(10 rows)

Page 151: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Standard GROUP BY Example Continued

now include subtotals and grand total…

SELECT pn, vn, sum(prc*qty)FROM saleGROUP BY pn, vnUNION ALLSELECT pn, null, sum(prc*qty)FROM saleGROUP BY pnUNION ALLSELECT null, null, sum(prc*qty)FROM SALEORDER BY 1,2,3;

pn | vn | sum

-----+----+---------

100 | 20 | 0

100 | 40 | 2640000

100 | | 2640000

200 | 10 | 0

200 | 40 | 0

200 | | 0

300 | 30 | 0

300 | | 0

400 | 50 | 0

400 | | 0

500 | 30 | 120

500 | | 120

600 | 30 | 60

600 | | 60

700 | 40 | 1

700 | | 1

800 | 40 | 1

800 | | 1

| | 2640182

(19 rows)

Page 152: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

ROLLUP Example

the same query using ROLLUP…

SELECT pn, vn, sum(prc*qty)

FROM sale

GROUP BY ROLLUP(pn, vn)

ORDER BY 1,2,3;

pn | vn | sum

-----+----+---------

100 | 20 | 0

100 | 40 | 2640000

100 | | 2640000

200 | 10 | 0

200 | 40 | 0

200 | | 0

300 | 30 | 0

300 | | 0

400 | 50 | 0

400 | | 0

500 | 30 | 120

500 | | 120

600 | 30 | 60

600 | | 60

700 | 40 | 1

700 | | 1

800 | 40 | 1

800 | | 1

| | 2640182

(19 rows)

Page 153: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

GROUPING SETS Example

the same query using GROUPING SETS…

SELECT pn, vn, sum(prc*qty)

FROM sale

GROUP BY GROUPING SETS

( (pn, vn), (pn), () )

ORDER BY 1,2,3;

pn | vn | sum

-----+----+---------

100 | 20 | 0

100 | 40 | 2640000

100 | | 2640000

200 | 10 | 0

200 | 40 | 0

200 | | 0

300 | 30 | 0

300 | | 0

400 | 50 | 0

400 | | 0

500 | 30 | 120

500 | | 120

600 | 30 | 60

600 | | 60

700 | 40 | 1

700 | | 1

800 | 40 | 1

800 | | 1

| | 2640182

(19 rows)

Page 154: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

CUBE ExampleCUBE creates subtotals for all possible

combinations of grouping columns, so…

SELECT pn, vn, sum(prc*qty)FROM saleGROUP BY CUBE(pn, vn)ORDER BY 1,2,3;

is the same as…

SELECT pn, vn, sum(prc*qty)FROM saleGROUP BY GROUPING SETS

( (pn, vn), (pn), (vn), () )ORDER BY 1,2,3;

pn | vn | sum

-----+----+---------

100 | 20 | 0

100 | 40 | 2640000

100 | | 2640000

200 | 10 | 0

200 | 40 | 0

200 | | 0

300 | 30 | 0

300 | | 0

400 | 50 | 0

400 | | 0

500 | 30 | 120

500 | | 120

600 | 30 | 60

600 | | 60

700 | 40 | 1

700 | | 1

800 | 40 | 1

800 | | 1

| 10 | 0

| 20 | 0

| 30 | 180

| 40 | 2640002

| 50 | 0

| | 2640182

(24 rows)

Page 155: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

GROUPING Function Example

SELECT * FROM dsales_null;

store | customer | product | price

-------+----------+---------+-------

s2 | c1 | p1 | 90

s2 | c1 | p2 | 50

s2 | | p1 | 44

s1 | c2 | p2 | 70

s1 | c3 | p1 | 40

(5 rows)

SELECT store,customer,product, sum(price),

grouping(customer)

FROM dsales_null

GROUP BY ROLLUP(store,customer,product);

store | customer | product | sum | grouping

-------+----------+---------+-----+----------

s1 | c2 | p2 | 70 | 0

s1 | c2 | | 70 | 0

s1 | c3 | p1 | 40 | 0

s1 | c3 | | 40 | 0

s1 | | | 110 | 1

s2 | c1 | p1 | 90 | 0

s2 | c1 | p2 | 50 | 0

s2 | c1 | | 140 | 0

s2 | | p1 | 44 | 0

s2 | | | 44 | 0

s2 | | | 184 | 1

| | | 294 | 1

(12 rows)

Distinguishes NULL values from summary

markers…

Page 156: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

GROUP_ID Function• Returns 0 for each output row in a unique grouping set

• Assigns a serial number >0 to each duplicate grouping set found

• Useful when combining grouping extension clauses

• Can be used to filter output rows of duplicate grouping sets:

SELECT a, b, c, sum(p*q), group_id()FROM salesGROUP BY ROLLUP(a,b), CUBE(b,c)HAVING group_id()<1ORDER BY a,b,c;

Page 157: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

SELECT – OLAP Windowing Extensions

• About Window Functions

• Constructing a Window Specification OVER clause WINDOW clause

• Built-in Window Functions

Page 158: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

About Window FunctionsNew class of function allowed only in the SELECT list

Returns a value per row (unlike aggregate functions)

Results interpreted in terms of the current row and its corresponding window partition or frame

Characterized by the use of the OVER clauseDefines the window partitions (groups of rows) to apply the functionDefines ordering of data within a windowDefines the positional or logical framing of a row in respect to its window

Page 159: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Defining Window Specifications (OVER Clause)

All window functions have an OVER() clause

Specifies the ‘window’ of data to which the function applies

Defines:

Window partitions (PARTITION BY clause)

Ordering within a window partition (ORDER BY clause)

Framing within a window partition (ROWS/RANGE clauses)

Page 160: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

OVER (PARTITION BY…) Example

SELECT * , row_number() OVER()FROM saleORDER BY cn;

SELECT * , row_number() OVER(PARTITION BY cn)FROM saleORDER BY cn;

row_number | cn | vn | pn | dt | qty | prc

------------+----+----+-----+------------+------+------

1 | 1 | 10 | 200 | 1401-03-01 | 1 | 0

2 | 1 | 30 | 300 | 1401-05-02 | 1 | 0

3 | 1 | 50 | 400 | 1401-06-01 | 1 | 0

4 | 1 | 30 | 500 | 1401-06-01 | 12 | 5

5 | 1 | 20 | 100 | 1401-05-01 | 1 | 0

1 | 2 | 50 | 400 | 1401-06-01 | 1 | 0

2 | 2 | 40 | 100 | 1401-01-01 | 1100 | 2400

1 | 3 | 40 | 200 | 1401-04-01 | 1 | 0

(8 rows)

row_number | cn | vn | pn | dt | qty | prc

------------+----+----+-----+------------+------+------

1 | 1 | 10 | 200 | 1401-03-01 | 1 | 0

2 | 1 | 30 | 300 | 1401-05-02 | 1 | 0

3 | 1 | 50 | 400 | 1401-06-01 | 1 | 0

4 | 1 | 30 | 500 | 1401-06-01 | 12 | 5

5 | 1 | 20 | 100 | 1401-05-01 | 1 | 0

6 | 2 | 50 | 400 | 1401-06-01 | 1 | 0

7 | 2 | 40 | 100 | 1401-01-01 | 1100 | 2400

8 | 3 | 40 | 200 | 1401-04-01 | 1 | 0

(8 rows)

Page 161: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

OVER (ORDER BY…) ExampleSELECT vn, sum(prc*qty)FROM saleGROUP BY vnORDER BY 2 DESC;

SELECT vn, sum(prc*qty), rank() OVER (ORDER BY sum(prc*qty) DESC)FROM saleGROUP BY vnORDER BY 2 DESC;

vn | sum | rank

----+---------+------

40 | 2640002 | 1

30 | 180 | 2

50 | 0 | 3

20 | 0 | 3

10 | 0 | 3

(5 rows)

vn | sum

----+---------

40 | 2640002

30 | 180

50 | 0

20 | 0

10 | 0

(5 rows)

Page 162: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

30 | 05022008 | 0

30 | 06012008 | 60

30 | 06012008 | 60

30 | 06012008 | 60

OVER (…ROWS…) Example

Window Framing: “Box car”

AverageSELECT

vn, dt,

AVG(prc*qty) OVER (

PARTITION BY vn

ORDER BY dt

ROWS BETWEEN

2 PRECEDING AND

2 FOLLOWING)

FROM sale;

vn | dt | avg

----+------------+---------

10 | 03012008 | 30

20 | 05012008 | 20

30 | 05022008 | 0

30 | 06012008 | 60

30 | 06012008 | 60

30 | 06012008 | 60

40 | 06012008 | 140

40 | 06042008 | 90

40 | 06052008 | 120

40 | 06052008 | 100

50 | 06012008 | 30

50 | 06012008 | 10

(12 rows)

30 | 05022008 | 0

30 | 06012008 | 60

30 | 06012008 | 60

30 | 06012008 | 60

30 | 05022008 | 0

30 | 06012008 | 60

30 | 06012008 | 60

30 | 06012008 | 60

30 | 05022008 | 0

30 | 06012008 | 60

30 | 06012008 | 60

30 | 06012008 | 60

Page 163: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Built-In Window Functionscume_dist()dense_rank()first_value(expr)lag(expr [,offset] [,default])last_value(expr)lead(expr [,offset] [,default])ntile(expr)percent_rank()rank()row_number()

* Any aggregate function (used with the OVER clause) can also be used as a window function

Page 164: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Designating the “Moving” Window

<function> OVER (column1..n This is like a “group by” for the

function

ROWS BETWEEN see below

ORDER BY Required by some functions)

ROWS BETWEEN

UNBOUNDED PRECEDING AND CURRENT ROW

N PRECEDING AND CURRENT ROW

CURRENT ROW AND N FOLLOWING

N PRECEDING AND N FOLLOWING

Page 165: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Moving Window ExampleREQUIREMENT: By store, rank all customers for the month of June.

EXAMPLE:SELECT s.storename

,c.custname

,t.totalsalesamt

,RANK() OVER (PARTITION BY t.storeid

ORDER BY t.totalsalesamt DESC) AS ranking

FROM (SELECT storeid

,customerid

,SUM(salesamt) AS totalsalesamt

FROM transaction

WHERE transdate BETWEEN '2008-06-01' AND '2008-06-30'

GROUP BY 1,2 ) t

INNER JOIN store s ON s.storeid = t.storeid

INNER JOIN customer c ON c.customerid = t.customerid

;

Page 166: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Global Window Specifications (WINDOW clause)

Useful for multiple window function queriesDefine and name a window specification Reuse window specification throughout the query

EXAMPLE:

SELECTRANK() OVER (ORDER BY pn),SUM(prc*qty) OVER (ORDER BY pn),AVG(prc*qty) OVER (ORDER BY pn)FROM sale;

SELECTRANK() OVER (w1),SUM(prc*qty) OVER (w1),AVG(prc*qty) OVER (w1)FROM saleWINDOW w1 AS (ORDER BY pn);

Page 167: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

LAB

Module 8

Advanced ReportingUsingOLAP

Page 168: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Module 9

Using Temporary Tables

Page 169: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Temporary Tables

• Local to the session Data may not be shared between sessions

• Temporary table DDL executed at run time

• Dropped when session ends

Page 170: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Why Use Temporary Tables

• When performing complicated transformations

• When multiple results sets need to be

combined

• To facilitate subsequent join operations Move data to get same segment joins

Page 171: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Features

• Distributed the same as any Greenplum table

• May be indexed

• May be Analyzed

Page 172: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Guidelines

• Always designate DISTRIBUTED BY column(s) Will default to the first valid column

• Once data is inserted ANALYZE the table Not available in other RDBMS temporary tables!

• Same CREATE DDL as a permanent table

• May not create temporary external table

Page 173: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

CREATE Temporary Table ExampleCREATE TEMPORARY TABLE monthlytranssummary (

storeid INTEGER,

customerid INTEGER,

transmonth SMALLINT,

salesamttot DECIMAL(10,2)

)

ON COMMIT PRESERVE ROWS

DISTRIBUTED BY (storeid,customerid)

;

ON COMMIT PRESERVE ROWS is the default action when creating

temporary tables.

Page 174: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

LAB

Module 9

Using Temporary Tables

Page 175: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Module 10

PostGREs Functions

Page 176: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Types of Functions

Greenplum supports several function types. query language functions (functions written in

SQL) procedural language functions (functions

written in for example, PL/pgSQL or PL/Tcl) internal functions C-language functions

This class will only present the first two types of functions:

Query Language and Procedural Language

Page 177: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Query Language Function - Rules

• Executes an arbitrary set of SQL commands

• Returns the results of the last query in the list Returns the first row of the last query’s result set Can return a “set of” a previously defined data type

• The body of the function must be a list of SQL

statements followed by a semi-colon

Page 178: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Query Language Function – More Rules

• May contain SELECT statements

• May contain DML statements Insert, Update or Delete statements

• May not contain: Rollback, Savepoint, Begin or Commit commands

• Last statement must be SELECT unless the

return type is void.

Page 179: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Query Language Function – Example

This function has no parameters and no return

set:CREATE FUNCTION public.clean_customer()

RETURNS void AS ‘DELETE FROM dimensions.customer

WHERE state = ‘‘NA’’; ’

LANGUAGE SQL;

It is executed as a SELECT statement:SELECT public.clean_customer();

clean_customer

-----------

(1 row)

Page 180: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Query Language Function - Notes

• Quoted values in the query must use the

doubled single quote syntax.

• One or more parameters may be passed to a

function. The first parameter is referenced as $1 The second parameter is referenced as $2, and so

forth

Page 181: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Query Language Function – With Parameter

CREATE FUNCTION public.clean_specific_customer(which

char(2))

RETURNS void AS ‘DELETE FROM dimensions.customer

WHERE state = $1; ’

LANGUAGE SQL;

It is executed as a SELECT statement:SELECT public.clean_specific_customer(‘NA’);

clean_specific_customer

-----------------------

(1 row)

Page 182: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

SQL Functions on Base Types

Used to return a single valueCREATE FUNCTION public.customer_cnt(char)

RETURNS bigint AS ‘SELECT COUNT(*)

FROM dimensions.customer

WHERE state = $1; ’

LANGUAGE SQL;

It is executed as a SELECT statement:SELECT public.customer_cnt(‘WA’);

customer_cnt

-----------

176

Page 183: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

SQL Functions on Composite Types

Allows you to pass in a composite parameter

(such as a table name)CREATE FUNCTION facts.viewnewtaxamt(transaction)

RETURNS numeric AS $$

SELECT $1.taxamt * .90;

$$ LANGUAGE SQL;

Note the syntactical difference using the double $ signs instead of

single quotes in the DDL and how the function is called.

SELECT transid, transdate, taxamt,

facts.viewnewtaxamt(transaction)

FROM transaction;

Page 184: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

SQL Functions with Output Parameters

Very similar to previous example, no RETURNS clause.

CREATE FUNCTION public.cust_cnt(IN whichstate char ,

OUT MyCount bigint)

AS ‘SELECT COUNT(*)

FROM dimensions.customer

WHERE state = $1; ’

LANGUAGE SQL;

It is executed as a SELECT statement with only the input parameters

specified:

SELECT public.cust_cnt(‘WA’);

customer_cnt

-----------

176

Page 185: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

SQL Functions as Table Sources

All SQL functions can be used in the FROM clause of a query, but it is

particularly useful for functions returning composite types.

CREATE FUNCTION dimensions.getcust(int)

RETURNS customer AS $$

SELECT *

FROM dimensions.customer WHERE customerid = $1;

$$ LANGUAGE SQL;

Now we can access the columns in the function as if it were a table.

SELECT *,UPPER(custname)

FROM dimensions.getcust(100);

Page 186: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

SQL Functions – Some Notes

• Note that functions operating on tables must be

created in the same schema as the table.

• To DROP the function, the DROP statement

must include the parameter types.EXAMPLE:

DROP FUNCTION DIMENSIONS.getcust(int);

Page 187: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

SQL Functions Returning Sets

When an SQL function is declared as returning SETOF sometype, the

function’s final SELECT query is executed to completion, and each

row it outputs is returned as an element of the result set.

CREATE FUNCTION dimensions.getstatecust(char)

RETURNS SETOF customer AS $$

SELECT *

FROM dimensions.customer WHERE state = $1;

$$ LANGUAGE SQL;

SELECT *,UPPER(city)

FROM dimensions.getstatecust(‘WA’);

This function returns all qualifying rows.

Page 188: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Function Overloading

• More than one function can be created with the

same name. Must have different input parameter types or number The optimizer selects which version to call based on

the input data type and number of arguments.

Page 189: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Function Overload - Example

CREATE FUNCTION dimensions.getcust(int)

RETURNS customer AS $$

SELECT *

FROM dimensions.customer WHERE customerid = $1;

$$ LANGUAGE SQL;

CREATE FUNCTION dimensions.getcust(char)

RETURNS customer AS $$

SELECT * FROM dimensions.customer WHERE state = $1;

$$ LANGUAGE SQL;

select *,upper(custname)from dimensions.getcust('WA') ;

select *,upper(custname) from dimensions.getcust(1);

Page 190: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Function Volatility Categories

• Volatility is a “promise” to the optimizer about

how the function will behave.

• Every function falls into one of these categories VOLATILE IMMUTABLE STABLE

Page 191: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Function Volatility - VOLATILE

• A VOLATILE function can do anything,

including modifying the database.

• It can return different results on successive

calls with the same arguments.

• The optimizer makes no assumptions about the

behavior of such functions.

• A query using a volatile function will re-

evaluate the function at every row where its

value is needed.

Page 192: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Function Volatility - IMMUTABLE

• An IMMUTABLE function cannot modify the database.

• It is guaranteed to return the same results given the

same arguments forever.

• This category allows the optimizer to pre-evaluate the

function when a query calls it with constant arguments.

• For example, a query like SELECT ... WHERE x = 2 + 2; can be

simplified on sight to SELECT ... WHERE x = 4; because the

function underlying the integer addition operator is

marked IMMUTABLE.

Page 193: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Function Volatility - STABLE

• A STABLE function cannot modify the

database.

• It is guaranteed to return the same results

given the same arguments for all rows within a

single statement.

• This category allows the optimizer to optimize

multiple calls of the function to a single call.

• In particular, it is safe to use an expression

containing such a function in an index scan

condition.

Page 194: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Procedural Language Functions

• PL/pgSQL Procedural Language for GreenplumCan be used to create functions and proceduresAdds control structures to the SQL languageCan perform complex computations Inherits all user-defined types, functions, and

operatorsCan be defined to be trusted by the server Is (relatively) easy to use.

Page 195: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Advantages of PLpgSQL Functions

• There can be considerable savings of

client/server communication overhead. Extra round trips between client and server are

eliminated Intermediate results that the client does not need do

not have to be marshaled or transferred between

server and clientMultiple rounds of query parsing can be avoided This can result in a considerable performance

increase as compared to an application that does not

use stored functions.

Page 196: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Structure of PL/pgSQL

• PL/pgSQL is a block-structured language.CREATE FUNCTION somefunc() RETURNS integer AS $$

DECLARE quantity integer := 30;

BEGIN

RAISE NOTICE ’Quantity here is %’, quantity;

-- Prints 30 quantity := 50;

-- Create a subblock

DECLARE quantity integer := 80;

BEGIN

RAISE NOTICE ’Quantity here is %’, quantity;

-- Prints 80

RAISE NOTICE ’Outer quantity here is %’, outerblock.quantity;

-- Prints 50

END;

RAISE NOTICE ’Quantity here is %’, quantity;

-- Prints 50

RETURN quantity;

END;

$$ LANGUAGE plpgsql;

Page 197: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL Declarations

• All variables used in a block must be declared in the declarations

section of the block.

• SYNTAX:name [ CONSTANT ] type [ NOT NULL ] [ { DEFAULT | := } expression ];

• EXAMPLES:user_id integer;

quantity numeric(5);

url varchar;

myrow tablename%ROWTYPE;

myfield tablename.columnname%TYPE;

somerow RECORD;

Page 198: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL %Types

• %TYPE provides the data type of a variable or

table column.

• You can use this to declare variables that will

hold database values.

SYNTAX: variable%TYPE

EXAMPLES:

custid customer.customerid%TYPE

tid transaction.transid%TYPE

Page 199: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL Row Types

• A row variable can be declared to have the

same type as the rows of an existing table or

view.

• Only the user-defined columns of a table row

are accessible in a row-type variable, not the

OID or other system columns.

SYNTAX: table_name%ROWTYPE

EXAMPLES:

cust customer%ROWTYPE

trans transaction%ROWTYPE

Page 200: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL Record Types

• Record variables are similar to row-type

variables, but they have no pre-defined

structure.

• They take on the actual row structure of the row

they are assigned during a SELECT or FOR

command. (It is a place holder.)

• The substructure of a record variable can

change each time it is assigned to.

SYNTAX: name RECORD;

Page 201: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL Basic Statements

• Assignment

variable := expression;

• Executing DML (no RETURN)

PERFORM myfunction(myparm1,myparm2);

• Single row results

Use SELECT … INTO mytarget%TYPE;

Page 202: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Executing Dynamic SQL

• Oftentimes you will want to generate dynamic

commands inside your PL/pgSQL functions

• This is for commands that will involve different

tables or different data types each time they are

executed.

• To handle this sort of problem, the EXECUTE

statement is provided: EXECUTE command-string [ INTO [STRICT] target ];

Page 203: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Dynamic SQL Example

Dynamic values that are to be inserted into the constructed query

require careful handling since they might themselves contain quote

characters.

EXECUTE ’UPDATE tbl SET ’

|| quote_ident(colname)

|| ’ = ’

|| quote_literal(newvalue)

|| ’ WHERE key = ’

|| quote_literal(keyvalue);

Page 204: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Getting the Results

• There are several ways to determine the effect

of a command.

• The first method is to use the GET

DIAGNOSTICS command, which has the form:

GET DIAGNOSTICS variable = item [ , ... ];

• This command allows retrieval of system status

indicators.

GET DIAGNOSTICS integer_var = ROW_COUNT;

Page 205: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Getting the Results (continued)

• The second method to determine the effects of

a command is to check the special variable

named FOUND, which is of type boolean. • FOUND starts out false within each PL/pgSQL

function call.

Page 206: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – FOUND

• FOUND is set by: A SELECT INTO statement sets FOUND true if a row is assigned, false if

no row is returned. A PERFORM statement sets FOUND true if it produces (and discards)

one or more rows, false if no row is produced. UPDATE, INSERT, and DELETE statements set FOUND true if at least

one row is affected, false if no row is affected. A FETCH statement sets FOUND true if it returns a row, false if no row

is returned. A MOVE statement sets FOUND true if it successfully repositions the

cursor, false otherwise. A FOR statement sets FOUND true if it iterates one or more times, else

false.

Page 207: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Control Structures

• Control structures are probably the most useful

(and important) part of PL/pgSQL.

• With PL/pgSQL’s control structures, you can

manipulate PostgreSQL data in a very flexible

and powerful way.

Page 208: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Returning from a Function

• There are two commands available that allow you to return data

from a function: RETURN and RETURN NEXT. • RETURN with an expression terminates the function and returns the

value of expression to the caller.

• If you declared the function with output parameters, write just

RETURN with no expression. The current values of the output

parameter variables will be returned.

• If you declared the function to return void, a RETURN statement can

be used to exit the function early; but do not write an expression

following RETURN.

• The return value of a function cannot be left undefined.

Page 209: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Returning from a Function• RETURN NEXT and RETURN QUERY do not actually return from the

function — they simply append zero or more rows to the function’s result

set.

EXAMPLE:CREATE OR REPLACE FUNCTION getAllStores()

RETURNS SETOF store AS

$BODY$

DECLARE r store%rowtype;

BEGIN

FOR r IN SELECT * FROM store WHERE storeid > 0 LOOP

-- can do some processing here

RETURN NEXT r;

-- return current row of SELECT

END LOOP;

RETURN;

END

$BODY$

LANGUAGE plpgsql;

Page 210: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Conditionals

IF – THEN – ELSE conditionals lets you execute commands based on

certain conditions.IF boolean-expression THEN

statements

[ ELSIF boolean-expression THEN

statements

[ ELSIF boolean-expression THEN

statements

[ ELSE statements ]

END IF;

TIP: Put the most commonly occurring condition first!

Page 211: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Simple Loops

• With the LOOP, EXIT, CONTINUE, WHILE, and FOR statements,

you can arrange for your PL/pgSQL function to repeat a series of

commands.

LOOP SYNTAX:

[ <<label>> ]

LOOP

statements

END LOOP [ label ];

Page 212: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Simple LoopsEXIT EXAMPLES:LOOP

-- some computations

IF count > 0

THEN EXIT;

-- exit loop

END IF;

END LOOP;

LOOP

-- some computations

EXIT WHEN count > 0;

-- same result as previous example

END LOOP;

BEGIN

-- some computations

IF stocks > 100000

THEN EXIT;

-- causes exit from the BEGIN block

END IF;

END;

Page 213: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Simple Loops

CONTINUE can be used within all loops.

CONTINUE EXAMPLE:

LOOP

-- some computations

EXIT WHEN count > 100;

CONTINUE WHEN count < 50;

-- some computations for count IN [50 .. 100]

END LOOP;

Page 214: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Simple Loops

The WHILE statement repeats a sequence of statements so long as

the boolean-expression evaluates to true. The expression is checked

just before each entry to the loop body.

WHILE EXAMPLE:

WHILE customerid < 50 LOOP

-- some computations

END LOOP;

Page 215: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Simple For (integer) Loops

This form of FOR creates a loop that iterates over a range of integer

values.

EXAMPLE:

FOR i IN 1..3 LOOP

-- i will take on the values 1,2,3 within the loop

END LOOP;

FOR i IN REVERSE 3..1 LOOP

-- i will take on the values 3,2,1 within the loop

END LOOP;

FOR i IN REVERSE 8..1 BY 2 LOOP

-- i will take on the values 8,6,4,2 within the loop

END LOOP;

Page 216: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Simple For (Query) Loops

Using a different type of FOR loop, you can iterate through the results

of a query and manipulate that data accordingly.

EXAMPLE: CREATE FUNCTION nukedimensions() RETURNS integer AS $$

DECLARE mdims RECORD;

BEGIN

FOR mdims IN SELECT * FROM pg_class WHERE schema=‘dimensions’ LOOP

-- Now ”mdims" has one record from pg_class

EXECUTE ’TRUNCATE TABLE ’ || quote_ident(mdimss.relname);

END LOOP;

RETURN 1;

END;

$$ LANGUAGE plpgsql;

Page 217: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Trapping Errors

• By default, any error occurring in a PL/pgSQL function aborts

execution of the function, and indeed of the surrounding

transaction as well.

• You can trap errors and recover from them by using a BEGIN block

with an EXCEPTION clause.

BEGIN

statements

EXCEPTION WHEN condition [OR condition ... ] THEN

handler_statements

[ WHEN condition [ OR condition ... ] THEN

handler_statements ... ]

END;

Page 218: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

PL/pgSQL – Common Exception Conditions

NO_DATA No data matches predicates

CONNECTION_FAILURE SQL cannot connect

DIVISION_BY_ZERO Invalid division operation 0 in divisor

INTERVAL_FIELD_OVERFLOW Insufficient precision for timestamp

NULL_VALUE_VIOLATION Tried to insert NULL into NOT NULL column

RAISE_EXCEPTION Error raising an exception

PLPGSQL_ERROR PL/pgSQL error encountered at run time

NO_DATA_FOUND Query returned zero rows

TOO_MANY_ROWS Anticipated single row return, received more than 1 row.

Page 219: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Why Use Functions?

• To perform conditional logic

• To provide common SQL to the user

community Insures a single version of the metrics Prevents poorly designed user queries

• To execute complex report or transformation

logic

Note: We specifically did not cover cursors in this

course. Cursors can take a parallel process and make

it nicely serial!

Page 220: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

User Defined Data Type

• Often a function will need to return a data type

other than one based on a table.

CREATE TYPE queryreturn AS (

Customerid BIGINT,

CustomerName CHARACTER VARYING,

TotalSalesAmt DECIMAL(12,2)

);

Page 221: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

LAB

Module 10

PostGREs Functions

Page 222: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Module 11

Controlling Access

Page 223: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

•ProblemsPrivileged users (gpadmin/root) have access to

everything (physical data, binaries, backups)Sharing accounts makes auditing difficultUsers are tempted to run additional workloads on

master node, such as ETLOnce logged into the master, all other hosts are

available since ssh keys are exchangedNetwork infrastructure isn’t under our controlSystem administration isn’t under our control

System Level Security

Page 224: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

•Problemsgpadmin user has access to everythingUnless roles and privileges are configured, all users

have access to everythingUnless resource queues are configured, there are no

limits on what users can runDefault pg_hba.conf is loosely configured (see

Security in GPDB.pdf)Customers are tempted to share user accounts

Database Security

Page 225: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

•RolesEach person should have their own roleRoles should be further divided into groups, which

are usually roles that can’t loginPrivileges should be granted at the group level

whenever possiblePrivileges should be as restrictive as possibleColumn level access can be accomplished with viewsRoles are not related to OS users or groups

Database Security

Page 226: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Creating Roles

• The Greenplum Database manages database access

permissions using the concept of roles.

• The concept of roles subsumes the concepts of users

and groups.

• A role can be a database user, a group, or both. Roles

can own database objects (for example, tables) and can

assign privileges on those objects to other roles to

control access to the objects.

• Roles can be members of other roles, thus a member

role can inherit the attributes and privileges of its

parent role.

Page 227: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Role based access

• The Greenplum Database manages database access permissions using the concept of roles.

• The concept of roles subsumes the concepts of users and groups.

• A role can be a database user, a group, or both. Roles can own database objects (for example, tables) and can assign privileges on those objects to other roles to control access to the objects.

• Roles can be members of other roles, thus a member role can inherit the attributes and privileges of its parent role.

Page 228: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Database Architecture

• Isolate users from physical layer through views Insolate users from changes to

tables Control access to columns and/or

rows Enforce locking strategy

• Separate functional areas into separate schemas Simplify security Simplify backups/restores

Risk Sales MktgTable

Table

Table

Table

Table

Table

Table

Table

Table

Table

Risk Sales MktgView

View

View

View View

View

View

A KEY COMPONENT OF ACCESS CONTROL

Page 229: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

ROLES AS USERS AND GROUPS

• Every Greenplum Database system contains a

set of database roles (users and groups). Those roles are separate from the users and groups

managed by the operating system on which the

server runs. However, for convenience you may want

to maintain a relationship between operating system

user names and Greenplum Database role names,

since many of the client applications use the current

operating system user name as the default. A user is a “role” with a password!

Page 230: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Creating new roles (users)

• A user-level role is considered to be a database

role that can log in to the database and initiate

a database session. Therefore, when you create

a new user-level role using the CREATE ROLE

command, you must specify the LOGIN

privilege.

Example: CREATE ROLE apatel WITH LOGIN;

Page 231: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Role attributes

• A database role may have a number of

attributes that define what sort of tasks that

role can perform in the database.These attributes can be set at role creation time or by

using the ALTER ROLE syntax.

Page 232: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Common user attributes

Page 233: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Role membership (groups)• It is frequently convenient to group users together to ease

management of object privileges: that way, privileges can be granted to, or revoked from, a group as a whole.

• In the Greenplum Database this is done by creating a role that represents the group, and then granting membership in the group role to individual user roles.

• Use the CREATE ROLE SQL command to create a new group role.

Example: CREATE ROLE admin CREATEROLE CREATEDB;

Once the group role exists, you can add and remove members (user roles) using the GRANT and REVOKE commands.

Example: GRANT admin TO john, sally;

REVOKE admin FROM bob;

Page 234: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Object level privileges

• For managing object privileges, you would then

grant the appropriate permissions to the group-

level role only.

• The member user roles then inherit the object

privileges of the group role.

Examples: GRANT ALL ON TABLE mytable TO admin;

GRANT ALL ON SCHEMA myschema TO admin;

GRANT ALL ON DATABASE mydb TO admin;

Page 235: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Managing object privileges

• When an object (table, view, sequence, database, function, language, schema, or tablespace) is created, it is assigned an owner.

• The owner is normally the role that executed the creation statement.

• For most kinds of objects, the initial state is that only the owner (or a superuser) can do anything with the object.

• To allow other roles to use it, privileges must be granted.

Page 236: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Object privileges

Page 237: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Security examples

CREATE ROLE admin CREATEROLE CREATEDB;

CREATE ROLE batch;

GRANT select, insert, update, delete

ON dimensions.customer TO batch;

CREATE ROLE batchuser LOGIN;

GRANT batch TO batchuser;

Page 238: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

LAB

Module 11

Controlling Access

Page 239: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Module 12

Advanced SQL Topics&

Performance Tips

Page 240: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

UPDATE and DELETE Performance

• Use TRUNCATE TABLE to delete all rows from a

table Do not DROP the table and recreate the table

• To perform an UPDATE or DELETE on many

rows, insert the rows into a temp table and then

perform an UPDATE JOIN or a DELETE JOIN The WHERE condition must specify equality between

the distribution columns of the target table and all

joined tables

Page 241: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Temporal and Boolean Data Types

• To avoid processing skew do not use temporal

data types (date, time, timestamp/datetime,

interval) for distribution columns

• To avoid data skew never use a boolean data

type (‘t’/’f’) for distribution columns

Page 242: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Avoid Approximate Numeric Data Types

• Avoid floating point data types when possible Use the absolute minimum precision required

• Do not use approximate numerics (double, float,

etc) for: Distribution Columns Join Columns Columns that will be used for mathematical operations

(SUM, AVG)

Page 243: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Data Types and Performance

• Use data types that will minimize disk storage

requirements, minimize scan time and maximize

performance• Choose the smallest integer type

For example, INTEGER not NUMERIC(11,0)

• Choose integer types for distribution columns

• Choose the same integer types for join columns

Page 244: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Use CASE to Avoid Multiple Passes

• Use the CASE statement instead of DECODE. Put the most commonly occurring condition first Put the least commonly occurring condition last Always put a default (ELSE) condition

EXAMPLE:CASE

WHEN state = ‘CA’

THEN state_population * 1.2

WHEN state = ‘WY’

THEN state_population * 10.7

ELSE

0

END

Page 245: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

SET Operations

• SET operations work against queries, not

tables

• SET operations can improve query performance

• Greenplum set operations include UNION

- Insures uniqueness in the results set

UNION ALL- Allow for duplication, use with caution!

INTERSECT EXCEPT (same as MINUS)

Page 246: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

SET Operation Cautions

• Do not UNION a large numbers of queries

against tables with millions (or billions) of

rows. The optimizer will attempt to launch each query in

parallel, you could run out of memory and temp

space. Use a temporary table to store the results and

populate it with separate queries. Use a CASE statement, if possible, to retrieve the

data in one pass.

Page 247: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Avoid Multiple DISTINCT Operations

Original Query:

SELECT customerId,

COUNT(DISTINCT transId),

COUNT(DISTINCT storeId)

FROM transaction

WHERE transDate > '03/20/2008'

AND transDate < '03/25/2008'

GROUP BY 1

ORDER BY 1

LIMIT 100;

Page 248: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Avoid Multiple DISTINCT Operations

More Efficient VersionSELECT sub2.customerId,

sub2.cd_transId,

sub4.cd_storeId

FROM (SELECT customerId, COUNT(transId) as cd_transId

FROM ( SELECT customerId, transId

FROM transaction

WHERE transDate > '03/20/2008’ AND transDate < '03/25/2008'

GROUP BY 1,2

) sub1

GROUP BY 1

) sub2,

(SELECT customerId, COUNT(storeId) as cd_storeId

FROM ( SELECT customerId, storeId

FROM transaction

WHERE transDate > '03/20/2008’ AND transDate < '03/25/2008'

GROUP BY 1,2

) sub3

GROUP BY 1

) sub4

WHERE sub4.customerId = sub2.customerId

ORDER BY 1 LIMIT 100;

Page 249: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

IsDate Function

Not part of Greenplum suite, this function works like the

Oracle “IsDate” function.

CREATE OR REPLACE FUNCTION public.isdate(text)

RETURNS boolean AS $BODY$

begin

perform $1::date;

return true;

exception when others then

return false;

end

$BODY$

LANGUAGE 'plpgsql' VOLATILE;

Page 250: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Using Arrays

With Greenplum it is possible to have an ARRAY data type.• Arrays may not be the distribution key for a table

• You may not create indexes on ARRAY data type columnsEXAMPLE:

CREATE TEMPORARY TABLE dimensionsOID AS(

SELECT ARRAY(SELECT c.OID

FROM pg_class c,

pg_namespace n

WHERE n.oid = c.relnamespace

AND n.nspname = 'dimensions’)

AS oidArray)

DISTRIBUTED RANDOMLY;

Page 251: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

How to Parse Strings Efficiently

Here are some functions that will help parse a delimited string:

split_part(string ,delimiter, occurrence);

EXAMPLE: SELECT split_part(‘name1=value1&name2=value2,’&’,2);

----------------

name2=value2

string_to_array(string, delimiter or pattern matching expression)

EXAMPLE: SELECT string_to_array(‘a,b,c,d,e,f,g’,’,’);

string_to_array

------------------

{a,b,c,d,e,f,g}

Page 252: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

How to get Database Object Sizes

a. select pg_size_pretty(pg_database_size(’datamart'));

b. select pg_size_pretty(pg_relation_size(‘dimension.customer'));

c. select pg_relation_size(’facts.transaction');

d. select pg_database_size(’datamart');

e. select pg_relation_size(schemaname||'.'||tablename) from

pg_tables

f. gpsizecalc -x dbname -s g

g. gpsizecalc -a dbname -t public.tablename -s g

Page 253: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Skew Analysis

Finding uneven data distribution & fixing it!

Command: gpskew

The gpskew script can be used to determine if table data is equally distributed across

all of the active segments. gpskew reports the following information:

The total number of records in the specified table. Note that for parent tables

with inherited child tables, you must use the -h option to show child table skew as well.

The number of records on each segment.

The variance of records between segments. It takes the segment which has the

maximum count, and the segment which has the minimum count, and reports the

difference between these two segments.

The segment response times (if -r is supplied).

The distribution key column names (if -c is supplied).

Page 254: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Skew analysis

aparashar@mdw1 ~ $ > gpskew -t public.t1

20080823:09:43:43:gpskew:mdw1:aparashar-[INFO]:-Spawning parallel processes batch [1], please wait...

........

20080823:09:43:45:gpskew:mdw1:aparashar-[INFO]:-Waiting for parallel processes batch [1], please wait...

20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:---------------------------------------

20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:-Parallel process exit status

20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:---------------------------------------

20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:---------------------------------------

20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:-Number of Distribution Column(s) = {1}

20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:-Distribution Column Name(s) = c1

20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:---------------------------------------

20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Total records = 2560000

20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:---------------------------------------

20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Skew calculation target type = Primary Segments

20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:---------------------------------------

20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Skew result

20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Maximum Record Count = 320512

20080823:09:56:33:gpskew:mdw1:aparashar-[INFO]:-Segment instance hostname = 3114

20080823:09:56:33:gpskew:mdw1:aparashar-[INFO]:-Segment instance name = gp

20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Minimum Record Count = 319488

20080823:09:56:33:gpskew:mdw1:aparashar-[INFO]:-Segment instance hostname = 3114

20080823:09:56:33:gpskew:mdw1:aparashar-[INFO]:-Segment instance name = gp

20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Record count variance = 1024

20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:---------------------------------------

Page 255: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Skew analysis

Skew problem can be seen through queries

Use Explain analyzeais_3114=# explain analyze SELECT count(*) from (SELECT count(*) from t2 group by c2) as a;

-> Subquery Scan a (cost=75.33..87.84 rows=1 width=0)

Rows out: Avg 1250.0 rows x 8 workers. Max 1251 rows (seg2) with 6.305 ms to first row, 15 ms to

end, start offset by -17 ms.

-> HashAggregate (cost=75.33..87.83 rows=1000 width=4)

Group By: t2.c2

Rows out: Avg 1250.0 rows x 8 workers. Max 1251 rows (seg2) with 6.300 ms to first row, 15

ms to end, start offset by -17 ms.

Executor memory: 1938K bytes avg, 1938K bytes max (seg0).

Work_mem used: 138K bytes avg, 138K bytes max (seg2).

(seg2) Hash chain length 1.0 avg, 1 max, using 1251 of 110160 buckets.

-> Redistribute Motion 8:8 (slice1) (cost=27.83..60.33 rows=1000 width=4)

Hash Key: t2.c2

Rows out: Avg 1250.0 rows x 8 workers at destination. Max 1251 rows (seg2) with 0.498

ms to first row, 2.729 ms to end, start offset by -16 ms.

-> HashAggregate (cost=27.83..40.33 rows=1000 width=4)

Group By: t2.c2

Rows out: Avg 1250.0 rows x 8 workers. Max 1251 rows (seg2) with 3.400 ms to fi

rst row, 4.280 ms to end, start offset by -21 ms.

Executor memory: 1938K bytes avg, 1938K bytes max (seg0).

-> Seq Scan on t2 (cost=0.00..19.22 rows=1722 width=4)

Rows out: Avg 1250.0 rows x 8 workers. Max 1251 rows (seg2) with 0.091 ms

to first row, 0.463 ms to end, start offset by -19 ms.

…..

Page 256: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Skew Analysis

Example: (A Greenplum Customer)

> 75% of the tables - randomly distributed

Several very large tables had completely skewed distribution

Debug process:

Check the database skew

Start with big tables then work towards the smaller ones

Page 257: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

LAB

Module 12

Advanced SQL Topics

Page 258: Course 3 GP Advanced SQL and Tuning

Greenplum Confidential

Feedback and Questions

Please fill in the feedback form that you have

been given by your instructor. We value your

input in making this class as meaningful to our

students as possible!

THANK YOU!