Upload
nandy39
View
14
Download
0
Embed Size (px)
DESCRIPTION
Green Plum
Citation preview
Greenplum SQL&
Performance Tuning
Greenplum Confidential
Course Introduction
Greenplum Confidential
Objectives
• Review Greenplum “Shared Nothing” Architecture & Concepts
• By implementing a Greenplum Data Mart students will Create a Greenplum Database and Schemas Evaluate Logical and Physical Data Models Determine appropriate Distribution Keys and create tables Determine and implement partitioning and indexing strategies Determine and implement optimal load methodologies
• By fulfilling reporting requirements students will Be able to “Explain the Explain Plan” Create and use statistics Create detail and summary reports using OLAP functions Create and use temporary tables Write basic PostGREs functions and create data types Identify and optimize problem queries
Greenplum Confidential
Prerequisites
• Greenplum Fundamentals Training Course
• Experience with Unix or Linux
• Experience with SQL (PostGRESQL is nice!)
• Experience with VI or EMACS
• Experience with Shell Scripting
Greenplum Confidential
Module 1
Greenplum Architecture Review
Greenplum Confidential
Greenplum Architecture
• MPP = Massively Parallel Processing• Multiple units of parallelism working on the same task
• Parallel Database Operations
• Parallel CPU Processing
• Greenplum Units of Parallelism are “Segments”
•“Shared Nothing” Architecture• Segments only operate on their portion of the data
• With Greenplum, each segment is a separate PostGres Database
• Segments are self-sufficient
• Dedicated CPU Processes
• Dedicated storage that is only accessible by the segment
Greenplum Confidential
Greenplum Architecture
Greenplum Confidential
Master Host, Segment Hosts andSegment Instances
Greenplum Confidential
Interconnect
Greenplum Confidential
How does Architecture Relate to Queries?
• Data is spread across the entire system so …• Good queries reduce the amount of data motion
between segments• Good physical implementation allows queries to easily
eliminate rows• Good data distribution reduces skewing
• The Greenplum database has many “high availability”
features• High availability may incur a performance cost• Queries and Data Manipulation (DML) operations must
be optimized with the architecture in mind.
Greenplum Confidential
LAB
Module 1
Greenplum Architecture Review
Greenplum Confidential
Module 2
Data Modeling,Databases & Schemas
Greenplum Confidential
Data Models
• Logical Data Model Graphical representation of the business
requirements Contains the things of importance in an organization
and how they relate to one another Contains business textual definitions and examples
• Enhanced Logical Data Model Gathers additional information regarding objects in
LDM
• Physical Logical Data Model How the Logical Model is implemented
Greenplum Confidential
Data Modeling Terms• Entity
Can exist on it’s own Can be uniquely identified
• AttributeDefines a property of the entity
• Relationship How entities relate to each other
• Primary Key A single or combination of attributes that uniquely
identifies the entity.
Greenplum Confidential
Types of Logical Data Models
• Dimensional Model (Star Schema)- Easiest for users to understand
- Single layer of dimensions
- May have data redundancy
Greenplum Confidential
Dimensional Model – Star Schema
DimensionKnown entry point to a
Fact
A defined attribute of a
Fact
Simple Primary Key
FactMeasurable
Relates to a specific
instance
May have compound
Primary Key
Greenplum Confidential
Types of Logical Data Models
• Snowflake Schema- Greater reporting ability
- Can break up large dimensions into manageable chunks
- May have redundant data
Greenplum Confidential
Snowflake Schema
Greenplum Confidential
Types of Logical Data Models
• Third Normal Form- Most flexible option
- No redundant data
- Most difficult for users to use
Greenplum Confidential
Third Normal Form (3NF) ModelCustomer Dimension - 3rd Normal Form
Customer
Customer IDCustomer NameAddressZip CodeZip Code Plus FourPhone
Country
Country CodeCountry Name
State
StateLong NameCapital City
Zip Code
Zip CodeCityStateCountry Code
CountryCode
Zip Code
State
Greenplum Confidential
Data Warehousing & Models
• Star and Snowflake Schemas are most common
in Data Warehouses Fast retrieval of large numbers of rows Easier to aggregate Intuitive to end users
Greenplum Confidential
Enhanced Logical Data Model
• Provides more detail than the Logical Model Defines the cardinality of the data attributes Defines the anticipated initial data volume (in rows) Defines the anticipated data growth over time
Greenplum Confidential
Enhanced Logical Data Model
Attribute Source Uniqueness Data Type Data Example
Store ID Stores Unique Small Integer
1,2 … N
Store Name Stores Non Character Tom’s Groceries
…
Has Pharmacy
Stores Non Character T,F
Entity: Store
Source: SOLS (Stores On-Line System)
stores table
Row Count at Source: 35
Anticipated Annual Growth Rate: 100
Greenplum Confidential
Physical Data Model
• Table The physical manifestation of an entity
• Columns How the attributes of the entity are physically
represented
• Constraints Foreign Key (disabled in Greenplum 3.1) Uniqueness Primary Key Null / Not Null Check Values
Greenplum Confidential
Logical to Physical – Logical Data Model
FACT
DIMENSIONSStore
Store IDStore NameAddressCityStateZip CodeZip Code Plus FourPhoneCountry CodeHas Pharmacy?Has Grocery?Has Deli?Has Butcher?Has Bakery?
Transaction
Transaction IDTerminal IDTransaction DateStore IDCustomer IDItem CountSales AmountDiscount AmountCoupon AmountCash AmountCredit Card AmountDebit AmountOther Amount
Customer
Customer IDCustomer NameAddressCityStateZip CodeZip Code Plus FourCountry CodePhone
Country
Country CodeCountry Name
DimensionsStore
Country
Customer
FactTransaction
Greenplum Confidential
Logical to Physical – Physical Data Model
FACT
DIMENSIONS
Store
storeId SMALLINT NOT NULLstoreName CHAR(20) NOT NULLaddress VARCHAR(50) NOT NULLcity VARCHAR(40) NOT NULLstate CHAR(2) NOT NULLzipcode CHAR(8) NOT NULLzipPlusFour CHAR(4) NOT NULLcountrycd CHAR(3) NOT NULLphone CHAR(13)pharmacy CHAR(1)grocery CHAR(1)deli CHAR(1)butcher CHAR(1)bakery CHAR(1)
Transaction
transId BIGINT NOT NULLterminalId INTEGER NOT NULLtransDate DATE NOT NULLstoreId SMALLINT NOT NULLcustomerId INTEGERitemCnt INTEGER NOT NULLsalesAmt DECIMAL(9,2) NOT NULLtaxAmt DECIMAL(9,2) NOT NULLdiscountAmt DECIMAL(9,2)couponAmt DECIMAL(9,2)cashAmt DECIMAL(9,2)checkAmt DECIMAL(9,2)ccAmt DECIMAL(9,2)debitAmt DECIMAL(9,2)otherAmt DECIMAL(9,2)
Customer
customerId INTEGER NOT NULLcustName CHAR(50) NOT NULL
address CHAR(50) NOT NULL city CHAR(40) NOT NULL state CHAR(2) NOT NULLzipcode CHAR(7) NOT NULLzipPlusFour CHAR(4) NOT NULLcountrycd CHAR(3) NOT NULL
phone CHAR(13)
Country
countrycd CHAR(3) NOT NULLcountryname VARCHAR(100) NOT NULL
Notes:Entities become tables
Attributes become
columns
Relationships may become
foreign key constraints
Greenplum Confidential
Data Segregation
• Database May have multiple in a Greenplum system Greenplum databases do not share data
• Schema A physical grouping of database objects
• Views All or some table columns may be seen Allows for user defined columns
- Instantiated at run time
Greenplum Confidential
About Greenplum Databases
• A Greenplum Database system may have
multiple databases Clients connect to one database at a time Default database template: template1 Other database templates
- template0
- postgres
NEVER DROP OR CHANGE TEMPLATE
DATABASES!
Greenplum Confidential
Database SQL Commands
To Create: CREATE DATABASE or createdb
To Drop: DROP DATABASE or dropdb
To Modify: ALTER DATABASE Change database name Assign to a new owner Set configuration parameters
Greenplum Confidential
PSQL Database Tips
PSQL prompt shows what database you are in.EXAMPLE: template1=# (super user)
template1=> (non-super user)
To show a list of all databases:\l (a lower case L)
To connect to a different database:\c db_name
BEST PRACTICE: Use the PGDATABASE environment variable in the .bashrc
shell script to set the default database!
Greenplum Confidential
About Schemas
Schemas are a way to logically organize objects within a
database (tables, views, functions, etc)
Use “fully qualified” names to access objects in a schema
EXAMPLE: myschema.mytable
Every database has a public schema
Other system level schemas include: pg_catalog,
Information_schema, pg_toast, pg_bitmapindex
Greenplum Confidential
Schema SQL Commands
To Create: CREATE SCHEMA
To Drop: DROP SCHEMA
To Modify: ALTER SCHEMA Change schema name Assign to a new owner
Greenplum Confidential
PSQL Schema Tips
To see the current schemaselect current_schema();
To see a list of all schemas in the database\dn
To see the schema search pathshow search_path;
Greenplum Confidential
Default Schemas
• public schema No objects at database creation Generally used to store “public” objects
• pg_catalog This is the data dictionary
• information_schema This schema contains views and tables to make the
data dictionary easier to understand with less joins.
• pg_bitmapindex Used to store bitmap index objects
Greenplum Confidential
TOAST Schema
• This schema is used specifically to store large
attribute objects (> 1 page of data)
The
Oversized
Attribute
Storage
Technique
Greenplum Confidential
LAB
Module 2
Data Modeling,Databases & Schemas
Greenplum Confidential
Module 3
Physical Design Decisions
Greenplum Confidential
Key Design Considerations
• Data Distribution Selecting the best distribution key
• Checking for Data Skew
• Partition or not to Partition
• Data Types Select the smallest type that will store the data
• Constraints Table Column
Greenplum Confidential
Primary Key VS Distribution Key
A primary key is a logical model concept which allows each row to be uniquely identified
A distributed by key is a physical Greenplum database concept which determines how a row is physically stored
A well designed Greenplum schema could have a large percentage of tables where the physical distributed by key is different from the logical primary key (tables with common distributed by keys provide faster join operations by reducing number of motion files)
Greenplum Confidential
Key ComparisonPrimary Key Distribution By Key
Logical concept of data modeling Physical mechanism for storage
Optional for Greenplum (becomes the distribution index if specified)
Requirement for every Greenplum table (first column if not specified in “distributed by”)
No limit to number of columns Limits:
Documented in data model Defined in Create Table command
Must be unique Can be non-unique
Uniquely identifies each row Can uniquely or non-uniquely identify each row
Value can not be changed Value can not be changed
Can not be NULL Can be NULL
Does not define an access path (DBMS independent)
Defines a storage path
Chosen for logical correctness Chosen for distribution and performance
Greenplum Confidential
Data Distribution
CREATE TABLE tablename ( column_name1 data_type NOT NULL,column_name2 data_type NOT NULL,column_name3 data_type NOT NULL DEFAULT default_value,. . . )
[DISTRIBUTED BY (column_name)] hash algorithm
[DISTRIBUTED RANDOMLY] round-robin algorithm Every table in the Greenplum
database has a distribution method
OR
Greenplum Confidential
Distribution Key Considerations
• A distribution key can not be changed or altered
after a table is created
• If a table has a unique constraint it must be
declared as the distribution key
• User defined data types and geometric data
types are not eligible as distribution keys
Greenplum Confidential
Default Distributions Not Recommended
• If a table has a primary key and a distribution
key is NOT specified, by default the primary key
will be used as the distribution key
• If the table does not have a primary key and a
distribution key is NOT specified, then the first
eligible column of the table will be used as the
distribution key User defined data types and geometric data types are
not eligible as distribution keys If a table does not have an eligible column then a
random distribution is used
Greenplum Confidential
Distributions and Performance
Response time is the
completion time for ALL
segment instances in a
Greenplum database
To optimize performance use a hash distribution method (DISTRIBUTED BY) that distributes data
evenly across all segment instances
SegmentInstance CPUDisk I/ONetwork
0
1
2
3
4
5
Greenplum Confidential
Hash Distributions and Data Skew
Select a distribution key with unique values and high cardinality
Gender = M or F
SegmentInstance CPUDisk I/ONetwork
0
1
2
3
4
5
Greenplum Confidential
Hash Distributions and Processing Skew
Using a DATE column as the distribution key may distribute rows evenly across segment instances but may result in
processing skew
SegmentInstance CPUDisk I/ONetwork
0
1
2
3
4
5
Select a distribution key that will not result in processing skew
JAN
FEB
MAR
APR
MAY
JUN
Greenplum Confidential
Use the Same Distribution Key for Commonly Joined Tables
Segment Instance 0
customer (c_customer_id)
freg_shopper (f_customer_id) =
Segment Instance 1
customer (c_customer_id)
freq_shopper(f_customer_id) =
Optimize for Local Joins!Distribute on the same key
used in the join(WHERE clause)
Segment Host
Segment Host
Greenplum Confidential
Avoid Redistribute Motion for Large Tables
WHERE customer.c_customer_id = freg_shopper.f_customer_id
freq_shopper table is dynamically redistributed on f_customer_id
Segment Instance 0
customer (c_customer_id)customer_id =102
freg_shopper (f_trans_number)
Segment Instance 1
customer (c_customer_id)customer_id=745
freq_shopper(f_trans_number)customer_id=102
Segment Instance 1
customer (c_customer_id)
freq_shopper(f_trans_number)customer_id=745
Segment Host Segment Host Segment Host
Greenplum Confidential
Avoid Broadcast Motion for Large Tables
Segment Instance 0
customer (c_customer_id)
state(s_statekey)AK, AL, AZ, CA…
Segment Instance 1
customer (c_customer_id)
state(s_statekey)AK, AL, AZ, CA…
Segment Instance 1
customer (c_customer_id)
state(s_statekey)AK, AL, AZ, CA…
Segment Host Segment Host Segment Host
WHERE customer.c_statekey = state.s_statekey
The state table is dynamically broadcasted to all segment instances
Greenplum Confidential
Commonly Joined Tables Use the Same Data Type for Distribution Keys
customer (c_customer_id) 745::int
freq_shopper (f_customer_id)
745::varchar(10)
• Values might appear the same but they are
stored differently at the disk level
• Values might appear the same but they HASH
to different values• Resulting in like rows being stored on different
segment instances
• Requiring a redistribution before the tables can be
joined
Greenplum Confidential
DISTRIBUTED RANDOMLY
• Uses a round robin algorithm
• Any query that joins to a table that is
distributed randomly will require a
redistribute motion or a broadcast motion
• Acceptable for small tables For example dimension tables
• Not acceptable for large tables For example fact tables
Greenplum Confidential
Distribution Strategies to Avoid Do not generate or fabricate columns in order to create a unique distribution key simply to avoid skew
DISTRIBUTED BY (col1, col2, col3) if col1, col2, col3 is never used in joins
Homegrown serial key algorithm
Do not distribute on boolean data types Do not distribute on floating point data types Do not DISTRIBUTE RANDOMLY because it is the easy choice!
Greenplum Confidential
Check for Data Skew
• Use gpskew to check for data skewgpskew –t schema.table –a database_name
• For a partitioned table use the gpskew –h optiongpskew –t schema.table –a database_name -h
Greenplum Confidential
gpskew Examplegpadmin@mdw > gpskew -t public.customer
. . .
. . . Number of Distribution Column(s) = {1}
. . . Distribution Column Name(s) = cust_id
20080316:20:12:08:gpskew:mdw:gpadmin-[INFO]
. . . Total records (inc child tables) = 5277811399
. . .
20080316:20:12:08:gpskew:mdw:gpadmin-[INFO]:-Skew result
. . . Maximum Record Count = 66004567
. . . Segment instance hostname = gpdbhost
. . . Segment instance name = dw100
. . . Minimum Record Count = 65942375
. . . Segment instance hostname = gpdbhost
. . . Segment instance name = dw100
. . . Record count variance = 62192
Greenplum Confidential
Redistribute Using CREATE TABLE LIKE
• To change the distribution of a table use the
CREATE TABLE LIKE statement
CREATE TABLE customer_tmp (LIKE customer)
DISTRIBUTED BY (customer_id);
INSERT INTO customer_tmp SELECT * FROM
customer;
DROP customer;
ALTER TABLE customer_tmp RENAME TO
customer;
Specify the New Distribution Key
Greenplum Confidential
Why Partition a Table?
Provide more efficiency in querying against a subset of large volumes of transactional detail data as well as to manage this data more effectively• Businesses have recognized the analytic value of detailed transactions and are
storing larger and larger volumes of this data
Increase query efficiency by avoiding full table scans- without the overhead and maintenance costs of an index for date
• As the retention volume of detailed transactions increases, the percent of transactions that the “average” query requires for execution decreases
Allow “instantaneous” dropping of “old” data and simple addition of “new” data• Support a “rolling n periods” methodology for transactional data
Greenplum Confidential
Partitions in Greenplum
A mechanism in Greenplum for use in physical database design
Increases the available options to improve the performance of a certain class of queries• Only the rows of the qualified partitions (child table) in a query need to be
accessed
Two types of partitioning:• Range partitioning (division of data based on a numerical range, such as
date or price) • List partitioning (division of data based on a list of values, such as state or
region)
Partitioning in Greenplum works using table inheritance and CHECK table constraintsPartitioned tables are also distributed
Greenplum Confidential
Candidates for Partitioning
Any table with large data volumes where queries commonly select using a date range or has rolloff requirements
• Examples- Fact tables with dates
- Transaction Tables with dates
- Activity Tables with dates
- Statement Tables with numeric statement periods (YYYYMM)
- Financial Summary Tables with end of month dates
Greenplum Confidential
Partitioning Column Candidates
Recommended and simple• Dates using date expressions
- Trade date- Order date- Billing date- End of Month date
• Numeric period that are integers- Year and Month - Cycle run date- Julian date
Greenplum Confidential
Greenplum Table PartitioningGreenplum partitions tables in the CREATE TABLE DDL statement.
Partitions may be by range (date, id, etc.)Partitions may be by value list (“Male”,”Female”, etc.)
Partition Operations include:CREATE TABLE … PARTITION BYALTER TABLE ADD PARTITION
DROP PARTITIONRENAME PARTITIONEXCHANGE PARTITION
Greenplum Confidential
Partitioned Table Example
Greenplum Confidential
DDL to Create Table (prior slide)CREATE TABLE baby.rank ( id INT,rank INT, year INT, gender CHAR(1), count
INT )
DISTRIBUTED BY ( id, gender, year )
PARTITION BY LIST ( gender )
SUBPARTITION BY RANGE ( year )
SUBPARTITION TEMPLATE
( SUBPARTITION "2001" START ( 2001 ) INCLUSIVE,
SUBPARTITION "2002" START ( 2002 ) INCLUSIVE,
SUBPARTITION "2003" START ( 2003 ) INCLUSIVE,
SUBPARTITION "2004" START ( 2004 ) INCLUSIVE,
SUBPARTITION "2005" START ( 2005 ) INCLUSIVE END ( 2006 ) EXCLUSIVE )
( PARTITION females VALUES ( 'F' ),
PARTITION males VALUES ( 'M' ) );
Greenplum Confidential
Table Partitioning Best Practices
• Use table partitioning on large distributed tables
to improve query performanceTable partitioning is not a substitute for table
distribution
• If the table can be divided into rather equal parts
based on a defining criteria For example, range partitioning on date
• And the defining partitioning criteria is the same
access pattern used in query predicates WHERE date = ‘1/30/2008’
• No overlapping ranges or duplicate list values
Greenplum Confidential
Table Partitioning Example
Distributed by c_customer_id
Partitioned by week_number
SELECT . . . FROM customer WHERE week_number = 200702
Segment Instance 0
customer (c_customer_id)(week_number)week_number = 200701week_number = 200702week_number = 200703
…week_number = 200752
Segment Instance 1
Segment Instance 51
Segment Host Segment Host Segment Host
. . .
customer (c_customer_id)(week_number)week_number = 200701week_number = 200702week_number = 200703
…week_number = 200752
customer (c_customer_id)(week_number)week_number = 200701week_number = 200702week_number = 200703
…week_number = 200752
Greenplum Confidential
Partition Elimination SELECT . . . FROM customer
WHERE week_number = 200702 Segment
Instance 0
customer (c_customer_id)(week_number)week_number = 200701week_number = 200702week_number = 200703week_number = 200704week_number = 200705week_number = 200706week_number = 200707week_number = 200708week_number = 200709. . . week_number = 200750week_number = 200751week_number = 200752
Only the 200702 week partition is
scanned.
The other 51 partitions on the segment
instance are eliminated from the query
execution plan.
The primary goal of table partitioning is to achieve
partition elimination
Greenplum Confidential
CREATING PARTITIONED TABLES
CREATE TABLE customer (
c_customer_id INT,
week_number INT,
. . .
) DISTRIBUTED BY (c_customer_id)
PARTITION BY RANGE ( week_number )
(START (200701) END (200752) INCLUSIVE)
;
Greenplum Confidential
Maintaining Rolling Windows
Use ALTER TABLE to add a new child table (week_number=200801)
Use ALTER TABLE to delete an old child table (week_number=200701)
Segment Instance 0
Segment Instance 1
Segment Instance 51
Segment Host Segment Host Segment Host
. . .
customer (c_customer_id)(week_number)week_number = 200702week_number = 200703
…week_number = 200752week_number = 200801
customer (c_customer_id)(week_number)week_number = 200702week_number = 200703
…week_number = 200752week_number = 200801
customer (c_customer_id)(week_number)week_number = 200702week_number = 200703
…week_number = 200752week_number = 200801
Greenplum Confidential
customer parent
week_number=200701 child
week_number = 200702 child
week_number = 200703 child
week_number = 200704 child
. . .
week_number = 200752 child
Loading Partitioned Tables
• Data may be loaded into the parent
table Parent table contains no data By default data is not automatically
inserted into the correct child tables-Use rewrite rules
• Data may be loaded into the child
tables
Segment Instance 0
For the best load performance load directly into child tables
OR
Load into “clone” of child table and EXCHANGE partitions
NFS
Greenplum Confidential
customer parent
week_number = 200701 child
week_number = 200702 child
week_number = 200703 child
week_number = 200704 child
. . .
week_number = 200752 child
Loading Child Tables
• Data must be partitioned to ensure it is
loaded into the correct child table Manually partition the data files
OR Use external tables to filter the data
INSERT INTO customer_200701
SELECT * FROM ext_customer
WHERE week_number = 200701;
Segment Instance 0
Greenplum Confidential
Partitioned Table with No Sub-PartitionsHeterogeneous Example
CREATE TABLE orders
( order_id NUMBER(12),
order_date TIMESTAMP WITH LOCAL TIME ZONE,
order_mode VARCHAR2(8),
customer_id NUMBER(6),
order_status NUMBER(2),
order_total NUMBER(8,2),
sales_rep_id NUMBER(6),
promotion_id NUMBER(6)
)
DISTRIBUTED BY (customer_id)
PARTITION BY RANGE (order_date)
PARTITIONS 100
(
starting '2005-12-01'
ending '2007-12-01' every 3 months,
ending '2008-12-01' every 4 weeks,
ending '2008-12-03' every 6 hours,
ending '2008-12-04' every 2 secs
);
Greenplum Confidential
Greenplum ReleaseMulti-level Partition example
CREATE TABLE orders
( order_id NUMBER(12),
order_date TIMESTAMP WITH LOCAL TIME ZONE,
order_mode VARCHAR2(8),
customer_id NUMBER(6),
order_status NUMBER(2),
order_total NUMBER(8,2),
sales_rep_id NUMBER(6),
promotion_id NUMBER(6)
)
DISTRIBUTED BY (customer_id)
PARTITION BY RANGE (order_date)
SUBPARTITION BY HASH (customer_id)
SUBPARTITIONS 8
(
partition minny,
starting '2004-12-01' ending '2006-12-01',
partition maxy
);
Greenplum Confidential
What is the difference between heterogeneous and multi-level partitions?
• Heterogeneous partitions - when all
partitions use the same column value, for
example date but the partition are
determined by the expression (daily,
weekly, monthly).
• Multi-level partitions – when the
partitions are different levels based on
different columns
Greenplum Confidential
Data Type Best Practices
• Use data types to constrain your column dataUse character types to store stringsUse date or timestamp types to store datesUse numeric types to store numbers
• Choose the type that uses the least spaceDon’t use a BIGINT when an INTEGER will work
• Use identical data types for columns used in
join operationsConverting data types prior to joining is resource
intensive
Greenplum Confidential
Greenplum Data Types
Greenplum Confidential
Greenplum Data Types (continued)
Greenplum Confidential
More Greenplum Data Types
Greenplum Confidential
Table Constraints
Primary Key Constraint * EXAMPLE: CONSTRAINT mytable_pk PRIMARY KEY (mycolumn)
With OIDs
EXAMPLE: WITH (OIDS=TRUE)
This is used when updating or deleting rows via an ODBC
connection from another database. (Oracle, SQL Server, etc.)
* If a table has a primary key, this column (or group of columns) is
chosen as the distribution key for the table. If a table has a primary key,
then it can not have other columns with a unique constraint.
Greenplum Confidential
Column Constraints
Check Constraints
EXAMPLE: price_amt DECIMAL(10,2) CHECK (price_amt > 0.00)
Not Null Constraint EXAMPLE: customer_id INTEGER NOT NULL
Uniqueness Constraint * EXAMPLE: customer_id INTEGER UNIQUE
* Greenplum Database has some special conditions for unique constraints. A
table can only have one unique constraint and the unique column (or set of
columns) must also be declared as the distribution key for the table.
Greenplum Confidential
LAB
Module 3
Physical Design Decisions
Greenplum Confidential
Module 4
Data Loading“ETL” versus “ELT”
Greenplum Confidential
ETL – Extract, Transform & Load
Extract data from one or more source systemsTransform data using source system, ETL server or ETL Tool
(i.e. Informatica or DataStage) required by business rulesLoad data into the data warehouse.
Greenplum Confidential
The Greenplum Way - ELT
Extract data from one or more source systems Load data in a parallel manner via Greenplum External Table
into stage or temporary table in the MPP data warehouse Transform data warehouse in a parallel manner and load into
target tables.
Greenplum Confidential
Loading Data – Greenplum Methods
• External Tables File Based Web Based gpfdist – Massively Parallel Load Tool gpload
• SQL Copy Command
• ETL Tool How to do ELT with a tool
Greenplum Confidential
Loading with External Tables
•What are external tables? read-only data resides outside database SELECT, JOIN, SORT, etc. just like a regular table
•Advantages: Parallelism ELT flexibility Error handling of badly formatted rows Multiple data sources
•Two Types: External Tables: File-based Web Tables: URL or Command-based
Greenplum Confidential
File-Based External Tables•Two Access Protocols
file gpfdist
•File Formats text csv
•Example External Table DefinitionCREATE EXTERNAL TABLE ext_expenses (name text, date date, amount float4, category text, description text) LOCATION ( 'file://seghost1/dbfast/external/expenses1.csv','file://seghost1/dbfast/external/expenses2.csv','file://seghost2/dbfast/external/expenses3.csv','file://seghost2/dbfast/external/expenses4.csv','file://seghost3/dbfast/external/expenses5.csv','file://seghost3/dbfast/external/expenses6.csv’ )FORMAT 'CSV' ( HEADER );
Greenplum Confidential
Parallel File Distribution Program (gpfdist)•Ensures full parallelism for the best performance•Can be run on a server outside the Greenplum array•C program - uses HTTP protocol•200 MB/s data distribution rate per gpfdist•Configuration parameter: gp_external_max_segs
•Example Start Commands:gpfdist -d /var/load_files/expenses -p 8081 >> gpfdist.log 2>&1 &gpfdist -d /var/load_files/expenses -p 8082 >> gpfdist.log 2>&1 &
•Example External Table Definition:CREATE EXTERNAL TABLE ext_expenses ( name text, date date, amount float4, description text ) LOCATION ('gpfdist//etlhost:8081/*','gpfdist//etlhost:8082/*') FORMAT 'TEXT' (DELIMITER '|') ENCODING ’UTF-8’ LOG ERRORS INTO ext_expenses_loaderrors SEGMENT REJECT LIMIT 10000 ROWS ;
Greenplum Confidential
Parallel File Distribution Program (gpfdist)
Master Host
Inte
rcon
nect
- G
igab
it E
ther
net S
witc
h
Segment Host
ETL Server
gpfdist
gpfdist
External Table Files
External Table Files
Segment Host
Segment Host
Segment Host
Greenplum Confidential
About External and Web Tables
•Two TypesExternal Tables
-Greenplum file server (gpfdist)
-File-based (file://)
Web Tables-URL-based (http://)
-Output-based (OS commands, scripts)
•Used for ETL and Data Loading
•Data accessed at runtime by segments (in
parallel)
•Read-only
Greenplum Confidential
External Table SQL Syntax
CREATE EXTERNAL TABLE table_name (column_name data_type [, ...])
LOCATION ('file://seghost[:port]/path/file' [, ...]) | ('gpfdist://filehost[:port]/file_pattern' [, ...])
FORMAT 'TEXT | CSV' [( [DELIMITER [AS] 'delimiter'] [NULL [AS] 'null string'] [ESCAPE [AS] 'escape' | 'OFF'] [HEADER] [QUOTE [AS] 'quote'] [FORCE NOT NULL column [, ...] )]
[ ENCODING 'encoding' ]
[ [LOG ERRORS INTO error_table] SEGMENT REJECT LIMIT count [ROWS | PERCENT] ]
DROP EXTERNAL TABLE table_name;
Greenplum Confidential
External Web Tables•Dynamic data – not rescannable•Two Forms:
URLs Commands or scripts
- executed on specified number of segment instances- can disable by setting: gp_external_enable_exec=off
•Output Data Formats text csv
•
Example Web Table DefinitionsCREATE EXTERNAL WEB TABLE ext_expenses (name text, date date, amount float4, description text) LOCATION ( 'http://intranet.company.com/expenses/sales/expenses.csv', 'http://intranet.company.com/expenses/finance/expenses.csv', 'http://intranet.company.com/expenses/ops/expenses.csv‘ )FORMAT 'CSV' ( HEADER );
CREATE EXTERNAL WEB TABLE log_output (linenum int, message text) EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST FORMAT 'TEXT' (DELIMITER '|');
CREATE EXTERNAL WEB TABLE du_space (storage text) EXECUTE 'du -sh' ON ALL FORMAT 'TEXT';
Greenplum Confidential
External Table Error Handling• Load good rows and catch poorly formatted rows, such as:
rows with missing or extra attributes rows with attributes of the wrong data type rows with invalid client encoding sequences
• Does not apply to constraint errors: PRIMARY KEY, NOT NULL, CHECK or UNIQUE constraints
• Optional error handling clause for external tables:[LOG ERRORS INTO error_table] SEGMENT REJECT LIMIT count [ROWS | PERCENT]( PERCENT based on gp_reject_percent_threshold parameter )
Example:CREATE EXTERNAL TABLE ext_customer (id int, name text, sponsor text) LOCATION ( 'gpfdist://filehost:8081/*.txt' ) FORMAT 'TEXT' ( DELIMITER '|' NULL ' ') LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5 ROWS;
Greenplum Confidential
External Tables and Planner Statistics
• Data resides outside the database
• No database statistics for external table data
• Not meant for frequent or ad-hoc access
• Can manually set rough statistics in pg_class:
UPDATE pg_class
SET reltuples=400000, relpages=400
WHERE relname='myexttable';
Greenplum Confidential
gpload
• Runs a load job as defined in a YAML formatted control file
• gpload is a data loading utility that acts as an interface external
table parallel loading feature. Using a load specification defined in a
YAML formatted control file, gpload executes a load by invoking the
Greenplum parallel file server (gpfdist), creating an external table
definition based on the source data defined, and executing an
INSERT, UPDATE or MERGE operation to load the source data into
the target table in the database.
• Usage:gpload -f control_file [-l log_file] [-h hostname] [-p port] [-U username] [-d database] [-W] [-v | -V] [-q] [-D] gpload -? | --version
• Example:gpload -f my_load.yml
Greenplum Confidential
gpload - Control File Format
• The gpload control file uses the YAML 1.1 document format and
then implements its own schema for defining the various steps of
a Greenplum Database load operation. The control file must be a
valid YAML document.
• Example load control file:VERSION: 1.0.0.1
DATABASE: ops
USER: gpadmin
HOST: mdw-1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- etl1-1
- etl1-2
- etl1-3
Greenplum Confidential
gpload - Control File Format (contd.)- etl1-4
PORT: 8081
FILE: /var/load/data/*
- COLUMNS:
- name: text
- amount: float4
- category: text
- desc: text
- date: date
- FORMAT: text
- DELIMITER: |
- ERROR_LIMIT: 25
- ERROR_TABLE: payables.err_expenses
OUTPUT:
- TABLE: payables.expenses
- MODE: INSERT
Greenplum Confidential
COPY SQL Command
• PostgreSQL command
• Optimized for loading a large number of rows
• Loads all rows in one command (not parallel)
• Loads data from a file or from standard input
• Supports error handling as does external tables
EXAMPLE:COPY mytable FROM '/data/myfile.csv' WITH CSV HEADER;
Greenplum Confidential
When to use SQL Copy When scripting a process that is loading small amounts of data (<
10K rows). One-time data loads. Use COPY to load multiple rows in one command, instead of using
a series of INSERT commands. The COPY command is optimized for loading large numbers of
rows; it is less flexible than INSERT, but incurs significantly less
overhead for larger data loads. Since COPY is a single command, there is no need to disable
autocommit if you use this method to populate a table.
Greenplum Confidential
Why use an ETL tool?
ETL tool may be corporate standard ETL tool may be used to do standard data
quality processes ETL tool may be used for standard code
standardization (multiple source systems with
different values for same code – i.e. status
code) ETL tool may be used for house holding or new
identification keys (unique customer ids,
product ids) Use ETL tool to provide connectivity options, for
example MVS connectivity.
Greenplum Confidential
Architecting ELT Solutions with an ETL tool
Use ETL tool to create file for external table Use the ETL tool to execute SQL commands Use ETL tool for job scheduling of external
table loads
Greenplum Confidential
System Generated Keys in Greenplum
• Primary Key contains many columns?
• Distribution Key rarely used in queries or
joins?
• Solution: A System Generated Key Use a Sequence Use SQL
Greenplum Confidential
About SequencesGreenplum Master sequence generator process (seqserver)
Sequence SQL Commands: CREATE SEQUENCEALTER SEQUENCEDROP SEQUENCE
Sequence Function Limitations:lastval and currval not supportedsetval cannot be used in queries that update datanextval not allowed in UPDATEs or DELETEs if mirrors are enabled nextval may grab a block of values for some queries
PSQL Tips:To list all sequences while in psql: \dsTo see an sequence definition: \d+ sequence_name
Greenplum Confidential
Sequence ExampleCreate a sequence named ‘myseq’CREATE SEQUENCE myseq START 101;
Insert a row into a table that gets the next value:INSERT INTO distributors VALUES (nextval('myseq'), 'acme');
Reset the sequence counter value on the master:SELECT setval('myseq', 201);
Not allowed in Greenplum DB (set sequence value on segments):INSERT INTO product VALUES (setval('myseq', 201), 'gizmo');
NOTE: THERE COULD BE GAPS IN THE SEQUENCE NUMBERS!
Greenplum Confidential
Using SQLCREATE TEMPORARY TABLE MAXI AS (SELECT COALESCE(MAX(CustomerID),0) AS MaxID FROM dimensions.Customer)DISTRIBUTED BY (MaxID);
SELECT (RANK() OVER (ORDER BY phone)) + MAXI.MaxID, custName, address, city, state, zipcode, zipPlusFour, countrycd, phone FROM public.customer_external CROSS JOIN MAXI;
Create a volatile table with the max current value
Join the single row table with the max value to every row in the source table, add the rank and voila, no gaps!
Greenplum Confidential
LAB
Module 4
Data Loading“ETL” versus “ELT”
Greenplum Confidential
Module 5
Explain the Explain PlanAnalyzing Queries
Greenplum Confidential
The Greenplum Explain Utility
• Allows the user to view how the optimizer will
handle the query.
• Shows the cost (in page views) and an
estimated run time.
• EXPLAIN ANALYZE will show the difference
between the estimates and the actual run costs.
Greenplum Confidential
EXPLAIN or EXPLAIN ANALYZE
• Use EXPLAIN ANALYZE to run the query and
generate query plans and planner estimates
EXPLAIN ANALYZE <query>;
• Run EXPLAIN ANALYZE on queries to identify
opportunities to improve query performance
• EXPLAIN ANALYZE is also a engineering tool
used by Greenplum developers Focus on the query plan elements that are relevant
from an end user perspective to improve query
performance
Greenplum Confidential
EXPLAIN Output• Query plans are a right to left plan tree of nodes
read from the bottom up
• Each node feeds its results to the node directly
above
• There is one line for each node in the plan tree
• Each node represents a single operation Sequential Scan, Hash Join, Hash Aggregation, etc
• The plan will also include motion nodes
responsible for moving rows between segment
instances Redistribute Motion, Broadcast Motion, Gather
Motion
Greenplum Confidential
EXPLAIN Example Gather Motion 48:1 (slice1) (cost=14879102.69..14970283.82 rows=607875 width=548)
Merge Key: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Unique (cost=14879102.69..14970283.82 rows=607875 width=548)
Group By: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Sort (cost=14879102.69..14894299.54 rows=6078742 width=548)
Sort Key (Distinct): b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Result (cost=0.00..3188311.04 rows=6078742 width=548)
-> Append (cost=0.00..3188311.04 rows=6078742 width=548)
-> Seq Scan on display_run b (cost=0.00..1.02 rows=1 width=548)
Filter: local_time >= '2007-03-01 00:00:00'::timestamp without time zone
AND local_time < '2007-04-01 00:00:00'::timestamp without time zone AND (url::text
'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text)
-> Seq Scan on display_run_child_2007_03_month b
(cost=0.00..3188310.02 rows=6078741 width=50)
Filter: local_time >= '2007-03-01 00:00:00'::timestamp without time zone
AND local_time < '2007-04-01 00:00:00'::timestamp without time zone AND (url::text
'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text)
Greenplum Confidential
Begin by Examining Rows and CostAggregate (cost=16141.02..16141.03 rows=1 width=0)
-> Gather Motion 2:1 (slice2) (cost=16140.97..16141.00 rows=1 width=0)
-> Aggregate (cost=16140.97..16140.98 rows=1 width=0)
-> Hash Join (cost=2999.46..15647.43 rows=197414 width=0)
Hash Cond: ps.ps_partkey = part.p_partkey
-> Hash Join (cost=240.46..9920.71 rows=200020 width=4)
Hash Cond: ps.ps_suppkey = supplier.s_suppkey
-> Seq Scan on partsupp ps (cost=0.00..6180.00 rows=400000
width=8)
-> Hash (cost=177.97..177.97 rows=4999 width=4)
-> Broadcast Motion 2:2 (slice1) (cost=0.00..177.97 rows=4999
width=4)
-> Seq Scan on supplier (cost=0.00..77.99 rows=4999
width=4)
-> Hash (cost=1509.00..1509.00 rows=100000 width=4)
-> Seq Scan on part (cost=0.00..1509.00 rows=100000 width=4)Identify plan nodes where the estimated cost is very high and the number of rows are very large(where the majority of time is spent)
Greenplum Confidential
Validate Partition Elimination Gather Motion 48:1 (slice1) (cost=174933650.92..176041040.58 rows=7382598 width=548)
Merge Key: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Unique (cost=174933650.92..176041040.58 rows=7382598 width=548)
Group By: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Sort (cost=174933650.92..175118215.86 rows=73825977 width=548)
Sort Key (Distinct): b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Result (cost=0.00..31620003.26 rows=73825977 width=548)
-> Append (cost=0.00..31620003.26 rows=73825977 width=548)
-> Seq Scan on display_run b (cost=0.00..1.02 rows=1 width=548)
Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text
-> Seq Scan on display_run_child_2007_03_month b (cost=0.00..2635000.02 rows=6079950 width=50)
Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text
-> Seq Scan on display_run_child_2007_04_month b (cost=0.00..2635000.02 rows=6182099 width=50)
Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text
-> Seq Scan on display_run_child_2007_05_month b (cost=0.00..2635000.02 rows=6186057 width=50)
Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text
-> Seq Scan on display_run_child_2007_06_month b (cost=0.00..2635000.02 rows=6153744 width=50)
Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text
-> Seq Scan on display_run_child_2007_07_month b (cost=0.00..2635000.02 rows=6124871 width=50)
Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text
-> Seq Scan on display_run_child_2007_08_month b (cost=0.00..2635000.02 rows=6147722 width=50)
Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text
-> Seq Scan on display_run_child_2007_09_month b (cost=0.00..2635000.02 rows=6244890 width=50)
Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text
-> Seq Scan on display_run_child_2007_10_month b (cost=0.00..2635000.02 rows=6083077 width=50)
Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text
-> Seq Scan on display_run_child_2007_11_month b (cost=0.00..2635000.02 rows=6161248 width=50)
Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text
-> Seq Scan on display_run_child_2007_12_month b (cost=0.00..2635000.02 rows=6156930 width=50)
Filter: url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user time’::text
All child partitions are scanned
Greenplum Confidential
Partition Elimination Gather Motion 48:1 (slice1) (cost=14879102.69..14970283.82 rows=607875 width=548)
Merge Key: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Unique (cost=14879102.69..14970283.82 rows=607875 width=548)
Group By: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Sort (cost=14879102.69..14894299.54 rows=6078742 width=548)
Sort Key (Distinct): b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Result (cost=0.00..3188311.04 rows=6078742 width=548)
-> Append (cost=0.00..3188311.04 rows=6078742 width=548)
-> Seq Scan on display_run b (cost=0.00..1.02 rows=1 width=548)
Filter: local_time >= '2007-03-01 00:00:00'::timestamp without time zone AND local_time < '2007-04-01
00:00:00'::timestamp without time zone AND (url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user
time’::text)
-> Seq Scan on display_run_child_2007_03_month b
(cost=0.00..3188310.02 rows=6078741 width=50)
Filter: local_time >= '2007-03-01 00:00:00'::timestamp without time zone AND local_time < '2007-04-01
00:00:00'::timestamp without time zone AND (url::text 'delete’::text OR url::text 'estimate time'::text OR url::text 'user
time’::text)
To eliminate partitions the WHERE clause must be the same as the partition criteria AND must
contain an explicit value (no subquery)
Greenplum Confidential
Optimal Plan Heuristics
• Fast Operators Sequential Scan Hash Join Hash Agg Redistribute Motion
•Slow OperatorsNested Loop JoinMerge Join Sort Broadcast Motion
NEW
Greenplum Confidential
Eliminate Nested Loop JoinsGather Motion 24:1 (slice4) (cost=4517.05..214553.33 rows=416845 width=371)
-> Nested Loop (cost=4517.05..214553.33 rows=416845 width=371)
InitPlan (slice6)
-> Gather Motion 24:1 (slice3) (cost=1.09..10.23 rows=1 width=4)
-> Seq Scan on calendar_dim (cost=1.09..10.23 rows=1 width=4)
Filter: calendar_date = $0
InitPlan (slice5)
-> Gather Motion 24:1 (slice2) (cost=0.00..1.09 rows=1 width=8)
-> Seq Scan on event_status (cost=0.00..1.09 rows=1 width=8)
Filter: event_name::text = 'bonusbuy'::text
-> Seq Scan on primary_contact_events (cost=0.00..7866.45 rows=416845 width=109)
-> Materialize (cost=4506.82..4507.06 rows=24 width=262)
-> Broadcast Motion 24:24 (slice1) (cost=0.00..4506.80 rows=24 width=262)
-> Seq Scan on contact_type (cost=0.00..4506.55 rows=1 width=262)
Filter: call_type::text = 'Regional Group'::text AND created_dt::date = $1
Eliminate nested loop joins for hash joins
Greenplum Confidential
Replace Large SortsRedistribute Motion 48:48 (slice2) (cost=8568413851.73..8683017216.78 rows=83317
width=154)
-> GroupAggregate (cost=8568413851.73..8683017216.78 rows=83317 width=154)
Group By: b."host", "customer_id"
-> Sort (cost=8568413851.73..8591334233.13 rows=9168152560 width=154)
Sort Key: b."host", "customer_id"
-> Redistribute Motion 48:48 (slice1) (cost=2754445.61..1748783807.47
rows=9168152560 width=154)
Hash Key: b."host", "customer_id"
-> Hash Join (cost=2754445.61..1565420756.27 rows=9168152560
width=154)
Hash Cond: a.pack_id = b.pack_id
Join Filter: (a.local_time - b.local_time) >= '-00:10:00'::
Replace large sorts with HashAgg
Greenplum Confidential
Re-write Sort Example• The optimizer dynamically re-writes a single count
distinct to replace a sort for a HashAgg by pushing the
distinct into a subquery
• For example
SELECT count( distinct l_quantity ) FROM lineitem;
SELECT count(*) FROM (SELECT l_quantity FROM lineitem
GROUP BY l_quantity) as foo;
• The optimizer can not rewrite multiple count distincts Re-write the query to push the distinct into a
subquery that inserts the rows into a temp table Then query the temp table that contains the distinct
values to obtain the count
Greenplum Confidential
Eliminate Large Table Broadcast Motion-> Hash (cost=18039.92..18039.92 rows=20 width=66)
-> Redistribute Motion 24:24 (slice3) (cost=0.55..18039.92 rows=20 width=66)
Hash Key: cust_contact_activity.src_system_id::text
-> Hash Join (cost=0.55..18039.52 rows=20 width=66)
Hash Cond: cust_contact_activity.contact_id::text =
cust_contact.contact_id::text
-> Seq Scan on cust_contact_activity (cost=0.00..15953.63 rows=833663
width=39)
-> Hash (cost=0.25..0.25 rows=24 width=84)
-> Broadcast Motion 24:24 (slice2) (cost=0.00..0.25 rows=24 width=84)
-> Seq Scan on cust_contact (cost=0.00..0.00 rows=1 width=84)
Use gp_segments_for_planner to increase the cost of the motion to favor
a redistribute motion
Small table broadcast
is acceptable
Greenplum Confidential
Increase work_mem -> HashAggregate (cost=74852.40..84739.94 rows=791003 width=45)
Group By: l_orderkey, l_partkey, l_comment
Rows out: 2999671 rows (seg1) with 13345 ms to first row, 71558 ms to end, start offset by 3.533
ms.
Executor memory: 2645K bytes avg, 5019K bytes max (seg1).
Work_mem used: 2321K bytes avg, 4062K bytes max (seg1).
Work_mem wanted: 237859K bytes avg, 237859K bytes max (seg1) to lessen workfile I/O
affecting 1 workers.
. . .
-> Seq Scan on lineitem (cost=0.00..44855.70 rows=2999670 width=45)
Rows out: 2999671 rows (seg1) with 0.571 ms to first row, 4167 ms to end, start offset by 4.105
Slice statistics:
(slice0) Executor memory: 211K bytes.
(slice1) * Executor memory: 2840K bytes avg x 2 workers, 5209K bytes max (seg1).
Work_mem: 4062K bytes max, 237859K bytes wanted.
Settings: work_mem=4MB
Total runtime: 73326.082 ms
(24 rows)
Greenplum Confidential
Spill Files• Operations performed in memory are optimal
Hash Join, Hash Agg, Sort, etc.
• If there is not sufficient memory rows will be written out
to disk as spill files Certain amount of overhead with any disk I/O operation
• Spill files are located within the pgsql_temp directory for
the database
[jesh:~/gpdb-data/seg1/base/16571/pgsql_tmp] ls -ltrh
total 334464
-rw------- 1 jeshle jeshle 163M Jan 9 23:41 pgsql_tmp_SortTape_Slice1_14022.205
[jesh:~/gpdb-data/seg1/base/16571/pgsql_tmp] ls -ltrh
total 341632
-rw------- 1 jeshle jeshle 167M Jan 9 23:41 pgsql_tmp_SortTape_Slice1_14022.205
[jesh:~/gpdb-data/seg1/base/16571/pgsql_tmp] ls -ltrh
total 341632
-rw------- 1 jeshle jeshle 167M Jan 9 23:41 pgsql_tmp_SortTape_Slice1_14022.205
[jesh:~/gpdb-data/seg1/base/16571/pgsql_tmp]
Greenplum Confidential
Identify Respill in Hash Agg Operations . . .
-> HashAggregate (cost=74852.40..84739.94 rows=791003 width=45)
Group By: l_orderkey, l_partkey, l_comment
Rows out: 2999671 rows (seg1) with 13345 ms to first row, 71558 ms to end . . .
Executor memory: 2645K bytes avg, 5019K bytes max (seg1).
Work_mem used: 2321K bytes avg, 4062K bytes max (seg1).
Work_mem wanted: 237859K bytes avg, 237859K bytes max (seg1) to lessen workfile I/O
affecting 1 workers.
(seg1) 2999671 groups total in 5 batches; 64 respill passes; 23343536 respill rows.
(seg1) Initial pass: 44020 groups made from 44020 rows; 2955651 rows spilled to workfile.
(seg1) Hash chain length 5.0 avg, 18 max, using 602986 of 607476 buckets.
-> Seq Scan on lineitem (cost=0.00..44855.70 rows=2999670 width=45)
Rows out: 2999671 rows (seg1) with 0.571 ms to first row, 4167 ms to end, start offset by 4.105
Slice statistics:
(slice0) Executor memory: 211K bytes.
(slice1) * Executor memory: 2840K bytes avg x 2 workers, 5209K bytes max (seg1). Work_mem: 4062K
bytes max, 237859K bytes wanted.
Settings: work_mem=4MB
Greenplum Confidential
Eliminate Respill in Hash Agg Operations -> HashAggregate (cost=74852.40..84739.94 rows=791003 width=45)
Group By: l_orderkey, l_partkey, l_comment
Rows out: 2999671 rows (seg1) with 9374 ms to first row, 19169 ms to end, start offset by ms.
Executor memory: 5690K bytes avg, 8063K bytes max (seg1).
Work_mem used: 2321K bytes avg, 4062K bytes max (seg1) .
Work_mem wanted: 237859K bytes avg, 237859K bytes max (seg1) to lessen workfile I/O
affecting 1 workers.
(seg1) 2999671 groups total in 100 batches; 0 respill passes; 0 respill rows.
(seg1) Initial pass: 44020 groups made from 44020 rows; 2955651 rows spilled to workfile.
(seg1) Hash chain length 3.5 avg, 16 max, using 850893 of 880400 buckets.
-> Seq Scan on lineitem (cost=0.00..44855.70 rows=2999670 width=45)
Rows out: 2999671 rows (seg1) with 0.480 ms to first row, 4199 ms to end, start offset by 18 ms.
Slice statistics:
(slice0) Executor memory: 211K bytes.
(slice1) * Executor memory: 5884K bytes avg x 2 workers, 8254K bytes max (seg1). Work_mem: 4062K
bytes max, 237859K bytes wanted.
Settings: work_mem=4MB
Total runtime: 20987.615 ms
Use gp_hashagg_spillbatch_min andgp_hashagg_spillbatch_max
Greenplum Confidential
Review Join Order
Aggregate (cost=16141.02..16141.03 rows=1 width=0)
-> Gather Motion 2:1 (slice2) (cost=16140.97..16141.00 rows=1 width=0)
-> Aggregate (cost=16140.97..16140.98 rows=1 width=0)
-> Hash Join (cost=2999.46..15647.43 rows=197414 width=0)
Hash Cond: ps.ps_partkey = part.p_partkey
-> Hash Join (cost=240.46..9920.71 rows=200020 width=4)
Hash Cond: ps.ps_suppkey = supplier.s_suppkey
-> Seq Scan on partsupp ps (cost=0.00..6180.00 rows=400000
width=8)
-> Hash (cost=177.97..177.97 rows=4999 width=4)
-> Broadcast Motion 2:2 (slice1) (cost=0.00..177.97 rows=4999
width=4)
-> Seq Scan on supplier (cost=0.00..77.99 rows=4999
width=4)
-> Hash (cost=1509.00..1509.00 rows=100000 width=4)
-> Seq Scan on part (cost=0.00..1509.00 rows=100000 width=4)
Build and probe on the
largest table(s) last
Greenplum Confidential
Use join_collapse_limit to Specify Join Order
• Use join_collapse_limit to specify join order
• Must be ANSI style join
• For example
SELECT COUNT(*) FROM partsupp ps
JOIN supplier ON ps_suppkey = s_suppkey
JOIN part ON p_partkey = ps_partkey;
• Non-ANSI style joinSELECT COUNT(*) FROM partsupp, supplier, part
WHERE ps_suppkey = s_suppkey
AND p_partkey = ps_partkey;
• Set join_collapse_limit = 1
Greenplum Confidential
Analyzing Query Plans Summary
Identify plan nodes with a large number of rows and
high cost Validate partitions are being eliminated Eliminate large table broadcast motion Replace slow operators to favor fast operators
Nested Loop Join, Sort Merge Join, Sort, Agg Identify spill files and increase memory
work_mem used versus work_mem wanted Eliminate respill in HashAggregate Review join order
Greenplum Confidential
LAB
Module 5
Explain the Explain PlanAnalyzing Queries
Greenplum Confidential
Module 6
Improve PerformanceWith Statistics
Greenplum Confidential
EXPLAIN ANALYZE Estimated Costs
• EXPLAIN ANALYZE provides cost estimates for
the execution of the plan node as follows
• Cost Measured in units of disk page fetches
• Rows The number of rows output by the plan node
• Width Total bytes of all the rows output by the plan node
Greenplum Confidential
ANALYZE and Database Statistics
• Database statistics used by the optimizer and
query planner
• Run ANALYZE after loading data
• Run ANALYZE after large INSERT, UPDATE and
DELETE operations
• RUN ANALYZE after CREATE INDEX operations
• RUN ANALYZE after database restores from
backupsANALYZE is the most important ongoing
administrative task for optimal query performance
Greenplum Confidential
ANALYZE and VACUUM ANALYZE
• Use the ANALYZE command to generate
database statistics May specify table and column names
ANALYZE [table [ (column [, ...] ) ]]
• Use VACUUM ANALYZE to vacuum the database
and generate database statistics May specify table and column names
VACUUM ANALYZE [table [(column [, ...] )]]
Greenplum Confidential
Use SET STATISTICS to Increase Sampling• Use ALTER TABLE … SET STATISTICS to increase
sampling for statistics collected for a given columnDefault statistics value is 25
• Increasing the statistics value may improve query
planner estimatesFor columns used in query predicates and JOINS
(WHERE clause) Example
ALTER TABLE customer ALTER customer_id
SET STATISTICS 35;
• Larger values increases the time it takes to
ANALYZE
Greenplum Confidential
default_statistics_target
• Use default_statistics_target server configuration
parameter to increase sampling for statistics
collected for ALL columns
• Increasing the target value may improve query
planner estimates
• Default is 25
• Larger values increases the time it takes to
ANALYZE
Greenplum Confidential
gp_analyze_relative_error
• The gp_analyze_relative_error server
configuration parameter sets the estimated
acceptable error in the cardinality of the table
• A value of 0.5 is equivalent to an acceptable
error of 50%
• The default value is 0.5
• Decreasing the relative error fraction (accepting
less error) will increase the number of rows
sampled
gp_analyze_relative_error = .25
Greenplum Confidential
ANALYZE Best Practices
• For large data warehouse environments it may
not be feasible to run ANALYZE on the entire
database or on all columns in a table
• Run ANALYZE for Columns used in a JOIN condition Columns used in a WHERE clause Columns used in a SORT clause Columns used in a GROUP BY or HAVING
Clause
NEW
Greenplum Confidential
LAB
Module 6
Improve PerformanceWith Statistics
Greenplum Confidential
Module 7
Indexing Strategies
Greenplum Confidential
Indexes• Most data warehouse environment operate on
large volumes of data Low selectivity Sequential scan is the preferred method to read the data in a
Greenplum MPP environment
• For queries with high selectivity indexes may
improve performance Avoid indexes on frequently updated columns Avoid overlapping indexes Use bitmap indexes for columns with very low cardinality Drop indexes before loading data and recreate indexes after the
load Analyze after re-creating indexes
Greenplum Confidential
Greenplum Index Types
• B-Tree Use for fairly unique columns Use for those columns that are single row queries
• Bitmap Use for low cardinality columns Use when column is often included in predicate
• Hash Available, but not recommended
Greenplum Confidential
Create Index Syntax
CREATE [UNIQUE] INDEX [CONCURRENTLY]
name ON table
[USING method] ( {column | (expression)} [opclass]
[, ...] )
[ WITH ( FILLFACTOR = value )
[WHERE predicate]
Greenplum Confidential
B-Tree Index
• Supports single value row lookups
• Can be Unique or Non-Unique
• Can be Single or Multi-Column If multi-column, all columns in the index must be
included in the predicate for the index to be used.
EXAMPLE:
CREATE INDEX transid_btridx
ON facts.transaction
USING BTREE (transactionid)
;
Greenplum Confidential
Bitmap Index
• Single column index is recommended
• Provides very fast retrieval
• Low cardinality columns Gender State / Province Code
EXAMPLE:
CREATE INDEX store_pharm_bmidx ON dimensions.store
USING BITMAP (pharmacy);
CREATE INDEX store_grocery_bmidx ON dimensions.store
USING BITMAP (grocery);
CREATE INDEX store_deli_bmidx ON dimensions.store
USING BITMAP (deli);
Greenplum Confidential
Index on Expressions
• Only use when the expression appears often in
query predicates.
• Very high overhead maintaining the index
during insert and update operations.
EXAMPLE:
CREATE INDEX lcase_storename_idx
ON store (LOWER(storename));
SUPPORTS:
SELECT * FROM store WHERE LOWER(storename) = ‘top foods’;
Greenplum Confidential
Index with Predicate (Partial Index)
• Pre-selects rows based on predicate
• Use to select small numbers of rows from large
tables
EXAMPLE:
CREATE INDEX canada_stores_idx
ON facts.transaction
WHERE storeid IN(8,32);
Greenplum Confidential
To Index or Not to Index
• Will the optimizer use the index? Test it with production volumes No hints in Greenplum!
• Is the column(s) used in query predicates? Does the frequency of use justify the overhead? Is the space available?
• Should a unique B-Tree Index be the primary
key? Consider a Primary Key Constraint instead
Greenplum Confidential
Index Costs
• Indexes are not free!
• Indexes take space
• Indexes incur overhead during inserts and
updates
• Indexes incur processing overhead during
creation
Greenplum Confidential
Index Best Practices
• Only create indexes when full table scans do
not perform.
• Name the index.
• Use B-Tree indexes for single column lookups
• Use Bitmap Indexes for low cardinality columns
• Use Partial Indexes for queries that frequently
access small subsets of much larger data sets.
Greenplum Confidential
LAB
Module 7
Indexing Strategies
Greenplum Confidential
Module 8
Advanced ReportingUsingOLAP
Greenplum Confidential
What is OLAP?
• On-Line Analytic Processing Quickly provide answers to multi-dimensional
queries
• “Window” Functions Allows access to multiple rows in a single pass
• OLAP Grouping Sets More flexible than standard GROUP BY
Greenplum Confidential
SELECT - OLAP Grouping Extensions
• Standard GROUP BY Example
• ROLLUP
• GROUPING SETS
• CUBE
• grouping(column [, ...]) function
• group_id() function
Greenplum Confidential
Standard GROUP BY Example
summarize product sales by vendor…
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY pn, vn
ORDER BY 1,2,3;
pn | vn | sum
-----+----+---------
100 | 20 | 0
100 | 40 | 2640000
200 | 10 | 0
200 | 40 | 0
300 | 30 | 0
400 | 50 | 0
500 | 30 | 120
600 | 30 | 60
700 | 40 | 1
800 | 40 | 1
(10 rows)
Greenplum Confidential
Standard GROUP BY Example Continued
now include subtotals and grand total…
SELECT pn, vn, sum(prc*qty)FROM saleGROUP BY pn, vnUNION ALLSELECT pn, null, sum(prc*qty)FROM saleGROUP BY pnUNION ALLSELECT null, null, sum(prc*qty)FROM SALEORDER BY 1,2,3;
pn | vn | sum
-----+----+---------
100 | 20 | 0
100 | 40 | 2640000
100 | | 2640000
200 | 10 | 0
200 | 40 | 0
200 | | 0
300 | 30 | 0
300 | | 0
400 | 50 | 0
400 | | 0
500 | 30 | 120
500 | | 120
600 | 30 | 60
600 | | 60
700 | 40 | 1
700 | | 1
800 | 40 | 1
800 | | 1
| | 2640182
(19 rows)
Greenplum Confidential
ROLLUP Example
the same query using ROLLUP…
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY ROLLUP(pn, vn)
ORDER BY 1,2,3;
pn | vn | sum
-----+----+---------
100 | 20 | 0
100 | 40 | 2640000
100 | | 2640000
200 | 10 | 0
200 | 40 | 0
200 | | 0
300 | 30 | 0
300 | | 0
400 | 50 | 0
400 | | 0
500 | 30 | 120
500 | | 120
600 | 30 | 60
600 | | 60
700 | 40 | 1
700 | | 1
800 | 40 | 1
800 | | 1
| | 2640182
(19 rows)
Greenplum Confidential
GROUPING SETS Example
the same query using GROUPING SETS…
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY GROUPING SETS
( (pn, vn), (pn), () )
ORDER BY 1,2,3;
pn | vn | sum
-----+----+---------
100 | 20 | 0
100 | 40 | 2640000
100 | | 2640000
200 | 10 | 0
200 | 40 | 0
200 | | 0
300 | 30 | 0
300 | | 0
400 | 50 | 0
400 | | 0
500 | 30 | 120
500 | | 120
600 | 30 | 60
600 | | 60
700 | 40 | 1
700 | | 1
800 | 40 | 1
800 | | 1
| | 2640182
(19 rows)
Greenplum Confidential
CUBE ExampleCUBE creates subtotals for all possible
combinations of grouping columns, so…
SELECT pn, vn, sum(prc*qty)FROM saleGROUP BY CUBE(pn, vn)ORDER BY 1,2,3;
is the same as…
SELECT pn, vn, sum(prc*qty)FROM saleGROUP BY GROUPING SETS
( (pn, vn), (pn), (vn), () )ORDER BY 1,2,3;
pn | vn | sum
-----+----+---------
100 | 20 | 0
100 | 40 | 2640000
100 | | 2640000
200 | 10 | 0
200 | 40 | 0
200 | | 0
300 | 30 | 0
300 | | 0
400 | 50 | 0
400 | | 0
500 | 30 | 120
500 | | 120
600 | 30 | 60
600 | | 60
700 | 40 | 1
700 | | 1
800 | 40 | 1
800 | | 1
| 10 | 0
| 20 | 0
| 30 | 180
| 40 | 2640002
| 50 | 0
| | 2640182
(24 rows)
Greenplum Confidential
GROUPING Function Example
SELECT * FROM dsales_null;
store | customer | product | price
-------+----------+---------+-------
s2 | c1 | p1 | 90
s2 | c1 | p2 | 50
s2 | | p1 | 44
s1 | c2 | p2 | 70
s1 | c3 | p1 | 40
(5 rows)
SELECT store,customer,product, sum(price),
grouping(customer)
FROM dsales_null
GROUP BY ROLLUP(store,customer,product);
store | customer | product | sum | grouping
-------+----------+---------+-----+----------
s1 | c2 | p2 | 70 | 0
s1 | c2 | | 70 | 0
s1 | c3 | p1 | 40 | 0
s1 | c3 | | 40 | 0
s1 | | | 110 | 1
s2 | c1 | p1 | 90 | 0
s2 | c1 | p2 | 50 | 0
s2 | c1 | | 140 | 0
s2 | | p1 | 44 | 0
s2 | | | 44 | 0
s2 | | | 184 | 1
| | | 294 | 1
(12 rows)
Distinguishes NULL values from summary
markers…
Greenplum Confidential
GROUP_ID Function• Returns 0 for each output row in a unique grouping set
• Assigns a serial number >0 to each duplicate grouping set found
• Useful when combining grouping extension clauses
• Can be used to filter output rows of duplicate grouping sets:
SELECT a, b, c, sum(p*q), group_id()FROM salesGROUP BY ROLLUP(a,b), CUBE(b,c)HAVING group_id()<1ORDER BY a,b,c;
Greenplum Confidential
SELECT – OLAP Windowing Extensions
• About Window Functions
• Constructing a Window Specification OVER clause WINDOW clause
• Built-in Window Functions
Greenplum Confidential
About Window FunctionsNew class of function allowed only in the SELECT list
Returns a value per row (unlike aggregate functions)
Results interpreted in terms of the current row and its corresponding window partition or frame
Characterized by the use of the OVER clauseDefines the window partitions (groups of rows) to apply the functionDefines ordering of data within a windowDefines the positional or logical framing of a row in respect to its window
Greenplum Confidential
Defining Window Specifications (OVER Clause)
All window functions have an OVER() clause
Specifies the ‘window’ of data to which the function applies
Defines:
Window partitions (PARTITION BY clause)
Ordering within a window partition (ORDER BY clause)
Framing within a window partition (ROWS/RANGE clauses)
Greenplum Confidential
OVER (PARTITION BY…) Example
SELECT * , row_number() OVER()FROM saleORDER BY cn;
SELECT * , row_number() OVER(PARTITION BY cn)FROM saleORDER BY cn;
row_number | cn | vn | pn | dt | qty | prc
------------+----+----+-----+------------+------+------
1 | 1 | 10 | 200 | 1401-03-01 | 1 | 0
2 | 1 | 30 | 300 | 1401-05-02 | 1 | 0
3 | 1 | 50 | 400 | 1401-06-01 | 1 | 0
4 | 1 | 30 | 500 | 1401-06-01 | 12 | 5
5 | 1 | 20 | 100 | 1401-05-01 | 1 | 0
1 | 2 | 50 | 400 | 1401-06-01 | 1 | 0
2 | 2 | 40 | 100 | 1401-01-01 | 1100 | 2400
1 | 3 | 40 | 200 | 1401-04-01 | 1 | 0
(8 rows)
row_number | cn | vn | pn | dt | qty | prc
------------+----+----+-----+------------+------+------
1 | 1 | 10 | 200 | 1401-03-01 | 1 | 0
2 | 1 | 30 | 300 | 1401-05-02 | 1 | 0
3 | 1 | 50 | 400 | 1401-06-01 | 1 | 0
4 | 1 | 30 | 500 | 1401-06-01 | 12 | 5
5 | 1 | 20 | 100 | 1401-05-01 | 1 | 0
6 | 2 | 50 | 400 | 1401-06-01 | 1 | 0
7 | 2 | 40 | 100 | 1401-01-01 | 1100 | 2400
8 | 3 | 40 | 200 | 1401-04-01 | 1 | 0
(8 rows)
Greenplum Confidential
OVER (ORDER BY…) ExampleSELECT vn, sum(prc*qty)FROM saleGROUP BY vnORDER BY 2 DESC;
SELECT vn, sum(prc*qty), rank() OVER (ORDER BY sum(prc*qty) DESC)FROM saleGROUP BY vnORDER BY 2 DESC;
vn | sum | rank
----+---------+------
40 | 2640002 | 1
30 | 180 | 2
50 | 0 | 3
20 | 0 | 3
10 | 0 | 3
(5 rows)
vn | sum
----+---------
40 | 2640002
30 | 180
50 | 0
20 | 0
10 | 0
(5 rows)
Greenplum Confidential
30 | 05022008 | 0
30 | 06012008 | 60
30 | 06012008 | 60
30 | 06012008 | 60
OVER (…ROWS…) Example
Window Framing: “Box car”
AverageSELECT
vn, dt,
AVG(prc*qty) OVER (
PARTITION BY vn
ORDER BY dt
ROWS BETWEEN
2 PRECEDING AND
2 FOLLOWING)
FROM sale;
vn | dt | avg
----+------------+---------
10 | 03012008 | 30
20 | 05012008 | 20
30 | 05022008 | 0
30 | 06012008 | 60
30 | 06012008 | 60
30 | 06012008 | 60
40 | 06012008 | 140
40 | 06042008 | 90
40 | 06052008 | 120
40 | 06052008 | 100
50 | 06012008 | 30
50 | 06012008 | 10
(12 rows)
30 | 05022008 | 0
30 | 06012008 | 60
30 | 06012008 | 60
30 | 06012008 | 60
30 | 05022008 | 0
30 | 06012008 | 60
30 | 06012008 | 60
30 | 06012008 | 60
30 | 05022008 | 0
30 | 06012008 | 60
30 | 06012008 | 60
30 | 06012008 | 60
Greenplum Confidential
Built-In Window Functionscume_dist()dense_rank()first_value(expr)lag(expr [,offset] [,default])last_value(expr)lead(expr [,offset] [,default])ntile(expr)percent_rank()rank()row_number()
* Any aggregate function (used with the OVER clause) can also be used as a window function
Greenplum Confidential
Designating the “Moving” Window
<function> OVER (column1..n This is like a “group by” for the
function
ROWS BETWEEN see below
ORDER BY Required by some functions)
ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW
N PRECEDING AND CURRENT ROW
CURRENT ROW AND N FOLLOWING
N PRECEDING AND N FOLLOWING
Greenplum Confidential
Moving Window ExampleREQUIREMENT: By store, rank all customers for the month of June.
EXAMPLE:SELECT s.storename
,c.custname
,t.totalsalesamt
,RANK() OVER (PARTITION BY t.storeid
ORDER BY t.totalsalesamt DESC) AS ranking
FROM (SELECT storeid
,customerid
,SUM(salesamt) AS totalsalesamt
FROM transaction
WHERE transdate BETWEEN '2008-06-01' AND '2008-06-30'
GROUP BY 1,2 ) t
INNER JOIN store s ON s.storeid = t.storeid
INNER JOIN customer c ON c.customerid = t.customerid
;
Greenplum Confidential
Global Window Specifications (WINDOW clause)
Useful for multiple window function queriesDefine and name a window specification Reuse window specification throughout the query
EXAMPLE:
SELECTRANK() OVER (ORDER BY pn),SUM(prc*qty) OVER (ORDER BY pn),AVG(prc*qty) OVER (ORDER BY pn)FROM sale;
SELECTRANK() OVER (w1),SUM(prc*qty) OVER (w1),AVG(prc*qty) OVER (w1)FROM saleWINDOW w1 AS (ORDER BY pn);
Greenplum Confidential
LAB
Module 8
Advanced ReportingUsingOLAP
Greenplum Confidential
Module 9
Using Temporary Tables
Greenplum Confidential
Temporary Tables
• Local to the session Data may not be shared between sessions
• Temporary table DDL executed at run time
• Dropped when session ends
Greenplum Confidential
Why Use Temporary Tables
• When performing complicated transformations
• When multiple results sets need to be
combined
• To facilitate subsequent join operations Move data to get same segment joins
Greenplum Confidential
Features
• Distributed the same as any Greenplum table
• May be indexed
• May be Analyzed
Greenplum Confidential
Guidelines
• Always designate DISTRIBUTED BY column(s) Will default to the first valid column
• Once data is inserted ANALYZE the table Not available in other RDBMS temporary tables!
• Same CREATE DDL as a permanent table
• May not create temporary external table
Greenplum Confidential
CREATE Temporary Table ExampleCREATE TEMPORARY TABLE monthlytranssummary (
storeid INTEGER,
customerid INTEGER,
transmonth SMALLINT,
salesamttot DECIMAL(10,2)
)
ON COMMIT PRESERVE ROWS
DISTRIBUTED BY (storeid,customerid)
;
ON COMMIT PRESERVE ROWS is the default action when creating
temporary tables.
Greenplum Confidential
LAB
Module 9
Using Temporary Tables
Greenplum Confidential
Module 10
PostGREs Functions
Greenplum Confidential
Types of Functions
Greenplum supports several function types. query language functions (functions written in
SQL) procedural language functions (functions
written in for example, PL/pgSQL or PL/Tcl) internal functions C-language functions
This class will only present the first two types of functions:
Query Language and Procedural Language
Greenplum Confidential
Query Language Function - Rules
• Executes an arbitrary set of SQL commands
• Returns the results of the last query in the list Returns the first row of the last query’s result set Can return a “set of” a previously defined data type
• The body of the function must be a list of SQL
statements followed by a semi-colon
Greenplum Confidential
Query Language Function – More Rules
• May contain SELECT statements
• May contain DML statements Insert, Update or Delete statements
• May not contain: Rollback, Savepoint, Begin or Commit commands
• Last statement must be SELECT unless the
return type is void.
Greenplum Confidential
Query Language Function – Example
This function has no parameters and no return
set:CREATE FUNCTION public.clean_customer()
RETURNS void AS ‘DELETE FROM dimensions.customer
WHERE state = ‘‘NA’’; ’
LANGUAGE SQL;
It is executed as a SELECT statement:SELECT public.clean_customer();
clean_customer
-----------
(1 row)
Greenplum Confidential
Query Language Function - Notes
• Quoted values in the query must use the
doubled single quote syntax.
• One or more parameters may be passed to a
function. The first parameter is referenced as $1 The second parameter is referenced as $2, and so
forth
Greenplum Confidential
Query Language Function – With Parameter
CREATE FUNCTION public.clean_specific_customer(which
char(2))
RETURNS void AS ‘DELETE FROM dimensions.customer
WHERE state = $1; ’
LANGUAGE SQL;
It is executed as a SELECT statement:SELECT public.clean_specific_customer(‘NA’);
clean_specific_customer
-----------------------
(1 row)
Greenplum Confidential
SQL Functions on Base Types
Used to return a single valueCREATE FUNCTION public.customer_cnt(char)
RETURNS bigint AS ‘SELECT COUNT(*)
FROM dimensions.customer
WHERE state = $1; ’
LANGUAGE SQL;
It is executed as a SELECT statement:SELECT public.customer_cnt(‘WA’);
customer_cnt
-----------
176
Greenplum Confidential
SQL Functions on Composite Types
Allows you to pass in a composite parameter
(such as a table name)CREATE FUNCTION facts.viewnewtaxamt(transaction)
RETURNS numeric AS $$
SELECT $1.taxamt * .90;
$$ LANGUAGE SQL;
Note the syntactical difference using the double $ signs instead of
single quotes in the DDL and how the function is called.
SELECT transid, transdate, taxamt,
facts.viewnewtaxamt(transaction)
FROM transaction;
Greenplum Confidential
SQL Functions with Output Parameters
Very similar to previous example, no RETURNS clause.
CREATE FUNCTION public.cust_cnt(IN whichstate char ,
OUT MyCount bigint)
AS ‘SELECT COUNT(*)
FROM dimensions.customer
WHERE state = $1; ’
LANGUAGE SQL;
It is executed as a SELECT statement with only the input parameters
specified:
SELECT public.cust_cnt(‘WA’);
customer_cnt
-----------
176
Greenplum Confidential
SQL Functions as Table Sources
All SQL functions can be used in the FROM clause of a query, but it is
particularly useful for functions returning composite types.
CREATE FUNCTION dimensions.getcust(int)
RETURNS customer AS $$
SELECT *
FROM dimensions.customer WHERE customerid = $1;
$$ LANGUAGE SQL;
Now we can access the columns in the function as if it were a table.
SELECT *,UPPER(custname)
FROM dimensions.getcust(100);
Greenplum Confidential
SQL Functions – Some Notes
• Note that functions operating on tables must be
created in the same schema as the table.
• To DROP the function, the DROP statement
must include the parameter types.EXAMPLE:
DROP FUNCTION DIMENSIONS.getcust(int);
Greenplum Confidential
SQL Functions Returning Sets
When an SQL function is declared as returning SETOF sometype, the
function’s final SELECT query is executed to completion, and each
row it outputs is returned as an element of the result set.
CREATE FUNCTION dimensions.getstatecust(char)
RETURNS SETOF customer AS $$
SELECT *
FROM dimensions.customer WHERE state = $1;
$$ LANGUAGE SQL;
SELECT *,UPPER(city)
FROM dimensions.getstatecust(‘WA’);
This function returns all qualifying rows.
Greenplum Confidential
Function Overloading
• More than one function can be created with the
same name. Must have different input parameter types or number The optimizer selects which version to call based on
the input data type and number of arguments.
Greenplum Confidential
Function Overload - Example
CREATE FUNCTION dimensions.getcust(int)
RETURNS customer AS $$
SELECT *
FROM dimensions.customer WHERE customerid = $1;
$$ LANGUAGE SQL;
CREATE FUNCTION dimensions.getcust(char)
RETURNS customer AS $$
SELECT * FROM dimensions.customer WHERE state = $1;
$$ LANGUAGE SQL;
select *,upper(custname)from dimensions.getcust('WA') ;
select *,upper(custname) from dimensions.getcust(1);
Greenplum Confidential
Function Volatility Categories
• Volatility is a “promise” to the optimizer about
how the function will behave.
• Every function falls into one of these categories VOLATILE IMMUTABLE STABLE
Greenplum Confidential
Function Volatility - VOLATILE
• A VOLATILE function can do anything,
including modifying the database.
• It can return different results on successive
calls with the same arguments.
• The optimizer makes no assumptions about the
behavior of such functions.
• A query using a volatile function will re-
evaluate the function at every row where its
value is needed.
Greenplum Confidential
Function Volatility - IMMUTABLE
• An IMMUTABLE function cannot modify the database.
• It is guaranteed to return the same results given the
same arguments forever.
• This category allows the optimizer to pre-evaluate the
function when a query calls it with constant arguments.
• For example, a query like SELECT ... WHERE x = 2 + 2; can be
simplified on sight to SELECT ... WHERE x = 4; because the
function underlying the integer addition operator is
marked IMMUTABLE.
Greenplum Confidential
Function Volatility - STABLE
• A STABLE function cannot modify the
database.
• It is guaranteed to return the same results
given the same arguments for all rows within a
single statement.
• This category allows the optimizer to optimize
multiple calls of the function to a single call.
• In particular, it is safe to use an expression
containing such a function in an index scan
condition.
Greenplum Confidential
Procedural Language Functions
• PL/pgSQL Procedural Language for GreenplumCan be used to create functions and proceduresAdds control structures to the SQL languageCan perform complex computations Inherits all user-defined types, functions, and
operatorsCan be defined to be trusted by the server Is (relatively) easy to use.
Greenplum Confidential
Advantages of PLpgSQL Functions
• There can be considerable savings of
client/server communication overhead. Extra round trips between client and server are
eliminated Intermediate results that the client does not need do
not have to be marshaled or transferred between
server and clientMultiple rounds of query parsing can be avoided This can result in a considerable performance
increase as compared to an application that does not
use stored functions.
Greenplum Confidential
Structure of PL/pgSQL
• PL/pgSQL is a block-structured language.CREATE FUNCTION somefunc() RETURNS integer AS $$
DECLARE quantity integer := 30;
BEGIN
RAISE NOTICE ’Quantity here is %’, quantity;
-- Prints 30 quantity := 50;
-- Create a subblock
DECLARE quantity integer := 80;
BEGIN
RAISE NOTICE ’Quantity here is %’, quantity;
-- Prints 80
RAISE NOTICE ’Outer quantity here is %’, outerblock.quantity;
-- Prints 50
END;
RAISE NOTICE ’Quantity here is %’, quantity;
-- Prints 50
RETURN quantity;
END;
$$ LANGUAGE plpgsql;
Greenplum Confidential
PL/pgSQL Declarations
• All variables used in a block must be declared in the declarations
section of the block.
• SYNTAX:name [ CONSTANT ] type [ NOT NULL ] [ { DEFAULT | := } expression ];
• EXAMPLES:user_id integer;
quantity numeric(5);
url varchar;
myrow tablename%ROWTYPE;
myfield tablename.columnname%TYPE;
somerow RECORD;
Greenplum Confidential
PL/pgSQL %Types
• %TYPE provides the data type of a variable or
table column.
• You can use this to declare variables that will
hold database values.
SYNTAX: variable%TYPE
EXAMPLES:
custid customer.customerid%TYPE
tid transaction.transid%TYPE
Greenplum Confidential
PL/pgSQL Row Types
• A row variable can be declared to have the
same type as the rows of an existing table or
view.
• Only the user-defined columns of a table row
are accessible in a row-type variable, not the
OID or other system columns.
SYNTAX: table_name%ROWTYPE
EXAMPLES:
cust customer%ROWTYPE
trans transaction%ROWTYPE
Greenplum Confidential
PL/pgSQL Record Types
• Record variables are similar to row-type
variables, but they have no pre-defined
structure.
• They take on the actual row structure of the row
they are assigned during a SELECT or FOR
command. (It is a place holder.)
• The substructure of a record variable can
change each time it is assigned to.
SYNTAX: name RECORD;
Greenplum Confidential
PL/pgSQL Basic Statements
• Assignment
variable := expression;
• Executing DML (no RETURN)
PERFORM myfunction(myparm1,myparm2);
• Single row results
Use SELECT … INTO mytarget%TYPE;
Greenplum Confidential
PL/pgSQL – Executing Dynamic SQL
• Oftentimes you will want to generate dynamic
commands inside your PL/pgSQL functions
• This is for commands that will involve different
tables or different data types each time they are
executed.
• To handle this sort of problem, the EXECUTE
statement is provided: EXECUTE command-string [ INTO [STRICT] target ];
Greenplum Confidential
PL/pgSQL – Dynamic SQL Example
Dynamic values that are to be inserted into the constructed query
require careful handling since they might themselves contain quote
characters.
EXECUTE ’UPDATE tbl SET ’
|| quote_ident(colname)
|| ’ = ’
|| quote_literal(newvalue)
|| ’ WHERE key = ’
|| quote_literal(keyvalue);
Greenplum Confidential
PL/pgSQL – Getting the Results
• There are several ways to determine the effect
of a command.
• The first method is to use the GET
DIAGNOSTICS command, which has the form:
GET DIAGNOSTICS variable = item [ , ... ];
• This command allows retrieval of system status
indicators.
GET DIAGNOSTICS integer_var = ROW_COUNT;
Greenplum Confidential
PL/pgSQL – Getting the Results (continued)
• The second method to determine the effects of
a command is to check the special variable
named FOUND, which is of type boolean. • FOUND starts out false within each PL/pgSQL
function call.
Greenplum Confidential
PL/pgSQL – FOUND
• FOUND is set by: A SELECT INTO statement sets FOUND true if a row is assigned, false if
no row is returned. A PERFORM statement sets FOUND true if it produces (and discards)
one or more rows, false if no row is produced. UPDATE, INSERT, and DELETE statements set FOUND true if at least
one row is affected, false if no row is affected. A FETCH statement sets FOUND true if it returns a row, false if no row
is returned. A MOVE statement sets FOUND true if it successfully repositions the
cursor, false otherwise. A FOR statement sets FOUND true if it iterates one or more times, else
false.
Greenplum Confidential
PL/pgSQL – Control Structures
• Control structures are probably the most useful
(and important) part of PL/pgSQL.
• With PL/pgSQL’s control structures, you can
manipulate PostgreSQL data in a very flexible
and powerful way.
Greenplum Confidential
PL/pgSQL – Returning from a Function
• There are two commands available that allow you to return data
from a function: RETURN and RETURN NEXT. • RETURN with an expression terminates the function and returns the
value of expression to the caller.
• If you declared the function with output parameters, write just
RETURN with no expression. The current values of the output
parameter variables will be returned.
• If you declared the function to return void, a RETURN statement can
be used to exit the function early; but do not write an expression
following RETURN.
• The return value of a function cannot be left undefined.
Greenplum Confidential
PL/pgSQL – Returning from a Function• RETURN NEXT and RETURN QUERY do not actually return from the
function — they simply append zero or more rows to the function’s result
set.
EXAMPLE:CREATE OR REPLACE FUNCTION getAllStores()
RETURNS SETOF store AS
$BODY$
DECLARE r store%rowtype;
BEGIN
FOR r IN SELECT * FROM store WHERE storeid > 0 LOOP
-- can do some processing here
RETURN NEXT r;
-- return current row of SELECT
END LOOP;
RETURN;
END
$BODY$
LANGUAGE plpgsql;
Greenplum Confidential
PL/pgSQL – Conditionals
IF – THEN – ELSE conditionals lets you execute commands based on
certain conditions.IF boolean-expression THEN
statements
[ ELSIF boolean-expression THEN
statements
[ ELSIF boolean-expression THEN
statements
[ ELSE statements ]
END IF;
TIP: Put the most commonly occurring condition first!
Greenplum Confidential
PL/pgSQL – Simple Loops
• With the LOOP, EXIT, CONTINUE, WHILE, and FOR statements,
you can arrange for your PL/pgSQL function to repeat a series of
commands.
LOOP SYNTAX:
[ <<label>> ]
LOOP
statements
END LOOP [ label ];
Greenplum Confidential
PL/pgSQL – Simple LoopsEXIT EXAMPLES:LOOP
-- some computations
IF count > 0
THEN EXIT;
-- exit loop
END IF;
END LOOP;
LOOP
-- some computations
EXIT WHEN count > 0;
-- same result as previous example
END LOOP;
BEGIN
-- some computations
IF stocks > 100000
THEN EXIT;
-- causes exit from the BEGIN block
END IF;
END;
Greenplum Confidential
PL/pgSQL – Simple Loops
CONTINUE can be used within all loops.
CONTINUE EXAMPLE:
LOOP
-- some computations
EXIT WHEN count > 100;
CONTINUE WHEN count < 50;
-- some computations for count IN [50 .. 100]
END LOOP;
Greenplum Confidential
PL/pgSQL – Simple Loops
The WHILE statement repeats a sequence of statements so long as
the boolean-expression evaluates to true. The expression is checked
just before each entry to the loop body.
WHILE EXAMPLE:
WHILE customerid < 50 LOOP
-- some computations
END LOOP;
Greenplum Confidential
PL/pgSQL – Simple For (integer) Loops
This form of FOR creates a loop that iterates over a range of integer
values.
EXAMPLE:
FOR i IN 1..3 LOOP
-- i will take on the values 1,2,3 within the loop
END LOOP;
FOR i IN REVERSE 3..1 LOOP
-- i will take on the values 3,2,1 within the loop
END LOOP;
FOR i IN REVERSE 8..1 BY 2 LOOP
-- i will take on the values 8,6,4,2 within the loop
END LOOP;
Greenplum Confidential
PL/pgSQL – Simple For (Query) Loops
Using a different type of FOR loop, you can iterate through the results
of a query and manipulate that data accordingly.
EXAMPLE: CREATE FUNCTION nukedimensions() RETURNS integer AS $$
DECLARE mdims RECORD;
BEGIN
FOR mdims IN SELECT * FROM pg_class WHERE schema=‘dimensions’ LOOP
-- Now ”mdims" has one record from pg_class
EXECUTE ’TRUNCATE TABLE ’ || quote_ident(mdimss.relname);
END LOOP;
RETURN 1;
END;
$$ LANGUAGE plpgsql;
Greenplum Confidential
PL/pgSQL – Trapping Errors
• By default, any error occurring in a PL/pgSQL function aborts
execution of the function, and indeed of the surrounding
transaction as well.
• You can trap errors and recover from them by using a BEGIN block
with an EXCEPTION clause.
BEGIN
statements
EXCEPTION WHEN condition [OR condition ... ] THEN
handler_statements
[ WHEN condition [ OR condition ... ] THEN
handler_statements ... ]
END;
Greenplum Confidential
PL/pgSQL – Common Exception Conditions
NO_DATA No data matches predicates
CONNECTION_FAILURE SQL cannot connect
DIVISION_BY_ZERO Invalid division operation 0 in divisor
INTERVAL_FIELD_OVERFLOW Insufficient precision for timestamp
NULL_VALUE_VIOLATION Tried to insert NULL into NOT NULL column
RAISE_EXCEPTION Error raising an exception
PLPGSQL_ERROR PL/pgSQL error encountered at run time
NO_DATA_FOUND Query returned zero rows
TOO_MANY_ROWS Anticipated single row return, received more than 1 row.
Greenplum Confidential
Why Use Functions?
• To perform conditional logic
• To provide common SQL to the user
community Insures a single version of the metrics Prevents poorly designed user queries
• To execute complex report or transformation
logic
Note: We specifically did not cover cursors in this
course. Cursors can take a parallel process and make
it nicely serial!
Greenplum Confidential
User Defined Data Type
• Often a function will need to return a data type
other than one based on a table.
CREATE TYPE queryreturn AS (
Customerid BIGINT,
CustomerName CHARACTER VARYING,
TotalSalesAmt DECIMAL(12,2)
);
Greenplum Confidential
LAB
Module 10
PostGREs Functions
Greenplum Confidential
Module 11
Controlling Access
Greenplum Confidential
•ProblemsPrivileged users (gpadmin/root) have access to
everything (physical data, binaries, backups)Sharing accounts makes auditing difficultUsers are tempted to run additional workloads on
master node, such as ETLOnce logged into the master, all other hosts are
available since ssh keys are exchangedNetwork infrastructure isn’t under our controlSystem administration isn’t under our control
System Level Security
Greenplum Confidential
•Problemsgpadmin user has access to everythingUnless roles and privileges are configured, all users
have access to everythingUnless resource queues are configured, there are no
limits on what users can runDefault pg_hba.conf is loosely configured (see
Security in GPDB.pdf)Customers are tempted to share user accounts
Database Security
Greenplum Confidential
•RolesEach person should have their own roleRoles should be further divided into groups, which
are usually roles that can’t loginPrivileges should be granted at the group level
whenever possiblePrivileges should be as restrictive as possibleColumn level access can be accomplished with viewsRoles are not related to OS users or groups
Database Security
Greenplum Confidential
Creating Roles
• The Greenplum Database manages database access
permissions using the concept of roles.
• The concept of roles subsumes the concepts of users
and groups.
• A role can be a database user, a group, or both. Roles
can own database objects (for example, tables) and can
assign privileges on those objects to other roles to
control access to the objects.
• Roles can be members of other roles, thus a member
role can inherit the attributes and privileges of its
parent role.
Greenplum Confidential
Role based access
• The Greenplum Database manages database access permissions using the concept of roles.
• The concept of roles subsumes the concepts of users and groups.
• A role can be a database user, a group, or both. Roles can own database objects (for example, tables) and can assign privileges on those objects to other roles to control access to the objects.
• Roles can be members of other roles, thus a member role can inherit the attributes and privileges of its parent role.
Greenplum Confidential
Database Architecture
• Isolate users from physical layer through views Insolate users from changes to
tables Control access to columns and/or
rows Enforce locking strategy
• Separate functional areas into separate schemas Simplify security Simplify backups/restores
Risk Sales MktgTable
Table
Table
Table
Table
Table
Table
Table
Table
Table
Risk Sales MktgView
View
View
View View
View
View
A KEY COMPONENT OF ACCESS CONTROL
Greenplum Confidential
ROLES AS USERS AND GROUPS
• Every Greenplum Database system contains a
set of database roles (users and groups). Those roles are separate from the users and groups
managed by the operating system on which the
server runs. However, for convenience you may want
to maintain a relationship between operating system
user names and Greenplum Database role names,
since many of the client applications use the current
operating system user name as the default. A user is a “role” with a password!
Greenplum Confidential
Creating new roles (users)
• A user-level role is considered to be a database
role that can log in to the database and initiate
a database session. Therefore, when you create
a new user-level role using the CREATE ROLE
command, you must specify the LOGIN
privilege.
Example: CREATE ROLE apatel WITH LOGIN;
Greenplum Confidential
Role attributes
• A database role may have a number of
attributes that define what sort of tasks that
role can perform in the database.These attributes can be set at role creation time or by
using the ALTER ROLE syntax.
Greenplum Confidential
Common user attributes
Greenplum Confidential
Role membership (groups)• It is frequently convenient to group users together to ease
management of object privileges: that way, privileges can be granted to, or revoked from, a group as a whole.
• In the Greenplum Database this is done by creating a role that represents the group, and then granting membership in the group role to individual user roles.
• Use the CREATE ROLE SQL command to create a new group role.
Example: CREATE ROLE admin CREATEROLE CREATEDB;
Once the group role exists, you can add and remove members (user roles) using the GRANT and REVOKE commands.
Example: GRANT admin TO john, sally;
REVOKE admin FROM bob;
Greenplum Confidential
Object level privileges
• For managing object privileges, you would then
grant the appropriate permissions to the group-
level role only.
• The member user roles then inherit the object
privileges of the group role.
Examples: GRANT ALL ON TABLE mytable TO admin;
GRANT ALL ON SCHEMA myschema TO admin;
GRANT ALL ON DATABASE mydb TO admin;
Greenplum Confidential
Managing object privileges
• When an object (table, view, sequence, database, function, language, schema, or tablespace) is created, it is assigned an owner.
• The owner is normally the role that executed the creation statement.
• For most kinds of objects, the initial state is that only the owner (or a superuser) can do anything with the object.
• To allow other roles to use it, privileges must be granted.
Greenplum Confidential
Object privileges
Greenplum Confidential
Security examples
CREATE ROLE admin CREATEROLE CREATEDB;
CREATE ROLE batch;
GRANT select, insert, update, delete
ON dimensions.customer TO batch;
CREATE ROLE batchuser LOGIN;
GRANT batch TO batchuser;
Greenplum Confidential
LAB
Module 11
Controlling Access
Greenplum Confidential
Module 12
Advanced SQL Topics&
Performance Tips
Greenplum Confidential
UPDATE and DELETE Performance
• Use TRUNCATE TABLE to delete all rows from a
table Do not DROP the table and recreate the table
• To perform an UPDATE or DELETE on many
rows, insert the rows into a temp table and then
perform an UPDATE JOIN or a DELETE JOIN The WHERE condition must specify equality between
the distribution columns of the target table and all
joined tables
Greenplum Confidential
Temporal and Boolean Data Types
• To avoid processing skew do not use temporal
data types (date, time, timestamp/datetime,
interval) for distribution columns
• To avoid data skew never use a boolean data
type (‘t’/’f’) for distribution columns
Greenplum Confidential
Avoid Approximate Numeric Data Types
• Avoid floating point data types when possible Use the absolute minimum precision required
• Do not use approximate numerics (double, float,
etc) for: Distribution Columns Join Columns Columns that will be used for mathematical operations
(SUM, AVG)
Greenplum Confidential
Data Types and Performance
• Use data types that will minimize disk storage
requirements, minimize scan time and maximize
performance• Choose the smallest integer type
For example, INTEGER not NUMERIC(11,0)
• Choose integer types for distribution columns
• Choose the same integer types for join columns
Greenplum Confidential
Use CASE to Avoid Multiple Passes
• Use the CASE statement instead of DECODE. Put the most commonly occurring condition first Put the least commonly occurring condition last Always put a default (ELSE) condition
EXAMPLE:CASE
WHEN state = ‘CA’
THEN state_population * 1.2
…
WHEN state = ‘WY’
THEN state_population * 10.7
ELSE
0
END
Greenplum Confidential
SET Operations
• SET operations work against queries, not
tables
• SET operations can improve query performance
• Greenplum set operations include UNION
- Insures uniqueness in the results set
UNION ALL- Allow for duplication, use with caution!
INTERSECT EXCEPT (same as MINUS)
Greenplum Confidential
SET Operation Cautions
• Do not UNION a large numbers of queries
against tables with millions (or billions) of
rows. The optimizer will attempt to launch each query in
parallel, you could run out of memory and temp
space. Use a temporary table to store the results and
populate it with separate queries. Use a CASE statement, if possible, to retrieve the
data in one pass.
Greenplum Confidential
Avoid Multiple DISTINCT Operations
Original Query:
SELECT customerId,
COUNT(DISTINCT transId),
COUNT(DISTINCT storeId)
FROM transaction
WHERE transDate > '03/20/2008'
AND transDate < '03/25/2008'
GROUP BY 1
ORDER BY 1
LIMIT 100;
Greenplum Confidential
Avoid Multiple DISTINCT Operations
More Efficient VersionSELECT sub2.customerId,
sub2.cd_transId,
sub4.cd_storeId
FROM (SELECT customerId, COUNT(transId) as cd_transId
FROM ( SELECT customerId, transId
FROM transaction
WHERE transDate > '03/20/2008’ AND transDate < '03/25/2008'
GROUP BY 1,2
) sub1
GROUP BY 1
) sub2,
(SELECT customerId, COUNT(storeId) as cd_storeId
FROM ( SELECT customerId, storeId
FROM transaction
WHERE transDate > '03/20/2008’ AND transDate < '03/25/2008'
GROUP BY 1,2
) sub3
GROUP BY 1
) sub4
WHERE sub4.customerId = sub2.customerId
ORDER BY 1 LIMIT 100;
Greenplum Confidential
IsDate Function
Not part of Greenplum suite, this function works like the
Oracle “IsDate” function.
CREATE OR REPLACE FUNCTION public.isdate(text)
RETURNS boolean AS $BODY$
begin
perform $1::date;
return true;
exception when others then
return false;
end
$BODY$
LANGUAGE 'plpgsql' VOLATILE;
Greenplum Confidential
Using Arrays
With Greenplum it is possible to have an ARRAY data type.• Arrays may not be the distribution key for a table
• You may not create indexes on ARRAY data type columnsEXAMPLE:
CREATE TEMPORARY TABLE dimensionsOID AS(
SELECT ARRAY(SELECT c.OID
FROM pg_class c,
pg_namespace n
WHERE n.oid = c.relnamespace
AND n.nspname = 'dimensions’)
AS oidArray)
DISTRIBUTED RANDOMLY;
Greenplum Confidential
How to Parse Strings Efficiently
Here are some functions that will help parse a delimited string:
split_part(string ,delimiter, occurrence);
EXAMPLE: SELECT split_part(‘name1=value1&name2=value2,’&’,2);
----------------
name2=value2
string_to_array(string, delimiter or pattern matching expression)
EXAMPLE: SELECT string_to_array(‘a,b,c,d,e,f,g’,’,’);
string_to_array
------------------
{a,b,c,d,e,f,g}
Greenplum Confidential
How to get Database Object Sizes
a. select pg_size_pretty(pg_database_size(’datamart'));
b. select pg_size_pretty(pg_relation_size(‘dimension.customer'));
c. select pg_relation_size(’facts.transaction');
d. select pg_database_size(’datamart');
e. select pg_relation_size(schemaname||'.'||tablename) from
pg_tables
f. gpsizecalc -x dbname -s g
g. gpsizecalc -a dbname -t public.tablename -s g
Greenplum Confidential
Skew Analysis
Finding uneven data distribution & fixing it!
Command: gpskew
The gpskew script can be used to determine if table data is equally distributed across
all of the active segments. gpskew reports the following information:
The total number of records in the specified table. Note that for parent tables
with inherited child tables, you must use the -h option to show child table skew as well.
The number of records on each segment.
The variance of records between segments. It takes the segment which has the
maximum count, and the segment which has the minimum count, and reports the
difference between these two segments.
The segment response times (if -r is supplied).
The distribution key column names (if -c is supplied).
Greenplum Confidential
Skew analysis
aparashar@mdw1 ~ $ > gpskew -t public.t1
20080823:09:43:43:gpskew:mdw1:aparashar-[INFO]:-Spawning parallel processes batch [1], please wait...
........
20080823:09:43:45:gpskew:mdw1:aparashar-[INFO]:-Waiting for parallel processes batch [1], please wait...
20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:---------------------------------------
20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:-Parallel process exit status
20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:---------------------------------------
20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:---------------------------------------
20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:-Number of Distribution Column(s) = {1}
20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:-Distribution Column Name(s) = c1
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:---------------------------------------
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Total records = 2560000
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:---------------------------------------
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Skew calculation target type = Primary Segments
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:---------------------------------------
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Skew result
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Maximum Record Count = 320512
20080823:09:56:33:gpskew:mdw1:aparashar-[INFO]:-Segment instance hostname = 3114
20080823:09:56:33:gpskew:mdw1:aparashar-[INFO]:-Segment instance name = gp
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Minimum Record Count = 319488
20080823:09:56:33:gpskew:mdw1:aparashar-[INFO]:-Segment instance hostname = 3114
20080823:09:56:33:gpskew:mdw1:aparashar-[INFO]:-Segment instance name = gp
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Record count variance = 1024
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:---------------------------------------
Greenplum Confidential
Skew analysis
Skew problem can be seen through queries
Use Explain analyzeais_3114=# explain analyze SELECT count(*) from (SELECT count(*) from t2 group by c2) as a;
-> Subquery Scan a (cost=75.33..87.84 rows=1 width=0)
Rows out: Avg 1250.0 rows x 8 workers. Max 1251 rows (seg2) with 6.305 ms to first row, 15 ms to
end, start offset by -17 ms.
-> HashAggregate (cost=75.33..87.83 rows=1000 width=4)
Group By: t2.c2
Rows out: Avg 1250.0 rows x 8 workers. Max 1251 rows (seg2) with 6.300 ms to first row, 15
ms to end, start offset by -17 ms.
Executor memory: 1938K bytes avg, 1938K bytes max (seg0).
Work_mem used: 138K bytes avg, 138K bytes max (seg2).
(seg2) Hash chain length 1.0 avg, 1 max, using 1251 of 110160 buckets.
-> Redistribute Motion 8:8 (slice1) (cost=27.83..60.33 rows=1000 width=4)
Hash Key: t2.c2
Rows out: Avg 1250.0 rows x 8 workers at destination. Max 1251 rows (seg2) with 0.498
ms to first row, 2.729 ms to end, start offset by -16 ms.
-> HashAggregate (cost=27.83..40.33 rows=1000 width=4)
Group By: t2.c2
Rows out: Avg 1250.0 rows x 8 workers. Max 1251 rows (seg2) with 3.400 ms to fi
rst row, 4.280 ms to end, start offset by -21 ms.
Executor memory: 1938K bytes avg, 1938K bytes max (seg0).
-> Seq Scan on t2 (cost=0.00..19.22 rows=1722 width=4)
Rows out: Avg 1250.0 rows x 8 workers. Max 1251 rows (seg2) with 0.091 ms
to first row, 0.463 ms to end, start offset by -19 ms.
…..
Greenplum Confidential
Skew Analysis
Example: (A Greenplum Customer)
> 75% of the tables - randomly distributed
Several very large tables had completely skewed distribution
Debug process:
Check the database skew
Start with big tables then work towards the smaller ones
Greenplum Confidential
LAB
Module 12
Advanced SQL Topics
Greenplum Confidential
Feedback and Questions
Please fill in the feedback form that you have
been given by your instructor. We value your
input in making this class as meaningful to our
students as possible!
THANK YOU!