Upload
ethelbert-short
View
256
Download
1
Tags:
Embed Size (px)
Citation preview
1
Chapter 10Joins and Subqueries
2
Joins & Subqueries
Joins– Methods to combine data from multiple tables– Optimizer information can be limited based on
Algorithms used Knowledge of data
Subqueries– Complex by nature– Difficult for Optimizer to determine best plan
3
Types of Joins
Equi-join (equality condition – i.e. “-”) Non-equi or Theta (non-equality – e.g. “<>”, between) Cross (Cartesian – i.e. no join condition) Outer (joining data not matching in other table)
– Left– Right– Full
Self (joining table to itself) Hierarchical (type of self-join) Anti (rows from one table without match from other) Semi (only one row from matching table returned)
4
Join Methods
Nested Loops– Performing search of inner table for each row found in outer
table– Optimizer will choose only if index exists on inner table– Nested table scan – scan of entire inner table for each outer
table row if no index on inner table– Generally least effective join method
Sort-Merge– Each table sorted by value of the join columns– After sort, data merged– Best when
Large amount of data needed No index on inner table
5
Join Methods (cont.)
Hash – Hash table built for one of the tables– Hash table used to find matching rows in other table– Also good for large amounts of data– Can be similar in performance to sort-merge
6
Choosing Join Method
See Table 10-1 (p. 296) Sort-Merge/Hash vs. Nested Loops
– Nested Loops Better response time Smaller amounts of data Indexes needed
– Sort-Merge/Hash Better throughput Larger amounts of data More memory needed for sorting or building hash table Better with parallel operations (especially Hash)
7
Choosing Join Method (cont.)
Sort-Merge vs. Hash– Hash
Generally performs better than Sort-Merge Only applicable to equi-joins Only has to process all of one table (creating the hash table)
– Sort-Merge Applies to more situations than Hash Applies to equi and non-equi joins Both tables processed (sorted) More memory and CPU generally needed Outperforms Hash if data is pre-sorted
8
Choosing Join Method (cont.)
Sort-Merge vs. Hash– Hash
Generally performs better than Sort-Merge Only applicable to equi-joins Only has to process all of one table (creating the hash table)
– Sort-Merge Applies to more situations than Hash Applies to equi and non-equi joins Both tables processed (sorted) More memory and CPU generally needed Outperforms Hash if data is pre-sorted
9
Choosing Join Method (cont.)
When Joining A to B– Both are small– Small subset from B– Want first rows quickly– Want all rows quickly– FTS of A / parallelism– Limited memory
NL– Depends– Yes– Yes– Depends– Yes, if..– Yes
SM/Hash– Yes– No– No– Depends– Yes– Maybe not
10
Optimizing Nested Loops Joins
Nested Loops– Ensure index is on inner table– Join column is selective(low cardinality)
Sort-Merge & Hash– Needs enough memory in PGA to perform well– Best if entire structure constructed in memory
Avoid “multi-pass” operations to disk
– Sort-Merge is the most resource intensive Two sorted tables Merge operation
11
Avoiding Joins
Maintaining denormalized data from one table to another
– Requires application process to copy data– Data integrity needs to be carefully maintained
Storing tables in index cluster– Reduces IO by combining into single segment– SIZE parameter must be set appropriately– FTS operations still slow– Rarely Used
Creating Materialized Views Create bitmap join index
12
Avoiding Joins (cont.)
Creating Materialized Views– Allows transparent query rewrite– Keeps transaction data in log tables– Avoid join overhead for frequently used queries
Create bitmap join index– Efficient method of matching values between indexes– Higher frequency of locking can occur
13
Join Order
Optimizer calculates join possibilities– Factorial of number of tables being joined– Only two tables joined in single operation– Temporary result sets created for three or more tables– Let optimizer decide join order, but..
Ensure statistics are current Create histograms where appropriate
14
Join Order (cont.)
If you don’t trust the optimizer– The driving table (first table in join)
Should be most selective Should have most efficient WHERE clause
– Eliminate rows from final result set as early as possible during join operations
Try to process filtering conditions early on in the join
– For small tables with indexes Use nested loops join Ensure all columns of WHERE clause are indexed
15
Outer Joins
Rows returned from one table in a join, even if there is no matching rows in the other table
Three types– Left Outer Join (rows missing from one table) – Right Outer Join (rows missing from one table)– Full Outer join (shows rows missing from both tables)
Optimizer joins table with missing rows last Specified with
– Proprietary oracle syntax (+)– ANSI syntax (e.g. LEFT OUTER JOIN, etc.)
Inner Join– Shows only matching rows from both tables– This is the “default”
16
Star Joins
Common in the data warehouse Star schema consists of
– Large Fact table containing detailed rows and foreign keys– Dimension tables categorizes fact items (e.g. time, product, etc.)
Oracle’s default approach is to:– Query all dimensions to retrieve foreign key values– Merge dimension result sets using Cartesian join– Resulting foreign keys used to identify fact table rows
Requires many concatenated indexes
17
Star Transformation
Cartesian join approach has drawbacks– Assumes small dimension tables, which may not be true– Concatenated index requirements across all dimension keys may
not be practical Oracle created “Star Transformation” optimization
– Uses bitmap indexes on fact table– Requires setting parameter
STAR_TRANSFORMATION_ENABLED=TRUE– Also can use OPT_PARAM hint– Can validate star transformation via the execution plan– Easier to configure and manage– Supports widest range of possible WHERE clause conditions– Possible lock overhead with bitmap indexes still applies
18
Hierarchical Joins
Special case of self-join Column in table points to the primary key of
another row in the same table Next row points to a further row and so on Cascading effect Avoid indexes in execution plan
19
Subqueries
Is a SELECT statement contained within another SQL Statement
Types include– Simple– Correlated– Anti-join– Semi-join
20
Simple Subqueries
Inner query makes no reference to parent query Example to find employees with lowest salary
SELECT COUNT(*)
FROM employees
WHERE salary = (SELECT MIN (salary) FROM employees);
Each query can and should be tuned independently Generally use more resources than running queries
separately within a program
21
Correlated Subqueries
Subquery refers to values in the parent query Subquery is logically executed once for each row
returned by the parent query Usually accomplished via a join method
SELECT employee_id, first_name, last_name, salaryFROM employees aWHERE salary = (SELECT MIN (salary)
FROM employees b WHERE b.department_id =
a.department_id);
Can generate inefficient plans Consider rewriting as joins or using analytic functions
22
Anti-join Subqueries
As named, is the opposite of a join– Returns rows in one table that do not match rows from another– Expressed with ‘NOT IN’ or ‘NOT EXISTS’ subquery– Example: Google customers who are not Microsoft customers
SELECT COUNT(*)FROM google_customersWHERE (cust_first_name, cust_last_name)NOT IN (SELECT cust_first_name, cust_last_name)
FROM microsoft_customers)
Optimizer generally uses HASH JOIN ANTI method May be beneficial to add index to subquery table Avoid NOT IN unless join keys are NOT NULL
23
Semi-join Subqueries
Expressed as ‘WHERE IN’ or ‘WHERE EXISTS’ subquery
SELECT COUNT(*)
FROM google_customers
WHERE (cust_first_name, cust_last_name)
IN (SELECT cust_first_name, cust_last_name)
FROM microsoft_customers)
Returns rows from first table only once– Even if more than one matching rows in second table