34
Database normalization From Wikipedia, the free encyclopedia Jump to: navigation , search In the field of relational database design, normalization is a systematic way of ensuring that a database structure is suitable for general-purpose querying and free of certain undesirable characteristics— insertion, update, and deletion anomalies—that could lead to a loss of data integrity .[1] E.F. Codd , the inventor of the relational model , introduced the concept of normalization and what we now know as the first normal form in 1970.[2] Codd went on to define the second and third normal forms in 1971,[3] and Codd and Raymond F. Boyce defined the Boyce-Codd normal form in 1974.[4] Higher normal forms were defined by other theorists in subsequent years, the most recent being the sixth normal form introduced by Chris Date , Hugh Darwen , and Nikos Lorentzos in 2002.[5] Informally, a relational database table (the computerized representation of a relation ) is often described as "normalized" if it is in the third normal form (3NF).[6] Most 3NF tables are free of insertion, update, and deletion anomalies, i.e. in most cases 3NF tables adhere to BCNF , 4NF , and 5NF (but typically not 6NF ). A standard piece of database design guidance is that the designer should begin by fully normalizing the design, and selectively denormalize only in places where doing so is absolutely necessary to address performance issues.[7] However, some modeling disciplines, such as the dimensional modeling approach to data warehouse design, explicitly recommend non-normalized designs, i.e. designs that in large part do not adhere to 3NF.[8] Objectives of normalization A basic objective of the first normal form defined by Codd in 1970 was to permit data to be queried and manipulated using a "universal data sub-language" grounded in first-order logic .[9] (SQL is an example of such a data sub-language, albeit one that Codd regarded as seriously flawed.)[10] Querying and manipulating the data within an unnormalized data structure, such as the following non-1NF representation of customers' credit card transactions, involves more complexity than is really necessary: Customer Transactions __Tr. ID____Date______Amt._ Jones 12890 14-Oct-2003 -87 12904 15-Oct-2003 -50 Wilkins 12898 14-Oct-2003 -21 Stevens 12907 15-Oct-2003 -18

Complete Rdbms 1

Embed Size (px)

Citation preview

Page 1: Complete Rdbms 1

Database normalizationFrom Wikipedia, the free encyclopedia

Jump to: navigation, search

In the field of relational database design, normalization is a systematic way of ensuring that a database structure is suitable for general-purpose querying and free of certain undesirable characteristics—insertion, update, and deletion anomalies—that could lead to a loss of data integrity.[1] E.F. Codd, the inventor of the relational model, introduced the concept of normalization and what we now know as the first normal form in 1970.[2] Codd went on to define the second and third normal forms in 1971,[3] and Codd and Raymond F. Boyce defined the Boyce-Codd normal form in 1974.[4] Higher normal forms were defined by other theorists in subsequent years, the most recent being the sixth normal form introduced by Chris Date, Hugh Darwen, and Nikos Lorentzos in 2002.[5]

Informally, a relational database table (the computerized representation of a relation) is often described as "normalized" if it is in the third normal form (3NF).[6] Most 3NF tables are free of insertion, update, and deletion anomalies, i.e. in most cases 3NF tables adhere to BCNF, 4NF, and 5NF (but typically not 6NF).

A standard piece of database design guidance is that the designer should begin by fully normalizing the design, and selectively denormalize only in places where doing so is absolutely necessary to address performance issues.[7] However, some modeling disciplines, such as the dimensional modeling approach to data warehouse design, explicitly recommend non-normalized designs, i.e. designs that in large part do not adhere to 3NF.[8]

Objectives of normalizationA basic objective of the first normal form defined by Codd in 1970 was to permit data to be queried and manipulated using a "universal data sub-language" grounded in first-order logic.[9] (SQL is an example of such a data sub-language, albeit one that Codd regarded as seriously flawed.)[10] Querying and manipulating the data within an unnormalized data structure, such as the following non-1NF representation of customers' credit card transactions, involves more complexity than is really necessary:

Customer Transactions __Tr. ID____Date______Amt._

Jones 12890 14-Oct-2003 -8712904 15-Oct-2003 -50

Wilkins 12898 14-Oct-2003 -21

Stevens 12907 15-Oct-2003 -18

Page 2: Complete Rdbms 1

14920 20-Nov-2003 -7015003 27-Nov-2003 -60

To each customer there corresponds a repeating group of transactions. The automated evaluation of any query relating to customers' transactions therefore would broadly involve two stages:

1. Unpacking one or more customers' groups of transactions allowing the individual transactions in a group to be examined, and

2. Deriving a query result based on the results of the first stage

For example, in order to find out the monetary sum of all transactions that occurred in October 2003 for all customers, the system would have to know that it must first unpack the Transactions group of each customer, then sum the Amounts of all transactions thus obtained where the Date of the transaction falls in October 2003.

One of Codd's important insights was that this structural complexity could always be removed completely, leading to much greater power and flexibility in the way queries could be formulated (by users and applications) and evaluated (by the DBMS). The normalized equivalent of the structure above would look like this:

Customer Tr. ID Date AmountJones 12890 14-Oct-2003 -87Jones 12904 15-Oct-2003 -50Wilkins 12898 14-Oct-2003 -21Stevens 12907 15-Oct-2003 -18Stevens 14920 20-Nov-2003 -70Stevens 15003 27-Nov-2003 -60

Now each row represents an individual credit card transaction, and the DBMS can obtain the answer of interest, simply by finding all rows with a Date falling in October, and summing their Amounts. All of the values in the data structure are on an equal footing: they are all exposed to the DBMS directly, and can directly participate in queries, whereas in the previous situation some values were embedded in lower-level structures that had to be handled specially. Accordingly, the normalized design lends itself to general-purpose query processing, whereas the unnormalized design does not.

The objectives of normalization beyond 1NF were stated as follows by Codd:

1. To free the collection of relations from undesirable insertion, update and deletion dependencies;

2. To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs;

3. To make the relational model more informative to users;

4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.

Page 3: Complete Rdbms 1

—E.F. Codd, "Further Normalization of the Data Base Relational Model"[11]

The sections below give details of each of these objectives.

[edit] Free the database of modification anomaliesAn update anomaly. Employee 519 is shown as having different addresses on different records.An insertion anomaly. Until the new faculty member, Dr. Newsome, is assigned to teach at least one course, his details cannot be recorded.A deletion anomaly. All information about Dr. Giddens is lost when he temporarily ceases to be assigned to any courses.

When an attempt is made to modify (update, insert into, or delete from) a table, undesired side-effects may follow. Not all tables can suffer from these side-effects; rather, the side-effects can only arise in tables that have not been sufficiently normalized. An insufficiently normalized table might have one or more of the following characteristics:

• The same information can be expressed on multiple rows; therefore updates to the table may result in logical inconsistencies. For example, each record in an "Employees' Skills" table might contain an Employee ID, Employee Address, and Skill; thus a change of address for a particular employee will potentially need to be applied to multiple records (one for each of his skills). If the update is not carried through successfully—if, that is, the employee's address is updated on some records but not others—then the table is left in an inconsistent state. Specifically, the table provides conflicting answers to the question of what this particular employee's address is. This phenomenon is known as an update anomaly.

• There are circumstances in which certain facts cannot be recorded at all. For example, each record in a "Faculty and Their Courses" table might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Code—thus we can record the details of any faculty member who teaches at least one course, but we cannot record the details of a newly-hired faculty member who has not yet been assigned to teach any courses. This phenomenon is known as an insertion anomaly.

• There are circumstances in which the deletion of data representing certain facts necessitates the deletion of data representing completely different facts. The "Faculty and Their Courses" table described in the previous example suffers from this type of anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on which that faculty member appears. This phenomenon is known as a deletion anomaly.

[edit] Minimize redesign when extending the database structureWhen a fully normalized database structure is extended to allow it to accommodate new types of data, the pre-existing aspects of the database structure can remain largely or entirely unchanged. As a result, applications interacting with the database are minimally affected.

[edit] Make the data model more informative to usersNormalized tables, and the relationship between one normalized table and another, mirror real-world concepts and their interrelationships.

Page 4: Complete Rdbms 1

[edit] Avoid bias towards any particular pattern of queryingNormalized tables are suitable for general-purpose querying. This means any queries against these tables, including future queries whose details cannot be anticipated, are supported. In contrast, tables that are not normalized lend themselves to some types of queries, but not others.

[edit] Background to normalization: definitions• Functional dependency: Attribute B has a functional dependency on attribute A (i.e., A → B)

if, for each value of attribute A, there is exactly one value of attribute B. If value of A is repeating in tuples then value of B will also repeat. In our example, Employee Address has a functional dependency on Employee ID, because a particular Employee ID value corresponds to one and only one Employee Address value. (Note that the reverse need not be true: several employees could live at the same address and therefore one Employee Address value could correspond to more than one Employee ID. Employee ID is therefore not functionally dependent on Employee Address.) An attribute may be functionally dependent either on a single attribute or on a combination of attributes. It is not possible to determine the extent to which a design is normalized without understanding what functional dependencies apply to the attributes within its tables; understanding this, in turn, requires knowledge of the problem domain. For example, an Employer may require certain employees to split their time between two locations, such as New York City and London, and therefore want to allow Employees to have more than one Employee Address. In this case, Employee Address would no longer be functionally dependent on Employee ID.

Another way to look at the above is by reviewing basic mathematical functions:

Let F(x) be a mathematical function of one independent variable. The independent variable is analogous to the attribute A. The dependent variable (or the dependent attribute using the lingo above), and hence the term functional dependency, is the value of F(A); A is an independent attribute. As we know, mathematical functions can have only one output. Notationally speaking, it is common to express this relationship in mathematics as F(A) = B; or, B → F(A).There are also functions of more than one independent variable—commonly, this is referred to as multivariable functions. This idea represents an attribute being functionally dependent on a combination of attributes. Hence, F(x,y,z) contains three independent variables, or independent attributes, and one dependent attribute, namely, F(x,y,z). In multivariable functions, there can only be one output, or one dependent variable, or attribute.

Trivial functional dependency A trivial functional dependency is a functional dependency of an attribute on a superset of itself. {Employee ID, Employee Address} → {Employee Address} is trivial, as is {Employee Address} → {Employee Address}.

Full functional dependency An attribute is fully functionally dependent on a set of attributes X if it is

• functionally dependent on X, and • not functionally dependent on any proper subset of X. {Employee Address} has a

functional dependency on {Employee ID, Skill}, but not a full functional dependency, because it is also dependent on {Employee ID}.

Transitive dependency A transitive dependency is an indirect functional dependency, one in which X→Z only by virtue

Page 5: Complete Rdbms 1

of X→Y and Y→Z. Multivalued dependency

A multivalued dependency is a constraint according to which the presence of certain rows in a table implies the presence of certain other rows.

Join dependency A table T is subject to a join dependency if T can always be recreated by joining multiple tables each having a subset of the attributes of T.

Superkey A superkey is an attribute or set of attributes that uniquely identifies rows within a table; in other words, two distinct rows are always guaranteed to have distinct superkeys. {Employee ID, Employee Address, Skill} would be a superkey for the "Employees' Skills" table; {Employee ID, Skill} would also be a superkey.

Candidate key A candidate key is a minimal superkey, that is, a superkey for which we can say that no proper subset of it is also a superkey. {Employee Id, Skill} would be a candidate key for the "Employees' Skills" table.

Non-prime attribute A non-prime attribute is an attribute that does not occur in any candidate key. Employee Address would be a non-prime attribute in the "Employees' Skills" table.

Primary key Most DBMSs require a table to be defined as having a single unique key, rather than a number of possible unique keys. A primary key is a key which the database designer has designated for this purpose.

[edit] Normal formsThe normal forms (abbrev. NF) of relational database theory provide criteria for determining a table's degree of vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less vulnerable it is to inconsistencies and anomalies. Each table has a "highest normal form" (HNF): by definition, a table always meets the requirements of its HNF and of all normal forms lower than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its HNF.

The normal forms are applicable to individual tables; to say that an entire database is in normal form n is to say that all of its tables are in normal form n.

Newcomers to database design sometimes suppose that normalization proceeds in an iterative fashion, i.e. a 1NF design is first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of how normalization typically works. A sensibly designed table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving the "higher" normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the designer, because 3NF tables usually need no modification to meet the requirements of these higher normal forms.

The main normal forms are summarized below.

Normal form Defined by Brief definitionFirst normal form (1NF)

Two versions: E.F. Codd (1970), C.J. Date (2003)[12]

Table faithfully represents a relation and has no "repeating groups"

Page 6: Complete Rdbms 1

Second normal form (2NF) E.F. Codd (1971)[13]

No non-prime attribute in the table is functionally dependent on a part (proper subset) of a candidate key

Third normal form (3NF)

E.F. Codd (1971)[14]; see also Carlo Zaniolo's equivalent but differently-expressed definition (1982)[15]

Every non-prime attribute is non-transitively dependent on every key of the table

Boyce-Codd normal form (BCNF)

Raymond F. Boyce and E.F. Codd (1974)[16]

Every non-trivial functional dependency in the table is a dependency on a superkey

Fourth normal form (4NF) Ronald Fagin (1977)[17] Every non-trivial multivalued dependency

in the table is a dependency on a superkey

Fifth normal form (5NF) Ronald Fagin (1979)[18]

Every non-trivial join dependency in the table is implied by the superkeys of the table

Domain/key normal form (DKNF)

Ronald Fagin (1981)[19]Every constraint on the table is a logical consequence of the table's domain constraints and key constraints

Sixth normal form (6NF)

Chris Date, Hugh Darwen, and Nikos Lorentzos (2002)[5]

Table features no non-trivial join dependencies at all (with reference to generalized join operator)

[edit] DenormalizationMain article: Denormalization

Databases intended for Online Transaction Processing (OLTP) are typically more normalized than databases intended for Online Analytical Processing (OLAP). OLTP Applications are characterized by a high volume of small transactions such as updating a sales record at a super market checkout counter. The expectation is that each transaction will leave the database in a consistent state. By contrast, databases intended for OLAP operations are primarily "read mostly" databases. OLAP applications tend to extract historical data that has accumulated over a long period of time. For such databases, redundant or "denormalized" data may facilitate business intelligence applications. Specifically, dimensional tables in a star schema often contain denormalized data. The denormalized or redundant data must be carefully controlled during ETL processing, and users should not be permitted to see the data until it is in a consistent state. The normalized alternative to the star schema is the snowflake schema. It has never been proven that this denormalization itself provides any increase in performance, or if the concurrent removal of data constraints is what increases the performance. In many cases, the need for denormalization has waned as computers and RDBMS software have become more powerful, but since data volumes have generally increased along with hardware and software performance, OLAP databases often still use denormalized schemas.

Denormalization is also used to improve performance on smaller computers as in computerized cash-registers and mobile devices, since these may use the data for look-up only (e.g. price lookups). Denormalization may also be used when no RDBMS exists for a platform (such as Palm), or no changes are to be made to the data and a swift response is crucial.

[edit] Non-first normal form (NF² or N1NF)In recognition that denormalization can be deliberate and useful, the non-first normal form is a

Page 7: Complete Rdbms 1

definition of database designs which do not conform to the first normal form, by allowing "sets and sets of sets to be attribute domains" (Schek 1982). This extension is a (non-optimal) way of implementing hierarchies in relations. Some academics have dubbed this practitioner developed method, "First Ab-normal Form"; Codd defined a relational database as using relations, so any table not in 1NF could not be considered to be relational.

Consider the following table:

Non-First Normal FormPerson Favorite ColorsBob blue, redJane green, yellow, redAssume a person has several favorite colors. Obviously, favorite colors consist of a set of colors modeled by the given table.

To transform this NF² table into a 1NF an "unnest" operator is required which extends the relational algebra of the higher normal forms (one would allow "colors" to be its own table). The reverse operator is called "nest" which is not always the mathematical inverse of "unnest", although "unnest" is the mathematical inverse to "nest". Another constraint required is for the operators to be bijective, which is covered by the Partitioned Normal Form (PNF).

Relational algebraFrom Wikipedia, the free encyclopedia

Jump to: navigation, search

Not to be confused with relation algebras, which are motivated by binary relations and relevant to mathematical logic..

Relational algebra, an offshoot of first-order logic (and of algebra of sets), deals with a set of finitary relations which is closed under certain operators. These operators operate on one or more relations to yield a relation. Relational algebra is a part of computer science.

IntroductionRelational algebras received little attention until the publication of E.F. Codd's relational model of data in 1970. Codd proposed such an algebra as a basis for database query languages. (See "Implementations" below.)

Relational algebra is essentially equivalent in expressive power to relational calculus (and thus first-order logic); this result is known as Codd's theorem. Some care, however, has to be taken to avoid a mismatch that may arise between the two languages since negation, applied to a formula of the calculus, constructs a formula that may be true on an infinite set of possible tuples, while the difference

Page 8: Complete Rdbms 1

operator of relational algebra always returns a finite result. To overcome these difficulties, Codd restricted the operands of relational algebra to finite relations only and also proposed restricted support for negation (NOT) and disjunction (OR). Analogous restrictions are found in many other logic-based computer languages. Codd defined the term relational completeness to refer to a language that is complete with respect to first-order predicate calculus apart from the restrictions he proposed. In practice the restrictions have no adverse effect on the applicability of his relational algebra for database purposes.

[edit] Primitive operationsAs in any algebra, some operators are primitive and the others, being definable in terms of the primitive ones, are derived. It is useful if the choice of primitive operators parallels the usual choice of primitive logical operators. Although it is well known that the usual choice in logic of AND, OR and NOT is somewhat arbitrary, Codd made a similar arbitrary choice for his algebra.

The six primitive operators of Codd's algebra are the selection, the projection, the Cartesian product (also called the cross product or cross join), the set union, the set difference, and the rename. (Actually, Codd omitted the rename, but the compelling case for its inclusion was shown by the inventors of ISBL.) These six operators are fundamental in the sense that none of them can be omitted without losing expressive power. Many other operators have been defined in terms of these six. Among the most important are set intersection, division, and the natural join. In fact ISBL made a compelling case for replacing the Cartesian product by the natural join, of which the Cartesian product is a degenerate case.

Altogether, the operators of relational algebra have identical expressive power to that of domain relational calculus or tuple relational calculus. However, for the reasons given in the Introduction above, relational algebra has strictly less expressive power than that of first-order predicate calculus without function symbols. Relational algebra actually corresponds to a subset of first-order logic that is Horn clauses without recursion and negation.

[edit] Set operatorsAlthough three of the six basic operators are taken from set theory, there are additional constraints that are present in their relational algebra counterparts: For set union and set difference, the two relations involved must be union-compatible—that is, the two relations must have the same set of attributes. As set intersection can be defined in terms of set difference, the two relations involved in set intersection must also be union-compatible.

The Cartesian product is defined differently from the one defined in set theory in the sense that tuples are considered to be 'shallow' for the purposes of the operation. That is, unlike in set theory, where the Cartesian product of a n-tuple by an m-tuple is a set of 2-tuples, the Cartesian product in relational algebra has the 2-tuple "flattened" into an n+m-tuple. More formally, R × S is defined as follows:

R S = {r s| r R, s S}

In addition, for the Cartesian product to be defined, the two relations involved must have disjoint headers — that is, they must not have a common attribute name.

[edit] ProjectionMain article: Projection (relational algebra)

Page 9: Complete Rdbms 1

A projection is a unary operation written as where a1,...,an is a set of attribute names. The result of such projection is defined as the set that is obtained when all tuples in R are restricted to the set {a1,...,an}.

[edit] SelectionMain article: Selection (relational algebra)

A generalized selection is a unary operation written as where is a propositional formula that consists of atoms as allowed in the normal selection and the logical operators (and), (or) and (negation). This selection selects all those tuples in R for which holds.

[edit] RenameMain article: Rename (relational algebra)

A rename is a unary operation written as ρa / b(R) where the result is identical to R except that the b field in all tuples is renamed to an a field. This is simply used to rename the attribute of a relation or the relation itself.

[edit] Joins and join-like operators

[edit] Natural joinNatural join is a binary operator that is written as (R S) where R and S are relations.[1] The result of the natural join is the set of all combinations of tuples in R and S that are equal on their common attribute names. For an example consider the tables Employee and Dept and their natural join:

Employee

Name EmpId DeptNameHarry 3415 FinanceSally 2241 SalesGeorge 3401 FinanceHarriet 2202 Sales

Dept

DeptName ManagerFinance GeorgeSales HarrietProduction Charles

Employee Dept

Name EmpId DeptName ManagerHarry 3415 Finance GeorgeSally 2241 Sales HarrietGeorge 3401 Finance GeorgeHarriet 2202 Sales Harriet

This can also be used to define composition of relations. In category theory, the join is precisely the fiber product.

The natural join is arguably one of the most important operators since it is the relational counterpart of logical AND. Note carefully that if the same variable appears in each of two predicates that are connected by AND, then that variable stands for the same thing and both appearances must always be substituted by the same value. In particular, natural join allows the combination of relations that are associated by a foreign key. For example, in the above example a foreign key probably holds from Employee.DeptName to Dept.DeptName and then the natural join of Employee and Dept combines all employees with their departments. Note that this works because the foreign key holds between

Page 10: Complete Rdbms 1

attributes with the same name. If this is not the case such as in the foreign key from Dept.manager to Emp.emp-number then we have to rename these columns before we take the natural join. Such a join is sometimes also referred to as an equijoin (see θ-join).

More formally the semantics of the natural join is defined as follows:

where p is a predicate that is true for a binary relation r iff r is a functional binary relation. It is usually required that R and S must have at least one common attribute, but if this constraint is omitted then the natural join becomes exactly the Cartesian product.

The natural join can be simulated with Codd's primitives as follows. Assume that b1,...,bm are the attribute names common to R, S, a1,...,an are the attribute names unique to R and c1,...,ck are the attribute unique to S. Furthermore assume that the attribute names d1,...,dm are neither in R nor in S. In a first step we can now rename the common attribute names in S:

Then we take the Cartesian product and select the tuples that are to be joined:

Finally we take a projection to get rid of the renamed attributes:

[edit] θ-join and equijoinConsider tables Car and Boat which list models of cars and boats and their respective prices. Suppose a customer wants to buy a car and a boat, but she doesn't want to spend more money for the boat than for the car. The θ-join on the relation CarPrice ≥ BoatPrice produces a table with all the possible options.

Car

CarModel CarPriceCarA 20'000CarB 30'000CarC 50'000

Boat

BoatModel BoatPriceBoat1 10'000Boat2 40'000Boat3 60'000

CarModel CarPrice BoatModel BoatPriceCarA 20'000 Boat1 10'000CarB 30'000 Boat1 10'000CarC 50'000 Boat1 10'000CarC 50'000 Boat2 40'000

If we want to combine tuples from two relations where the combination condition is not simply the equality of shared attributes then it is convenient to have a more general form of join operator, which is the θ-join (or theta-join). The θ-join is a binary operator that is written as or where a and b are attribute names, θ is a binary relation in the set {<, ≤, =, >, ≥}, v is a value constant, and R and S are relations. The result of this operation consists of all combinations of tuples in R and S that satisfy the relation θ. The result of the θ-join is defined only if the headers of S and R are disjoint, that is, do not contain a common attribute.

The simulation of this operation in the fundamental operations is therefore as follows:

R φ S = σφ(R × S)

In case the operator θ is the equality operator (=) then this join is also called an equijoin.

Note, however, that a computer language that supports the natural join and rename operators does not need θ-join as well, as this can be achieved by selection from the result of a natural join (which degenerates to Cartesian product when there are no shared attributes).

Page 11: Complete Rdbms 1

[edit] SemijoinThe semijoin is joining similar to the natural join and written as R S where R and S are relations. The result of the semijoin is only the set of all tuples in R for which there is a tuple in S that is equal on their common attribute names. For an example consider the tables Employee and Dept and their semi join:

Employee

Name EmpId DeptNameHarry 3415 FinanceSally 2241 SalesGeorge 3401 FinanceHarriet 2202 Production

Dept

DeptName ManagerSales HarrietProduction Charles

Employee Dept

Name EmpId DeptNameSally 2241 SalesHarriet 2202 Production

More formally the semantics of the semijoin is defined as follows:

R S = { t : t R, s S, fun (t s) }

where fun(r) is as in the definition of natural join.

The semijoin can be simulated using the natural join as follows. If a1, ..., an are the attribute names of R, then

R S = Πa1,..,an(R S).

Since we can simulate the natural join with the basic operators it follows that this also holds for the semijoin.

[edit] AntijoinThe antijoin, written as R S where R and S are relations, is similar to the natural join, but the result of an antijoin is only those tuples in R for which there is NOT a tuple in S that is equal on their common attribute names.

For an example consider the tables Employee and Dept and their antijoin:

Employee

Name EmpId DeptNameHarry 3415 FinanceSally 2241 SalesGeorge 3401 FinanceHarriet 2202 Production

Dept

DeptName ManagerSales HarrietProduction Charles

Employee Dept

Name EmpId DeptNameHarry 3415 FinanceGeorge 3401 Finance

Page 12: Complete Rdbms 1

The antijoin is formally defined as follows:

R S = { t : t R s S : fun (t s) }

or

R S = { t : t R, there is no tuple s of S that satisfies fun (t s) }

where fun(r) is as in the definition of natural join.

The antijoin can also be defined as the complement of the semijoin, as follows:

R S = R - R S

Given this, the antijoin is sometimes called the anti-semijoin, and the antijoin operator is sometimes written as semijoin symbol with a bar above it, instead of .

[edit] DivisionThe division is a binary operation that is written as R ÷ S. The result consists of the restrictions of tuples in R to the attribute names unique to R, i.e., in the header of R but not in the header of S, for which it holds that all their combinations with tuples in S are present in R. For an example see the tables Completed, DBProject and their division:

Completed

Student TaskFred Database1Fred Database2Fred Compiler1Eugene Database1Eugene Compiler1Sara Database1Sara Database2

DBProject

TaskDatabase1Database2

Completed ÷ DBProject

StudentFredSara

If DBProject contains all the tasks of the Database project then the result of the division above contains exactly all the students that have completed the Database project.

More formally the semantics of the division is defined as follows:

R ÷ S = { t[a1,...,an] : t R s S ( (t[a1,...,an] s) R) }

where {a1,...,an} is the set of attribute names unique to R and t[a1,...,an] is the restriction of t to this set. It is usually required that the attribute names in the header of S are a subset of those of R because otherwise the result of the operation will always be empty.

The simulation of the division with the basic operations is as follows. We assume that a1,...,an are the

Page 13: Complete Rdbms 1

attribute names unique to R and b1,...,bm are the attribute names of S. In the first step we project R on its unique attribute names and construct all combinations with tuples in S:

T := πa1,...,an(R) × S

In the prior example, T would represent a table such that every Student (because Student is the unique key / attribute of the Completed table) is combined with every given Task. So Eugene, for instance, would have two rows, Eugene -> Database1 and Eugene -> Database2 in T.

In the next step we subtract R from this relation:

U := T - R

Note that in U we have the possible combinations that "could have" been in R, but weren't. So if we now take the projection on the attribute names unique to R then we have the restrictions of the tuples in R for which not all combinations with tuples in S were present in R:

V := πa1,...,an(U)

So what remains to be done is take the projection of R on its unique attribute names and subtract those in V:

W := πa1,...,an(R) - V

[edit] Outer joinsIt has been suggested that this section be split into a new article entitled outer join. (Discuss)

Whereas the result of a join (or inner join) consists of tuples formed by combining matching tuples in the two operands, an outer join contains those tuples and additionally some tuples formed by extending an unmatched tuple in one of the operands by "fill" values for each of the attributes of the other operand.

The operators defined in this section assume the existence of a null value, ω, which we do not define, to be used for the fill values. It should not be assumed that this is the NULL defined for the database language SQL, nor should it be assumed that ω is a mark rather than a value, nor should it be assumed that the controversial three-valued logic is introduced by it.

Three outer join operators are defined: left outer join, right outer join, and full outer join. (The word "outer" is sometimes omitted.)

[edit] Left outer joinThe left outer join is written as R =X S where R and S are relations. The result of the left outer join is the set of all combinations of tuples in R and S that are equal on their common attribute names, in addition (loosely speaking) to tuples in R that have no matching tuples in S.

For an example consider the tables Employee and Dept and their left outer join:

Page 14: Complete Rdbms 1

Employee

Name EmpId DeptNameHarry 3415 FinanceSally 2241 SalesGeorge 3401 FinanceHarriet 2202 SalesTim 1123 Executive

Dept

DeptName ManagerSales HarrietProduction Charles

Employee =X Dept

Name EmpId DeptName ManagerHarry 3415 Finance ωSally 2241 Sales HarrietGeorge 3401 Finance ωHarriet 2202 Sales HarrietTim 1123 Executive ω

In the resulting relation, tuples in S which have no common values in common attribute names with tuples in R take a null value, ω.

Since there are no tuples in Dept with a DeptName of Finance or Executive, ωs occur in the resulting relation where tuples in DeptName have tuples of Finance or Executive.

Let r1, r2, ..., rn be the attributes of the relation R and let {(ω, ..., ω)} be the singleton relation on the attributes that are unique to the relation S (those that are not attributes of R). Then the left outer join can be described in terms of the natural join (and hence using basic operators) as follows:

[edit] Right outer joinThe right outer join behaves almost identically to the left outer join, but the roles of the tables are switched.

The right outer join of relations R and S is written as R X= S. The result of the right outer join is the set of all combinations of tuples in R and S that are equal on their common attribute names, in addition to tuples in S that have no matching tuples in R.

For example consider the tables Employee and Dept and their right outer join:

Employee

Name EmpId DeptNameHarry 3415 FinanceSally 2241 SalesGeorge 3401 FinanceHarriet 2202 SalesTim 1123 Executive

Dept

DeptName ManagerSales HarrietProduction Charles

Employee X= Dept

Name EmpId DeptName ManagerSally 2241 Sales HarrietHarriet 2202 Sales Harrietω ω Production Charles

In the resulting relation, tuples in R which have no common values in common attribute names with tuples in S take a null value, ω.

Since there are no tuples in Employee with a DeptName of Production, ωs occur in the Name attribute of the resulting relation where tuples in DeptName had tuples of Production.

Page 15: Complete Rdbms 1

Let s1, s2, ..., sn be the attributes of the relation S and let {(ω, ..., ω)} be the singleton relation on the attributes that are unique to the relation R (those that are not attributes of S). Then, as with the left outer join, the right outer join can be simulated using the natural join as follows:

[edit] Outer joinThe outer join or full outer join in effect combines the results of the left and right outer joins.

The full outer join is written as R =X= S where R and S are relations. The result of the full outer join is the set of all combinations of tuples in R and S that are equal on their common attribute names, in addition to tuples in S that have no matching tuples in R and tuples in R that have no matching tuples in S in their common attribute names.

For an example consider the tables Employee and Dept and their full outer join:

Employee

Name EmpId DeptNameHarry 3415 FinanceSally 2241 SalesGeorge 3401 FinanceHarriet 2202 SalesTim 1123 Executive

Relational calculusFrom Wikipedia, the free encyclopedia

Jump to: navigation, search

Relational calculus consists of two calculi, the tuple relational calculus and the domain relational calculus, that are part of the relational model for databases and provide a declarative way to specify database queries. This in contrast to the relational algebra which is also part of the relational model but provides a more procedural way for specifying queries.

The relational algebra might suggest these steps to retrieve the phone numbers and names of book stores that supply Some Sample Book:

1. Join books and book stores over the BookstoreID. 2. Restrict the result of that join to tuples for the book Some Sample Book. 3. Project the result of that restriction over StoreName and StorePhone.

The relational calculus would formulate a descriptive, declarative way:

Page 16: Complete Rdbms 1

Get StoreName and StorePhone for supplies such that there exists a title BK with the same BookstoreID value and with a BookTitle value of Some Sample Book.

The relational algebra and the relational calculus are essentially logically equivalent: for any algebraic expression, there is an equivalent expression in the calculus, and vice versa. This result is known as Codd's theorem.

Tuple relational calculusFrom Wikipedia, the free encyclopedia

Jump to: navigation, search

The tuple calculus is a calculus that was introduced by Edgar F. Codd as part of the relational model in order to give a declarative database query language for this data model. It formed the inspiration for the database query languages QUEL and SQL of which the latter, although far less faithful to the original relational model and calculus, is now used in almost all relational database management systems as the ad-hoc query language. Lacroix and Pirotte proposed domain calculus, which is closer to first-order logic and showed that these two calculi (and the relational algebra) are equivalent in expressive power. Subsequently query languages for the relational model were called relationally complete if they could express at least all these queries.

Page 17: Complete Rdbms 1

Contents[hide]

• 1 Definition of the calculus

• 1 .1 Relational database

• 1 .2 Atoms

• 1 .3 For

Page 18: Complete Rdbms 1

[edit] Definition of the calculus

[edit] Relational databaseSince the calculus is a query language for relational databases we first have to define a relational database. The basic relational building block is the domain, or data type. A tuple is an ordered multiset of attributes, which are ordered pairs of domain and value. A relvar (relation variable) is a set of ordered pairs of domain and name, which serves as the header for a relation. A relation is a set of tuples. Although these relational concepts are mathematically defined, those definitions map loosely to traditional database concepts. A table is an accepted visual representation of a relation; a tuple is similar to the concept of row.

We first assume the existence of a set C of column names, examples of which are "name", "author", "address" et cetera. We define headers as finite subsets of C. A relational database schema is defined as a tuple S = (D, R, h) where D is the domain of atomic values (see relational model for more on the notions of domain and atomic value), R is a finite set of relation names, and

h : R → 2C

a function that associates a header with each relation name in R. (Note that this is a simplification from the full relational model where there is more than one domain and a header is not just a set of column names but also maps these column names to a domain.) Given a domain D we define a tuple over D as a partial function

t : C → D

that maps some column names to an atomic value in D. An example would be (name : "Harry", age : 25).

The set of all tuples over D is denoted as TD. The subset of C for which a tuple t is defined is called the domain of t (not to be confused with the domain in the schema) and denoted as dom(t).Finally we define a relational database given a schema S = (D, R, h) as a function

db : R → 2TD

that maps the relation names in R to finite subsets of TD, such that for every relation name r in R and tuple t in db(r) it holds that

dom(t) = h(r).

The latter requirement simply says that all the tuples in a relation should contain the same column names, namely those defined for it in the schema.

[edit] AtomsFor the construction of the formulae we will assume an infinite set V of tuple variables. The formulas are defined given a database schema S = (D, R, h) and a partial function type : V -> 2C that defines a type assignment that assigns headers to some tuple variables. We then define the set of atomic formulas A[S,type] with the following rules:

Page 19: Complete Rdbms 1

1. if v and w in V, a in type(v) and b in type(w) then the formula " v.a = w.b " is in A[S,type], 2. if v in V, a in type(v) and k denotes a value in D then the formula " v.a = k " is in A[S,type], and 3. if v in V, r in R and type(v) = h(r) then the formula " r(v) " is in A[S,type].

Examples of atoms are:

• (t.age = s.age) — t has an age attribute and s has an age attribute with the same value • (t.name = "Codd") — tuple t has a name attribute and its value is "Codd" • Book(t) — tuple t is present in relation Book.

The formal semantics of such atoms is defined given a database db over S and a tuple variable binding val : V -> TD that maps tuple variables to tuples over the domain in S:

1. " v.a = w.b " is true if and only if val(v)(a) = val(w)(b) 2. " v.a = k " is true if and only if val(v)(a) = k 3. " r(v) " is true if and only if val(v) is in db(r)

[edit] FormulasThe atoms can be combined into formulas, as is usual in first-order logic, with the logical operators ∧ (and), (or) and ¬ (not), and we can use the existential quantifier ( ) and the universal quantifier ( ) to∨ ∃ ∀ bind the variables. We define the set of formulas F[S,type] inductively with the following rules:

1. every atom in A[S,type] is also in F[S,type] 2. if f1 and f2 are in F[S,type] then the formula " f1 ∧ f2 " is also in F[S,type] 3. if f1 and f2 are in F[S,type] then the formula " f1 ∨ f2 " is also in F[S,type] 4. if f is in F[S,type] then the formula " ¬ f " is also in F[S,type] 5. if v in V, H a header and f a formula in F[S,type[v->H]] then the formula " ∃ v : H ( f ) " is also in

F[S,type], where type[v->H] denotes the function that is equal to type except that it maps v to H, 6. if v in V, H a header and f a formula in F[S,type[v->H]] then the formula " ∀ v : H ( f ) " is also in

F[S,type]

Examples of formulas:

• t.name = "C. J. Date" ∨ t.name = "H. Darwen" • Book(t) Magazine(∨ t) • ∀ t : {author, title, subject} ( ¬ ( Book(t) ∧ t.author = "C. J. Date" ¬ ( ∧ t.subject = "relational

model")))

Note that the last formula states that all books that are written by C. J. Date have as their subject the relational model. As usual we omit brackets if this causes no ambiguity about the semantics of the formula.

We will assume that the quantifiers quantify over the universe of all tuples over the domain in the schema. This leads to the following formal semantics for formulas given a database db over S and a tuple variable binding val : V -> TD:

1. " f1 ∧ f2 " is true if and only if " f1 " is true and " f2 " is true, 2. " f1 ∨ f2 " is true if and only if " f1 " is true or " f2 " is true or both are true, 3. " ¬ f " is true if and only if " f " is not true, 4. " ∃ v : H ( f ) " is true if and only if there is a tuple t over D such that dom(t) = H and the

Page 20: Complete Rdbms 1

formula " f " is true for val[v->t], and 5. " ∀ v : H ( f ) " is true if and only if for all tuples t over D such that dom(t) = H the formula " f "

is true for val[v->t].

[edit] QueriesFinally we define what a query expression looks like given a schema S = (D, R, h):

{ v : H | f(v) }

where v is a tuple variable, H a header and f(v) a formula in F[S,type] where type = { (v, H) } and with v as its only free variable. The result of such a query for a given database db over S is the set of all tuples t over D with dom(t) = H such that f is true for db and val = { (v, t) }.

Examples of query expressions are:

• { t : {name} | ∃ s : {name, wage} ( Employee(s) ∧ s.wage = 50.000 ∧ t.name = s.name ) } • { t : {supplier, article} | ∃ s : {s#, sname} ( Supplier(s) ∧ s.sname = t.supplier ∧ ∃ p : {p#,

pname} ( Product(p) ∧ p.pname = t.article ∧ ∃ a : {s#, p#} ( Supplies(a) ∧ s.s# = a.s# ∧ a.p# = p.p# ) }

[edit] Semantic and syntactic restriction of the calculus

[edit] Domain-independent queriesBecause the semantics of the quantifiers is such that they quantify over all the tuples over the domain in the schema it can be that a query may return a different result for a certain database if another schema is presumed. For example, consider the two schemas S1 = ( D1, R, h ) and S2 = ( D2, R, h ) with domains D1 = { 1 }, D2 = { 1, 2 }, relation names R = { "r1" } and headers h = { ("r1", {"a"}) }. Both schemas have a common instance:

db = { ( "r1", { ("a", 1) } ) }

If we consider the following query expression

{ t : {a} | t.a = t.a }

then its result on db is either { (a : 1) } under S1 or { (a : 1), (a : 2) } under S2. It will also be clear that if we take the domain to be an infinite set, then the result of the query will also be infinite. To solve these problems we will restrict our attention to those queries that are domain independent, i.e., the queries that return the same result for a database under all of its schemas.

An interesting property of these queries is that if we assume that the tuple variables range over tuples over the so-called active domain of the database, which is the subset of the domain that occurs in at least one tuple in the database or in the query expression, then the semantics of the query expressions does not change. In fact, in many definitions of the tuple calculus this is how the semantics of the quantifiers is defined, which makes all queries by definition domain independent.

Page 21: Complete Rdbms 1

[edit] Safe queriesIn order to limit the query expressions such that they express only domain-independent queries a syntactical notion of safe query is usually introduced. To determine whether a query expression is safe we will derive two types of information from a query. The first is whether a variable-column pair t.a is bound to the column of a relation or a constant, and the second is whether two variable-column pairs are directly or indirectly equated (denoted t.v == s.w).

For deriving boundedness we introduce the following reasoning rules:

1. in " v.a = w.b " no variable-column pair is bound, 2. in " v.a = k " the variable-column pair v.a is bound, 3. in " r(v) " all pairs v.a are bound for a in type(v), 4. in " f1 ∧ f2 " all pairs are bound that are bound either in f1 or in f2, 5. in " f1 ∨ f2 " all pairs are bound that are bound both in f1 and in f2, 6. in " ¬ f " no pairs are bound, 7. in " ∃ v : H ( f ) " a pair w.a is bound if it is bound in f and w <> v, and 8. in " ∀ v : H ( f ) " a pair w.a is bound if it is bound in f and w <> v.

For deriving equatedness we introduce the following reasoning rules (next to the usual reasoning rules for equivalence relations: reflexivity, symmetry and transitivity):

1. in " v.a = w.b " it holds that v.a == w.b, 2. in " v.a = k " no pairs are equated, 3. in " r(v) " no pairs are equated, 4. in " f1 ∧ f2 " it holds that v.a == w.b if it holds either in f1 or in f2, 5. in " f1 ∨ f2 " it holds that v.a == w.b if it holds both in f1 and in f2, 6. in " ¬ f " no pairs are equated, 7. in " ∃ v : H ( f ) " it holds that w.a == x.b if it holds in f and w<>v and x<>v, and 8. in " ∀ v : H ( f ) " it holds that w.a == x.b if it holds in f and w<>v and x<>v.

We then say the a query expression { v : H | f(v) } is safe if

• for every column name a in H we can derive that v.a is equated with a bound pair in f, • for every subexpression of f of the form " ∀ w : G ( g ) " we can derive that for every column

name a in G we can derive that w.a is equated with a bound pair in g, and • for every subexpression of f of the form " ∃ w : G ( g ) " we can derive that for every column

name a in G we can derive that w.a is equated with a bound pair in g.

The restriction to safe query expressions does not limit the expressiveness since all domain-independent queries that could be expressed can also be expressed by a safe query expression. This can be proven by showing that for a schema S = (D, R, h), a given set K of constants in the query expression, a tuple variable v and a header H we can construct a safe formula for every pair v.a with a in H that states that its value is in the active domain. For example, assume that K={1,2}, R={"r"} and h = { ("r", {"a, "b"}) } then the corresponding safe formula for v.b is:

v.b = 1 ∨ v.b = 2 ∨ ∃ w ( r(w) ( ∧ v.b = w.a ∨ v.b = w.b ) )

This formula, then, can be used to rewrite any unsafe query expression to an equivalent safe query expression by adding such a formula for every variable v and column name a in its type where it is used in the expression. Effectively this means that we let all variables range over the active domain, which, as was already explained, does not change the semantics if the expressed query is domain independent

Page 22: Complete Rdbms 1

Domain relational calculusFrom Wikipedia, the free encyclopedia

Jump to: navigation, search

In computer science, domain relational calculus (DRC) is a calculus that was introduced by Michel Lacroix and Alain Pirotte as a declarative database query language for the relational data model.[1]

In DRC, queries have the form:

where each Xi is either a domain variable or constant, and p(<X1, X2, ...., Xn>) denotes a DRC formula. The result of the query is the set of tuples Xi to Xn which makes the DRC formula true.

This language uses the same operators as tuple calculus, the logical connectives (and), (or) and ¬∧ ∨ (not). The existential quantifier ( ) and the ∃ universal quantifier ( ) can be used to bind the variables.∀Its computational expresivity is equivalent to that of Relational algebra.[2]

[edit] ExamplesLet A, B, C mean Rank, Name, ID and D, E, F to mean Name, DeptName, ID

Find all captains of the starship USS Enterprise:

In this example, A, B, C denotes both the result set and a set in the table Enterprise.

Find Names of Enterprise crewmembers who are in Stellar Cartography:

In this example, we're only looking for the name, and that's B. F = C is a requirement, because we need to find Enterprise crew members AND they are in the Stellar Cartography Department.

An alternate representation of the previous example would be:

In this example, the value of the requested F domain is directly placed in the formula and the C domain variable is re-used in the query for the existence of a department, since it already holds a crew member's id.

Codd's 12 rulesFrom Wikipedia, the free encyclopedia

Jump to: navigation, search

Codd's 12 rules are a set of thirteen rules (numbered zero to twelve) proposed by Edgar F. Codd, a

Page 23: Complete Rdbms 1

pioneer of the relational model for databases, designed to define what is required from a database management system in order for it to be considered relational, i.e., an RDBMS.[1] [2]

Codd produced these rules as part of a personal campaign to prevent his vision of the relational database being diluted, as database vendors scrambled in the early 1980s to repackage existing products with a relational veneer. Rule 12 was particularly designed to counter such a positioning. In fact, the rules are so strict that all popular so-called "relational" DBMSs fail on many of the criteria.[citation needed]

[edit] The rulesRule 0: The system must qualify as relational, as a database, and as a management system.

For a system to qualify as a relational database management system (RDBMS), that system must use its relational facilities (exclusively) to manage the database.

Rule 1: The information rule:

All information in the database is to be represented in one and only one way, namely by values in column positions within rows of tables.

Rule 2: The guaranteed access rule:

All data must be accessible. This rule is essentially a restatement of the fundamental requirement for primary keys. It says that every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row.

Rule 3: Systematic treatment of null values:

The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of "missing information and inapplicable information" that is systematic, distinct from all regular values (for example, "distinct from zero or any other number", in the case of numeric values), and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way.

Rule 4: Active online catalog based on the relational model:The system must support an online, inline, relational catalog that is accessible to authorized users by means of their regular query language. That is, users must be able to access the database's structure (catalog) using the same query language that they use to access the database's data.

Rule 5: The comprehensive data sublanguage rule:

The system must support at least one relational language that 1. Has a linear syntax 2. Can be used both interactively and within application programs, 3. Supports data definition operations (including view definitions), data manipulation

operations (update as well as retrieval), security and integrity constraints, and transaction management operations (begin, commit, and rollback).

Rule 6: The view updati ng rule:

Page 24: Complete Rdbms 1

All views that are theoretically updatable must be updatable by the system.

Rule 7: High-level insert, update, and delete:

The system must support set-at-a-time insert, update, and delete operators. This means that data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table.

Rule 8: Physical data independence:

Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require a change to an application based on the structure.

Rule 9: Logical data independence:

Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence.

Rule 10: Integrity independence:

Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications.

Rule 11: Distribution independence:

The distribution of portions of the database to various locations should be invisible to users of the database. Existing applications should continue to operate successfully :

1. when a distributed version of the DBMS is first introduced; and 2. when existing distributed data are redistributed around the system.

Rule 12: The nonsubversion rule:

If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system, for example, bypassing a relational security or integrity constraint.

Index (database)From Wikipedia, the free encyclopedia

Jump to: navigation, search

Page 25: Complete Rdbms 1

A database index is a data structure that improves the speed of operations on a database table. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random look ups and efficient access of ordered records. The disk space required to store the index is typically less than that required by the table (since indexes usually contain only the key-fields according to which the table is to be arranged, and excludes all the other details in the table), yielding the possibility to store indexes in memory for a table whose data is too large to store in memory.

In a relational database, an index is a copy of one part of a table. Some databases extend the power of indexing by allowing indexes to be created on functions or expressions. For example, an index could be created on upper(last_name), which would only store the upper case versions of the last_name field in the index. Another option sometimes supported is the use of "filtered" indexes, where index entries are created only for those records that satisfy some conditional expression. A further aspect of flexibility is to permit indexing on user-defined functions, as well as expressions formed from an assortment of built-in functions.

Indexes may be defined as unique or non-unique. A unique index acts as a constraint on the table by preventing duplicate entries in the index and thus, the backing table.

[edit] Index ArchitectureIndex architectures can be classified as clustered or non-clustered.

[edit] UnclusteredAn Unclustered index is structured in a way that doesn't correspond to the order of the actual data records. It resembles the words index at the back of a book. For this reason, it will perform worse than clustered indexes on ranged searches where the result set is large, since each result could cost an additional I/O-operation to get the actual data record. One can have any number of these kinds of indexes, since all that is required is the space to store the index itself -- one does not copy the actual data to create a new index.

[edit] ClusteredClustering re-orders the data block in the same order as the index, hence it is also an operation on the data storage blocks as well as on the index. A phone book ordered by last name resembles a clustered index in its structure and purpose. Exact operation of database systems vary, but because storing the row data more than once is never done in practice (since it is usually too expensive), the row data can only be stored in one order physically. Therefore, only one clustered index can be created on a given database table. Clustered indexes can greatly increase access speed, but usually only where the data is accessed sequentially in the same or reverse order of the clustered index, or when a range of items are selected.

Since the physical records are in this sort order on disk, the next row item in the sequence is immediately before or after the last one, and so fewer data block reads are required. The primary feature of a clustered index is therefore the ordering of the physical data rows in accordance with the index blocks that point to them. Some databases separate the data and index blocks into separate files,

Page 26: Complete Rdbms 1

while others intermix the two different types of data blocks within the same physical file(s). Databases that use the latter scheme may be said to store the actual data in the leaf node of the index, whereas, in fact there is still a distinction between the index and the data block, and the data blocks can be traversed without using the index by way of a link list that links the data blocks in orders1.

[edit] Index ImplementationsIndexes can be implemented using a variety of data structures. Popular indexes include balanced trees, B+ trees and hashes.[1]

In Microsoft SQL Server, the leaf node of the clustered index corresponds to the actual data, not simply a pointer to data that resides elsewhere, as is the case with a non-clustered index.[2] Each relation can have a single clustered index and many unclustered indexes.[3]

[edit] Column orderThe order in which columns are listed in the index definition is important. It is possible to retrieve a set of row identifiers using only the first indexed column. However, it is not possible or efficient (on most databases) to retrieve the set of row identifiers using only the second or greater indexed column.

For example, imagine a phone book that is organized by city first, then by last name, and then by first name. If you are given the city, you can easily extract the list of all phone numbers for that city. However, in this phone book it would be very tedious to find all the phone numbers for a given last name. You would have to look within each city's section for the entries with that last name. Some databases can do this, others just won’t use the index.

[edit] Applications and limitationsIndexes are useful for many applications but come with some limitations. Consider the following SQL statement: SELECT first_name FROM people WHERE last_name = 'Smith';. To process this statement without an index the database software must look at the last_name column on every row in the table (this is known as a full table scan). With an index the database simply follows the b-tree data structure until the Smith entry has been found; this is much less computationally expensive than a full table scan.

Consider this SQL statement: SELECT email_address FROM customers WHERE email_address LIKE '%@yahoo.com';. This query would yield an email address for every customer whose email address ends with "@yahoo.com", but even if the email_address column has been indexed the database still must perform a full table scan. This is because the index is built with the assumption that words go from left to right. With a wildcard at the beginning of the search-term, the database software is unable to use the underlying b-tree data structure. This problem can be solved through the addition of another index created on reverse(email_address) and a SQL query like this: SELECT email_address FROM customers WHERE reverse(email_address) LIKE reverse('%@yahoo.com');. This puts the wild-card at the right-most part of the query (now moc.oohay@%) which the index on reverse(email_address) can satisfy.

Page 27: Complete Rdbms 1

[edit] Types

[edit] Bitmap indexMain article: Bitmap index

A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (bitmaps) and answers most queries by performing bitwise logical operations on these bitmaps. The most commonly used index, such as B+trees, are most efficient if the values it indexes do not repeat or repeat a relatively smaller number of times. In contrast, the bitmap index is designed for cases where the values of a variable repeat very frequently. For example, the gender field in a customer database usually contains two distinct values: male or female. For such variables, the bitmap index can have a significant performance advantage over the commonly used trees.

[edit] Dense indexA dense index in databases is a file with pairs of keys and pointers for every record in the data file. Every key in this file is associated with a particular pointer to a record in the sorted data file. In clustered indices with duplicate keys, the dense index points to the first record with that key.[4]

[edit] Sparse indexA sparse index in databases is a file with pairs of keys and pointers for every block in the data file. Every key in this file is associated with a particular pointer to the block in the sorted data file. In clustered indexes with duplicate keys, the sparse index points to the lowest search key in each block.

[edit] Covering IndexIn most cases, an index is used to quickly locate the data record(s) from which the required data is read. In other words, the index is only used to locate data records in the table and not to return data.

A covering index is a special case where the index itself contains the required data field(s) and can return the data.

Consider the following table (other fields omitted):

ID Name Other Fields12 Plug ...13 Lamp ...14 Fuse ...To find the Name for ID 13, an index on (ID) will be useful, but the record must still be read to get the Name. However, an index on (ID, Name) contains the required data field and eliminates the need to look up the record.

A covering index can dramatically speed up data retrieval but may itself be relatively large due to the additional keys, which slows down data insertion & update. To reduce such index size, some systems allow non-key fields to be included in the index. Non-key fields are not themselves part of the index ordering but only included at the leaf level, allowing for a covering index with less overall index size

Page 28: Complete Rdbms 1

Trigger Basics (Oracle) Posted by decipherinfosys on March 18, 2008

Triggers are one of the widely used database objects. In Oracle, we can define triggers on tables, views, schema or a database. On tables, we can define DML (insert, update ,delete) triggers, on schema we can define DDL triggers and logon/logoff event triggers, and system event triggers(startup/ shutdown) on the database. In general based on where a trigger is defined it can be for any DML events, DDL events or login events. Common use of triggers are to generate running numbers before inserting new rows, for security restrictions, for populating some other columns based on the new or old column values and for auditing purpose to name few.

Following are the different trigger types for DML related triggers:

1) BEFORE and AFTER triggers: Indicates whether trigger should be fired before or after the INSERT, UPDATE or DELETE operation occurs on the table. It can be used for DDL statements as well but in that case trigger should be defined either on the schema or the database.

2) ROW and STATEMENT triggers: Defines whether trigger should be fired for each row affected or just once for triggering statement:a) BEFORE ROW: For each row before triggering statement is executed.b) AFTER ROW: For each row after triggering statement is executed.c) BEFORE STATEMENT: Once, before triggering statement is executed.d) AFTER STATEMENT: Once, after triggering statement is executed.

3) INSTEAD OF triggers: These triggers are normally defined on complex view. It allows to update the views which are not directly updatable through DML statements. It updates the underlying table used to create view.

In 11g, Oracle introduced COMPOUND triggers in which, more than one type of trigger can be defined in a single trigger. In compound trigger, we can define specific action for each type mentioned above but all these actions are composed in a single trigger. We will cover more about compound triggers in future blog post.

We can define more than one trigger of different type on the same table. Stretching it further we can even have multiple triggers of the same type. e.g. two before insert trigger on the same table.

When two or more triggers are defined on the table, they are fired in following order.

• All before statement triggers• All before row triggers• All after row triggers• All after statement triggers.

If multiple triggers are of the same type, then they follow above order but there is no guarantee which one will be executed first unless it is created with ‘FOLLOWS’ clause. This clause is introduced in 11g. To know more about this clause, please visit our previous blog post.

By default when triggers are created, they are enabled. One can disable it using ‘ALTER TRIGGER’ command. But in Oracle 11g, we can create trigger in ‘DISABLE’ state. We have covered it in one of our blog post.

It is advisable to avoid creating triggers for defining referential integrity constraints. If triggers are not

Page 29: Complete Rdbms 1

defined correctly, it may result into mutating table errors. We will cover more about it in our future blog post.

Trigger Basics (Oracle) Posted by decipherinfosys on March 18, 2008

Triggers are one of the widely used database objects. In Oracle, we can define triggers on tables, views, schema or a database. On tables, we can define DML (insert, update ,delete) triggers, on schema we can define DDL triggers and logon/logoff event triggers, and system event triggers(startup/ shutdown) on the database. In general based on where a trigger is defined it can be for any DML events, DDL events or login events. Common use of triggers are to generate running numbers before inserting new rows, for security restrictions, for populating some other columns based on the new or old column values and for auditing purpose to name few.

Following are the different trigger types for DML related triggers:

1) BEFORE and AFTER triggers: Indicates whether trigger should be fired before or after the INSERT, UPDATE or DELETE operation occurs on the table. It can be used for DDL statements as well but in that case trigger should be defined either on the schema or the database.

2) ROW and STATEMENT triggers: Defines whether trigger should be fired for each row affected or just once for triggering statement:a) BEFORE ROW: For each row before triggering statement is executed.b) AFTER ROW: For each row after triggering statement is executed.c) BEFORE STATEMENT: Once, before triggering statement is executed.d) AFTER STATEMENT: Once, after triggering statement is executed.

3) INSTEAD OF triggers: These triggers are normally defined on complex view. It allows to update the views which are not directly updatable through DML statements. It updates the underlying table used to create view.

In 11g, Oracle introduced COMPOUND triggers in which, more than one type of trigger can be defined in a single trigger. In compound trigger, we can define specific action for each type mentioned above but all these actions are composed in a single trigger. We will cover more about compound triggers in future blog post.

We can define more than one trigger of different type on the same table. Stretching it further we can even have multiple triggers of the same type. e.g. two before insert trigger on the same table.

When two or more triggers are defined on the table, they are fired in following order.

• All before statement triggers• All before row triggers• All after row triggers• All after statement triggers.

If multiple triggers are of the same type, then they follow above order but there is no guarantee which one will be executed first unless it is created with ‘FOLLOWS’ clause. This clause is introduced in 11g. To know more about this clause, please visit our previous blog post.

By default when triggers are created, they are enabled. One can disable it using ‘ALTER TRIGGER’

Page 30: Complete Rdbms 1

command. But in Oracle 11g, we can create trigger in ‘DISABLE’ state. We have covered it in one of our blog post.

It is advisable to avoid creating triggers for defining referential integrity constraints. If triggers are not defined correctly, it may result into mutating table errors. We will cover more about it in our future blog post.

CURSOR

For every SQL statement execution certain area in memory is allocated. PL/SQL allow you to name this area. This private SQL area is called context area or cursor. A cursor acts as a handle or pointer into the context area. A PL/SQL program controls the context area using the cursor. Cursor represents a structure in memory and is different from cursor variable.

When you declare a cursor, you get a pointer variable, which does not point any thing. When the cursor is opened, memory is allocated and the cursor structure is created. The cursor variable now points the cursor. When the cursor is closed the memory allocated for the cursor is released.

Cursors allow the programmer to retrieve data from a table and perform actions on that data one row at a time. There are two types of cursors implicit cursors and explicit cursors.

Implicit cursors

For SQL queries returning single row PL/SQL declares implicit cursors. Implicit cursors are simple SELECT statements and are written in the BEGIN block (executable section) of the PL/SQL. Implicit cursors are easy to code, and they retrieve exactly one row. PL/SQL implicitly declares cursors for all DML statements. The most commonly raised exceptions here are NO_DATA_FOUND or TOO_MANY_ROWS. Syntax:

SELECT ename, sal INTO ena, esa FROM EMP WHERE EMPNO = 7844;

Note: Ename and sal are columns of the table EMP and ena and esa are the variables used to store ename and sal fetched by the query.

Explicit Cursors

Explicit cursors are used in queries that return multiple rows. The set of rows fetched by a query is called active set. The size of the active set meets the search criteria in the select statement. Explicit cursor is declared in the DECLARE section of PL/SQL program.

Syntax:

CURSOR <cursor-name> IS <select statement>

Page 31: Complete Rdbms 1

Sample Code:

DECLARE CURSOR emp_cur IS SELECT ename FROM EMP; BEGIN ---- --- END;

Processing multiple rows is similar to file processing. For processing a file you need to open it, process records and then close. Similarly user-defined explicit cursor needs to be opened, before reading the rows, after which it is closed. Like how file pointer marks current position in file processing, cursor marks the current position in the active set.

Opening Cursor

Syntax: OPEN <cursor-name>;

Example : OPEN emp_cur;

When a cursor is opened the active set is determined, the rows satisfying the where clause in the select statement are added to the active set. A pointer is established and points to the first row in the active set.

Fetching from the cursor: To get the next row from the cursor we need to use fetch statement.

Syntax: FETCH <cursor-name> INTO <variables>;

Example: FETCH emp_cur INTO ena;

FETCH statement retrieves one row at a time. Bulk collect clause need to be used to fetch more than one row at a time.

Closing the cursor: After retrieving all the rows from active set the cursor should be closed. Resources allocated for the cursor are now freed. Once the cursor is closed the execution of fetch statement will lead to errors.

CLOSE <cursor-name>;

Page 32: Complete Rdbms 1

Explicit Cursor Attributes

Every cursor defined by the user has 4 attributes. When appended to the cursor name these attributes let the user access useful information about the execution of a multirow query.

The attributes are:

1. %NOTFOUND: It is a Boolean attribute, which evaluates to true, if the last fetch failed. i.e. when there are no rows left in the cursor to fetch.

2. %FOUND: Boolean variable, which evaluates to true if the last fetch, succeeded. 3. %ROWCOUNT: It’s a numeric attribute, which returns number of rows fetched by the cursor

so far. 4. %ISOPEN: A Boolean variable, which evaluates to true if the cursor is opened otherwise to

false.

In above example I wrote a separate fetch for each row, instead loop statement could be used here. Following example explains the usage of LOOP.

Page 33: Complete Rdbms 1

Using WHILE:

While LOOP can be used as shown in the following example for accessing the cursor values.

Example:

Fetch is used twice in the above example to make %FOUND available. See below example.

Page 34: Complete Rdbms 1