A Relational Model of Data for Large Shared Data Banks

Infonnation RetrievalPhyllis Baxendale

A Relational Model of Data forLarge Shared Data Banks

E.F. Codd

June, 1970Volume 13, Number 6

pp, 377-387

In 1970. Codd proposed a new model fordatabase sysiems called the relationalmodel. Through its .împlicit}' and mathe-nuitical basis, the relatiomil tru)del lias pro-vided an intuitively more appealing founda-tion for database systems titan its ty.'o majorcompetitors: the hierarchical and networkmodels. The model has had an enormousimpact on both the theory and developmentof database systems. A growing number ofcommercial database systems are rela-tional. In I9S1. the ACM Turing Award waspresented to Codd.

-D.E.D.

A Relational Mudel uf Data for

Laree Shared Data Banks

E, F, ComResearch L<iboralory. San Jaae, Califami,

Fulure uiei^ of larê doTa banli^ mu^l be protpcfed from

huvng fo know' hov fhft dala n orgunized in Ihe mocjiine Une

unaffeded when me jnlerrcl represenlafjon ur data [i changed

ond ewfln when jomt aspvcfs of ITIB e 11,1 em• I reproenlahDU

Traffic Una irclural grovn^ in the lype& or ttoifd mformahun,

with Iref-itructuied Hiss o i slightly more general ncrwoilt

are discuued, A model bu&ed on n-ofy felo'ioni, 0 normal

and applied lo Ihe problerns of redundancy ond ccniittflncy

lEY WODDS AMD

Communicationsofthe ACM

25th Anniversary Issue January, 19S3Volume 26Number I

CU CAlEGOIIfS, 3,?0, 3.?3, 3 7S, 1.70. i-27. l.lt

1. Relational Model and No

This paper is concerned nith the apphcation of ela-mentaiy relation theory to systems which provide sharedaccess to large banks of formRtted data. Except for a paperby Childa |1J, the principal appUcfttion of relations to dataByetcmB hu been to deductive quest ion-answering systenu,Levein and Marun [2] provide numerous references tu workin this area.

In contrast, the pruhlems treated here are those of dalaindependaice—the independence uf applieation programsand tcnninal activities frum growth b data types andchanges in data repreacntation—and eertain kinds uf dalnfnctmeisteiunj which iire expected to hccome trouhlesumeeven in nundeductive systems.

The relational vien (or niodel) of data dcjcnbed inSection 1 appears tu be superior in several respects to thegraph or network mudel |3, 4| presently m vogue for non-inferential systems. It provides a means of describing datawith ita natural structure only—that is, withnut superim-posing auy additional structure for machine representationpurposes. Accordingly, it provides a basis for a high leveldata language which will yield maximal independence be-tween programs on the one hand and maehine representa-tion and organization uf datii on the other,

A further advantage uf the relational view is that itfurms a . ound bafis for treating derivahility, redundancy,and eopsistency of relations—these are discussed ic Section2, The network model, un the uther hand, has spawned flnumber of conlusions, not the lefist of which is mistakmffthe derivation of connections for the denvation of rela-tions (see remarks in Section 2 on the "connection trap j .

Finally, the relational view permits a clearer evaluationof the scope and logical limitations of present formatteddata systems, and also the relative merits (from a logicalstandpoint) of competing representations of data withio asingle syst«m. Examples of this clearer pempeetive arecited in various parts of this paper, frnplementations ofsystems to support the relational model are not disciu sed,

1,2, DATA DEPENDENCIES TN PREBE^TT SI-STEHSThe prnviaon of liata description tables in recently de-

veluped informntion systems reprtsenlj a major adiancetoward the gual of data independence |5, 6, 7], Snch tablesfacilitate changing certain eharacteristics of the data repre-sentation stored in a dnta hank. However, the variety nfdata representation characteristics whieh can be changedtuiUuiut togically impairinii some apptvaltffn pra^âms ia

still quite limitod. Further, the model of data with whichuseis interact is still cluttered with representational prop-erties, particularly in regard to tho rapresentation of col-leetions of data (as oppnsed to individual items). Three ofthe principal kinds of data dependences which still needto be removed are: ordering dependence, indexing depend-ence, and acce^ path dependence. In si>me Ryatema thesedependences lire nut clearly separable frem une another.

1,2.1, Ordmn^ Dependence. Elements of data in adatabank may be stored in a variety of ways, some involv-ing no cotLCem for orderingn sumE permitting each elementto piirticipate in one ordering onlj, others permitting eaehelement to participate in several orderintp. Let us considerthose existing systems which either rer]iiire or permit dataelement' tu be stored in at least one total ordpring which \felosely associated with the liardware-determiiieil iinJeringof addresses. Fur example, the recurds uf a file concerningparta might he stored in ascending order by part serialnumber. Such systems normally permit application pro-grams fo aMnme that the order of presentation of recordsfrom such a file is identical to (or is a subonlcriiig of) the

64

stored orênng, ITiuse applicatiun programs which takeadvantage uf the stored ordering of a hie are likely to failto operate correctly if for sarae reason it becomes necessaryto repiaci? that ordering bv a different one, bimilar remarkshold for n stored ordering implemented by means of

pUiDt«TS,

because all the well-knoq-n informatiun systems tât aremarketed today fail to make a clear distinction betweenorder of presentation on the one hand and stored orderingon the other. Significant implementadon problems must besolved to provide thia kind of independence,

1.2.2, Indexing Dependence. In the context of for-

performance-oriented eumponent of the data representa-tion. It tends t^ improve response tu qneries and updatesand, at the same time, slow down responsB to insertionsand deletions. From an infunnatiunul standpoint, an index

syitem uat9 indices at all and if it is to perform well in anenvironment with changing patterns of activity on the databank, an ability to create and destroy indices from time totime will probably be necessary. The question then arises:Oan applieatjon programs and terminal activities remaininvariant as indices rome and gu?

Present furmatted data systems take widely diiTerentappreaches to indexing, TDMS |7| nnconditiunally pro-\qdes indexing on all attributes. The presently releasedversiun of IMS |5| provides the user vi-ith a choice fur eachfilei a choice hetween no indenting at all (the hierarehic se-quential urganiEatiun ] ur indexing on the primary keyonly (the hierarchic indexed sequential urganization). Inrieitlier case is the user's application Iugic dependent on theexistenee uf the uneonditionallv provided indices, IDS[S|, however, permits the file designers to select attributesto be indexed and to incorporate indices into the file struc-ture hy means of additional chains. Applicatiun programstaking advantage of the performance benefit of these in-dexiriR chains mustrefer to those chains by uume. Such pro-grams do not operate correctly if these chains are later

1.3.3. .4cM»3 Folk Dependence. Many uf the existingformatted data systems provide useis with tree-structuredfiles or slightly more general network models of the data.Application programs developed to work with these sys-tems tend to be logieally impaired if the trees or networksare changed in structure, A simple example foUowa.

Suppose the data bank contains information ahout partsand projects. For each part, the part number, part name,part description, qnantity-on-hand, and quantity-on-orderare recorded. For each project, the prejeet numher, projectname, project desciiptiun are recorded. Whenever a projectmakes uae of a certain part, the quantity of that part com-mitted to the given proiect ia aUo recorded. Suppose thatthe systeni requires the user or file designer to declare ordefine the data in terms of tree structures. Then, any oneof the hiemichical structures may be adopted fur the infor-matiun mentiuned above (see Structnred 1-5),

Slructure I, Projecla Su]jordiD&1« lo Pu

PART pan f

pmitcl t

quanllty cocimitMd

structure 2. PuU SubnrdLnn

PROJECT projon

part nune

3, PirU andubordinate Lu Proi

PART part f

pirl deflrriptiori

PROJECT prajtct I

project dearrJptiDDPART part f

Structure 4, PuIA and ProjecCA u Peer«NRurJtmeDt RelatiooBtiip Skiburdtnatc to Paris

PART p in i

quaiilily-on-orderPROJECT proiMl i

PROJECT project f

Slruclure I. Firli, PraJDrti, and

o unDtEtV'O n -huid

PROJECT projtal fproject name

NuiVp eonaider the problem of printing out the partnumber, piirt name, and quantity committed for every partiiicd in the project whose project nsme is "alpha," Thefulluwing obnCrvatiou. may be made regardless of whichavailable tree-oriented information system is selected totackle this prehleni. If a pregram P is developed for thisproblem assuming one of the five structures above—thatis, P makes no test to determine which structure is in ef-fect—then P will fail on at least three of the remainingstructures- More specifically, if P succeeds with structure 5,it will fail with all the others; if i" succeeila with structure 3or 4, it will fail with at least 1, 2, and 5\\i P succeeds with1 or 2, it ivill f^ nith at least 3, 4, and 5. The reason issimple in each caê. In the absence of a ti^t tu dclerminew hich structure is in effect, P fails because BTI attempt ismade to exreute a reference tu a nonexistent file (availablesystems treat this aa an errer) ur no attempt is made toexecute a reference to a file containing needed information.The reader who is not convinced should develop sampleprograms for this simple prehlem.

Since, in general, it is not practical to develop applica-tion programs which test for all tree structurings permittedhy the system, these programs fail when a chaiiEe instructure becomes necessary'.

Systems wluch provide users witli a iietnurk mrxlcl ofthe data rxin into similar difficulties. In both the tree andnetwork eases, the user (or his pmgraiTi) ia required toexploit a collection of user access paths to the data. It doesnot matter whether these patlis arein close currespoiidencewith pointer-delined paths in the stored representation-—inIDS the eorrrapondence is extremely simple, in TDMS it isjnst the opposite-The con.4equenee, regardl^s of the storedrepre^Dtation, is that terminal activities and programs be-come dependent un the continued existence u( the useracce^ paths.

One solution to this is to adopt the policy tliat once auwr access path is defined it ivill not be made ob^leUr un-til all applicivtioii programs using that path hax'e becomeobJHolctc, Such a policy is nut practical, because the numberof acefss paths in tiie total model for the community ofuser? of a dnta bank would eventually become excessivelylarge,

l.ij, -II HtLAIlOWAl, VIKW ur U.KVh

The term relalum is used here in its accepted mathe-matical sense. Given sets S,,S,. , S. (not necessarilydistinct), /f is a relation on these rt sets if it is a set of n-tuples each of which has its fint clement from S,, itasecond element from S,, and su on-' We shall refer to S/ asthe jth domain of R. As delined above, R is said to havedegree n. Relations uf degree I are often called unary, de-gree 2 binary, degree 3 ttmary, and degree n n-ary,

For expository reAsons, we shall frequently make use ofan array reprisentation of relation.'*, but it must he re-membered that this particular representation is nut an ea-sential part of the rclatioual view being expuunded, .Kn ar-

ray which represents an n-iry relation R has the folloningproperties:

(1) Each row represents an n-tupJe of R.(2) The ordering uf rowa is immatpria],(3) All ixiws are du^tinct(1) The ordering of columns is significant—it corre-

sponds t.> the ordering S,, S, , , - , S. of the do-mains un whicli R is defined (see, however, remarksheluw on domain-ordered and domain-uiioitleredrelatiuns),

(5) The significance of each column is partially con-veyed by labeling it with the name of the corre-sponding domain.

The eiampIc in Figure 1 iUustnktes u relation of dagree4, called supplfip which reflects the shipmcnts-iil-progresiof parts frem specified supplieis to specified projects inspecified quantities,

supply It1ippli»r Torl projl^l ir'anl.lg]

One might ask: If the columns are labeled by the namenf corresponding domains, why should tho ordering of col-umns matter? Aa the e-\ample in Figure 2 shows, two col-umns may have identical headings (indicating identicaldomains) but poaaess distinct meanings with respect to therelation. The relation depicted is called coJnponejtl. It is aternary relation, whose fiist two domains are culled partand third domain is called quanlUy. The meaning of com-poncnl (i, a, r) is that part i is an immediate component(urauhAssembly) nf part T/, and 2 units uf part z are neededto assemble one nnit of part j / . I t is a relation which playsa critical roie in tbe parts explosion problem.

inly, R : t o< tho Cart«ian producl Si X

F[U ?, A rolation wuh two idenLioaJ domiiiLA

It is a renuirkable fact that several eiisting informationsystems (chiefly thoso based on tree-structured filn) fulto provide data representations for rotations which havetwo ur mure identical dumains. The present veniion ofIMS/360 [6| is an example of such a system.

Tlie totality of data in a data bank may be viewed u fhcollection of time-varying relatioiw, Thesa relation* ars ofassorted degrera. As time progresses, each n-arj' relationmay be subiect to insertion of additional n-tuples, deletionof existing one), and alteration of components of any of ilaexisting n-tuples.

65

ILI many commercial, governmental, and scientific databauks, however, some of the relations are of quite high de-gTee (a degree of 30 is nut at all uncommon), Usera shouldnut nurnisilly be burdened vjith remembering the domainordering of any relation (for example, the ordering ûpp ifr,then pnrl, then project, then ântiiy in the relation .mpply),Aecordingly, we propose that u.sers deal, not with relationswhich are domain-ordered, but »ith relalianships which arethdr do main-unordered counterpartE,' To accomplish this,domains must be umquely identifiable at least within anyBlveD relation, without using position. Thus, where thereare tn~u or more identical domains, we rcqnire in each casethat the domain name be qualified by a distinctive rokname, which serves to identify the role played by thatdomain in the given relation. For example, in the relationcompoTttnl of Figure 2, the nrat domain part might bequaliSed by the role name sab, and the second by svper. sothat nsera oould deal with the relationship coinp(menl andits domains—sub,por! super,port, gnanUty—without regardto an?- ordering between these domains.

Tu sum up, it is proposed that most users should interactwith a relational model of the data eonsisting of a collectionof time-vatying relation&liips (rather than relations). Eachnser need not know more about any relationship than itsname together with the names of its domains (role quali-fied whenever necessary). Even this informatiun might heoHered in menu style by the system (subject to securityand privacy eonstraints) upon request by the user.

There are usually many alternative ways in which a re-lational model may be established for a data bunk. Inurder to disciise a preferred way [or normal form), wemuat first introduee a few additional concepts (activedomain, primary key, foreign key, nonsimple domain)ai>d establish some links ;vith terminology eurrently in usein information systems progracnming- In the remainder oftbis paper, we shall not bother to distinguish between re-lations and relationships except where it appeals advan-tageoiji to be eipUcit-

Cunsider an example of a data bank whicli includes rela-tions eonceming parts, prejects, and suppliers- One rela-tion Galled purl is dc&ned on the following domains:

(1) part number(2) part name(3) part color(i) part wciglit(5) quantity on hand(61 quantity on order

and pusaibly other domains es well. Each of these dumainsis, in effect, a puol of values, some or all of which may berepresented in the data hank at any instant. While it isconceivable that, at some instant, all part colors are pres-ent, it is ooiikely that all possible part wiaghts, part

' In I l ier

^ NatiiraUy, u wilh ^ny dAU pjl into and vpurer Byscem, tbo Liser vill nnnoaLl/ mAkf F,

names, and part numbers are. We shall call the set ofvalues reprcBsnted at some instant the odiue domain at that

ii ormally, one domain (or combination of dumains) uf agiven relation has values which uniquely identify eaeh ele-ment (n-tuple) u( that relation. Such a domain (or cnm-binatiun) is called a primary key. In the example above,part number would be a primary key, while part culorwould not be. A primary key is nonredundanl it it is eithera simple dumain (not a combination) or a combinationsueh that none of the partieipating simple domains issuperfluous in uniquely identifying each elemeTit, A rela-tion may possess more than one nonredundant primarykey- This wuuld be the ca=e in the eiample if different partswere always given distinct names. Whenever a relationhas twu or mure nonredundant primary keys, one of themis arbitrarily seleeted and called Ua primary key uf that re-lation,

A common requirement is for elements of a relation tocruss-reference other elements uf the same relation ur ele-ments of a different relation, I\eys provide a user-onentedmeans (but not the only means) of expressing such cruss-referencea. We shall call a domain (or domain combina-tion) cf relation-R & foreign te^ if it is not the primary keyof R but its elements are values of the primary key of somerelation S (tbe possibility tliat S and R are identical is notexeluded), fn the relation supply uf Figure 1, the combina-tion of supplier, pari, pfojeii is the primary key while eachof these three domains taken separately is a foreign key.

In prei-ious work there has been a strung tendency totreat the data in a data bank as consisting oF twu parts, unepart consisting of entity descriptiuns (for example, descrip-tions of suppliers) and the other part consisting oi rela-tions between the variuns entities or types of entities (forexample, the supply relation). This distinction is difficultto maintain when une may have foreign keys in any rela-tion whatsoever. In the user's relational model there ap-pears to be no advantage to making such a distinction(there may be sume advantage, however, when one appliesrelational cuncepta to machine representations of the user'sset of relationships).

So far, we have discusBed examples of relations whieh aredefined on simple domains—domains whose elements areatomic (nondecomposable) values, Nonatumic values canbe discussed within tbe relational framework. Thus, sumedomain. may have relations as elements. These relationsmay, in tum, be defined on nunsimple dumain-4, and su un.For example, one of the domains on which the relation em-ployee is deêd might be salary history. An element uf thesalary' history domain is a binary relation defined untiic do-main dale and the domain salary. The aa/on/ history dumaioisthcêt of all such binary relations. At any instant of timethere are as many instancts of the saiary history relationin the data bank as there are employees In contrast, thereis only one instjLncH of tbe employee relation.

The terms attribute and repeating group in present database terminology are roughly analogous to simple domain

and nonsimpie domain, respectively. Much of tha confusionin present terminulugy is due to failure to distinguish be-tween type and instance (us in "record") aild betweencomponents of a user model of the data on the one handand their machine representation counterparts on theother hand (again, ive cite "record" as an example),

1,4. NORMAL FORMA relatiun whose domains are all simple can be repre-

sented in storage by a two-dimensional column-homo-geneous array of tbe kind discussed above. Some morecomplicated data structure is nece-isary for a relation withoneor more nonsimple domains. For this reason (and othersto be cited below) the possibility of eliminating nunsimpledumains appears worth investigating,* There is, in fact, avery simple elimination procedure, which we shall callnorm aliiat ion.

Consider, for example, the eolleetion of relations ex-hibited in Figure 3(a), Job hialory and childrm are non-simple domains of the relation employee. Salary history is anonsimple domain of the relation job history. The tree inFigure 3(al shon-s jvist these interrelationships of the non-simple domains.

employe hnanf. namo, birthd&le, jobhistory, children)jobbiEtory (jctdali, title, sslnrybiElory)BalaryhiBtory (Matoryiklls. salary]cbildren (cAiMnomf, birtbyear)

AiDpkoyce' {pinnf, name, birlndjobhiBtory' (manf, iabilaU. lill•alarybiBlDry' (numf, }tJbltat^. lcbiidren' (numf, childnami. bir

alarydalt, salnry)

FIO SOI). Nomdiied set

Normalization proceeds as follows. Starting with the re-lation at the top of the tree, take its primary key and ex-pand each of the immediately subordinate relations byinserting this primary key domain or domain combination,Tbe primary key uf eaeb expanded relation consists of theprimary key before expansion augmented by the primarykey copied down frem the parent relatiun, Nuw, strike outfrom the parent relation all nunstmpledumains, remove thetop node of the tree, and repeat the same sequenee ofoperations on each remaining subtree.

The result of normalizing the ooUection of relations inFigure3(a)is the collection in Figure 3 (b). The primarykey of each relation is italicized to show huw ouch keysare expanded by the normal izAtion,

• M, E, Smko nf IBM, San JoM, indepandently recuKniiei Ibedeflirabilily of tiiminalins nonaiiDple domaina.

If normatiiation as described above is to be applicable,the unnormaliied collection of relations must sattJfy thefollowing conditions:

(1) The graph uf interrelationships of the nonampJedom^ns is a collection of trees,

(2) No primary key has a component domain which is

The writer knows of no application which would requireany relaxation of these cunditiuns. Further operations of anormalizing kind are possible. These are not discussed inthis paper,

Tliesimplicityofthearrayrepresentation which becomesfeasible when all relations are cast in normal form is notonly an advantage for slorage purposes but aUo for com-munication of bulk dnta between systcnu which use n-idrlydifferent representations of the data. The communicationform would be a suitably compreêd version of the arrayrepresentation and would have the following advantagea:

(1) It would be devoid of pointers (address-valued ordisplacement-valued),

(2) It would avoid all dependence on haah addr^ngacliciiies,

(3) It would contain no indica or ordering lists.If the user's relational model is set up in normal form,

names uf items of data in the data bank can take a simplerform than would othenvise be the case. A general namewould take a form sucli as

where R is a relational name; i; is a generation identifier(optional); r is a role name (optional); d is a domain name.Since a is needed unly when several generations of a givenrelation exist, or are anticipated to exist, and r is neededonly when the relation R hsa t«o or more domains namedd, the simple form R.rf will tiften be adequate,

1.5. SOME LINDUISTIC ,,\flFEc-r3The adoption of a relational model of data, as described

above, permits the development of a nniversal data sub-language based on an applied predicate ealeulus, A fit t-order predicate calculus suf^ces if the collection of relationsis in normal form. Such a language would previde a yard'stick of linguistic power for all otlier proposed data lan-guages, and ivould itself be a ftrong candidate for embed-ding (witli appropriate syntactic modification) in a varietyof host langviageB (pregramming, command- or problem-oriented). While it is not the purpose of this paper todescribe such a langnage in detail, its salient featureswould he as follows.

Let us denote the data sublanguage by R and the hostlanguage by H, R pernutd the declaration of relations andtheir domains. Each declaration of a relatiun identifies theprimary key for that relation. Declared relations art addedto the system catalog for use by any members uf the usercommunity who have appropriate authorisation. H per-mits suppurting declarations which indicate, perhaps l«vipermanently, how these relations are represented in stur.

66

age, B permits tlie specification for retrieval of any subsetof data from the data bank, Actiuii on such a retrieval re-quest is subject to security constr^nta,

Tlie imiveisality of the data sublanguage Iic5 in it.descriptive ability (nut ite cumputing ability). In a largedata bank each snbset of the data ha^ a very large numberof possible (and sensible) deseriptions, even wben I'e as-sume (as we do) that there is only a finite set of functionsubroutines to which the system has acreA< for use inqualifying data for retrieval. Thus, the clasf, of qualificationexpressions which can be lispd in a set specification musthave the descriptive power uf the class of well-formedformulas of an upplicd predicate ealculus. It is well knuwnthat to preserve this descriptive power it is unncces&arv toOLprPss (in whatever sjTitax is ehoaen) e 'ery formula oftlie selected predicate calrulua. For example, jii.' l those in

.\ritlimetjc functioii-H may bo needed in the quahficationor otiier parts of retrieval statements. Suth funetions canbe defined in H and invoked in R.

A si-t so flpcsified may be fetched for query purposesunly, or it may be held for pus.sib!e clianges- Insertions takethe form of adding new elements to declared relatiuns with-out regard to any ordering that may be present in theirmachine representatiuii, I!)eletions u'hich are effective fortbe community (as opposed to the individual user or sub-communiti^) take the form of removing elements from de-clared relations. Some deletions and updates may he trig-gered by otiiers, i/ deletion and update dependencies be-tween ipecified relations are declared in R.

One important effect that tbe view adopteil toward datahas un the language used to retrieve it is in the naming ufdata elements and sets. Some aspects of this have heen dis-cussed in the previous section. With the nsual network\iew, u^rs will often be burdened mth coining and usingmore relation names than arc absolutely jieceiiar>, sincenarnCH are a.' ueiatfd 'itJj paths (ur path t}'pes] ratherIhan with relations,

Onee a ueer ia aware that a certain relatiun is stored, he" ill expect to he able to exploit' it using any cumbinationuf its arguments; aa "kuowiu" and the remaining argu-ment? a, "unknon'ns," becaiLse the informatiun (likeEverest) is there. This is a system feature (mining fremmnny current infonnation syatems) wbich we shall call(lugically) 8j/T7ini£tric eiploiUiLion of reUtions, Xnturally,e 'mmetry in perfurniance is not to be esîected,

Tu support symmetric exploitation of a single binary re-lation, two directed paths are needed- For a relatiun uf de-gree n, the number uf paths to be named and controlled isn factorial.

Again, if a relational view is adopted in which every n-ary relation (n > 2J has to be expressed by the user as anested expression involving only binary relations (seeFcldman'B LEAP System 110], for esample) then 2n - 1names have to be coined in- tead of only n + 1 with directn-:kry nutation as described in Section \.'l. For example, the

4-a(y relation nippli, uf Figure 1, which entails 5 names injt-ary notation, would be represented in the form

P (nippher, 0 (part, R {project, quanlity)))

in nested binary notation and, thus, employ 7 names,.\ further disadvantage uf this kind uf expression is ita

asjmnietrj'. Although this asymmetry dots nut prohibitsymmetrie exploitation, it certainly makes some bases ofinterrogation very awkward for the user to express (con-sider» for example, a querj' for those parts and quantitiesrelated to ceriain given projects via Q and R).

1.6, EXPRESSIBLE, N,uiEn, ANU STORED RELATIONSAssouated with a data bank are two collections of rela-

tions: the named set and the expressible >fl- The named setis the collection of all those rdatiuns that the eommunity ufusers can identify by means of a simple name (or identifier),A relatiun A aequires membership in the named set when asuitably authoriiod user declares R; it loses membershipwhen a suitably authoriied user cancels the declaration ofR

The expressible set is the total collection of relatiuns thatcan be designated by expressions in the duta language. Suchexpressions are constructed from simple nam^ of relatiunsm the named set; names uf generatiuns, rules anil domains;logical connectives; the quantifiera uf the predicate calcu-lus ;' and eertain constant relatiun symbols such as =, >,The named set Ls a subset of the expressible set—usually avery small subset.

Since sume relations in the named set may be time-inde-pendent combinations of others in that Bet, it is a^ful toconsider aairaciatini; with the named set a collection ofstatements that define these time-independent constraints.We <hall postpone further discussion of this until we haveintroduced several operations on rektiona (see Section 2).

One of the major prohlems confronting the designer of adata sj-stem whieii ia to support a relational model for itausers i that uf determining the class uf stored representa,tions to be lupported- Ideally, the variety of permitteddata representations should he just adequate to cover thespeetmm of performanci requirements of the total eol-lectiun uf installations. Too great a variety Ifads to un-necessary overhead in storage and continual reinterpreta-tion of descriptions for the BtructuiTS currently in effect.

For any selected elass of stored representations the datasv-stem must provide a means of translating user reouesbiexpre. sed in tbe data language of the relational model intocorresponding —and efficient—-actiuns on the currentstored representation. Fur a high level data language thispresents a challenging design problem- Nevertheless, it is aproblem which must be solved—as mure UAST ubtain con-current acrsa to a large data bank, responsibility for pro-viding efficient response and throughput shifts frum theindividual user to the data system.

• BeriuH nuh celalioa in a pruticij il~la hitik la i flnitc ael al

2. Rtdunda nd Con ney

2,1, OPEFUTIONS ON RELATIONSSince relations are sets, all of the usual set Ujiemtions are

applicable to them. Nevertheless, the result may not be arelatiun; for example, the union of a binary relation and aternary relatiun is not a relatiun.

The opexatiuns discussed below are specifically fur rela-tions. These operations are introduced because of their keyrule in denvmg relations from other relations. Theirprincipal application is in noninferential information sys-tems— systems which do not previde logical inferenceservices—although their applicability is not iieccssaritydestroyed ivhen such services are added.

Most usera would not be directly concerned with theseoperatinn.4. Information systems designers and people con-cemed with daU bank control should, however, be thor-oughly familiar with them.

2.1.1, PmmiliUiim. A binary relation has an arrayrepresentation with two eolumns. Inlerehanging thete col-umns yields tbe converse relation. More generally, if apermutation is applied to the columns of an i-arv relatiun,the resulting relation is said to be a permMltdion of thegiven relation. There are, for example, 4! = 24 permuta-tions of the relation mpply m Figure J, if we inobde theidentity permutation which leaves the ordenng of columnsunchanged.

Since the user s relaliunal model consists of a collectionof relationships (domain-unordered relations), permuta-tion IS not relevant to such a model considered in isolation.It is, however, relevant to the conaderation of storedrepresentations uf the mudel. In a systeni which pmvidessymmetric exploitation of relations, the set uf queriesana^verable hy a stored relation is identical to the setanswerable by any permiitatinu of that relation. Althoughit i« logically unnecessary Ui store both a relation and somepermutation of it, performance considBration, euuld makeit advisable,

2.1.2, Projection. SupiKiw nuw we select eertain col-umns of a relation (striking out the utiicrs) and then re-move from the resulting array any duphcalion in the rows.The final arruy represents a relation whieh is sud to be aprojtction uf the given relation,

A selection operator T in used Co obtain any desiredpermutation, projection, or combination of the two opera-tions. Thus, if L is a list of Jt indicw' L ^ i | . t,, ••• ,i,and R isan n-ary relation (n > t) , then ii(fi) is thct-aryrelation whuse Jth columnLicolumni,ufR(j- 1,2, ,, ,k)except that duplication in resulting ruwE is removed- Cun-aider tlie relation sapply of Figure 1 A permuted projectionof this relatiun is exhibited in Figure 4, Note that, in thisparticular caae, the projection has fewer n-tuplcs than therelation from which it is derived,

2-lJJ, Join. Suppose we are given twu binary rela-tions, which have aome domain in common, Undec whatcireumstanctB can we combine these relations to form a

ternary relation which preserves all of the information inthe given relations?

The example in Figure 5 shows two relations R, S, whichare joiriable without loas cf information, while Figure 6shews a join of R with S, A binary relation R is jouwftfcwith a hinary relation S if there exists a ternary relatiun tJauch that m d / ) = R and rn((/j = S, Any such ternaryrelation is called a jam of R n'itli S, If R, S are binary rela-tions such that n{R) = . ,(5), then R is juinable n-ith 5,One join that always exists in auch a case ia the naturaljoin of R with S defined by

tl.S = |(a, fc, c):B(n, 6) A Sfb, c)|where R(a, 6) has the valne true if (a, M is a member of Rand similarly for 5(6, c). It is immodiatn that

•ri,(R.S) - R

Note that the join shown in Figure 6 ia the tiatur^ joinf ;; "ith .' frum Figure 5, Another join is shown in Figure

Via. 4, A parmurAd on of [he relaliou in Figun L

'Wben dealini intb dumaf dora

(loli-

Flu. 1, Aniithi-i join ai It wilh S [frnin Figun i)

Inspection of tlitse relations reveals an element (ele-ment 1) of the domain part (the domain on which the joinis to be made) with the preperty that it poft eaee murethaTi one relative under R and also under S. It LI this cle-

67

ment which gives rise to the plurality of joins, Sueh an ele-ment in the joining domain is called a pnint of ambiguitywith respect to the joining uf R with S.

If either IU (Rl ur S i.s a function,' no point of ambiguitycan occur in juining R with S. In sueh a case, the naturaljoin of R «itli S is the only juin of R with S, Note that thereiterated qualification "of R with S" is necessary, becauseS might be joinuble with fi (as well aa R v,ith S). and thisjoin would he an entirely separate consideration. In FigureS, none ofthe relations R, iii(fi), ,5, jra(S) is a function.

Ambiguity in the joining of R with S can sometimes bere?ulved by means of other rel.itions- Suppose we are given,or eau derive from source independent of R and 5, a rela-tion 2' on tbe domains project and supplier with the follow-ing properties:

(1) liT) = MS),

(2) ,,(T) - MR),

(31 rij,s)-.3p(R(S,p) AStpJ)).

(4) R(a,p)-3j(S(p,j) A T{j.>)).

(,i) S(p,j l-3s(r(j ,3) A R(s,p)).

then we may form a three-way join of fi, S, T; that is, aternary relation Bnch that

ir,,(U) - R, r^(U) = S, ™((/l = T.

Such a join will be called a cyclic 3-join to distinguish itfrom a linear 3-jciin which would be a quaternary relation1' ûch that

= fi, = S,

While iti? possible for more tbiin one cyelie 3-juin to exist(see Fignres 8,9, for an example), the circumstanees underwhich this can occur entail much mure severe constraints

\\h a p]ijrnlity o\ ryclii: H-jnina

U' 1. p (•)

1 a d2 a d

Flo. 9. Two cyclic 3-)oin3 o( lbs .tHiona in Fisure 8

than those for a plurality of 2-juiiis, To bo specific, the re-lations R, S, T must posMas puinta of ambiguity withrespect lo joining fi n-ith S (say point i), S with T (say

y), and T with R (say ; ) , and, furthermore, y must be arelative of x under S, ; a relative of y under T, and r arelative of i under It. Note that in Figure S the pointsX = II', y = d. z = 2 have this property-

The natural linciir 3-join of tbi-ee binary ralaticns fi, S,T is g i^n by

R'S'T = \{ii,h,c,d):R{a,b) A S(&, c) A Tl_c,(l)]

n-here parentheses are not needed on the left-hand side be-cause the natural 2-join (•] is associative. To obtain thecyclic counterpart, we introduce the operator y which pro-duces a relation uf degree n — 1 from a rolatiun of degree nby tying its ends together. Thus, if K is an n-ary relation(" > 2), tbe lie offi is defined by the equation

We may nuw represent the natural eyclic 3-juin uf R, S, Tby the expressiun

T(R.S.Tj.

Extension of tbe notions of hoear and cyclic 3-join andtheir natural counterparts to the joining of n binary rela-tions (where Tt ^ 3) ts obvious, A few words may be ap-propriate, however, regarding the juining of relations whichare not necessarily binary. Consider the case of two rela-tions fi (degree r), S (degree sl which are to be joined onp of tbeir domains (p < r, p < s). For simplicity, sup-pose these p domains are the last p of the r domains of fi,and the first p of the s domains uf S. If thia were not so, wecould always apply apprepriate permutations to make itso. Now, take the Cartesian preduct of the first r-p do-mains of R, and eall this new domain A. Take the Car-tesian produet of the last p domains of fi, and call this B.Take the Cartesian product of the last s-p domains of Sand caU this C.

We can treat R as if it were a binary relation on thedomains A, B. Similarly, we can treat S as if it were o bi-nary relation on the domains B, C. The notions of linearand cyclic 3-join are now directly applicable, A similar ap-proach can be t£ên with the linear and cyclic ri-joins uf nrelations of assorted degrees.

2,1,4, ComposUion. The reader ia probably familiarwith the notion of composition applied to functions. Weshall discuss a generalization of that concept and apply itfirst to binary relations. Our definitions uf eompusitiunand composahility are based very directly on the definitionsof juin and joinability given above.

Suppose we are given two relations fi, S. T is a com-paailionoffi with S if there exists a join F/offi with S suchthat T = ralJJ). Thus, two relation.i are composable ifand only if they are joinable. However, the existenee ofmore than une join uf fi witb S does not imply the existeneeof mure than one composition of R with o.

Corresponding to tbe natural juin of R with S is the

natural composition' ofR with S defined by

RS = Ti,(fi.S),

Taking the relations R, S from Figure 5, their natural com-position is exhibited in Figure 10 and another eompusitiunis exhibited in Figure 11 (derived from the join exhibitedin Figure 7),

Fiu. 10, Tbe nalLiral ,

FiQ l l , Anolh

When two orpoaitiun

tioQ of R willi 3 (iion Figure 5)

i ot R wiLb S (from

joins exist, the number of distinctay he as few as one or as many as the num-

ber uf distinet joins. Figure f2 shuws an esample of tworelatiuns which have several jums hut only one composition,ûte that the ambiguity of point c is lost in composing Rwith S, becauspoints a, b, d,

of biguous associations made '.'ia the

2 e s r

Extension of composition to pairs of relatiuns whicb arenot necessarily binary (and which may he uf different de-grees) follows the same pattern as extension of pairwisejoining to sueh relations,

A laek of understanding of relational composiLiun has ledseveral systems designets into what may be called theconntîtm trap. This trap may be described in terms of thefollowing example. Suppose each supplier description islinked by pointers to tbe descriptions of each part suppliedby that supplier, and each part description is similarlylinked to the descriptions of each project which uses thatpart, A cunclusiun is nuw drawn which is, in general, er-runeous: namely that, if all possible paths are followed froma given supplier via the parts he supplies tu the projectsusing those parta, une will obtain a valid set uf all projectssupplied by that supplier. Such a conclusion is correctonly in the very special case that the target relation be-tween projects and supplier is, in fact, the natural com-position of the other two relations—and we must normallyadd the phrase "for all time," because this is usufilly im-plied in claims concerning path-following techniques,

tht rompDBilion—Aee, for eARmple, Kellcy'a "Gdnernl TopolD£y,"

2 ,U, Restriction. A subset uf a relation ia a relation.One way in which a relation S may act on a relation R togenerate a subset of fi ia through the operation rcitnctionuf R by S. This operatiun is a generaliiation of the restric-tion of a function to a subset of its domain, and is definedas follows.

Let L, M be equal-length lisla of indices such thatL = il, Is , •• • ,ii,M - ji.ji. - , Jl where t £ degreeof fi and Jt S degree of S, Then the L, M restriction of R byS denoted RJvS is the ma,iimal subset R' of R anch that

The operation is defined unly if equality is apphcable be-tween elements of T , . ( R ) un the one hand and r, .(S) onthe other for all A = 1,'i, - - • , k.

The three relatiuns R, 5, fi' of Figure 13 satisfy the equa-tion R' = fi,..i,l,i,wS-

R (• j ) j ) !• p J)

FIO. 13. E.araple ot reatriciionWe are nuw in a position to eonsder vanous apphcatiuns

of these operations on relations,

2,3, REnuNDANcrRedundancy in the named set of relatiuns must be dis-

tinguished from redundancy in the stored set of repr^êntA-tions. We are primarily concerned here with tbe former.To begin with, we need a precise nntiun of derivability forrelations.

Suppose 0 IS a eolleetion of operations on relatiuns andeacb operation bas the property that from its operands ityields a unique relatiun (thus natural join is eligible, butjoin is nut), A relatiun R isB-derviable frum a set 5 of rela-tions if therH e,xists a sequence of operations frem the col-Icetion e which, for all time, yields fi from membera uf S.The phrase "for all time" is present, because we are dealingwith time-varying relatiuns, and uur interest is in derivabil-ity whieh holds over a significant period of time- For then&m<d set of relationships in noninferential systems^ it ap-peara that an adequate collection Si contains the followinguperatiuns: projection, natural join, tie, and restrietiun.Permutation is irrelevant and natural composition neednot be included, because it is obtainable by taking a naturaljoin and then a projection. For the stored set uf representa-tions, an adequate collection Ci of upemtions wuulJ includepermutation and additional operations concerned wilh sub-setting and merging relations, and ordenng and connecting

2,2,1. Strtmg Red^tndancij. A set of relations is atranglyredttndant if it contains at least one relatiun that posse^sâ projection whieh is derivable frum other projections ofrelations in the set. The fuUoiving two examples are in-tended to explain why strong redundancy is defined thisw»y, and tu demonstrate its practical use- In the firat ei-

68

ample the collection of relatiing relatiun:

with serialf as the primary key and managerjf as a foreignkey. Let us denote tlie active domain by &,, and supposethat

nsists of just the follow-

emphyfe iserinl §, name, managers, t

ili(nMiiio5eniant<) C ^.(minie)

for all time t. In tliis case the redundancy is obvious: thedomain Tnanagenarne is unncceasarj'. To see that it is astrong redundancy as defined above, we uhserve that

ruiemployee) = vi!tf>nployee)i\iriiemployee).In the second esample tlie collection of relations includes arelation S describing suppliers with primary key sff, a re-latiun D deseribing departments with primary key df, arelation/ describing projects with primary key j j , and thefollowing relations:

• ) , • - ) ,

where in each case • • • denotes domwns uther than it, df,jf. Let us suppuse the following condition C is known tohold independent of time: supplier t supplies departmentd (relation P) if and only if supplier ssuppUea some projeetj (relationQ)tu whieh d is assigned (relation fi). Then, necan write the equation

and thereby exliihit a strong redundancy.An important reason fur the existence of strong re-

dundaneies in the named set uf relationships is user eon-venicnee, A particular case uf this B the retention of semi-ob'ôlete relationships in the named set so that old pro-grams that refer to them by name can continue to run cor-rectly. Knowledge of the existence of strong redundanciesin the namai set enables a system or data base adminis-trst-ir greater freedom in the selection of stored representa-tions to cope more efficiently with cnrrant traffic. If thestrong redundancies in the named set are directly reflectedin strong redundancies in the stored set (or if other strongredundancies are introduced into the stored set), tben, gen-erally speaking, extra storage space and update time areconsumed with a potential drop in query time for somequerio and in load on the ceutnil prucessing nnits,

2J>,2. Weak ReiiunJancy. A seeond type of redun-dancy may exist. In contrast to strong redundancy it is notcharacteriwd by an ei|uation, A collection uf relations LavKokly redundant if it contains a relation that has a projec-tion which is not derivable from other mcmbeis but is atall times a projection uf some join of utJier projections ofrelations in the collection.

We can exhibit a weak redundancy hy taking the secondexample (cited above) for a strong redundancy, and as-aiimine now that condition C does nut hold at all times.

The relations r i , ( f ) , i , , (0) , .,,(fi) are cumplei"'relatiunswith the possibihty of points uf ambiguity occurring fromtime to time in the potential joining uf any tv,-o. Underthese circumstances, none of them is derivable from theother two. However, constraints do exist between them,

them. One of the weak redundancies can be characteriiedby the statement: for all time, iru(P) is some eompnsitionof B'u(Q) with ru(R 1- The eomposition in qutotion mightbe the natiual one at some instant and a nonnatural one atanother instant.

Generally speaking, weak redundancies are inherent inthe logieal needs of the community of users. They are notremovahle by the system or data base administrator. Ifthey appear at all, they appear in buth the named set andtbe stored set of representations,

2,3- CONSIBTENCTWhenever the narued set of relatiuns is redundant in

either sense, we shall associate with that set a collection ofstatements which define all cf the redundancies which holdindependent of time between the member relations. If theinformation systeni lacks—and it most probably will—de-tailed semantic information about each n a m ^ relation, itcannot deduce the redundancita applicable tu the namedset. I t might, over a period uf time, make attempts toinduce the rcduudancics, but auch attempts wuuld be fal.lible.

Given a collection C of time-varying relations, an as-sociated set 2 of constraint statcmen ts and an instantaneonsvalue V fur C. we shall call the state (C, Z, V) aauislentor ifivtnsistfnt according as V dues or does not satisfy Z.For example, given storod relations R, S, T together withthe constraint statement "••ii(T'| is a composition ufiri,(Rl with r^,(S)", we may cheekfrom time to time thatthe values stored fur R, S, T satisfy this constraint. An al-gorithm for making tlii3 check would examine the first twocolumns of each of R, S, T (in what*\-er way thoy are repre-sented in the system) and determine whether

(1) MT) - MB),(2) MT) = MS).

(3) for every element pair {a, e) in the relatiun ru(7')there is an element b such that (a, j>) is in ru(R)and ((., i ; ) isinitu(S),

There are practical prublems (whicb we shall not discusshere) in taking an instantaneoi;^ snapshot of a collectionof relations, some uf wluch may be very large and highlyvariable-

It is important to note that consistency as defined aboveis a property of the instantaneous star« of a data bank, andifl independent of how that state came about- Tlius, inparticular, there is no di-^tinction niade on tho hasis ofwhether a user generated an inconsistency due to an act ofomi^îon or an art of commission. Examination of a simple

example Bill show the reasonableness of this (possibly un-conventional 1 approach to consiatenc)'.

Suppose the named set C includes the relations S, J, D,P. Q, R of the example in Section 2,2 and that P, Q, Rpos.' ess either the strong or weak redundancies deecrihedtherein (in the partieular case now under consideration, itdocs not matter which kind of redundancy occurs)- Further,suppose tbat at some time I the data bank state is consistentand contains no project j such tbat supplier 1 suppliesftroject j and j is assigned to department ?t. Accordingly,there is no element (2,51 in i i i [P) , Now, a user introducesthe element (2,5) intoir,,(P) by inserting some appropri-ate element into F. The data bank state is now inconsistent.The incunsisteney could have arisen frum an aat uf omis-siun, if the input t'l. 5) is correct, and there does exist aproject j'sucli that supplier 2 suppliffi ; and J is assigned todepartment 5, In tbis case, it is very likely that the userintends m the near future to insert elements into Q and H« hich will have the effect of introducing (2, j ) into .i,(Q)and (5 , j ) inT, , (R) , On the other hand, the input (2, 5)might have been faulty. It eould be the case that the userintended to insert some other element into P—an elementwhose insertion would transform a iMinsistent state intoa consistent state. The point is that the system willnormally have no way of resolving this question withoutintermgating ita environment (perhaps the user whu cre-ated the inconsistency).

There are, of course, several possible ways in which asystem can dc t^ t 'jiconsistencies and respond to them.In one approach the system checks for possible inconsist-ency whenever an insertiun, deletion, or key update oecurs.Naturally, sucb checking will slow these operations down.If an ineonsistency has heui generated, det&ils are luggedinternally, and if it is not remedied within some reasonabletitne interval, either the user ur sumeone responsible forthe seetirity and integrity of the data is notified. Anotherapproach is tu conduct con.siitency checking as a batchoperation onee a day or less frequently. Inputs cauMiiK theinconsistencies which remain in the data hank sEate atchecking time ean he tracked doii'n if the system main-tains a joumal of all state-changing transae tions. Thislatter approach would certainly be superior if few non-transitory inconsistencies occurred,

2,4. SHMMAHV

Tn Section 1 a relatiunal mudel of data is proponed aa abasis for protecting usera of formatted data systems fromtbe potentially disruptive changes in dnta representationcaused by growth in the data bank and changes in traffic.A normal form for the time-varying coiloction of relation-ships is introduced.

In Section 2 operations on relations and two typcfl ofredundancy are defined and applied to the pruhlem ofmaintaiiting the data iu a consisteut state. This is bound tobecome a serious practical problem as more and more dif-ferent typen of data are integrated together into commondata banks,

hfany questions are raised and left unanswered. Fore?<amplfl, only a few of the more important prepertiea ofthe data sublanguage in Section J.t are mentioned. Ndtherthe purely linguistic details of such a langiui^ nor thei in piemen tut ion prehtems are discussed. Nevertheless, thematerial presented should be adequate for experiencedsystems programmers to visualise several approaches. I tis also hoped that this paper can contribute to greater pre-cision in work on formatt-ed data systems-

Acknoaiedgmeni. I t was C. T, Davics of IBM Pough-keepsie who convinced the author of the need for dataindependence in future information systems. The authorwishes to thank him and also F. P, Palermo, C, P, Wang,E, B. Altman, and -M, E- Senko of the IBM San Joae Re-seareh laboratory for helpful discusaions,

RECUVID SipnuiitiL, lUGD. uviacg Fa^nuAni, 1U70

REPERENCES

1. CHILCB, D . L, Feuibilily at tsct-tlicoroiical dataitniclun

iFlation, Pruc, [>'IP Cong., IMS, .Nortb Holland Puti- Co,,AmalFidun, p 103-172,

2- 1.IVIIN, R. E,, /.m MAHUN, M, E , A compuxr syaiom forinfareuce tufcution and data r(tHt»al, Cumn, ACU 10,11 {ND., 1M7) , !16-721.

Datamation (Apr. IWA), 3&-1L4- MCCEE, W. C, Gorcraliiod Mo prooouio(. In Aifiuat Ri-

Now York, IBaS, pp, T7-HDfi- IfiEorfnalion MnuqgBmcnl Systom/SOO, AppiicpitiDn DoAcrip-

liou Murmal HM-OSM-l- IBM Corp., Wbilo Plains, N, Y,.July IMS,

B CI^ (CDncriliiod Inlurniation SyBtim), Appiiealion D»crip-lion .Minuai B2O-O57*- IBM Corp., Wbiln PLaina, N, V..IMI.

1- SLEJEH, R. E , Troaling biorarrbicsl data Itructunw in tboSDC tine-ibind d.t» Diana«oiiionl lyittu (TDM3),Proo. ACM ZInd Nat, Cant-, tM7, MDI PublicitiDDl,Wayne, Po,,pp. m e ,

8. IDS ItilBTO.ice Manual GE eiS/BU. GE lularm. Syl. DIT,Pbooni., Aril-. CPB 1093B, Fob, 1»58,

tgn U- PrtM, Princoton. N.J,, ISMla. FELDU^H, J- A-. INC HoiNIR, P D. Au Alial-bm»l aauoi-

alivo iuiKuagc Slanlord Artilii iii [ntolligcni:> Ri-p-Al-U,Aui 1, 1968.

• A binary i npit il noilbo

69

Documents

A Relational Model of Data for Large Shared Data Banks