Programming Pearls - Self-Describing Data

Embed Size (px)

Citation preview

  • 8/3/2019 Programming Pearls - Self-Describing Data

    1/5

    by ]an Bentley

    programmingpearlsSELF-DESCRIBING DATAYou just spent three CPU hours running a simula-tion to forecast your company s financial future, andyour boss asks you to interpret the output:Scenario 1: 3.2% -12.0% 1.1%Scenario 2: 12.7% 0.0% 8.6%Scenario 3: 1.6% -8.3% 9.2%

    Hmmm.You dig through the program to find the meaningof each output variable. Good news-Scenario 2paints a rosy picture for the next fiscal year. Now allyou have to do is uncover the assumptions of each.Oops-the disaster in Scenario 1 is your companyscurrent strategy, doomed to failure. What did Sce-nario 2 do that was so effective? Back to the pro-gram , trying to discover which input files each onereads. . . .

    Every programmer knows the frustration of tryingto decipher mysterious data. The first two sections ofthis column discuss two techniques for embeddingdescriptions in data files. The third section thenapplies both methods to a concrete problem.

    Name-Value PairsMany document production systems support biblio-graphic references in a form something like:Xtitle The Art of Computer Programming,

    Volume 3: Sorting and SearchingXauthor D. E. Knuth%pub Addison-Wesley%city Reading, Mass.%year 1973Xauthor A. V. Aho%author M. J. Corasick

    0 1967 ACM OOOl-0782/87/0600-0479 75a

    %title%journalXvolume%number%month%year%pages

    Efficient string matching:an aid to bibliographic searchCommunications of the ACM186June1975333-340

    Blank lines separate entries in the file. A line thatbegins with a percent sign contains an identifyingterm followed by arbitrary text. Text may be con-tinued on subsequent lines that do not start witha percent sign.The lines in the bibliography file are name-va luepairs: each line contains the name of an attributefollowed by its value. The names and the values aresufficiently self-describing that I dont need to elabo-rate further on them. This format is particularly wellsuited to bibliographies and other complex datamodels. It supports missing attributes (books have novolume number and journals have no city), multipleattributes (such as authors), and an arbitrary order offields (one need not remember whether the volumenumber comes before or after the month).Name-value pairs are useful in many databases.One might, for instance, describe the aircraft carrierUSS Nimitz in a database of naval vessels with thesepairs:name Nimitzclass CVNnumber 68displacement 81600length 1040beam 134draft 36.5flightdeck 252speed 30officers 447enlisted 5176

    June 1987 Volume 30 Number 6 Communications of the ACM 479

  • 8/3/2019 Programming Pearls - Self-Describing Data

    2/5

    Programming Pearls

    Such a record could be used for input, storage, andoutput. A user could prepare a record for entry intothe database using a standard text ed itor. The data-base system could store records in exadtly this form(well soon see a representation that is more spaceefficient). The same record could be included in theanswer to the query What ships have a displace-ment of more than 75,000 tons?Name-value pairs offer several advantages for thishypothetical application. A single format can beused for reading, storing, and writing records; thissimplifies life for user and implementer alike. Theapplication is inherently variable-format becausedifferent ships have different attributes: submarineshave no flight decks and aircraft carriers have nosubmerged depth. Unfortunately, the example doesnot document the units in which the various quanti-ties are expressed; well return to that shortly.

    Some database systems store records on massmemory in exactly the form shown above. This for-mat makes it particularly easy to add new fields torecords in an existing database. The name-value for-mat can be quite space efficient, especially com-pared to fixed-format records that have many fields,most of which are usually empty. If storage is criti-cal, however, then the database could be squeezedto a compressed format:

    Each field begins with a two-character name andends with a vertical bar. The input and the storedformats are connected by a data dictionary, whichmight start:ABBR NAME UNITSna name textcl class textnu number textdi displacement tonsle length feetbe beam feetdr draft feetfl flightdeck feetsP speed knotsof officers personnelen enlisted personnel

    format. The regular structure supports the tabularformat, but observe that the header line is anotherkind of self-description embedded in the data.

    Name-value pairs are a handy way to give inputto any program. (They are perhaps the tiniest of thelittle languages described in the September 1986column.) They can help meet the criteria thatKernighan and Plauger propose in Chapter 5 of theirElements of Progrumming Style (2nd ed., McGraw-Hill,1978):

    Use mnemonic input and output. Make input easy toprepare (and to prepare correctly). Echo the input andany defaults onto the output; make the outpui self-explanatory.Name-value pairs are useful in code far removedfrom input/output. Suppose we wanted to provide a

    subroutine that adds a ship to a database. Most lan-guages denote the (formal) name of a parameter byits position in the parameter list. This mechanismleads to remarkably clumsy calls:addship("Nimitz", "CVN", "68",

    81600, 1040, 134, 36.5,447, 5176,,,30,,,252,,,,)

    The missing parameters denote fields not present inthis record. Is 30 the speed in knots or the draft infeet? A little discipline in commenting helps unravelthe mess:addshipc "Nimitz", # name

    " CVN" , # class"68" , # number81600, # disp1040, # length

    . . . 1

    Many programming languages support namedparameters, which make the job easier:

    addshipcname = "Nimitz",class = "CVN",number = "68",disp = 81600,length = 1040,

    . . . 1In this dictionary the abbreviations are always thefirst two characters of the name; that may not holdin general. Readers offended by hypocrisy may com-plain that the above data is not in a name-value

    Even if your language doesnt have named param-eters, you can usually simulate them with a fewroutines:

    480 Communications of the ACM lune 1987 Volume 30 Number 6

  • 8/3/2019 Programming Pearls - Self-Describing Data

    3/5

    Programming Pearls

    shipstartship&r (name, "Nimitz")shipstr (class, "CVN" )shipstr (number, "68" )shipnum(disp, 81600)shipnum(length, 1040). . .shipendc )The name variables name, class, number, etc., areassigned unique integers.

    Provenances in ProgrammingThe provenance of a museum piece lists the originor source of the object. Antiques are worth morewhen they have a provenance (this chair was builtin such-and-such, then purchased by so-and-so,etc.). You might think of a provenance as a pedigreefor a nonliving object.

    The idea of a provenance is old hat to many pro-grammers. Some software shops insist that theprovenance of a program be kept in the source codeitself: in addition to other documentation in a mod-ule, the provenance gives the history of the code(who changed what when, and why). The prove-nance o f a data file is often kept in an associated file(a transaction log, for instance). Frank Starmer, aProfessor in the Departments of Computer Scienceand Medicine at Duke University, tells how his pro-grams produce data files that contain their ownprovenances:

    We constantly face the problem of keeping trackof our manipulations of data. We typically exploredata sets by setting up a UNIX@ pipeline likesim.events -k 1.5 -1 3 Isample -t .Ol IbinsThe first program is a simulation with the two pa-rameters k and 1 (set in this example to 1.5 and 3)The vertical bar at the end of the first line pipes theoutput into the second program. That program sam-ples the data at the designated frequency, and inturn pipes its output to the third program, whichchops the input into bins (suitable for graphicaldisplay as a histogram).

    UNIX is a registered trademark of AT&T Bell Laboratories. Note that the two parameters are set by a simple name-value mechanism-/. B.

    When looking at the result of a computation likethis, it is helpful to have an audit trail of the vari-ous command lines and data files encountered. Wetherefore built a mechanism for commenting thefiles with various annotations so that when we re-view the output, everything is there on one displayor piece of paper.We use several types of comments. An audittrail line identifies a data file or a command-linetransformation. A dictionary line names the attri-butes in each column of the output. A frame separa-tor sets apart a group of sequential records associ-ated with a common event. A note allows us toplace our remarks in the file. All comments beginwith an exclamation mark and the type of the com-ment; other lines are passed through untouched andprocessed as data. Thus the output of the abovepipeline might look like:

    ltrail sim.events -k 1.5 -1 3ltrail sample -t .Olltrail bins -t .OlIdict bin-bottom-value item-count0.00 720.01 1380.02 121

    . . .lnote there is a cluster around 0.75lframe

    All programs in our library automatically copy exist-ing comments from their input onto their output,and additionally add a new t r a i 1 comment to doc-ument their own action. Programs that reformat data(such as bins) add a diet comment to describe thenew format.Weve done this in order to survive. This disci-pline aids in making both input and output data filesself-documenting. Many other people have builtsimilar mechanisms; wherever possible, I have cop-ied their enhancements rather than having to figureout new ones myself.Tom Duff of Bell Labs uses a similar strategy in asystem for processing pictures. He has developed alarge suite of UNIX programs that perform transfor-mations on pictures. A picture file consists of text

    lines listing the commands that made the picture(terminated by a blank line) followed by the pictureitself (represented in binary). The prelude provides aprovenance of the picture. Before Duff started thispractice he would sometimes find himself with awonderful picture and no idea of what transforma-

    ]une 1987 Volume 30 Number 6 Communications of the ACM 481

  • 8/3/2019 Programming Pearls - Self-Describing Data

    4/5

    Programming Pearls

    tions produced it; now he can reconstruct anypicture from its provenance.Duff implements the provenances in a singlelibrary routine that all programs call as they beginexecution. The routine copies the old comm andlines to the output and then writes the commandline of the current program.

    A Sorting LabTo make the above ideas more concrete, well applythem to a problem described in the July 1985 col-umn: building scaffolding to experiment with sortroutines. That column contains code for several sortroutines and a few small program s to exercise them.This section will sketch a better interface to the rou-tines. The input and output are both expressed inname-value pairs, and the output contains a com-plete description of the input (its provenance).

    Experiments on sorting algorithms involve adjust-ing various parameters, executing the specified rou-tine, then reporting key attributes of the computa-tion. The precise operations to be performed can bespecified by a sequence of name-value pairs. Thusthe input file to the sorting lab might look like this:n 100000input identicalalg quickcutoff 10partition randomseed 379

    In this example the problem size, n, is set to 100,000.The input array is initialized with identicalelements (other options might include random,sorted,or reversed elements). The sortingalgo-rithm in this experiment is quit k for quicksort;insert (for insertionsort) and heap (forheapsort)might also be available. The cutoff and parti -t ion names specify further parameters in quit k -sort.The input to the simulation program is a sequenceof experiments in the above format, separated byblank. lines. Its output is in exactly the same formatof name-value pairs, separated by blank lines. Thefirst part of an output record contains the originalinput description, which gives the provenance ofeach experiment. The input is followed by three ad-ditional attributes: camps records the number ofcomparisons made, swaps counts swaps, and cpurecords the run time of the procedure. Thus an out-put record might end with the fields:

    camps 4772swaps 4676cpu 0.1083

    Given the sort routines and other procedures to dothe real work, the control program is easy to build.Its main loop is sketched in Program 1: the codereads each input line, copies it to the output, andprocesses the name-value pair. The s imul.ate ( )routine performs the experiment and writes the out-put variables in name-value format; it is called ateach blank line and also at the end of the file.

    loopread input line into string Sif end of file then breakif S = *I I then simulated)write S on outputFl = first field in SF2 = second field in Sif Fl= "n" then

    N = F2else if Fl = "alg" then

    if F2 = "insert" thenalg = 1

    else if F2 = "heap" thenalg = 2

    else if F2 = "quick" thenalg = 3

    else error("bad alg")else if Fl = "input" then. . .

    simulatet 1

    PROGRAM1. A Loop to Process Name-Value Pairs

    This structure is useful for many simulation pro-grams. The program is easy to build and easy to use.Its output can be read by a human and can also befed to later programs for statistical analysis. Becauseall the input variables (which together provide aprovenance of the experiment) appear with the out-put variables, any particular experiment can be re-peated and studied in detail. The variable formatallows additional input and output parameters tobe added to future simulations without having torestructure previous data. Problem 8 shows how thebasic structure can be gracefully extended to per-forming sets of experiments.

    482 Commun ications of the ACM June 1987 Volume 30 Number 6

  • 8/3/2019 Programming Pearls - Self-Describing Data

    5/5

    PrinciplesThis column has only scratched the surface of self-describing data. Many systems, for instance, allow aprogrammer to multiply two numeric objects of un-specified type (ranging from integers to arrays ofcomplex numbers); at run time the system deter-mines the types by inspecting descriptions storedwith the operands, and then performs the appropri-ate action. Some tagged-architecture machines pro-vide hardware support of self-describing objects, andsome communications protocols store data alongwith a description of its format and types. It is easyto give even more exotic examples of self-describingdata.This column has concentrated on two simple butuseful kinds of self-descriptions. Each reflects an im-portant principle of program documentation.l The most important aid to documentation is aclean programming language. Name-value pairs

    are a simple, elegant, and widely useful linguisticmechanism.l The best place for program documentation is inthe source file itself. A data file is a fine place tostore its own provenance: it is easy to manipulate

    and hard to lose.ProblemsSelf-documenting programs contain useful com-ments and suggestive indentation. Experimentwith formatting a data file to make it easier toread. If necessary, modify the programs that

    process the file to ignore white space and com-ments . Start your task using a text editor. If theresulting formatted records are indeed easier toread, try writing a pretty printing program topresent an arbitrary record in the format.Give an example of a data file that contains aprogram to process itself.The comments in good programs make themself-describing. The ultimate in a self-describingprogram, though, is one that prints exactly itssource code when executed. Is it possible towrite such a program in your favorite language?Many files are implicitly self-describing:although the operating system has no idea whatthey contain, a human reader can tell a t a glancewhether a file contains program source text,English text, numeric data, or binary data. Howwould you write a program to make an en-lightened guess as to the type of such a file?

    Programming Pearls

    Give examples of name-value pairs in your com-puting environment.Find a program with fixed-form at input that youfind hard to use, and modify it to read name-value pairs. Is it easier to modify the programdirectly or to write a new program that sits infront of the existing program?A general principle states that the output of aprogram should be suitable for input to the pro-gram. For instance, if a program wants a date tobe entered in the format 06/31/87" then itshould not write:Enter date (default 31 June 1987):

    Give other contexts in which this rule holds.The text sketched how to specify a single experi-ment on a sorting algorithm. Usually, though,experiments come in sets, with several param-eters systematically varied. Construct a genera-tor program that will convert a description liken Cl00 300 1000 3000 100001input [random identical sorted]al9 quicksortcutoff 15 10 20 401partition med-of-3

    into 5 x 3 x 4 = 60 different specifications, witheach item in a bracket list represented by a sin-gle element in a cross product. How would youadd more complex iterators to the language,such as[from 10 to 130 by 201

    For Correspondence: JonBentley, AT&T Bell Laboratories, Room X-317,600 Mountain Avenue, Murray Hill, NJ 07974.

    Permission to copy without fee all or part of this material is grantedprovided that the copies are not made or distributed for direct commer-cial advantage, the ACM copyright notice and the title of the publicationand its date appear, and notice is given that cop ying is by permission ofthe Association for Computing Machinery. To copy otherwise, or torepublish. requires a fee and/or specific p ermission.

    June 1987 Volume 30 Number 6 Communications of the ACM 483