LectureAll.4up

8/2/2019 LectureAll.4up

http://slidepdf.com/reader/full/lectureall4up 1/120

1CS 536 Spring 2008 ©

CS 536

Introduction toProgramming Languages

and CompilersCharles N. Fischer

Spring 2008

http://www.cs.wisc.edu/~fischer/cs536.html

2CS 536 Spring 2008 ©

Class Meets

Mondays, Wednesdays &Fridays,1:20 — 2:10

1325 Computer Sciences

Instructor

Charles N. Fischer

6367 Computer Sciences

Telephone: 262-6635

E-mail: [email protected]

Office Hours:

10:30 - Noon, Tuesdays &Thursdays,or by appointment

3CS 536 Spring 2008 ©

Teaching Assistant

Min Qiu5393 Computer Sciences

E-mail: [email protected]

Telephone: 608.262.5340

Office Hours:

Monday: 9:15 - 10:45,

Wednesday: 4:00 - 5:30or by appointment

4CS 536 Spring 2008 ©

Key Dates

• February 4: Assignment #1(Symbol Table

Routines)

• February 27: Assignment #2(CSX Scanner)

• March 28: Assignment #3(CSX Parser)

• April 2: Midterm Exam(tentative)

• April 21: Assignment #4

(CSX Type Checker)• May 9: Assignment #5

(CSX Code Generator)

• May 17: Final Exam7:45 am-9:45 am



5CS 536 Spring 2008 ©

Class Text

• Draft Chapters from:

Crafting a Compiler, SecondEdition.

(Distributed directly to class

members)• Handouts and Web-based reading will

also be used.

Reading Assignment

• Chapters 1-2 of CaC (as background)

Class Notes

• The transparencies used for eachlecture will be made available prior to,

and after, that lecture on the class Webpage (under the “Lecture Nodes” link).

6CS 536 Spring 2008 ©

Instructional Computers

Departmental Unix (Linux)Machines (king01- king12,emperor01-emperor40) havebeen assigned to CS 536. All

necessary compilers and toolswill be available on thesemachines.

You may also use your own PCor laptop. It will be your responsibility to load neededsoftware (instructions on whereto find needed software areincluded on the class webpages).

The Systems Lab teaches brief

tutorials on Unix if you areunfamiliar with that OS.

7CS 536 Spring 2008 ©

Academic Misconduct Policy

• You must do your assignments—nocopying or sharing of solutions.

• You may discuss general concepts andIdeas.

• All cases of Misconduct must bereported to the Dean’s office.

• Penalties may be severe.

8CS 536 Spring 2008 ©

Program & Homework LatePolicy

• An assignment may be handed in upto one week late.

• Each late day will be debited 3%, up toa maximum of 21%.

Approximate Grade Weights

Program 1 - Symbol Tables 5%

Program 2 - Scanner 12%

Program 3 - Parser 12%Program 4 - Type Checker 12%

Program 5 - Code Generator 12%

Homework #1 9%

Midterm Exam 19%

Final Exam (non-cumulative) 19%



9CS 536 Spring 2008 ©

Partnership Policy

• Program #1 and the writtenhomework must be done individually.

• For undergraduates, programs 2 to 5may be done individually or by two

person teams (your choice). Graduatestudents must do all assignmentsindividually.

10CS 536 Spring 2008 ©

Compilers

Compilers are fundamental tomodern computing.

They act as translators ,transforming human-oriented

programming languages intocomputer-oriented machinelanguages.

To most users, a compiler canbe viewed as a “black box” thatperforms the transformationshown below.

ProgrammingLanguage

MachineLanguage

Compiler

11CS 536 Spring 2008 ©

A compiler allows

programmers to ignore themachine-dependent details of programming.

Compilers allow programs andprogramming skills to bemachine-independent .

Compilers also aid in detectingprogramming errors (which are

all too common).

Compiler techniques also helpto improve computer security.For example, the Java BytecodeVerifier helps to guarantee that Java security rules are satisfied.

12CS 536 Spring 2008 ©

Compilers currently help in

protection of intellectualproperty (using obfuscation)and provenance (throughwatermarking ).



13CS 536 Spring 2008 ©

History of Compilers

The term compiler was coinedin the early 1950s by GraceMurray Hopper. Translation wasviewed as the “compilation” of

a sequence of machine-language subprograms selectedfrom a library.

One of the first real compilerswas the FORTRAN compiler of the late 1950s. It allowed aprogrammer to use a problem-oriented source language.

14CS 536 Spring 2008 ©

Ambitious “optimizations” wereused to produce efficientmachine code, which was vitalfor early computers with quitelimited capabilities.

Efficient use of machineresources is still an essentialrequirement for moderncompilers.

15CS 536 Spring 2008 ©

Compilers EnableProgramming Languages

Programming languages areused for much more than“ordinary” computation.

• TeX and LaTeX use compilers totranslate text and formattingcommands into intricatetypesetting commands.

• Postscript, generated by text-formatters like LaTeX, Word, andFrameMaker, is really aprogramming language. It istranslated and executed by laserprinters and document previewersto produce a readable form of adocument. A standardizeddocument representation languageallows documents to be freelyinterchanged, independent of how

16CS 536 Spring 2008 ©

they were created and how they

will be viewed.• Mathmatica is an interactive system

that intermixes programming withmathematics; it is possible to solveintricate problems in both symbolicand numeric form. This systemrelies heavily on compilertechniques to handle thespecification, internalrepresentation, and solution of problems.

• Verilog and VHDL support thecreation of VLSI circuits. A siliconcompiler specifies the layout andcomposition of a VLSI circuit mask,using standard cell designs. Just asan ordinary compiler understandsand enforces programminglanguage rules, a silicon compilerunderstands and enforces thedesign rules that dictate thefeasibility of a given circuit.



17CS 536 Spring 2008 ©

• Interactive tools often need aprogramming language to supportautomatic analysis andmodification of an artifact.How do you automatically filter orchange a MS Word document? Youneed a text-based specification that

can be processed, like a program,to check validity or produce anupdated version.

18CS 536 Spring 2008 ©

When do We Run a Compiler?

• Prior to executionThis is standard. We compile aprogram once, then use it repeatedly.

• At the start of each execution

We can incorporate values known atthe start of the run to improveperformance.A program may be partially complied,then completed with values set atexecution-time.

• During executionUnused code need not be compiled.Active or “hot” portions of a programmay be specially optimized.

• After executionWe can profile a program, looking for

heavily used routines, which can bespecially optimized for later runs.

19CS 536 Spring 2008 ©

What do Compilers Produce?

Pure Machine Code

Compilers may generate codefor a particular machine, notassuming any operating systemor library routines. This is “purecode” because it includesnothing beyond the instructionset. This form is rare; it issometimes used with systemimplementation languages, that

define operating systems orembedded applications (like aprogrammable controller). Purecode can execute on barehardware without dependenceon any other software.

20CS 536 Spring 2008 ©

Augmented Machine Code

Commonly, compilers generatecode for a machine architectureaugmented with operatingsystem routines and run-timelanguage support routines.To use such a program, aparticular operating systemmust be used and a collectionof run-time support routines(I/O, storage allocation,mathematical functions, etc.)

must be available. Thecombination of machineinstruction and OS and run-timeroutines define a virtual machine—a computer thatexists only as a hardware/software combination.



21CS 536 Spring 2008 ©

Virtual Machine Code

Generated code can consistentirely of virtual instructions(no native code at all). Thisallows code to run on a varietyof computers.

Java, with its JVM (Java VirtualMachine) is a great example of this approach.

If the virtual machine is keptsimple and clean, its interpretercan be easy to write. Machineinterpretation slows executionby a factor of 3:1 to perhaps10:1 over compiled code.

A “Just in Time” ( JIT ) compiler

can translate “hot” portions of virtual code into native code tospeed execution.

22CS 536 Spring 2008 ©

Advantages of VirtualInstructions

Virtual instructions serve avariety of purposes.

• They simplify a compiler by

providing suitable primitives (suchas method calls, stringmanipulation, and so on).

• They aid compiler transportability.

• They may decrease in the size of generated code since instructionsare designed to match a particularprogramming language (forexample, JVM code for Java).

Almost all compilers, to agreater or lesser extent,generate code for a virtualmachine, some of whoseoperations must be interpreted.

23CS 536 Spring 2008 ©

Formats of TranslatedPrograms

Compilers differ in the formatof the target code theygenerate. Target formats maybe categorized as assembly language, relocatable binary ,or memory-image.

• Assembly Language (Symbolic)Format

A text file containing assembler

source code is produced. Anumber of code generationdecisions (jump targets, longvs. short address forms, and soon) can be left for theassembler. This approach isgood for instructional projects.

24CS 536 Spring 2008 ©

Generating assembler code

supports cross-compilation(running a compiler on onecomputer, while its target is asecond computer). Generatingassembly language alsosimplifies debugging andunderstanding a compiler(since you can see thegenerated code).

C rather than a specificassembly language cangenerated, using C as a“universal assembly language.”



25CS 536 Spring 2008 ©

C is far more machine-independent than anyparticular assembly language.However, some aspects of aprogram (such as the run-timerepresentations of program and

data) are inaccessible using Ccode, but readily accessible inassembly language.

26CS 536 Spring 2008 ©

• Relocatable Binary Format

Target code may be generatedin a binary format withexternal references and localinstruction and data addressesare not yet bound. Instead,

addresses are assigned relativeto the beginning of the moduleor relative to symbolicallynamed locations. A linkage stepadds support libraries andother separately compiledroutines and produces anabsolute binary programformat that is executable.

27CS 536 Spring 2008 ©

• Memory-Image (Absolute Binary)

FormCompiled code may be loadedinto memory and immediatelyexecuted. This is faster thangoing through the intermediatestep of link/editing. The abilityto access library andprecompiled routines may belimited. The program must berecompiled for each execution.Memory-image compilers are

useful for student anddebugging use, where frequentchanges are the rule andcompilation costs far exceedexecution costs.

28CS 536 Spring 2008 ©

Java is designed to use and

share classes defined andimplemented at a variety of organizations. Rather than usea fixed copy of a class (whichmay be outdated), the JVMsupports dynamic linking of externally defined classes.When first referenced, a classdefinition may be remotelyfetched, checked, and loadedduring program execution. In

this way “foreign code” can beguaranteed to be up-to-dateand secure.



29CS 536 Spring 2008 ©

The Structure of a Compiler

A compiler performs two majortasks:

• Analysis of the source programbeing compiled

• Synthesis of a target programAlmost all modern compilersare syntax-directed: Thecompilation process is drivenby the syntactic structure of thesource program.

A parser builds semanticstructure out of tokens, theelementary symbols of programming language syntax.Recognition of syntacticstructure is a major part of theanalysis task.

30CS 536 Spring 2008 ©

Semantic analysis examines themeaning (semantics) of theprogram. Semantic analysisplays a dual role.

It finishes the analysis task byperforming a variety of correctness checks (forexample, enforcing type andscope rules). Semantic analysisalso begins the synthesisphase.

The synthesis phase maytranslate source programs intosome intermediaterepresentation (IR) or it may

directly generate target code.

31CS 536 Spring 2008 ©

If an IR is generated, it then

serves as input to a codegenerator component thatproduces the desired machine-language program. The IR mayoptionally be transformed byan optimizer so that a moreefficient program may begenerated.

32CS 536 Spring 2008 ©

Type Checker

Optimizer

Code

Scanner

Symbol Tables

Parser

SourceProgram

(CharacterStream)

TokensSyntaxTree

(AST)

DecoratedAST

IntermediateRepresentation

(IR)

IR

Generator

Target MachineCode

Translator

Abstract

The Structure of a Syntax-Directed Compiler



33CS 536 Spring 2008 ©

Scanner

The scanner reads the sourceprogram, character bycharacter. It groups individualcharacters into tokens

(identifiers, integers, reservedwords, delimiters, and so on).When necessary, the actualcharacter string comprising thetoken is also passed along foruse by the semantic phases.

The scanner:

• Puts the program into a compactand uniform format (a stream of tokens).

• Eliminates unneeded information

(such as comments).• Sometimes enters preliminary

information into symbol tables (for

34CS 536 Spring 2008 ©

example, to register the presenceof a particular label or identifier).

• Optionally formats and lists thesource program

Building tokens is driven bytoken descriptions defined

using regular expressionnotation.

Regular expressions are aformal notation able todescribe the tokens used inmodern programminglanguages. Moreover, they candrive the automatic generationof working scanners given onlya specification of the tokens.Scanner generators (like Lex,

Flex and Jlex) are valuablecompiler-building tools.

35CS 536 Spring 2008 ©

Parser

Given a syntax specification (asa context-free grammar, CFG),the parser reads tokens andgroups them into languagestructures.

Parsers are typically createdfrom a CFG using a parsergenerator (like Yacc, Bison or Java CUP).

The parser verifies correct

syntax and may issue a syntaxerror message.

As syntactic structure isrecognized, the parser usuallybuilds an abstract syntax tree(AST), a concise representationof program structure, whichguides semantic processing.

36CS 536 Spring 2008 ©

Type Checker(Semantic Analysis)

The type checker checks thestatic semantics of each ASTnode. It verifies that the constructis legal and meaningful (that allidentifiers involved are declared,that types are correct, and so on).If the construct is semanticallycorrect, the type checker“decorates” the AST node, addingtype or symbol table information

to it. If a semantic error isdiscovered, a suitable errormessage is issued.Type checking is purelydependent on the semantic rulesof the source language. It isindependent of the compiler’starget machine.



37CS 536 Spring 2008 ©

Translator(Program Synthesis)

If an AST node is semanticallycorrect, it can be translated.Translation involves capturing

the run-time “meaning” of aconstruct.

For example, an AST for a whileloop contains two subtrees,one for the loop’s controlexpression, and the other forthe loop’s body. Nothing in theAST shows that a while looploops! This “meaning” iscaptured when a while loop’sAST is translated. In the IR, the

notion of testing the value of the loop control expression,

38CS 536 Spring 2008 ©

and conditionally executing theloop body becomes explicit.

The translator is dictated by thesemantics of the sourcelanguage. Little of the nature of the target machine need bemade evident. Detailedinformation on the nature of the target machine (operationsavailable, addressing, registercharacteristics, etc.) is reservedfor the code generation phase.

In simple non-optimizingcompilers (like our classproject), the translatorgenerates target code directly,without using an IR.

More elaborate compilers mayfirst generate a high-level IR

39CS 536 Spring 2008 ©

(that is source language

oriented) and thensubsequently translate it into alow-level IR (that is targetmachine oriented). Thisapproach allows a cleanerseparation of source and targetdependencies.

40CS 536 Spring 2008 ©

Optimizer

The IR code generated by thetranslator is analyzed andtransformed into functionallyequivalent but improved IR codeby the optimizer.

The term optimization ismisleading: we don’t alwaysproduce the best possibletranslation of a program, evenafter optimization by the best of compilers.

Why?Some optimizations areimpossible to do in allcircumstances because theyinvolve an undecidable problem.Eliminating unreachable (“dead”)code is, in general, impossible.



41CS 536 Spring 2008 ©

Other optimizations are tooexpensive to do in all cases.These involve NP-completeproblems, believed to beinherently exponential.Assigning registers to variables

is an example of an NP-completeproblem.

Optimization can be complex; itmay involve numeroussubphases, which may need tobe applied more than once.

Optimizations may be turned off to speed translation.Nonetheless, a well designedoptimizer cansignificantly speedprogram execution bysimplifying, moving or

eliminating unneededcomputations.

42CS 536 Spring 2008 ©

Code Generator

IR code produced by thetranslator is mapped into targetmachine code by the codegenerator. This phase uses

detailed information about thetarget machine and includesmachine-specific optimizationslike register allocation and codescheduling .

Code generators can be quitecomplex since good targetcode requires consideration of many special cases.

Automatic generation of codegenerators is possible. The

basic approach is to match alow-level IR to targetinstruction templates, choosing

43CS 536 Spring 2008 ©

instructions which best match

each IR instruction.A well-known compiler usingautomatic code generationtechniques is the GNU Ccompiler. GCC is a heavilyoptimizing compiler withmachine description files forover ten popular computerarchitectures, and at least twolanguage front ends (C andC++).

44CS 536 Spring 2008 ©

Symbol Tables

A symbol table allowsinformation to be associatedwith identifiers and sharedamong compiler phases. Eachtime an identifier is used, asymbol table provides accessto the information collectedabout the identifier when itsdeclaration was processed.



45CS 536 Spring 2008 ©

Example

Our source language will beCSX , a blend of C, C++ and Java.

Our target language will be the

Java JVM, using the Jasminassembler.

• Our source line isa = bb+abs(c-7);

this is a sequence of ASCII charactersin a text file.

• The scanner groups characters intotokens, the basic units of a program.

a = bb+abs(c-7);After scanning, we have the followingtoken sequence:Id

aAsg Id

bbPlus Id

absLparen Id

cMinus IntLiteral7 Rparen Semi

46CS 536 Spring 2008 ©

• The parser groups these tokens intolanguage constructs (expressions,statements, declarations, etc.)represented in tree form:

(What happened to theparentheses and thesemicolon?)

Asg

Ida Plus

Idbb Call

Idabs Minus

Idc IntLiteral7

47CS 536 Spring 2008 ©

• The type checker resolves types and

binds declarations within scopes:

Asg

Ida Plus

Idbb Call

Idabs Minus

Idc IntLiteral7

int

intintloc

intloc int

int

intlocint

method

48CS 536 Spring 2008 ©

• Finally, JVM code is generated for each

node in the tree (leaves first, thenroots):

iload 3 ; push local 3 (bb)

iload 2 ; push local 2 (c)

ldc 7 ; Push literal 7

isub ; compute c-7

invokestatic java/lang/Math/abs(I)I

iadd ; compute bb+abs(c-7)

istore 1 ; store result intolocal 1(a)



49CS 536 Spring 2008 ©

Interpreters

There are two different kinds of interpreters that supportexecution of programs,machine interpreters and

language interpreters .

Machine Interpreters

Machine interpreters simulatethe execution of a programcompiled for a particularmachine architecture. Java usesa bytecode interpreter tosimulate the effects of programs compiled for the JVM.

Programs like SPIM simulate theexecution of a MIPS program ona non-MIPS computer.

50CS 536 Spring 2008 ©

Language Interpreters

Language interpreters simulate theeffect of executing a programwithout compiling it to any particularinstruction set (real or virtual).Instead some IR form (perhaps anAST) is used to drive execution.Interpreters provide a number of capabilities not found in compilers:

• Programs may be modified asexecution proceeds. This provides astraightforward interactive debuggingcapability. Depending on programstructure, program modifications mayrequire reparsing or repeatedsemantic analysis. In Python, forexample, any string variable may beinterpreted as a Python expression orstatement and executed.

51CS 536 Spring 2008 ©

• Interpreters readily support languages

in which the type of a variable denotesmay change dynamically (e.g., Pythonor Scheme). The user program iscontinuously reexamined as executionproceeds, so symbols need not have afixed type. Fluid bindings are muchmore troublesome for compilers, sincedynamic changes in the type of asymbol make direct translation intomachine code difficult or impossible.

• Interpreters provide betterdiagnostics. Source text analysis is

intermixed with program execution,so especially good diagnostics areavailable, along with interactivedebugging.

• Interpreters support machineindependence. All operations areperformed within the interpreter. Tomove to a new machine, we justrecompile the interpreter.

52CS 536 Spring 2008 ©

However, interpretation can

involve large overheads:• As execution proceeds, program text

is continuously reexamined, withbindings, types, and operationssometimes recomputed at each use.For very dynamic languages this canrepresent a 100:1 (or worse) factor inexecution speed over compiled code.For more static languages (such as Cor Java), the speed degradation iscloser to 10:1.

• Startup time for small programs is

slowed, since the interpreter must beload and the program partiallyrecompiled before execution begins.



53CS 536 Spring 2008 ©

• Substantial space overhead may beinvolved. The interpreter and allsupport routines must usually be keptavailable. Source text is often not ascompact as if it were compiled. Thissize penalty may lead to restrictions inthe size of programs. Programsbeyond these built-in limits cannot behandled by the interpreter.

Of course, many languages(including, C, C++ and Java) haveboth interpreters (for debuggingand program development) andcompilers (for production work).

54CS 536 Spring 2008 ©

Symbol Tables & Scoping

Programming languages usescopes to limit the range anidentifier is active (and visible).

Within a scope a name may be

defined only once (thoughoverloading may be allowed).

A symbol table (or dictionary) iscommonly used to collect allthe definitions that appearwithin a scope.

At the start of a scope, thesymbol table is empty. At theend of a scope, all declarationswithin that scope are availablewithin the symbol table.

55CS 536 Spring 2008 ©

A language definition may or

may not allow forward references to an identifier.

If forward references areallowed, you may use a namethat is defined later in thescope (Java does this for fieldand method declarations withina class).

If forward references are notallowed, an identifier is visibleonly after its declaration. C,C++ and Java do this forvariable declarations.

In CSX no forward referencesare allowed.

In terms of symbol tables,forward references require twopasses over a scope. First all

56CS 536 Spring 2008 ©

declarations are gathered.

Next, all references areresolved using the complete setof declarations stored in thesymbol table.

If forward references aredisallowed, one pass through ascope suffices, processingdeclarations and uses of identifiers together.



57CS 536 Spring 2008 ©

Block Structured Languages

• Introduced by Algol 60, includes C,C++, CSX and Java.

• Identifiers may have a non-globalscope. Declarations may be local to a

class, subprogram or block.• Scopes may nest , with declarations

propagating to inner (contained)scopes.

• The lexically nearest declaration of anidentifier is bound to uses of thatidentifier.

58CS 536 Spring 2008 ©

Example (drawn from C):

int x,z; void A() {

float x,y;print(x,y,z);

} void B() {

print (x,y,z)

}

floatfloatint

int

int undeclared

59CS 536 Spring 2008 ©

Block Structure Concepts

• Nested Visibility

No access to identifiers outsidetheir scope.

• Nearest Declaration Applies

Using static nesting of scopes.

• Automatic Allocation and Deallocationof Locals

Lifetime of data objects isbound to the scope of the

Identifiers that denote them.

60CS 536 Spring 2008 ©

Block-Structured SymbolTables

Block structured symbol tablesare designed to support thescoping rules of blockstructured languages.

For our CSX project we’ll useclass Symb to represent symbolsand SymbolTable toimplemented operationsneeded for a block-structured

symbol table.

Class Symb will contain amethod

public String name()

that returns the nameassociated with a symbol.



61CS 536 Spring 2008 ©

Class SymbolTable contains thefollowing methods:

• public void openScope() {A new and empty scope is opened.

• public void closeScope() throwsEmptySTException

The innermost scope is closed. Anexception is thrown if there is noscope to close.

• public void insert(Symb s)throws DuplicateException,EmptySTException

A Symb is inserted in the innermostscope. An exception is thrown if aSymb with the same name isalready in the innermost scope or if there is no symbol table to insertinto.

62CS 536 Spring 2008 ©

• public Symb localLookup(String s)The innermost scope is searchedfor a Symb whose name is equal tos. Null is returned if no Symbnamed s is found.

• public Symb globalLookup(String s)All scopes, from innermost tooutermost, are searched for a Symbwhose name is equal to s. The firstSymb that matches s is found;otherwise null is returned if nomatching Symb is found.

63CS 536 Spring 2008 ©

Is Case Significant?

In some languages (C, C++, Java and many others) case is significant in identifiers. Thismeans aa and AA are differentsymbols that may have entirelydifferent definitions.

In other languages (Pascal, Ada,Scheme, CSX) case is not significant. In such languagesaa and AA are two alternative

spellings of the same identifier.Data structures commonly usedto implement symbol tablesusually treat different cases asdifferent symbols. This is finewhen case is significant in alanguage. When case isinsignificant, you probably will

64CS 536 Spring 2008 ©

need to strip case before

entering or looking upidentifiers.

This just means that identifiersare converted to a uniform casebefore they are entered orlooked up. Thus if we choose touse lower case uniformly, theidentifiers aaa, AAA , and AaA areall converted to aaa forpurposes of insertion orlookup.

BUT, inside the symbol table theidentifier is stored in the form itwas declared so thatprogrammers see the form of identifier they expect inlistings, error messages, etc.



65CS 536 Spring 2008 ©

How are Symbol TablesImplemented?

There are a number of datastructures that can reasonablybe used to implement a symbol

table:• An Ordered List

Symbols are stored in a linked list,sorted by the symbol’s name. Thisis simple, but may be a bit too slowif many identifiers appear in ascope.

• A Binary Search TreeLookup is much faster than inlinked lists, but rebalancing may beneeded. (Entering identifiers insorted order turns a search tree

into a linked list.)• Hash Tables

The most popular choice.

66CS 536 Spring 2008 ©

Implementing Block-Structured Symbol Tables

To implement a blockstructured symbol table weneed to be able to efficiently

open and close individualscopes, and limit insertion tothe innermost current scope.

This can be done using onesymbol table structure if we tagindividual entries with a “scopenumber.”

It is far easier (but morewasteful of space) to allocateone symbol table for eachscope. Open scopes arestacked, pushing and poppingtables as scopes are openedand closed.

67CS 536 Spring 2008 ©

Be careful though—many

preprogrammed stackimplementations don’t allowyou to “peek” at entries belowthe stack top. This is necessaryto lookup an identifier in allopen scopes.

If a suitable stackimplementation (with a peekoperation) isn’t available, alinked list of symbol tables willsuffice.

68CS 536 Spring 2008 ©

More on Hashtables

Hashtables are a very usefuldata structure. Java provides apredefined Hashtable class.Python includes a built-indictionary type.

Every Java class has a hashCodemethod, which allows anyobject to be entered into a JavaHashtable.

For most objects, hash codes

are pretty simple (the addressof the corresponding object isoften used).

But for strings Java uses a muchmore elaborate hash

function: ci

37×i

i 0=

n 1–

∑



69CS 536 Spring 2008 ©

n is the length of the string, ciis the i-th character and allarithmetic is done withoutoverflow checking.

Why such an elaborate hash

function?Simpler hash functions canhave major problems.

Consider (add the

characters).

For short identifiers the sumgrows slowly, so large indiceswon’t often be used (leading tonon-uniform use of the hash

table).

ci

i 0=

n 1–

∑

70CS 536 Spring 2008 ©

We can try (product of

characters), but now(surprisingly) the size of thehash table becomes an issue.

The problem is that if even onecharacter is encoded as an evennumber, the product must beeven.

If the hash table is even in size(a natural thing to do), mosthash table entries will be ateven positions. Similarly, if even one character is encodedas a multiple of 3, the wholeproduct will be a multiple of 3,

so hash tables that are amultiple of three in size won’tbe uniformly used.

ci

i 0=

n 1–

∏

71CS 536 Spring 2008 ©

To see how bad things can get,

consider a hash table with size210 (which is equal to 2×3×5×7).This should be a particularlybad table size if a product hashis used. (Why?)

Is it? As an experiment, all thewords in the Unix spellchecker’s dictionary (26000words) where entered. Over50% (56.7% actually) hitposition 0 in the table!

Why such non-uniformity?

If an identifier containscharacters that are multiples of 2, 3, 5 and 7, then their hashwill be a multiple of 210 andwill map to position 0.

72CS 536 Spring 2008 ©

For example, in Wisconsin, n

has an ASCII code of 110 (2×55)and i has a code of 105(7×5×3).

If we change the table size everso slightly, to 211, no tableentry gets more than 1% of the26000 words hashed, which isvery good.

Why such a big difference? Well211 is prime and there is a bit afolk-wisdom that states thatprime numbers are goodchoices for hash table sizes.Now our product hash willcover table entries far moreuniformly (small factors in thehash don’t divide the table sizeevenly).



73CS 536 Spring 2008 ©

Now the reason for Java’s morecomplex string hash functionbecomes evident—it canuniformly fill a hash tablewhose size isn’t prime.

74CS 536 Spring 2008 ©

How are Collisions Handled?

Since identifiers are oftenunlimited in length, the set of possible identifiers is infinite.Even if we limit ourselves is

short identifiers (say 10 of fewer characters), the range of valid identifiers is greater than

2610.

This means that all hash tablesneed to contend with collisions ,when two different identifiersmap to the same place in thetable.

How are collisions handled?

The simplest approach is linear resolution. If identifier idhashes to position p in a hash

75CS 536 Spring 2008 ©

table of size s and position p is

already filled, we try (p+1) mods, then (p+2) mod s, until a freeposition is found.

As long as the table is not toofilled, this approach works well.When we approach an almost-filled situation, long searchchains form, and wedegenerate to an unorderedlist.

If the table is 100% filled, linearresolution fails.Some hash tableimplementations, including Java’s, set a load factor between 0 and 1.0. When thefraction of filled entries in thetable exceeds the load factor,

76CS 536 Spring 2008 ©

table size is increased and all

entries are rehashed.Note that bundling of ahashCode method within all Javaobjects makes rehashing easyto do automatically. If the hashfunction is external to thesymbol table entries, rehashingmay need to be done manuallyby the user.

An alternative to linearresolution is chained resolution, in which symboltable entries contain pointersto chains of symbols ratherthan a single symbol. Allidentifiers that hash to thesame position appear on thesame chain. Now overflowingtable size is not catastrophic—



77CS 536 Spring 2008 ©

as the table fills, chains fromeach table position get longer.As long as the table is not toooverfilled, average chain lengthwill be small.

78CS 536 Spring 2008 ©

Reading Assignment

Read Chapter 3 of

Crafting a Compiler SecondEdition.

79CS 536 Spring 2008 ©

Scanning

A scanner transforms a characterstream into a token stream.

A scanner is sometimes called alexical analyzer or lexer.

Scanners use a formal notation(regular expressions ) to specifythe precise structure of tokens.

But why bother? Aren’t tokensvery simple in structure?

Token structure can be more

detailed and subtle than onemight expect. Consider simplequoted strings in C, C++ or Java.The body of a string can be anysequence of characters except aquote character (which must beescaped). But is this simpledefinition really correct?

80CS 536 Spring 2008 ©

Can a newline character appear in

a string? In C it cannot, unless it isescaped with a backslash.

C, C++ and Java allow escapednewlines in strings, Pascal forbidsthem entirely. Ada forbids all unprintable characters.

Are null strings (zero-length)allowed? In C, C++, Java and Adathey are, but Pascal forbids them.

(In Pascal a string is a packedarray of characters, and zerolength arrays are disallowed.)

A precise definition of tokens canensure that lexical rules areclearly stated and properlyenforced.



81CS 536 Spring 2008 ©

Regular Expressions

Regular expressions specifysimple (possibly infinite) sets of strings. Regular expressionsroutinely specify the tokens

used in programminglanguages.

Regular expressions can drive ascanner generator .

Regular expressions are widelyused in computer utilities:

•The Unix utility grep uses regularexpressions to define searchpatterns in files.

•Unix shells allow regularexpressions in file lists for a

command.

82CS 536 Spring 2008 ©

• Most editors provide a “contextsearch” command that specifiesdesired matches using regularexpressions.

•The Windows Find utility allowssome regular expressions.

83CS 536 Spring 2008 ©

Regular Sets

The sets of strings defined byregular expressions are calledregular sets.

When scanning, a token class willbe a regular set, whose structureis defined by a regularexpression.

Particular instances of a tokenclass are sometimes calledlexemes, though we will simplycall a string in a token class an

instance of that token. Thus wecall the string abc an identifier if it matches the regular expressionthat defines valid identifiertokens.

Regular expressions use a finitecharacter set, or vocabulary (denoted Σ).

84CS 536 Spring 2008 ©

This vocabulary is normally the

character set used by a computer.Today, the ASCII character set,which contains a total of 128characters, is very widely used.

Java uses the Unicode characterset which includes all the ASCIIcharacters as well as a widevariety of other characters.

An empty or null string is allowed(denoted λ , “lambda”). Lambdarepresents an empty buffer inwhich no characters have yet

been matched. It also representsoptional parts of tokens. Aninteger literal may begin with aplus or minus, or it may beginwith λ if it is unsigned.



85CS 536 Spring 2008 ©

Catenation

Strings are built from charactersin the character set Σ viacatenation.

As characters are catenated to a

string, it grows in length. Thestring do is built by firstcatenating d to λ , and thencatenating o to the string d. Thenull string, when catenated withany string s, yields s. That is, s λ ≡λ s ≡ s. Catenating λ to a string islike adding 0 to an integer—nothing changes.

Catenation is extended to sets of strings:

Let P and Q be sets of strings.

(The symbol ∈ represents setmembership.) If s1 ∈ P and s2 ∈ Q

then string s1s2 ∈(P Q).

86CS 536 Spring 2008 ©

Alternation

Small finite sets are convenientlyrepresented by listing theirelements. Parentheses delimitexpressions, and |, the alternationoperator , separates alternatives.

For example, D, the set of the tensingle digits, is defined as

D = (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9).

The characters (, ), ' , ∗, +, and |are meta-characters (punctuationand regular expressionoperators).

Meta-characters must be quotedwhen used as ordinary charactersto avoid ambiguity.

87CS 536 Spring 2008 ©

For example the expression

( '(' | ')' | ; | , )defines four single charactertokens (left parenthesis, rightparenthesis, semicolon andcomma). The parentheses arequoted when they representindividual tokens and are notused as delimiters in a largerregular expression.

Alternation is extended to sets of strings:

Let P and Q be sets of strings.

Then string s ∈ (P | Q) if and onlyif s ∈ P or s ∈ Q.

For example, if LC is the set of lower-case letters and UC is theset of upper-case letters, then(LC | UC) is the set of all letters (ineither case).

88CS 536 Spring 2008 ©

Kleene Closure

A useful operation is Kleeneclosure represented by a postfix ∗operator.

Let P be a set of strings. Then P *

represents all strings formed bythe catenation of zero or moreselections (possibly repeated)from P.

Zero selections are denoted by λ .

For example, LC* is the set of all

words composed of lower-caseletters, of any length (includingthe zero length word, λ ).

Precisely stated, a string s ∈ P * if and only if s can be broken intozero or more pieces: s = s1 s2 ... snso that each si ∈ P (n ≥ 0, 1 ≤ i ≤ n).

We allow n = 0, so λ is always in P.



89CS 536 Spring 2008 ©

Definition of RegularExpressions

Using catenations, alternationand Kleene closure, we candefine regular expressions as

follows:• ∅ is a regular expression denoting

the empty set (the set containingno strings). ∅ is rarely used, but isincluded for completeness.

• λ is a regular expression denotingthe set that contains only theempty string. This set is not thesame as the empty set, because itcontains one element.

• A string s is a regular expressiondenoting a set containing the

single string s.

90CS 536 Spring 2008 ©

• If A and B are regular expressions,

then A | B, A B, and A* are alsoregular expressions, denoting thealternation, catenation, and Kleeneclosure of the correspondingregular sets.

Each regular expressiondenotes a set of strings (aregular set ). Any finite set of strings can be represented by aregular expression of the form(s1 | s2 | … | sk ). Thus thereserved words of ANSI C canbe defined as(auto | break | case | …).

91CS 536 Spring 2008 ©

The following additional

operations useful. They are notstrictly necessary, because theireffect can be obtained usingalternation, catenation, Kleeneclosure:

• P+ denotes all strings consisting of one or more strings in P catenatedtogether:

P* = (P+| λ ) and P+ = P P*.

For example, ( 0 | 1 )+ is the set of all strings containing one or morebits.

• If A is a set of characters, Not(A)denotes (Σ − A); that is, allcharacters in Σ not included in A.Since Not(A) can never be largerthan Σ and Σ is finite, Not(A) mustalso be finite, and is thereforeregular. Not(A) does not contain λ since λ is not a character (it is azero-length string).

92CS 536 Spring 2008 ©

For example, Not(Eol) is the set of

all characters excluding Eol (theend of line character, '\n' in Java orC).

• It is possible to extend Not tostrings, rather than just Σ. That is,if S is a set of strings, we define Sto be

(Σ* − S); the set of all strings exceptthose in S. Though S is usuallyinfinite, it is also regular if S is.

• If k is a constant, the set Ak

represents all strings formed by

catenating k (possibly different)strings from A.

That is, Ak = (A A A …) (k copies).

Thus ( 0 | 1 )32 is the set of all bitstrings exactly 32 bits long.



93CS 536 Spring 2008 ©

Examples

Let D be the ten single digitsand let L be the set of all 52letters. Then• A Java or C++ single-line comment

that begins with // and ends withEol can be defined as:

Comment = // Not(Eol)* Eol

• A fixed decimal literal (e.g.,12.345) can be defined as:

Lit = D+. D+

•An optionally signed integer literalcan be defined as:

IntLiteral = ( '+' | − | λ ) D+

(Why the quotes on the plus?)

94CS 536 Spring 2008 ©

• A comment delimited by ##markers, which allows single #’swithin the comment body:

Comment2 =

## ((# | λ ) Not(#) )* ##

All finite sets and many infinite setsare regular. But not all infinite setsare regular. Consider the set of balanced brackets of the form[ [ […] ] ].

This set is defined formally as

{ [m ]m | m ≥ 1 }.This set is known not to be regular.Any regular expression that tries todefine it either does not get all balanced nestings or it includesextra, unwanted strings.

95CS 536 Spring 2008 ©

Finite Automata and Scanners

A finite automaton (FA) can beused to recognize the tokensspecified by a regularexpression. FAs are simple,idealized computers thatrecognize strings belonging toregular sets. An FA consists of:

• A finite set of states

• A set of transitions (or moves ) fromone state to another, labeled with

characters in Σ• A special state called the start state

• A subset of the states called theaccepting , or final, states

96CS 536 Spring 2008 ©

These four components of a

finite automaton are oftenrepresented graphically:

Finite automata (the plural of automaton is automata) arerepresented graphically usingtransition diagrams. We start atthe start state. If the next inputcharacter matches the label on

is a transition

is the start state

is an accepting state

is a state



97CS 536 Spring 2008 ©

a transition from the currentstate, we go to the state itpoints to. If no move ispossible, we stop. If we finishin an accepting state, thesequence of characters read

forms a valid token; otherwise,we have not seen a valid token.

In this diagram, the validtokens are the stringsdescribed by the regular

expression (a b (c)+ )+.

a b c

c

a

98CS 536 Spring 2008 ©

Deterministic Finite Automata

As an abbreviation, a transitionmay be labeled with more thanone character (for example,Not(c)). The transition may betaken if the current inputcharacter matches any of thecharacters labeling the transition.

If an FA always has a uniquetransition (for a given state andcharacter), the FA is deterministic (that is, a deterministic FA, orDFA). Deterministic finiteautomata are easy to programand often drive a scanner.

If there are transitions to morethan one state for some character,then the FA is nondeterministic (that is, an NFA).

99CS 536 Spring 2008 ©

A DFA is conveniently represented

in a computer by a transitiontable. A transition table, T, is atwo dimensional array indexed bya DFA state and a vocabularysymbol.

Table entries are either a DFAstate or an error flag (oftenrepresented as a blank tableentry). If we are in state s, andread character c, then T[s,c] willbe the next state we visit, or T[s,c]will contain an error marker

indicating that c cannot extendthe current token. For example,the regular expression

// Not(Eol)* Eol

which defines a Java or C++single-line comment, might betranslated into

100CS 536 Spring 2008 ©

The corresponding transitiontable is:

A complete transition tablecontains one column for eachcharacter. To save space, tablecompression may be used. Onlynon-error entries are explicitlyrepresented in the table, usinghashing, indirection or linkedstructures.

State Character

/ Eol a b …

1 2

2 3

3 3 4 3 3 3

4

eof

Eol / /

Not(Eol)

1 2 3 4



101CS 536 Spring 2008 ©

All regular expressions can betranslated into DFAs that accept(as valid tokens) the stringsdefined by the regularexpressions. This translation canbe done manually by aprogrammer or automaticallyusing a scanner generator.A DFA can be coded in:

• Table-driven form

• Explicit control form

In the table-driven form, thetransition table that defines aDFA’s actions is explicitlyrepresented in a run-time tablethat is “interpreted” by a driverprogram.

In the direct control form, the

transition table that defines aDFA’s actions appears implicitly asthe control logic of the program.

102CS 536 Spring 2008 ©

For example, supposeCurrentChar is the current inputcharacter. End of file isrepresented by a special charactervalue, eof. Using the DFA for the Java comments shown earlier, atable-driven scanner is:State = StartState

while (true){

if (CurrentChar == eof)break

NextState =T[State][CurrentChar]

if(NextState == error)break

State = NextState

read(CurrentChar)

}

if (State in AcceptingStates)

// Process valid tokenelse // Signal a lexical error

103CS 536 Spring 2008 ©

This form of scanner is produced

by a scanner generator; it isdefinition-independent. Thescanner is a driver that can scanany token if T contains theappropriate transition table.

Here is an explicit-control scannerfor the same comment definition:

if (CurrentChar == '/'){

read(CurrentChar)

if (CurrentChar == '/')

repeat

read(CurrentChar)

until (CurrentChar in{eol, eof})

else //Signal lexical error

else // Signal lexical error

if (CurrentChar == eol)

// Process valid token

else //Signal lexical error

104CS 536 Spring 2008 ©

The token being scanned is

“hardwired” into the logic of thecode. The scanner is usually easyto read and often is moreefficient, but is specific to a singletoken definition.



105CS 536 Spring 2008 ©

More Examples

• A FORTRAN-like real literal (whichrequires digits on either or bothsides of a decimal point, or just astring of digits) can be defined as

RealLit = (D+ (λ | . )) | (D* . D+)

This corresponds to the DFA

. D

DD

D .

106CS 536 Spring 2008 ©

• An identifier consisting of letters,digits, and underscores, whichbegins with a letter and allows noadjacent or trailing underscores,may be defined as

ID = L (L | D)* ( _ (L | D)+)*

This definition includes identifierslike sum or unit_cost, butexcludes _one and two_ andgrand___total. The DFA is:

L | D

L

L | D

_

107CS 536 Spring 2008 ©

Lex/Flex/JLex

Lex is a well-known Unix scannergenerator. It builds a scanner, inC, from a set of regularexpressions that define thetokens to be scanned.

Flex is a newer and faster versionof Lex.

JLex is a Java version of Lex. Itgenerates a scanner coded in Java, though its regularexpression definitions are very

close to those used by Lex andFlex.

Lex, Flex and JLex are largely non- procedural . You don’t need to tellthe tools how to scan. All youneed to tell it what you wantscanned (by giving it definitionsof valid tokens).

108CS 536 Spring 2008 ©

This approach greatly simplifies

building a scanner, since most of the details of scanning (I/O,buffering, character matching,etc.) are automatically handled.



109CS 536 Spring 2008 ©

JLex

JLex is coded in Java. To use it,you enter

java JLex.Main f.jlex

Your CLASSPATH should be set to

search the directories where JLex’sclasses are stored.(The CLASSPATH we gave youincludes JLex’s classes).

After JLex runs (assuming thereare no errors in your tokenspecifications), the Java sourcefilef.jlex.java is created. (f standsfor any file name you choose.Thus csx.jlex might hold tokendefinitions for CSX, and

csx.jlex.java would hold thegenerated scanner).

110CS 536 Spring 2008 ©

You compile f.jlex.java justlike any Java program, using yourfavorite Java compiler.

After compilation, the class file

Yylex.class is created.

It contains the methods:

• Token yylex() which is the actualscanner. The constructor for Yylextakes the file you want scanned, sonew Yylex(System.in)will build a scanner that reads fromSystem.in. Token is the tokenclass you want returned by thescanner; you can tell JLex whatclass you want returned.

• String yytext() returns thecharacter text matched by the lastcall to yylex.

111CS 536 Spring 2008 ©

A simple example of the use of

JLex is in~cs536-1/pubic/jlex Just enter make test

112CS 536 Spring 2008 ©

Input to JLex

There are three sections,delimited by %%. The generalstructure is:User Code

%%

Jlex Directives

%%

Regular Expression rules

The User Code section is Java

source code to be copied into thegenerated Java source file. Itcontains utility classes or returntype classes you need. Thus if youwant to return a classIntlitToken (for integer literalsthat are scanned), you include itsdefinition in the User Codesection.



113CS 536 Spring 2008 ©

JLex directives are variousinstructions you can give JLex tocustomize the scanner yougenerate.

These are detailed in the JLexmanual. The most important are:

• %{Code copied into the Yylexclass (extra fields or

methods you may want)%}

• %eof{Java code to be executed whenthe end of file is reached%eof}

• %type classnameclassname is the return type youwant for the scanner method,

yylex()

114CS 536 Spring 2008 ©

Macro Definitions

In section two you may alsodefine macros , that are used insection three. A macro allows youto give a name to a regularexpression or character class.This allows you to reusedefinitions and make regularexpression rule more readable.

Macro definitions are of the formname = def

Macros are defined one per line.

Here are some simple examples:Digit=[0-9]

AnyLet=[A-Za-z]

In section 3, you use a macro byplacing its name within { and }.

Thus {Digit} expands to thecharacter class defining the digits0 to 9.

115CS 536 Spring 2008 ©

Regular Expression Rules

The third section of the JLex inputfile is a series of token definitionrules of the form

RegExpr {Java code}

When a token matching the givenRegExpr is matched, thecorresponding Java code(enclosed in “{“ and “}”) isexecuted. JLex figures out whatRegExpr applies; you need onlysay what the token looks like

(using RegExpr) and what youwant done when the token ismatched (this is usually to returnsome token object, perhaps withsome processing of the tokentext).

116CS 536 Spring 2008 ©

Here are some examples:

"+" {return new Token(sym.Plus);}

(" ")+ {/* skip white space */}

{Digit}+ {returnnew IntToken(sym.Intlit,new Integer(yytext()).intValue());}



117CS 536 Spring 2008 ©

Regular Expressions in JLex

To define a token in JLex, the userto associates a regular expressionwith commands coded in Java.

When input characters that match

a regular expression are read, thecorresponding Java code isexecuted. As a user of JLex youdon’t need to tell it how to matchtokens; you need only say what you want done when a particulartoken is matched.

Tokens like white space aredeleted simply by having theirassociated command not returnanything. Scanning continuesuntil a command with a return init is executed.

The simplest form of regularexpression is a single string thatmatches exactly itself.

118CS 536 Spring 2008 ©

For example,if {return new Token(sym.If);}

If you wish, you can quote thestring representing the reservedword ("if"), but since the stringcontains no delimiters or

operators, quoting it isunnecessary.

For a regular expression operator,like +, quoting is necessary:"+" {return

new Token(sym.Plus);}

119CS 536 Spring 2008 ©

Character Classes

Our specification of the reservedword if, as shown earlier, isincomplete. We don’t (yet) handleupper or mixed-case.

To extend our definition, we’ll usea very useful feature of Lex and JLex—character classes .

Characters often naturally fall intoclasses, with all characters in aclass treated identically in a tokendefinition. In our definition of

identifiers all letters form a classsince any of them can be used toform an identifier. Similarly, in anumber, any of the ten digitcharacters can be used.

120CS 536 Spring 2008 ©

Character classes are delimited by

[ and ]; individual characters arelisted without any quotation orseparators. However \, ^, ] and -,because of their special meaningin character classes, must beescaped. The character class[xyz] can match a single x, y, orz.

The character class [\])] canmatch a single ] or ).

(The ] is escaped so that it isn’tmisinterpreted as the end of

character class.)Ranges of characters areseparated by a -; [x-z] is thesame as [xyz]. [0-9] is the setof all digits and [a-zA-Z] is theset of all letters, upper- and lower-case. \ is the escape character,used to represent unprintablesand to escape special symbols.



121CS 536 Spring 2008 ©

Following C and Java conventions,\n is the newline (that is, end of line), \t is the tab character, \\ isthe backslash symbol itself, and\010 is the charactercorresponding to octal 10.

The ^ symbol complements acharacter class (it is JLex’srepresentation of the Notoperation).

[^xy] is the character class thatmatches any single characterexcept x and y. The ^ symbolapplies to all characters thatfollow it in a character classdefinition, so [^0-9] is the set of all characters that aren’t digits.[^] can be used to match allcharacters.

122CS 536 Spring 2008 ©

Here are some examples of character classes:

CharacterClass Set of Characters Denoted

[abc] Three characters: a, b and c

[cba] Three characters: a, b and c

[a-c] Three characters: a, b and c

[aabbcc] Three characters: a, b and c

[^abc] All characters except a, band c

[\^\-\]] Three characters: ^, - and ]

[^] All characters

"[abc]" Not a character class. This

is one five character string :[abc]

123CS 536 Spring 2008 ©

Regular Operators in JLex

JLex provides the standard regularoperators, plus some additions.

• Catenation is specified by the juxtaposition of two expressions;no explicit operator is used.Outside of character class brackets,individual letters and numbersmatch themselves; other charactersshould be quoted (to avoidmisinterpretation as regularexpression operators).

Case is significant.

Regular Expr Characters Matcheda b cd Four characters: abcd

(a)(b)(cd) Four characters: abcd

[ab][cd] Four different strings: ac or

ad or bc or bd

while Five characters: while

" while" Five characters: while

[w][h][i][l][e] Five characters: while

124CS 536 Spring 2008 ©

• The alternation operator is |.

Parentheses can be used to controlgrouping of subexpressions.If we wish to match the reservedword while allowing any mixtureof upper- and lowercase, we canuse(w|W)(h|H)(i|I)(l|L)(e|E)or[wW][hH][iI][lL][eE]

Regular Expr Characters Matched

ab|cd Two different strings: ab or cd

(ab)|(cd) Two different strings: ab or cd

[ab]|[cd] Four different strings: a or b orc or d



125CS 536 Spring 2008 ©

• Postfix operators:* Kleene closure: 0 or morematches.(ab)* matches λ or ab or abab orababab ...

+ Positive closure: 1 or more

matches.(ab)+ matches ab or abab orababab ...

? Optional inclusion:expr?

matches expr zero times or once.expr? is equivalent to (expr) | λ and eliminates the need for anexplicit λ symbol.

[-+]?[0-9]+ defines an optionallysigned integer literal.

126CS 536 Spring 2008 ©

• Single match:The character "." matches anysingle character (other than anewline).

• Start of line:The character ^ (when used outsidea character class) matches thebeginning of a line.

• End of line:The character $ matches the end of a line. Thus,^A.*e$

matches an entire line that beginswith A and ends with e.

127CS 536 Spring 2008 ©

Overlapping Definitions

Regular expressions map overlap(match the same input sequence).

In the case of overlap, two rulesdetermine which regularexpression is matched:

• The longest possible match isperformed. JLex automaticallybuffers characters while decidinghow many characters can bematched.

• If two expressions match exactly

the same string, the earlierexpression (in the JLexspecification) is preferred.Reserved words, for example, areoften special cases of the patternused for identifiers. Theirdefinitions are therefore placedbefore the expression that definesan identifier token.

128CS 536 Spring 2008 ©

Often a “catch all” pattern is

placed at the very end of theregular expression rules. It isused to catch characters thatdon’t match any of the earlierpatterns and hence are probablyerroneous. Recall that "." matchesany single character (other than anewline). It is useful in a catch-allpattern. However, avoid a patternlike .* which will consume allcharacters up to the next newline.

In JLex an unmatched character

will cause a run-time error.

The operators and specialsymbols most commonly used in JLex are summarized below. Notethat a symbol sometimes has onemeaning in a regular expressionand an entirely different meaning



129CS 536 Spring 2008 ©

in a character class (i.e., within apair of brackets). If you find JLexbehaving unexpectedly, it’s agood idea to check this table tobe sure of how the operators andsymbols you’ve used behave.

Ordinary letters and digits, andsymbols not mentioned (like @)represent themselves. If you’renot sure if a character is special ornot, you can always escape it ormake it part of a quoted string.

130CS 536 Spring 2008 ©

SymbolMeaning in RegularExpressions

Meaning inCharacterClasses

( Matches with ) to group sub-

expressions.

Represents itself.

) Matches with ( to group sub-

expressions.

Represents itself.

[ Begins a character class. Represents itself.

] Represents itself. Ends a characterclass.

{ Matches with } to signal

macro-expansion.

Represents itself.

} Matches with { to signal

macro-expansion.

Represents itself.

" Matches with " to delimit

strings

(only \ is special within

strings).

Represents itself.

\ Escapes individual charac-

ters.

Also used to specify a char-

acter by its octal code.

Escapes individual

characters.

Also used to spec-

ify a character by

its octal code.

. Matches any one characterexcept \n.

Represents itself.

| Alternation (or) operator. Represents itself.

131CS 536 Spring 2008 ©

* Kleeneclosure operator (zero

or more matches).

Represents itself.

+ Positive closure operator

(one or more matches).

Represents itself.

? Optional choice operator

(one or zero matches).

Represents itself.

/ Context sensitive matching

operator.

Represents itself.

^ Matches only at beginning of

a line.

Complements

remaining

characters in the

class.

$ Matches only at end of a line. Represents itself.

- Represents itself. Range of charac-ters operator.

Symbol Meaning in RegularExpressions

Meaning in

CharacterClasses

132CS 536 Spring 2008 ©

Potential Problems in UsingJLex

The following differences from“standard” Lex notation appear in JLex:

• Escaped characters within quotedstrings are not recognized. Hence"\n" is not a new line character.Escaped characters outside of quoted strings (\n) and escapedcharacters within character classes([\n]) are OK.

• A blank should not be used withina character class (i.e., [ and ]). Youmay use \040 (which is thecharacter code for a blank).

• A doublequote must be escapedwithin a character class. Use [\"]instead of ["].



133CS 536 Spring 2008 ©

• Unprintables are defined to be allcharacters before blank as well asthe last ASCII character. These canbe represented as: [\000-\037\177]

134CS 536 Spring 2008 ©

JLex Examples

A JLex scanner that looks for fiveletter words that begin with “P”and end with “T”.

This example is in

~cs536-1/public/jlex

135CS 536 Spring 2008 ©

The JLex specification file is:

class Token {String text;

Token(String t){text = t;}

}

%%

Digit=[0-9]

AnyLet=[A-Za-z]

Others=[0-9’&.]

WhiteSp=[\040\n]

// Tell JLex to have yylex() return aToken

%type Token

// Tell JLex what to return when eof offile is hit

%eofval{

return new Token(null);%eofval}

%%[Pp]{AnyLet}{AnyLet}{AnyLet}[Tt]{WhiteSp}+

{return new Token(yytext());}

({AnyLet}|{Others})+{WhiteSp}+{/*skip*/}

136CS 536 Spring 2008 ©

The Java program that uses the

scanner is:import java.io.*;

class Main {

public static void main(String args[])throws java.io.IOException {

Yylex lex = new Yylex(System.in);

Token token = lex.yylex();

while ( token.text != null ) {System.out.print("\t"+token.text);

token = lex.yylex(); //get next token

}

}}



137CS 536 Spring 2008 ©

In case you care, the words thatare matched include:

Pabst

paint

petit

pilot

pivot plant

pleat

point

posit

Pratt

print

138CS 536 Spring 2008 ©

An example of CSX tokenspecifications. This example is in~cs536-1/public/proj2/startup

139CS 536 Spring 2008 ©

The JLex specification file is:

import java_cup.runtime.*;

/* Expand this into your solution for project 2 */

class CSXToken {

int linenum;

int colnum;

CSXToken(int line,int col){

linenum=line;colnum=col;};

}

class CSXIntLitToken extends CSXToken {

int intValue;

CSXIntLitToken(int val,int line,int col){

super(line,col);intValue=val;};

}

class CSXIdentifierToken extendsCSXToken {

String identifierText;

CSXIdentifierToken(String text,int line,int col){

super(line,col);identifierText=text;};

}

140CS 536 Spring 2008 ©

class CSXCharLitToken extends CSXToken {

char charValue;CSXCharLitToken(char val,int line,int col){

super(line,col);charValue=val;};

}

class CSXStringLitToken extends CSXToken{

String stringText;CSXStringLitToken(String text,int line,int col){

super(line,col);

stringText=text; };

}

// This class is used to track line andcolumn numbers

// Feel free to change to extend it

class Pos {

static int linenum = 1;/* maintain this as line number current

token was scanned on */

static int colnum = 1;/* maintain this as column numbercurrent token began at */

static int line = 1;/* maintain this as line number after

scanning current token */



141CS 536 Spring 2008 ©

static int col = 1;/* maintain this as column numberafter scanning current token */

static void setpos() {//set starting pos for current token

linenum = line;

colnum = col;}

}

%%

Digit=[0-9]

// Tell JLex to have yylex() return aSymbol, as JavaCUP will require

%type Symbol

// Tell JLex what to return when eof offile is hit

%eofval{

return new Symbol(sym.EOF,new CSXToken(0,0));

%eofval}

142CS 536 Spring 2008 ©

%%

"+" {Pos.setpos(); Pos.col +=1;

return new Symbol(sym.PLUS,

new CSXToken(Pos.linenum,Pos.colnum));}

"!=" {Pos.setpos(); Pos.col +=2;

return new Symbol(sym.NOTEQ,new CSXToken(Pos.linenum,

Pos.colnum));}";" {Pos.setpos(); Pos.col +=1;

return new Symbol(sym.SEMI,new CSXToken(Pos.linenum,

Pos.colnum));}

{Digit}+ {// This def doesn’t check// for overflow

Pos.setpos();Pos.col += yytext().length();

return new Symbol(sym.INTLIT,new CSXIntLitToken(

new Integer(yytext()).intValue(),

Pos.linenum,Pos.colnum));}

\n {Pos.line +=1; Pos.col = 1;}

" " {Pos.col +=1;}

143CS 536 Spring 2008 ©

The Java program that uses this

scanner (P2) is:

class P2 { public static void main(String args[])

throws java.io.IOException {if (args.length != 1) {System.out.println("Error: Input file must be named on

command line." );System.exit(-1);

}java.io.FileInputStream yyin = null;try { yyin =new java.io.FileInputStream(args[0]);

} catch (FileNotFoundExceptionnotFound){

System.out.println("Error: unable to open input file.”);System.exit(-1);

}

// lex is a JLex-generated scanner that// will read from yyin

Yylex lex = new Yylex(yyin);

System.out.println("Begin test of CSX scanner.");

144CS 536 Spring 2008 ©

/**********************************You should enter code here thatthoroughly test your scanner.

Be sure to test extreme cases,like very long symbols or lines,illegal tokens, unrepresentableintegers, illegals strings, etc.The following is only a starting point.

***********************************/Symbol token = lex.yylex();

while ( token.sym != sym.EOF ) {System.out.print(

((CSXToken) token.value).linenum + ":"+ ((CSXToken) token.value).colnum + " ");

switch (token.sym) {

case sym.INTLIT:System.out.println("\tinteger literal(" +((CSXIntLitToken)token.value).intValue + ")");break;

case sym.PLUS:

System.out.println("\t+");break;



145CS 536 Spring 2008 ©

case sym.NOTEQ:System.out.println("\t!=");break;

default:throw new RuntimeException();

}

token = lex.yylex(); // get next token

}

System.out.println("End test of CSX scanner.");

}}}

146CS 536 Spring 2008 ©

Other Scanner Issues

We will consider other practicalissues in building real scannersfor real programming languages.

Our finite automaton model

sometimes needs to beaugmented. Moreover, errorhandling must be incorporatedinto any practical scanner.

147CS 536 Spring 2008 ©

Identifiers vs. Reserved Words

Most programming languagescontain reserved words like if, while, switch, etc. These tokenslook like ordinary identifiers, butaren’t.

It is up to the scanner to decide if what looks like an identifier isreally a reserved word. Thisdistinction is vital as reservedwords have different token codesthan identifiers and are parsed

differently.How can a scanner decide whichtokens are identifiers and whichare reserved words?

148CS 536 Spring 2008 ©

• We can scan identifiers and

reserved words using the samepattern, and then look up the tokenin a special “reserved word” table.

• It is known that any regularexpression may be complemented to obtain all strings not in theoriginal regular expression. Thus A,the complement of A, is regular if Ais. Using complementation we canwrite a regular expression fornonreserved

identifiers:

Since scanner generators don’tusually support complementationof regular expressions, thisapproach is more of theoreticalthan practical interest.

ident if while …( )



149CS 536 Spring 2008 ©

• We can give distinct regularexpression definitions for eachreserved word, and for identifiers.Since the definitions overlap (ifwill match a reserved word and thegeneral identifier pattern), we givepriority to reserved words. Thus a

token is scanned as an identifier if it matches the identifier patternand does not match any reservedword pattern. This approach iscommonly used in scannergenerators like Lex and JLex.

150CS 536 Spring 2008 ©

Converting Token Values

For some tokens, we may need toconvert from string form intonumeric or binary form.

For example, for integers, we

need to transform a string a digitsinto the internal (binary) form of integers.

We know the format of the tokenis valid (the scanner checkedthis), but:

• The string may represent aninteger too large to represent in 32or 64 bit form.

• Languages like CSX and ML use anon-standard representation fornegative values (~123 instead of

-123)

151CS 536 Spring 2008 ©

We can safely convert from string

to integer form by first convertingthe string to double form,checking against max and min int,and then converting to int form if the value is representable.

Thus d = new Double(str) willcreate an object d containing thevalue of str in double form. If str is too large or too small to berepresented as a double, plus orminus infinity is automaticallysubstituted.

d.doubleValue() will give d’svalue as a Java double, which canbe compared againstInteger.MAX_VALUE orInteger.MIN_VALUE.

152CS 536 Spring 2008 ©

If d.doubleValue() represents a

valid integer,(int) d.doubleValue()will create the appropriate integervalue.

If a string representation of aninteger begins with a “~” we canstrip the “~”, convert to a doubleand then negate the resultingvalue.



153CS 536 Spring 2008 ©

Scanner Termination

A scanner reads input charactersand partitions them into tokens.

What happens when the end of the input file is reached? It may be

useful to create an Eof pseudo-character when this occurs. In Java, for example,InputStream.read(), whichreads a single byte, returns -1when end of file is reached. Aconstant, EOF, defined as -1 canbe treated as an “extended” ASCIIcharacter. This character thenallows the definition of an Eoftoken that can be passed back tothe parser.

An Eof token is useful because itallows the parser to verify thatthe logical end of a programcorresponds to its physical end.

154CS 536 Spring 2008 ©

Most parsers require an end of file token.

Lex and Jlex automatically createan Eof token when the scannerthey build tries to scan an EOFcharacter (or tries to scan when

eof() is true).

155CS 536 Spring 2008 ©

Multi Character Lookahead

We may allow finite automata tolook beyond the next inputcharacter.

This feature is necessary toimplement a scanner forFORTRAN.

In FORTRAN, the statementDO 10 J = 1,100

specifies a loop, with index Jranging from 1 to 100.The statementDO 10 J = 1.100

is an assignment to the variableDO10J. (Blanks are not significantexcept in strings.)

A FORTRAN scanner decideswhether the O is the last characterof a DO token only after reading asfar as the comma (or period).

156CS 536 Spring 2008 ©

A milder form of extended

lookahead problem occurs inPascal and Ada.The token 10.50 is a real literal,whereas 10..50 is three differenttokens.

We need two-character lookaheadafter the 10 prefix to decidewhether we are to return 10 (aninteger literal) or 10.50 (a realliteral).



157CS 536 Spring 2008 ©

Suppose we use the following FA.

Given 10..100 we scan threecharacters and stop in a non-accepting state.

Whenever we stop reading in anon-accepting state, we back up along accepted characters until anaccepting state is found.

Characters we back up over arerescanned to form later tokens. If no accepting state is reachedduring backup, we have a lexicalerror.

.D

D D

D

.

.

158CS 536 Spring 2008 ©

Performance Considerations

Because scanners do so muchcharacter-level processing, theycan be a real performancebottleneck in productioncompilers.

Speed is not a concern in ourproject, but let’s see why scanningspeed can be a concern inproduction compilers.

Let’s assume we want to compileat a rate of 1000 lines/sec. (sothat most programs compile in just a few seconds).

Assuming 30 characters/line (onaverage), we need to scan 30,000char/sec.

159CS 536 Spring 2008 ©

On a 30 SPECmark machine (30

million instructions/sec.), wehave 1000 instructions percharacter to spend on all compiling steps.

If we allow 25% of compiling to bescanning (a compiler has a lotmore to do than just scan!), that’s just 250 instructions percharacter.

A key to efficient scanning is togroup character-level operationswhenever possible. It is better to

do one operation on n charactersrather than n operations on singlecharacters.

In our examples we’ve read inputone character as a time. Asubroutine call can cost hundredsor thousands of instructions toexecute—far too much to spendon a single character.

160CS 536 Spring 2008 ©

We prefer routines that do block

reads, putting an entire block of characters directly into a buffer.

Specialized scanner generatorscan produce particularly fastscanners.

The GLA scanner generator claimsthat the scanners it produces runas fast as:

while(c != Eof) {

c = getchar();

}



161CS 536 Spring 2008 ©

Lexical Error Recovery

A character sequence that can’t bescanned into any valid token is alexical error .

Lexical errors are uncommon, but

they still must be handled by ascanner. We won’t stopcompilation because of so minoran error.

Approaches to lexical errorhandling include:

• Delete the characters read so farand restart scanning at the nextunread character.

• Delete the first character read bythe scanner and resume scanningat the character following it.

Both of these approaches arereasonable.

162CS 536 Spring 2008 ©

The first is easy to do. We justreset the scanner and beginscanning anew.

The second is a bit harder butalso is a bit safer (less isimmediately deleted). It can be

implemented using scannerbackup.

Usually, a lexical error is causedby the appearance of some illegalcharacter, mostly at the beginningof a token.

(Why at the beginning?)

In these case, the two approachesare equivalent.

The effects of lexical errorrecovery might well create a latersyntax error , handled by the

parser.

163CS 536 Spring 2008 ©

Consider

...for$tnight...The $ terminates scanning of for.Since no valid token begins with$, it is deleted. Then tnight isscanned as an identifier. In effectwe get...for tnight...

which will cause a syntax error.Such “false errors” areunavoidable, though a syntacticerror-repair may help.

164CS 536 Spring 2008 ©

Error Tokens

Certain lexical errors requirespecial care. In particular,runaway strings and runawaycomments ought to receivespecial error messages.

In Java strings may not cross lineboundaries, so a runaway stringis detected when an end of a lineis read within the string body.Ordinary recovery rules areinappropriate for this error. In

particular, deleting the firstcharacter (the double quotecharacter) and restarting scanningis a bad decision.

It will almost certainly lead to acascade of “false” errors as thestring text is inappropriatelyscanned as ordinary input.



165CS 536 Spring 2008 ©

One way to handle runawaystrings is to define an error token.

An error token is not a validtoken; it is never returned to theparser. Rather, it is a pattern foran error condition that needs

special handling. We can define anerror token that represents astring terminated by an end of line rather than a double quotecharacter.

For a valid string, in whichinternal double quotes and backslashes are escaped (and no otherescaped characters are allowed),we can use

" ( Not( " | Eol | \ ) | \" | \\ )* "

For a runaway string we use

" ( Not( " | Eol | \ ) | \" | \\ )* Eol

(Eol is the end of line character.)

166CS 536 Spring 2008 ©

When a runaway string token isrecognized, a special errormessage should be issued.

Further, the string may be“repaired” into a correct string byreturning an ordinary string token

with the closing Eol replaced by adouble quote.

This repair may or may not be“correct.” If the closing doublequote is truly missing, the repairwill be good; if it is present on asucceeding line, a cascade of inappropriate lexical andsyntactic errors will follow.

Still, we have told theprogrammer exactly what iswrong, and that is our primary

goal.

167CS 536 Spring 2008 ©

In languages like C, C++, Java and

CSX, which allow multilinecomments, improperly terminated(runaway) comments present asimilar problem.

A runaway comment is notdetected until the scanner finds aclose comment symbol (possiblybelonging to some othercomment) or until the end of fileis reached. Clearly a special,detailed error message isrequired.

Let’s look at Pascal-stylecomments that begin with a { andend with a }. Comments thatbegin and end with a pair of characters, like /* and */ in Java,C and C++, are a bit trickier.

168CS 536 Spring 2008 ©

Correct Pascal comments are

defined quite simply:{ Not( } )* }

To handle comments terminatedby Eof, this error token can beused:

{ Not( } )* Eof

We want to handle commentsunexpectedly closed by a closecomment belonging to anothercomment:{... missing close comment

... { normal comment }...We will issue a warning (this formof comment is lexically legal).

Any comment containing an opencomment symbol in its body ismost probably a missing } error.



169CS 536 Spring 2008 ©

We split our legal commentdefinition into two tokendefinitions.

The definition that accepts anopen comment in its body causesa warning message ("Possible

unclosed comment") to beprinted.

We now use:

{ Not( { | } )* } and{ (Not( { | } )* { Not( { | } )* )+ }

The first definition matchescorrect comments that do notcontain an open comment in theirbody.

The second definition matchescorrect, but suspect, commentsthat contain at least one opencomment in their body.

170CS 536 Spring 2008 ©

Single line comments, found in Java, CSX and C++, are terminatedby Eol.

They can fall prey to a moresubtle error—what if the last linehas no Eol at its end?

The solution?Another error token for single linecomments:

// Not(Eol)*

This rule will only be used forcomments that don’t end with anEol, since scanners always matchthe longest rule possible.

171CS 536 Spring 2008 ©

Regular Expressions and FiniteAutomata

Regular expressions are fullyequivalent to finite automata.

The main job of a scannergenerator like JLex is to transforma regular expression definitioninto an equivalent finiteautomaton.

It first transforms a regularexpressionintoa nondeterministic finite automaton (NFA).

Unlike ordinary deterministicfinite automata, an NFA need notmake a unique (deterministic)choice of a successor state tovisit. As shown below, an NFA isallowed to have a state that hastwo transitions (arrows) comingout of it, labeled by the same

172CS 536 Spring 2008 ©

symbol. An NFA may also have

transitions labeled with λ .

Transitions are normally labeledwith individual characters in Σ,and although λ is a string (thestring with no characters in it), itis definitely not a character. In theabove example, when theautomaton is in the state at theleft and the next input characteris a, it may choose to use thetransition labeled a or first follow

a

a

a

λ a



173CS 536 Spring 2008 ©

the λ transition (you can alwaysfind λ wherever you look for it)and then follow an a transition.FAs that contain no λ transitionsand that always have uniquesuccessor states for any symbolare deterministic.

174CS 536 Spring 2008 ©

Building Finite Automata FromRegular Expressions

We make an FA from a regularexpression in two steps:

• Transform the regular expression

into an NFA.• Transform the NFA into a

deterministic FA.

The first step is easy.

Regular expressions are all builtout of the atomic regularexpressions a (where a is acharacter in Σ) and λ by using thethree operations

A B and A | B and A*.

175CS 536 Spring 2008 ©

Other operations (like A+) are just

abbreviations for combinations of these.

NFAs for a and λ are trivial:

Suppose we have NFAs for A andB and want one for A | B. We

construct the NFA shown below:

a

λ

A

B

FiniteAutomaton

for A

FiniteAutomaton

for B

λ

λ

λ

λ

176CS 536 Spring 2008 ©

The states labeled A and B were

the accepting states of theautomata for A and B; we create anew accepting state for thecombined automaton.

A path through the top automatonaccepts strings in A, and a paththrough the bottom automationaccepts strings in B, so the wholeautomaton matches A | B.

The construction for A B is eveneasier. The accepting state of thecombined automaton is the same

state that was the accepting stateof B. We must follow a paththrough A’s automaton, thenthrough B’s automaton, so overallA B is matched.

We could also just merge theaccepting state of A with theinitial state of B. We chose not to



177CS 536 Spring 2008 ©

only because the picture would bemore difficult to draw.

AFinite

Automatonfor A

FiniteAutomaton

for B

λ

178CS 536 Spring 2008 ©

Finally, let’s look at the NFA for

A*. The start state reaches anaccepting state via λ , so λ isaccepted. Alternatively, we canfollow a path through the FA for Aone or more times, so zero or

more strings that belong to A arematched.

AFinite

Automatonfor A

λ

λ

λ

λ

179CS 536 Spring 2008 ©

Creating DeterministicAutomata

The transformation from an NFA Nto an equivalent DFA D works bywhat is sometimes called thesubset construction.

Each state of D corresponds to aset of states of N.

The idea is that D will be in state{x, y, z} after reading a given inputstring if and only if N could be inany one of the states x , y , or z ,depending on the transitions itchooses. Thus D keeps track of all the possible routes N might takeand runs them simultaneously.

Because N is a finite automaton, ithas only a finite number of states.The number of subsets of N’sstates is also finite, which makes

180CS 536 Spring 2008 ©

tracking various sets of states

feasible.An accepting state of D will beany set containing an acceptingstate of N, reflecting theconvention that N accepts if thereis any way it could get to itsaccepting state by choosing the“right” transitions.

The start state of D is the set of allstates that N could be in withoutreading any input characters—that

is, the set of states reachablefrom the start state of N followingonly λ transitions. Algorithmclose computes those states thatcan be reached following only λ transitions.

Once the start state of D is built,we begin to create successorstates:



181CS 536 Spring 2008 ©

We take each state S of D, andeach character c, and compute S’ssuccessor under c.

S is identified with some set of N’sstates, {n1, n2,...}.

We find all the possible successor

states to {n1, n2,...} under c,obtaining a set {m1, m2,...}.

Finally, we computeT = CLOSE({ m1, m2,...}).T becomes a state in D, and atransition from S to T labeled withc is added to D.

We continue adding states andtransitions to D until all possiblesuccessors to existing states areadded.

Because each state correspondsto a finite subset of N’s states, the

182CS 536 Spring 2008 ©

process of adding new states to Dmust eventually terminate.

Here is the algorithm for λ -closure, called close. It startswith a set of NFA states, S, andadds to S all states reachable from

S using only λ transitions. void close(NFASet S) {

while (x in S and x →λ y

and y notin S) {

S = S U {y}

}}

Using close, we can define theconstruction of a DFA, D, from an NFA,N:

183CS 536 Spring 2008 ©

DFA MakeDeterministic(NFA N) {DFA D ; NFASet T

D.StartState = { N.StartState }

close(D.StartState)

D.States = { D.StartState }

while (states or transitions can beadded to D) {

Choose any state S in D.Statesand any character c in Alphabet

T = {y in N.States such that

x →c y for some x in S}

close(T);

if (T notin D.States) {

D.States = D.States U {T}D.Transitions =

D.Transitions U

{the transition S →c

T}

} }

D.AcceptingStates ={ S in D.States such that an

accepting state of N in S}}

184CS 536 Spring 2008 ©

Example

To see how the subsetconstruction operates, considerthe following NFA:

We start with state 1, the startstate of N, and add state 2 its λ -successor.

D’s start state is {1,2}.

Under a, {1,2}’s successor is{3,4,5}.

aλ

1 2

3 4

5

b

a

b

a

a | b



185CS 536 Spring 2008 ©

State 1 has itself as a successorunder b. When state 1’s λ -successor, 2, is included, {1,2}’ssuccessor is {1,2}. {3,4,5}’ssuccessors under a and b are {5}and {4,5}.

{4,5}’s successor under b is {5}.Accepting states of D are thosestate sets that contain N’saccepting state which is 5.

The resulting DFA is:

b1,2

5

4,5

b

a

a | b

a3,4,5

5

186CS 536 Spring 2008 ©

It is not too difficult to establishthat the DFA constructed by

MakeDeterministic is equivalent tothe original NFA.

The idea is that each path to anaccepting state in the original NFA

has a corresponding path in theDFA. Similarly, all paths throughthe constructed DFA correspondto paths in the original NFA.

What is less obvious is the factthat the DFA that is built cansometimes be much larger thanthe original NFA. States of the DFAare identified with sets of NFAstates.

If the NFA has n states, there are

2n distinct sets of NFA states, and

hence the DFA may have as manyas 2n states. Certain NFAs actually

187CS 536 Spring 2008 ©

exhibit this exponential blowup in

size when made deterministic.Fortunately, the NFAs built fromthe kind of regular expressionsused to specify programminglanguage tokens do not exhibitthis problem when they are madedeterministic.

As a rule, DFAs used for scanningare simple and compact.

If creating a DFA is impractical(because of size or speed-of-generation concerns), we canscan using an NFA. Each possiblepath through an NFA is tracked,and reachable accepting statesare identified. Scanning is slowerusing this approach, so it is usedonly when construction of a DFAis not practical.

188CS 536 Spring 2008 ©

Optimizing Finite Automata

We can improve the DFA createdby MakeDeterministic.

Sometimes a DFA will have morestates than necessary. For everyDFA there is a unique smallest equivalent DFA (fewest statespossible).

Some DFA’s contain unreachablestates that cannot be reachedfrom the start state.

Other DFA’s may contain dead states that cannot reach anyaccepting state.

It is clear that neither unreachablestates nor dead states canparticipate in scanning any validtoken. We therefore eliminate allsuch states as part of ouroptimization process.



189CS 536 Spring 2008 ©

We optimize a DFA by merging together states we know to beequivalent.

For example, two accepting statesthat have no transitions at all outof them are equivalent.

Why? Because they behave exactlythe same way—they accept thestring read so far, but will acceptno additional characters.

If two states, s1 and s2, areequivalent, then all transitions tos2 can be replaced withtransitions to s1. In effect, the twostates are merged together intoone common state.

How do we decide what states tomerge together?

190CS 536 Spring 2008 ©

We take a greedy approach andtry the most optimistic merger of states. By definition, acceptingand non-accepting states aredistinct, so we initially try tocreate only two states: onerepresenting the merger of allaccepting states and the otherrepresenting the merger of allnon-accepting states.

This merger into only two statesis almost certainly too optimistic.In particular, all the constituentsof a merged state must agree onthe same transition for eachpossible character. That is, forcharacter c all the merged statesmust have no successor under cor they must all go to a single

(possibly merged) state.If all constituents of a mergedstate do not agree on the

191CS 536 Spring 2008 ©

transition to follow for some

character, the merged state is split into two or more smaller statesthat do agree.

As an example, assume we startwith the following automaton:

Initially we have a merged non-

accepting state {1,2,3,5,6} and amerged accepting state {4,7}.

A merger is legal if and only if allconstituent states agree on thesame successor state for allcharacters. For example, states 3and 6 would go to an acceptingstate given character c; states 1, 2,5 would not, so a split must occur.

a

b

b c

cd

1 2 3 4

5 6 7

192CS 536 Spring 2008 ©

We will add an error state sE to the

original DFA that is the successorstate under any illegal character.(Thus reaching sE becomesequivalent to detecting an illegaltoken.) sE is not a real state; ratherit allows us to assume every statehas a successor under everycharacter. sE is never merged withany real state.

Algorithm Split , shown below,splits merged states whoseconstituents do not agree on acommon successor state for allcharacters. When Splitterminates, we know that thestates that remain merged areequivalent in that they alwaysagree on common successors.



193CS 536 Spring 2008 ©

Split(FASet StateSet) {

repeat

for(each merged state S in StateSet) {

Let S correspond to {s1,...,sn}

for(each char c in Alphabet){

Let t1,...,tn be the successorstates to s1,...,sn under cif(t1,...,tn do not all belong to

the same merged state){Split S into two or more new states such that si and sjremain in the same mergedstate if and only if ti and tjare in the same merged state}

}

until no more splits are possible

}

194CS 536 Spring 2008 ©

Returning to our example, weinitially have states {1,2,3,5,6} and{4,7}. Invoking Split , we firstobserve that states 3 and 6 have acommon successor under c, andstates 1, 2, and 5 have nosuccessor under c (equivalently,have the error state sE as asuccessor).

This forces a split, yielding {1,2,5},{3,6} and {4,7}.

Now, for character b, states 2 and5 would go to the merged state{3,6}, but state 1 would not, soanother split occurs.

We now have: {1}, {2,5}, {3,6} and{4,7}.

At this point we are done, as all

constituents of merged statesagree on the same successor foreach input symbol.

195CS 536 Spring 2008 ©

Once Split is executed, we are

essentially done.Transitions between mergedstates are the same as thetransitions between states in theoriginal DFA.

Thus, if there was a transitionbetween state si and s j undercharacter c, there is now atransition under c from themerged state containing si to themerged state containing s j. Thestart state is that merged statecontaining the original start state.

Accepting states are thosemerged states containingaccepting states (recall thataccepting and non-acceptingstates are never merged).

196CS 536 Spring 2008 ©

Returning to our example, the

minimum state automaton weobtain is

a | d b c1 2,5 3,6 4,7



197CS 536 Spring 2008 ©

Properties of RegularExpressions and FiniteAutomata

• Some token patterns can’t be definedas regular expressions or finite

automata. Consider the set of balanced brackets of the form [ [ [ … ] ] ].This set is defined formally as

{ [m ]m | m ≥ 1 }.This set is not regular.No finite automaton that recognizesexactly this set can exist.Why? Consider the inputs [, [[, [[[, ...For two different counts (call them i

and j) [i and [ j must reach the samestate of a given FA! (Why?)

Once that happens, we know that if [i]i

is accepted (as it should be), the [ j]iwill also be accepted (and that shouldnot happen).

198CS 536 Spring 2008 ©

• R = V* - R is regular if R is.Why?Build a finite automaton for R. Becareful to include transitions to an“error state” sE for illegal characters.

Now invert final and non-final states.What was previously accepted is nowrejected, and what was rejected isnow accepted. That is, R is acceptedby the modified automaton.

• Not all subsets of a regular set arethemselves regular. The regular

expression [+]+ has a subset that isn’tregular. (What is that subset?)

199CS 536 Spring 2008 ©

• Let R be a set of strings. Define Rrev as

all strings in R, in reversed (backward)character order.Thus if R = {abc, def}

then Rrev = {cba, fed}.

If R is regular, then Rrev is too.Why? Build a finite automaton for R.Make sure the automaton has only onefinal state. Now reverse the directionof all transitions, and interchange thestart and final states. What does themodified automation accept?

200CS 536 Spring 2008 ©

• If R1 and R2 are both regular, then

R1 ∩ R2 is also regular. We can showthis two different ways:

1. Build two finite automata, onefor R1 and one for R2. Pairtogether states of the twoautomata to match R1 and R2simultaneously. The paired-stateautomaton accepts only if bothR1 and R2 would, soR1 ∩ R2 is matched.

2. We can use the fact that R1 ∩ R2

is = We already know

union and complementation areregular.

R1

R2

∪



201CS 536 Spring 2008 ©

Reading Assignment

• Read Chapter 4 of Crafting a Compiler,Second Edition.

202CS 536 Spring 2008 ©

Context Free Grammars

A context-free grammar (CFG) isdefined as:

• A finite terminal set Vt;these are the tokens produced by

the scanner.• A set of intermediate symbols,

called non-terminals, Vn.

• A start symbol, a designated non-terminal, that starts all derivations.

• A set of productions (sometimescalled rewriting rules) of the form

A → X1 ... XmX1 to Xm may be anycombination of terminals andnon-terminals.

If m =0 we have A → λ

which is a valid production.

203CS 536 Spring 2008 ©

Example

Prog → { Stmts }

Stmts →Stmts ; Stmt

Stmts →Stmt

Stmt →id = Expr

Expr →id

Expr →Expr + id

204CS 536 Spring 2008 ©

Often more than one production

shares the same left-hand side.Rather than repeat the left handside, an “or notation” is used:

Prog → { Stmts }

Stmts →Stmts ; Stmt

| Stmt

Stmt →id = Expr

Expr →id

| Expr + id



205CS 536 Spring 2008 ©

Derivations

Starting with the start symbol,non-terminals are rewritten usingproductions until only terminalsremain.

Any terminal sequence that canbe generated in this manner issyntactically valid.

If a terminal sequence can’t begenerated using the productionsof the grammar it is invalid (hassyntax errors).

The set of strings derivable fromthe start symbol is the languageof the grammar (sometimesdenoted L(G)).

206CS 536 Spring 2008 ©

For example, starting at Prog wegenerate a terminal sequence, byrepeatedly applying productions:

Prog

{ Stmts }

{ Stmts ; Stmt }

{ Stmt ; Stmt }

{ id = Expr ; Stmt }

{ id = id ; Stmt }

{ id = id ; id = Expr }

{ id = id ; id = Expr + id}

{ id = id ; id = id + id}

207CS 536 Spring 2008 ©

Parse Trees

To illustrate a derivation, we candraw a derivation tree (also calleda parse tree):

Prog

{ Stmts }

Stmts ; Stmt

Stmt

id = Expr

id

id = Expr

Expr + id

id

208CS 536 Spring 2008 ©

An abstract syntax tree (AST)

shows essential structure buteliminates unnecessary delimitersand intermediate symbols:

Prog

Stmts

Stmts =

=

id id

id +

id id



209CS 536 Spring 2008 ©

If A → γ is a production thenαAβ ⇒ αγβ

where ⇒ denotes a one stepderivation (using productionA → γ ).

We extend ⇒ to ⇒+ (derives in

one or more steps), and ⇒*(derives in zero or more steps).

We can show our earlierderivation asProg ⇒{ Stmts } ⇒{ Stmts ; Stmt } ⇒{ Stmt ; Stmt } ⇒{ id = Expr ; Stmt } ⇒{ id = id ; Stmt } ⇒{ id = id ; id = Expr } ⇒

{ id = id ; id = Expr + id} ⇒{ id = id ; id = id + id}

Prog ⇒+ { id = id ; id = id + id}

210CS 536 Spring 2008 ©

When deriving a token sequence,if more than one non-terminal ispresent, we have a choice of which to expand next.

We must specify, at each step,which non-terminal is expanded,

and what production is applied.For simplicity we adopt aconvention on what non-terminalis expanded at each step.

We can choose the leftmostpossible non-terminal at eachstep.

A derivation that follows this ruleis a leftmost derivation.

If we know a derivation isleftmost, we need only specifywhat productions are used; the

choice of non-terminal is alwaysfixed.

211CS 536 Spring 2008 ©

To denote derivations that are

leftmost,we use ⇒L, ⇒+

L , and ⇒*L

The production sequencediscovered by a large class of parsers (the top-down parsers) isa leftmost derivation, hence theseparsers produce a leftmost parse.Prog ⇒L

{ Stmts } ⇒L

{ Stmts ; Stmt } ⇒L

{ Stmt ; Stmt } ⇒L{ id = Expr ; Stmt } ⇒L

{ id = id ; Stmt } ⇒L

{ id = id ; id = Expr } ⇒L

{ id = id ; id = Expr + id} ⇒L


Prog ⇒L+ { id = id ; id = id + id}

212CS 536 Spring 2008 ©

Rightmost Derivations

A rightmost derivation is analternative to a leftmostderivation. Now the rightmostnon-terminal is always expanded.

This derivation sequence mayseem less intuitive given ournormal left-to-right bias, but itcorresponds to an important classof parsers (the bottom-up parsers,including CUP).

As a bottom-up parser discovers

the productions used to derive atoken sequence, it discovers arightmost derivation, but inreverse order .

The last production applied in arightmost derivation is the firstthat is discovered. The firstproduction used, involving thestart symbol, is discovered last.



213CS 536 Spring 2008 ©

The sequence of productionsrecognized by a bottom-up parseris a rightmost parse.

It is the exact reverse of theproduction sequence thatrepresents a rightmost derivation.

For rightmost derivations, we usethe notation ⇒R, ⇒+

R , and ⇒*R

Prog ⇒R

{ Stmts } ⇒R

{ Stmts ; Stmt } ⇒R

{ Stmts ; id = Expr } ⇒R

{ Stmts ; id = Expr + id } ⇒R

{ Stmts ; id = id + id } ⇒R

{ Stmt ; id = id + id } ⇒R

{ id = Expr ; id = id + id } ⇒R


Prog ⇒+ { id = id ; id = id + id}

214CS 536 Spring 2008 ©

You can derive the same set of tokens using leftmost andrightmost derivations; the onlydifference is the order in whichproductions are used.

215CS 536 Spring 2008 ©

Ambiguous Grammars

Some grammars allow more thanone parse tree for the same tokensequence. Such grammars areambiguous . Because compilersuse syntactic structure to drivetranslation, ambiguity isundesirable—it may lead to anunexpected translation.

Consider

E → E - E

| idWhen parsing the input a-b-c(where a, b and c are scanned asidentifiers) we can build thefollowing two parse trees:

216CS 536 Spring 2008 ©

The effect is to parse a-b-c aseither (a-b)-c or a-(b-c). These twogroupings are certainly notequivalent.

Ambiguous grammars are usuallyvoided in building compilers; thetools we use, like Yacc and CUP,strongly prefer unambiguousgrammars.

To correct this ambiguity, we use

E → E - id

| id

E

E - E

E - E

id id id

E

E - E

E - E

id id id



217CS 536 Spring 2008 ©

Now a-b-c can only be parsed as:

E

E -

E -

id id id

218CS 536 Spring 2008 ©

Operator Precedence

Most programming languageshave operator precedence rulesthat state the order in whichoperators are applied (in theabsence of explicit parentheses).Thus in C and Java and CSX,a+b*c means compute b*c, thenadd in a.

These operators precedence rulescan be incorporated directly into aCFG.

Consider

E → E + T

| T

T → T * P

| PP → id

| ( E )

219CS 536 Spring 2008 ©

Does a+b*c mean (a+b)*c or

a+(b*c)?The grammar tells us! Look at thederivation tree:

The other grouping can’t beobtained unless explicitparentheses are used.

(Why?)

E

E + T

T T * P

P P

id id id

220CS 536 Spring 2008 ©

Java CUP

Java CUP is a parser-generationtool, similar to Yacc.

CUP builds a Java parser forLALR(1) grammars fromproduction rules and associated Java code fragments.

When a particular production isrecognized, its associated codefragment is executed (typically tobuild an AST).

CUP generates a Java source file parser.java. It contains a class parser, with a method

Symbol parse()

The Symbol returned by the parseris associated with the grammar’sstart symbol and contains the ASTfor the whole source program.



221CS 536 Spring 2008 ©

The file sym.java is also built foruse with a JLex-built scanner (sothat both scanner and parser usethe same token codes).

If an unrecovered syntax erroroccurs, Exception() is thrown by

the parser.CUP and Yacc accept exactly thesame class of grammars—all LL(1)grammars, plus many useful non-LL(1) grammars.

CUP is called as

java java_cup.Main < file.cup

222CS 536 Spring 2008 ©

Java CUP Specifications

Java CUP specifications are of theform:

• Package and import specifications

• User code additions

• Terminal and non-terminaldeclarations

• A context-free grammar,augmented with Java codefragments

Package and Import Specifications

You define a package name as: package name ;

You add imports to be used as:


223CS 536 Spring 2008 ©

User Code Additions

You may define Java code to beincluded within the generatedparser:action code {: /*java code */ :}This code is placed within thegenerated action class (whichholds user-specified productionactions).

parser code {: /*java code */ :}This code is placed within thegenerated parser class .

init with{: /*java code */ :}This code is used to initialize thegenerated parser.

scan with{: /*java code */ :}This code is used to tell thegenerated parser how to gettokens from the scanner.

224CS 536 Spring 2008 ©

Terminal and Non-terminal

DeclarationsYou define terminal symbols youwill use as:terminal classname name1, name2, ...

classname is a class used by thescanner for tokens (CSXToken,CSXIdentifierToken, etc.)

You define non-terminal symbolsyou will use as:non terminal classname name1, name2, ...

classname is the class for theAST node associated with thenon-terminal (stmtNode,exprNode, etc.)



225CS 536 Spring 2008 ©

Production Rules

Production rules are of the formname ::= name1 name2 ... action ;

orname ::= name1 name2 ...

action1| name3 name4 ... action2| ...

;

Names are the names of terminalsor non-terminals, as declaredearlier.

Actions are Java code fragments,of the form

{: /*java code */ :}

The Java object assocated with a

symbol (a token or AST node) maybe named by adding a :id suffixto a terminal or non-terminal in arule.

226CS 536 Spring 2008 ©

RESULT names the left-hand sidenon-terminal.

The Java classes of the symbolsare defined in the terminal andnon-terminal declaration sections.

For example,

prog ::= LBRACE:l stmts:s RBRACE{: RESULT =

new csxLiteNode(s,l.linenum,l.colnum); :}

This corresponds to the production

prog → { stmts }

The left brace is named l; thestmts non-terminal is called s.

In the action code, a newCSXLiteNode is created andassigned to prog. It isconstructed from the AST node

associated with s. Its line andcolumn numbers are those givento the left brace, l (by the scanner).

227CS 536 Spring 2008 ©

To tell CUP what non-terminal to

use as the start symbol ( prog inour example), we use thedirective:

start with prog;

228CS 536 Spring 2008 ©

Example

Let’s look at the CUP specificationfor CSX-lite. Recall its CFG is

program → { stmts }

stmts → stmt stmts

| λ

stmt → id = expr ;

| if ( expr ) stmt

expr → expr + id

| expr - id

| id



229CS 536 Spring 2008 ©

The corresponding CUPspecification is:/***

This Is A Java CUP Specification ForCSX-lite, a Small Subset of The CSXLanguage, Used In Cs536

***/

/* Preliminaries to set up and use thescanner. */


parser code {:

public void syntax_error(Symbol cur_token){

report_error(“CSX syntax error at line “+String.valueOf(((CSXToken)

cur_token.value).linenum),

null);}

:};

init with {: :};

scan with {:

return Scanner.next_token();:};

230CS 536 Spring 2008 ©

/* Terminals (tokens returned by thescanner). */

terminal CSXIdentifierToken IDENTIFIER;

terminal CSXToken SEMI, LPAREN, RPAREN, ASG, LBRACE, RBRACE;

terminal CSXToken PLUS, MINUS, rw_IF;

/* Non terminals */

non terminal csxLiteNode prog;non terminal stmtsNode stmts;

non terminal stmtNode stmt;

non terminal exprNode exp;

non terminal nameNode ident;

start with prog;

prog::= LBRACE:l stmts:s RBRACE

{: RESULT=


;

stmts::= stmt:s1 stmts:s2{: RESULT=

new stmtsNode(s1,s2,s1.linenum,s1.colnum);

:}

231CS 536 Spring 2008 ©

|

{: RESULT= stmtsNode.NULL; :};

stmt::= ident:id ASG exp:e SEMI

{: RESULT=

new asgNode(id,e,

id.linenum,id.colnum);

:}

| rw_IF:i LPAREN exp:e RPAREN stmt:s

{: RESULT=new ifThenNode(e,s,stmtNode.NULL,

i.linenum,i.colnum); :}

;

exp::=

exp:leftval PLUS:op ident:rightval

{: RESULT=new binaryOpNode(leftval,sym.PLUS, rightval,op.linenum,op.colnum); :}

| exp:leftval MINUS:op ident:rightval

{: RESULT=new binaryOpNode(leftval,sym.MINUS,rightval,

op.linenum,op.colnum); :}

| ident:i

{: RESULT = i; :}

;

232CS 536 Spring 2008 ©

ident::= IDENTIFIER:i

{: RESULT = new nameNode(new identNode(i.identifierText,

i.linenum,i.colnum),

exprNode.NULL,


;



233CS 536 Spring 2008 ©

Let’s parse

{ a = b ; }

First, a is parsed usingident::= IDENTIFIER:i

{: RESULT = new nameNode(

new identNode(i.identifierText,

i.linenum,i.colnum),exprNode.NULL,


We build

nameNode

identNode nullExprNode

a

234CS 536 Spring 2008 ©

Next, b is parsed usingident::= IDENTIFIER:i

{: RESULT = new nameNode(

new identNode(i.identifierText,

i.linenum,i.colnum),

exprNode.NULL,


We build

nameNode


b

235CS 536 Spring 2008 ©

Then b’s subtree is recognized as

an exp:| ident:i

{: RESULT = i; :}

Now the assignment statement isrecognized:stmt::= ident:id ASG exp:e SEMI

{: RESULT=

new asgNode(id,e,

id.linenum,id.colnum);

:}

We build

nameNode


a

nameNode


b

asgNode

236CS 536 Spring 2008 ©

The stmts → λ production is

matched (indicating that there areno more statements in theprogram).

CUP matchesstmts::=

{: RESULT= stmtsNode.NULL; :}

and we build

Next,

stmts → stmt stmts

is matched usingstmts::= stmt:s1 stmts:s2

{: RESULT=

new stmtsNode(s1,s2,s1.linenum,s1.colnum);

:}

nullStmtsNode



237CS 536 Spring 2008 ©

This builds

As the last step of the parse, theparser matches

program → { stmts }

using the CUP rule prog::= LBRACE:l stmts:s RBRACE

{: RESULT=


;

nameNode


a

nameNode


b

asgNode

stmtsNode

nullStmtsNode

238CS 536 Spring 2008 ©

The final AST reurned by theparser is

nameNode


a

nameNode


b

asgNode

stmtsNode

nullStmtsNode

csxLiteNode

239CS 536 Spring 2008 ©

Errors in Context-FreeGrammars

Context-free grammars cancontain errors, just as programsdo. Some errors are easy to detectand fix; others are more subtle.

In context-free grammars we startwith the start symbol, and applyproductions until a terminal stringis produced.

Some context-free grammars maycontain useless non-terminals.

Non-terminals that areunreachable (from the startsymbol) or that derive no terminalstring are considered useless.

Useless non-terminals (andproductions that involve them)can be safely removed from a

240CS 536 Spring 2008 ©

grammar without changing the

language defined by the grammar.A grammar containing uselessnon-terminals is said to be non- reduced .

After useless non-terminals areremoved, the grammar is reduced .

Consider

S → A B

| x

B → b

A → a AC → d

Which non-terminals areunreachable? Which derive noterminal string?



241CS 536 Spring 2008 ©

Finding Useless Non-terminals

To find non-terminals that canderive one or more terminalstrings, we’ll use a markingalgorithm.

We iteratively mark terminals thatcan derive a string of terminals,until no more non-terminals canbe marked. Unmarked non-terminals are useless.

(1) Mark all terminal symbols

(2) RepeatIf all symbols on the

righthand side of aproduction are marked

Then mark the lefthand sideUntil no more non-terminals

can be marked

242CS 536 Spring 2008 ©

We can use a similar markingalgorithm to determine whichnon-terminals can be reachedfrom the start symbol:

(1) Mark the Start Symbol

(2) RepeatIf the lefthand side of a

production is markedThen mark all non-terminals

in the righthand sideUntil no more non-terminals

can be marked

243CS 536 Spring 2008 ©

λ Derivations

When parsing, we’ll sometimesneed toknow which non-terminalscan derive λ . (λ is “invisible” andhence tricky to parse).

We can use the following markingalgorithm to decide which non-terminals derive λ

(1) For each production A → λ mark A

(2) RepeatIf the entire righthand

side of a productionis marked

Then mark the lefthand sideUntil no more non-terminals

can be marked

244CS 536 Spring 2008 ©

As an example consider

S → A B C

A → a

B → C D

D → d

| λ

C → c

| λ



245CS 536 Spring 2008 ©

Recall that compilers prefer anunambiguous grammar because aunique parse tree structure can beguaranteed for all inputs.

Hence a unique translation,guided by the parse tree

structure, will be obtained.We would like an algorithm thatchecks if a grammar isambiguous.

Unfortunately, it is undecidablewhether a given CFG isambiguous, so such an algorithmis impossible to create.

Fortunately for certain grammarclasses, including those for whichwe can generate parsers, we canprove included grammars are

unambiguous.

246CS 536 Spring 2008 ©

Potentially, the most serious flawthat a grammar might have is thatit generates the “wronglanguage."

This is a subtle point as agrammar serves as the definition

of a language.For established languages (like Cor Java) there is usually a suite of programs created to test andvalidate new compilers. Anincorrect grammar will almostcertainly lead to incorrectcompilations of test programs,which can be automaticallyrecognized.

For new languages, initialimplementors must thoroughly

test the parser to verify thatinputs are scanned and parsed asexpected.

247CS 536 Spring 2008 ©

Parsers and Recognizers

Given a sequence of tokens, wecan ask:

"Is this input syntactically valid?"

(Is it generable from thegrammar?).

A program that answers thisquestion is a recognizer .

Alternatively, we can ask:

"Is this input valid and, if it is,what is its structure (parse tree)?"

A program that answers this moregeneral question is termed aparser .

We plan to use language structureto drive compilers, so we will beespecially interested in parsers.

248CS 536 Spring 2008 ©

Two general approaches to

parsing exist.The first approach is top-down.

A parser is top-down if it"discovers" the parse treecorresponding to a tokensequence by starting at the top of the tree (the start symbol), andthen expanding the tree (viapredictions) in a depth-firstmanner.

Top-down parsing techniques arepredictive in nature because theyalways predict the production thatis to be matched before matchingactually begins.



249CS 536 Spring 2008 ©

Consider

E → E + T | T

T → T * id | id

To parse id+id in a top-downmanner, a parser build a parse

tree in the following steps:E E

E + T

E

E + T

T

E

E + T

T

id

E

E + T

T

id id

⇒ ⇒ ⇒

⇒

250CS 536 Spring 2008 ©

A wide variety of parsingtechniques take a differentapproach.

They belong to the class of bottom-up parsers.

As the name suggests, bottom-up

parsers discover the structure of aparse tree by beginning at itsbottom (at the leaves of the treewhich are terminal symbols) anddetermining the productions usedto generate the leaves.

Then the productions used togenerate the immediate parentsof the leaves are discovered.

The parser continues until itreaches the production used toexpand the start symbol.

At this point the entire parse treehas been determined.

251CS 536 Spring 2008 ©

A bottom-up parse of id1+id2

would follow the following steps:

E

E + T

T

id1 id2

⇒ ⇒

⇒

T

id1 T

id1

E

T

id2

252CS 536 Spring 2008 ©

A Simple Top-Down Parser

We’ll build a rudimentary top-down parser that simply trieseach possible expansion of a non-terminal, in order of productiondefinition.

If an expansion leads to a tokensequence that doesn’t match thecurrent token being parsed, webackup and try the next possibleproduction choice.

We stop when all the input tokens

are correctly matched or when allpossible production choices havebeen tried.



253CS 536 Spring 2008 ©

Example

Given the productions

S → a

| ( S )

we try a, then (a), then ((a)), etc.

Let’s next try an additionalalternative:

S → a

| ( S )

| ( S ]

Let’s try to parse a, then (a], then((a]], etc.

We’ll count the number of

productions we try for each input.

254CS 536 Spring 2008 ©

• For input = aWe try S → a which works.Matches needed = 1

• For input = ( a ]We try S → a which fails.We next try S → ( S ).

We expand the inner S threedifferent ways; all fail.Finally, we try S → ( S ].The inner S expands to a, whichworks.Total matches tried =1 + (1+3)+(1+1)= 7.

• For input = (( a ]]We try S → a which fails.We next try S → ( S ).We match the inner S to (a] using 7

steps, then fail to match the last ].Finally, we try S → ( S ].We match the inner S to (a] using 7

255CS 536 Spring 2008 ©

steps, then match the last ].

Total matches tried =1 + (1+7)+(1+7)= 17.

• For input = ((( a ]]]We try S → a which fails.We next try S → ( S ).We match the inner S to ((a]] using17 steps, then fail to match the last].Finally, we try S → ( S ].We match the inner S to ((a]] using17 steps, then match the last ].

Total matches tried =1 + (1+17) + (1+17) = 37.

Adding one extra ( ... ] pair doubles the number of matches we need todo the parse.

In fact to parse (ia]i takes 5*2i-3matches. This is exponential growth!

256CS 536 Spring 2008 ©

With a more effective dynamic

programming approach, in whichresults of intermediate parsing stepsare cached, we can reduce the

number of matches needed to n3 foran input with n tokens.

Is this acceptable?

No!

Typical source programs have at

least 1000 tokens, and 10003 = 109

is a lot of steps, even for a fastmodern computer.

The solution?—Smarter selection in the choice of productions we try.



257CS 536 Spring 2008 ©

Reading Assignment

Read Chapter 5 of

Crafting a Compiler, SecondEdition.

258CS 536 Spring 2008 ©

Prediction

We want to avoid tryingproductions that can’t possiblywork.

For example, if the current token

to be parsed is an identifier, it isuseless to try a production thatbegins with an integer literal.

Before we try a production, we’llconsider the set of terminals itmight initially produce. If thecurrent token is in this set, we’lltry the production.

If it isn’t, there is no way theproduction being consideredcould be part of the parse, sowe’ll ignore it.

A predict function tells us the setof tokens that might be initiallygenerated from any production.

259CS 536 Spring 2008 ©

For A → X1...Xn, Predict(A →

X1...Xn) = Set of all initial (first)tokens derivable from A → X1...Xn

= {a in Vt | A → X1...Xn ⇒* a...}

For example, given

Stmt → Label id = Expr ;

| Label if Expr then Stmt ;

| Label read ( IdList ) ;

| Label id ( Args ) ;

Label → intlit :

|λ

Production Predict Set

Stmt → Label id = Expr ; {id, intlit}

Stmt → Label if Expr then Stmt ; {if, intlit}

Stmt → Label read ( IdList ) ; {read, intlit}

Stmt → Label id ( Args ) ; {id, intlit}

260CS 536 Spring 2008 ©

We now will match a production p

only if the next unmatched tokenis in p’s predict set. We’ll avoidtrying productions that clearlywon’t work, so parsing will befaster.

But what is the predict set of aλ -production?

It can’t be what’s generated by λ (which is nothing!), so we’ll defineit as the tokens that can follow theuse of a λ -production.

That is, Predict(A → λ ) = Follow(A)

where (by definition)

Follow(A) = {a in Vt | S ⇒+ ...Aa...}

In our example,Follow(Label → λ ) ={ id, if, read }

(since these terminals canimmediately follow uses of Labelin the given productions).



261CS 536 Spring 2008 ©

Now let’s parse

id ( intlit ) ;

Our start symbol is Stmt and theinitial token is id.

id can predictStmt → Label id = Expr ;

id then predicts Label → λ

The id is matched, but “(“ doesn’tmatch “=” so we backup and try adifferent production for Stmt.

id also predictsStmt → Label id ( Args ) ;

Again, Label → λ is predicted andused, and the input tokens canmatch the rest of the remainingproduction.

We had only one misprediction,

which is better than before.Now we’ll rewrite the productionsa bit to make predictions easier.

262CS 536 Spring 2008 ©

We remove the Label prefix fromall the statement productions(now intlit won’t predict all fourproductions).

We now have

Stmt → Label BasicStmt

BasicStmt → id = Expr ;| if Expr then Stmt ;

| read ( IdList ) ;

| id ( Args ) ;

Label → intlit :

| λ

Now id predicts two differentBasicStmt productions. If werewrite these two productionsinto

BasicStmt → id StmtSuffixStmtSuffix → = Expr ;

| ( Args ) ;

263CS 536 Spring 2008 ©

we no longer have any doubt over

which production id predicts.We now have

This grammar generates the samestatements as our originalgrammar did, but now predictionnever fails!


Stmt → Label BasicStmt Not needed!

BasicStmt → id StmtSuffix {id}

BasicStmt → if Expr then Stmt ; {if}

BasicStmt → read ( IdList ) ; {read}

StmtSuffix → ( Args ) ; { ( }

StmtSuffix → = Expr ; { = }

Label → intlit : {intlit}

Label → λ {if, id, read}

264CS 536 Spring 2008 ©

Whenever we must decide what

production to use, the predict setsfor productions with the samelefthand side are always disjoint.

Any input token will predict aunique production or noproduction at all (indicating asyntax error).

If we never mispredict aproduction, we never backup, soparsing will be fast and absolutelyaccurate!



265CS 536 Spring 2008 ©

LL(1) Grammars

A context-free grammar whosePredict sets are always disjoint(for the same non-terminal) is saidto be LL(1) .

LL(1) grammars are ideally suitedfor top-down parsing because it isalways possible to correctlypredict the expansion of any non-terminal. No backup is everneeded.

Formally, let

First(X1...Xn) =

{a in Vt | A → X1...Xn ⇒* a...}

Follow(A) = {a in Vt | S ⇒+ ...Aa...}

266CS 536 Spring 2008 ©

Predict(A → X1...Xn) =

If X1...Xn⇒* λ Then First(X1...Xn) U Follow(A)Else First(X1...Xn)

If some CFG, G, has the property

that for all pairs of distinctproductions with the samelefthand side,A → X1...Xn and A → Y1...Ymit is the case that

Predict(A → X1...Xn) ∩

Predict(A → Y1...Ym) = φ

then G is LL(1).

LL(1) grammars are easy to parsein a top-down manner sincepredictions are always correct.

267CS 536 Spring 2008 ©

Example

Since the predict sets of both Bproductions and both Dproductions are disjoint, thisgrammar is LL(1).


S → A a {b,d,a}

A → B D {b, d, a}

B → b { b }

B → λ {d, a}

D → d { d }

D → λ { a }

268CS 536 Spring 2008 ©

Recursive Descent Parsers

An early implementation of top-down (LL(1)) parsing wasrecursive descent.

A parser was organized as a set of parsing procedures , one for eachnon-terminal. Each parsingprocedure was responsible forparsing a sequence of tokensderivable from its non-terminal.

For example, a parsing procedure,A, when called, would call the

scanner and match a tokensequence derivable from A.

Starting with the start symbol’sparsing procedure, we would thenmatch the entire input, whichmust be derivable from the startsymbol.



269CS 536 Spring 2008 ©

This approach is called recursivedescent because the parsingprocedures were typicallyrecursive, and they descended down the input’s parse tree (astop-down parsers always do).

270CS 536 Spring 2008 ©

Building A Recursive DescentParser

We start with a procedure Match,that matches the current inputtoken against a predicted token:

void Match(Terminal a) {if (a == currentToken)

currentToken = Scanner();else SyntaxErrror();}

To build a parsing procedure for anon-terminal A, we look at allproductions with A on thelefthand side:

A → X1...Xn | A → Y1...Ym | ...

We use predict sets to decidewhich production to match (LL(1)grammars always have disjoint

predict sets).We match a production’srighthand side by calling Match to

271CS 536 Spring 2008 ©

match terminals, and calling

parsing procedures to match non-terminals.

The general form of a parsingprocedure for

A → X1...Xn | A → Y1...Ym | ... is void A() {

if (currentToken in Predict(A →X1...Xn))for(i=1;i<=n;i++)if (X[i] is a terminal)

Match(X[i]);

else X[i]();

else

if (currentToken in Predict(A →Y1...Y m ))for(i=1;i<=m;i++)

if (Y[i] is a terminal) Match(Y[i]);

else Y[i]();else

// Handle other A →... productionselse // No production predicted

SyntaxError();

}

272CS 536 Spring 2008 ©

Usually this general form isn’t

used.Instead, each production is“macro-expanded” into asequence of Match and parsingprocedure calls.



273CS 536 Spring 2008 ©

Example: CSX-Lite


Prog → { Stmts } Eof {

Stmts → Stmt Stmts id if

Stmts → λ }

Stmt → id = Expr ; id

Stmt → if ( Expr ) Stmt if

Expr → id Etail id

Etail → + Expr +

Etail → - Expr -

Etail → λ ) ;

274CS 536 Spring 2008 ©

CSX-Lite Parsing Procedures

void Prog() { Match("{");Stmts();

Match("}"); Match(Eof);}

void Stmts() {if (currentToken == id ||

currentToken == if){Stmt();Stmts();

} else {

/* null */

}}

void Stmt() {if (currentToken == id){

Match(id); Match("=");Expr();

Match(";");} else {

Match(if);

Match("(");Expr(); Match(")");Stmt();

}}

275CS 536 Spring 2008 ©

void Expr() { Match(id);Etail();

}

void Etail() {if (currentToken == "+") {

Match("+");Expr();

} else if (currentToken == "-"){Match("-");Expr();

} else {

/* null */}}

276CS 536 Spring 2008 ©

Let’s use recursive descent to parse

{ a = b + c; } EofWe start by calling Prog() since thisrepresents the start symbol.

Calls Pending Remaining Input

Prog() { a = b + c; } Eof

Match("{");Stmts();

Match("}"); Match(Eof);

{ a = b + c; } Eof

Stmts(); Match("}"); Match(Eof);

a = b + c; } Eof

Stmt();Stmts();


a = b + c; } Eof


Match(";");Stmts();


a = b + c; } Eof



277CS 536 Spring 2008 ©

Match("=");Expr();

Match(";");Stmts();


= b + c; } Eof

Expr(); Match(";");Stmts();


b + c; } Eof

Match(id);Etail();

Match(";");Stmts();


b + c; } Eof

Etail(); Match(";");Stmts();


+ c; } Eof


278CS 536 Spring 2008 ©

Match("+");Expr();

Match(";");Stmts();


+ c; } Eof

Expr(); Match(";");Stmts();


c; } Eof

Match(id);Etail();

Match(";");Stmts();


c; } Eof

Etail(); Match(";");Stmts();


; } Eof

/* null */ Match(";");Stmts();


; } Eof


279CS 536 Spring 2008 ©

Match(";");Stmts();


; } Eof


} Eof

/* null */ Match("}"); Match(Eof);

} Eof


} Eof

Match(Eof); Eof

Done! All input matched


280CS 536 Spring 2008 ©

Syntax Errors in RecursiveDescent Parsing

In recursive descent parsing,syntax errors are automaticallydetected. In fact, they aredetected as soon as possible (assoon as the first illegal token isseen).

How? When an illegal token isseen by the parser, either it failsto predict any valid production orit fails to match an expected

token in a call to Match.Let’s see how the following illegalCSX-lite program is parsed:{ b + c = a; } Eof

(Where should the first syntaxerror be detected?)



281CS 536 Spring 2008 ©


Prog() { b + c = a; } Eof

Match("{");Stmts();


{ b + c = a; } Eof


b + c = a; } Eof

Stmt();Stmts();


b + c = a; } Eof


Match(";");Stmts();


b + c = a; } Eof

282CS 536 Spring 2008 ©

Match("=");Expr();

Match(";");Stmts();


+ c = a; } Eof

Call to Match fails! + c = a; } Eof


283CS 536 Spring 2008 ©

Table-Driven Top-DownParsers

Recursive descent parsers havemany attractive features. They areactual pieces of code that can beread by programmers andextended.

This makes it fairly easy tounderstand how parsing is done.

Parsing procedures are alsoconvenient places to add code tobuild ASTs, or to do type-checking, or to generate code.A major drawback of recursivedescent is that it is quiteinconvenient to change thegrammar being parsed. Anychange, even a minor one, mayforce parsing procedures to be

284CS 536 Spring 2008 ©

reprogrammed, as productions

and predict sets are modified.To a less extent, recursive descentparsing is less efficient than itmight be, since subprograms arecalled just to match a single tokenor to recognize a righthand side.

An alternative to parsingprocedures is to encode allprediction in a parsing table. Apre-programed driver programcan use a parse table (and list of productions) to parse any LL(1)grammar.

If a grammar is changed, theparse table and list of productionswill change, but the driver neednot be changed.



285CS 536 Spring 2008 ©

LL(1) Parse Tables

An LL(1) parse table, T, is a two-dimensional array. Entries in T areproduction numbers or blank(error) entries.

T is indexed by:• A, a non-terminal. A is the non-

terminal we want to expand.

• CT, the current token that is to bematched.

• T[A][CT] = A → X1...Xnif CT is in Predict(A → X1...Xn)T[A][CT] = errorif CT predicts no production with Aas its lefthand side

286CS 536 Spring 2008 ©

CSX-lite Example


1 Prog → { Stmts } Eof {

2 Stmts → Stmt Stmts id if

3 Stmts → λ }

4 Stmt → id = Expr ; id5 Stmt → if ( Expr ) Stmt if

6 Expr → id Etail id

7 Etail → + Expr +

8 Etail → - Expr -

9 Etail → λ ) ;

{ } if ( ) id = + - ; eof

Prog 1

Stmts 3 2 2

Stmt 5 4

Expr 6

Etail 9 7 8 9

287CS 536 Spring 2008 ©

LL(1) Parser Driver

Here is the driver we’ll use withthe LL(1) parse table. We’ll alsouse a parse stack that rememberssymbols we have yet to match.

void LLDriver(){

Push(StartSymbol);

while(! stackEmpty()){

//Let X=Top symbol on parse stack

//Let CT = current token to match

if (isTerminal(X)) {

match(X); //CT is updated pop(); //X is updated

} else if (T[X][CT] != Error){

//Let T[X][CT] = X→Y1...Y m

Replace X withY1...Y m on parse stack

} else SyntaxError(CT);

}

}

288CS 536 Spring 2008 ©

Example of LL(1) Parsing

We’ll again parse{ a = b + c; } Eof

We start by placing Prog (the startsymbol) on the parse stack.

Parse Stack Remaining Input

Prog { a = b + c; } Eof

{Stmts}Eof

{ a = b + c; } Eof

Stmts}Eof

a = b + c; } Eof

StmtStmts}Eof

a = b + c; } Eof



289CS 536 Spring 2008 ©

id=Expr;Stmts}

Eof

a = b + c; } Eof

=Expr;Stmts}Eof

= b + c; } Eof

Expr;Stmts}Eof

b + c; } Eof

idEtail;

Stmts}Eof

b + c; } Eof


290CS 536 Spring 2008 ©

Etail;Stmts}Eof

+ c; } Eof

+Expr;Stmts}Eof

+ c; } Eof

Expr;Stmts}Eof

c; } Eof

idEtail;Stmts}

Eof

c; } Eof


291CS 536 Spring 2008 ©

Etail;Stmts}Eof

; } Eof

;Stmts}Eof

; } Eof

Stmts}Eof

} Eof

}Eof

} Eof

Eof Eof

Done! All input matched


292CS 536 Spring 2008 ©

Syntax Errors in LL(1)Parsing

In LL(1) parsing, syntax errorsare automatically detected assoon as the first illegal token isseen.

How? When an illegal token isseen by the parser, either itfetches an error entry from theLL(1) parse table or it fails tomatch an expected token.

Let’s see how the followingillegal CSX-lite program isparsed:

{ b + c = a; } Eof

(Where should the first syntaxerror be detected?)



293CS 536 Spring 2008 ©


Prog { b + c = a; } Eof

{Stmts

}Eof

{ b + c = a; } Eof

Stmts}Eof

b + c = a; } Eof

StmtStmts}Eof

b + c = a; } Eof

id=Expr;Stmts

}Eof

b + c = a; } Eof

294CS 536 Spring 2008 ©

=Expr;Stmts}Eof

+ c = a; } Eof

Current token (+) failsto match expectedtoken (=)!

+ c = a; } Eof


295CS 536 Spring 2008 ©

How do LL(1) Parsers BuildSyntax Trees?

So far our LL(1) parser has actedlike a recognizer. It verifies thatinput token are syntacticallycorrect, but it produces nooutput.

Building complete (concrete)parse trees automatically is fairlyeasy.

As tokens and non-terminals arematched, they are pushed onto asecond stack, the semantic stack .At the end of each production, anaction routine pops off n itemsfrom the semantic stack (where nis the length of the production’srighthand side). It then builds asyntax tree whose root is the

296CS 536 Spring 2008 ©

lefthand side, and whose children

are the n items just popped off.

For example, for production

Stmt → id = Expr ;

the parser would include anaction symbol after the “;” whoseactions are:P4 = pop(); // Semicolon tokenP3 = pop(): // Syntax tree for ExprP2 = pop(); // Assignment tokenP1 = pop(); // Identifier tokenPush(new StmtNode(P1,P2,P3,P4));



297CS 536 Spring 2008 ©

Creating Abstract SyntaxTrees

Recall that we prefer that parsersgenerate abstract syntax trees,since they are simpler and more

concise.Since a parser generator can’tknow what tree structure we wantto keep, we must allow the userto define “custom” action code, just as Java CUP does.

We allow users to include “codesnippets” in Java or C. We alsoallow labels on symbols so thatwe can refer to the tokens andtress we wish to access. Ourproduction and action code will

now look like this:Stmt → id:i = Expr:e ;

{: RESULT = new StmtNode(i,e); :}

298CS 536 Spring 2008 ©

How do We Make GrammarsLL(1)?

Not all grammars are LL(1);sometimes we need to modify agrammar’s productions to create

the disjoint Predict sets LL1)requires.

There are two common problemsin grammars that make uniqueprediction difficult or impossible:

1. Common prefixes.Two or more productions withthe same lefthand side beginwith the same symbol(s).For example,

Stmt → id = Expr ;

Stmt→

id ( Args ) ;

299CS 536 Spring 2008 ©

2. Left-Recursion

A production of the formA → A ...

is said to be left-recursive.

When a left-recursive productionis used, a non-terminal isimmediately replaced by itself (with additional symbolsfollowing).

Any grammar with a left-recursiveproduction can never be LL(1).

Why?

Assume a non-terminal A reachesthe top of the parse stack, withCT as the current token. The LL(1)parse table entry, T[A][CT],predicts A → A ...

We expand A again, and T[A][CT],so we predict A → A ... again. Weare in an infinite prediction loop!

300CS 536 Spring 2008 ©

Eliminating Common Prefixes

Assume we have two of moreproductions with the samelefthand side and a commonprefix on their righthand sides:

A → α β | α γ | ... | α δ

We create a new non-terminal, X.

We then rewrite the aboveproductions into:

A → αX X → β | γ | ... | δ

For example,

Stmt → id = Expr ;Stmt → id ( Args ) ;

becomes

Stmt → id StmtSuffix

StmtSuffix → = Expr ;

StmtSuffix → ( Args ) ;



301CS 536 Spring 2008 ©

Eliminating Left Recursion

Assume we have a non-terminalthat is left recursive:

A → Aα A → β | γ | ... | δ

To eliminate the left recursion, we

create two new non-terminals, Nand T.

We then rewrite the aboveproductions into:

A → N T N → β | γ | ... | δ

T → α T | λ

302CS 536 Spring 2008 ©

For example,

Expr → Expr + id

Expr → id

becomes

Expr → N T

N → idT → + id T | λ

This simplifies to:

Expr → id T

T → + id T | λ

303CS 536 Spring 2008 ©

Reading Assignment

Read Sections 6.1 to 6.5.1 of Crafting a Compiler featuring Java.

304CS 536 Spring 2008 ©

How does JavaCup Work?

The main limitation of LL(1)parsing is that it must predict thecorrect production to use when itfirst starts to match theproduction’s righthand side.

An improvement to this approachis the LALR(1) parsing methodthat is used in JavaCUP (and Yaccand Bison too).

The LALR(1) parser is bottom-upin approach. It tracks the portion

of a righthand side alreadymatched as tokens are scanned. Itmay not know immediately whichis the correct production tochoose, so it tracks sets of possible matching productions.



305CS 536 Spring 2008 ©

Configurations

We’ll use the notation

X → A B • C D

to represent the fact that we aretrying to match the production

X → A B • C D with A and Bmatched so far.

A production with a “•”somewhere in its righthand side iscalled a configuration.

Our goal is to reach aconfiguration with the “dot” at theextreme right:

X → A B C D •

This indicates that an entire

production has just beenmatched.

306CS 536 Spring 2008 ©

Since we may not know whichproduction will eventually be fullymatched, we may need to track aconfiguration set . A configurationset is sometimes called a state.

When we predict a production, we

place the “dot” at the beginning of a production:

X → • A B C D

This indicates that the productionmay possibly be matched, but nosymbols have actually yet beenmatched.

We may predict a λ -production:

X → λ •

When a λ -production is predicted,it is immediately matched, since λ

can be matched at any time.

307CS 536 Spring 2008 ©

Starting the Parse

At the start of the parse, we knowsome production with the startsymbol must be used initially. Wedon’t yet know which one, so wepredict them all :

S → • A B C D

S → • e F g

S → • h I...

308CS 536 Spring 2008 ©

Closure

When we encounter aconfiguration with the dot to theleft of a non-terminal, we knowwe need to try to match that non-terminal.

Thus in

X → • A B C D

we need to match someproduction with A as its left handside.

Which production?We don’t know, so we predict all possibilities:

A → • P Q R

A → • s T

...



309CS 536 Spring 2008 ©

The newly added configurationsmay predict other non-terminals,forcing additional productions tobe included. We continue thisprocess until no additionalconfigurations can be added.

This process is called closure (of the configuration set).

Here is the closure algorithm:ConfigSet Closure(ConfigSet C){

repeatif (X → a •B d is in C &&

B is a non-terminal) Add all configurations of

the form B → •g to C) until (no more configurations

can be added);

return C;

}

310CS 536 Spring 2008 ©

Example of Closure

Assume we have the followinggrammar:

S → A b

A → C D

C → DC → c

D → d

To compute Closure(S → • A b)

we first include all productionsthat rewrite A:

A → • C D

Now C productions are included:

C → • D

C → • c

311CS 536 Spring 2008 ©

Finally, the D production is added:

D → • d

The complete configuration set is:

S → • A b

A → • C D

C → • D

C → • c

D → • d

This set tells us that if we want tomatch an A, we will need tomatch a C, and this is done bymatching a c or d token.

312CS 536 Spring 2008 ©

Shift Operations

When we match a symbol (aterminal or non-terminal), we shift the “dot” past the symbol justmatched. Configurations thatdon’t have a dot to the left of thematched symbol are deleted(since they didn’t correctlyanticipate the matched symbol).

The GoTo function computes anupdated configuration set after asymbol is shifted:

ConfigSet GoTo(ConfigSet C,Symbol X){B=φ;for each configuration f in C{

if (f is of the form A → α•X δ) Add A → α X •δ to B;

}return Closure(B);

}



313CS 536 Spring 2008 ©

For example, if C is

S → • A b

A → • C D

C → • D

C → • c

D → • d

and X is C, then GoTo returns

A → C • D

D → • d

314CS 536 Spring 2008 ©

Reduce Actions

When the dot in a configurationreaches the rightmost position,we have matched an entirerighthand side. We are ready toreplace the righthand sidesymbols with the lefthand side of the production. The lefthand sidesymbol can now be consideredmatched.

If a configuration set can shift atoken and also reduce aproduction, we have a potentialshift/reduce error .

If we can reduce more than oneproduction, we have a potentialreduce/reduce error .

How do we decide whether to doa shift or reduce? How do wechoose among more than onereduction?

315CS 536 Spring 2008 ©

We examine the next token to see

if it is consistent with thepotential reduce actions.

The simplest way to do this is touse Follow sets, as we did in LL(1)parsing.

If we have a configuration

A → α •

we will reduce this productiononly if the current token, CT, is inFollow(A).

This makes sense since if we

reduce α to A, we can’t correctlymatch CT if CT can’t follow A.

316CS 536 Spring 2008 ©

Shift/Reduce and Reduce/Reduce Errors

If we have a parse state thatcontains the configurations

A → α •

B → β • a γ

and a in Follow(A) then there is anunresolvable shift/reduce conflict.This grammar can’t be parsed.

Similarly, if we have a parse statethat contains the configurations

A → α •

B → β •

and Follow(A) ∩ Follow(B) ≠ φ,then the parser has anunresolvable reduce/reduceconflict. This grammar can’t beparsed.



317CS 536 Spring 2008 ©

Building Parse States

All the manipulations needed tobuild and complete configurationsets suggest that parsing may beslow—configuration sets need tobe updated after each token ismatched.Fortunately, all the configurationsets we ever will need can becomputed and tabled in advance,when a tool like Java Cup builds aparser.

The idea is simple. We firstcompute an initial parse state, s0,that corresponds to predictingproductions that expand the startsymbol. We then just compute

successor states for each tokenthat might be scanned. Acomplete set of states can becomputed. For typical

318CS 536 Spring 2008 ©

programming languagegrammars, only a few hundredstates are needed.

Here is the algorithm that builds acomplete set of parse states for agrammar:

StateSet BuildStates(){

Let s0=Closure({S → •α, S → •β, ...});

C={s0};

while (not all states in C are marked){Choose any unmarked state, s, in C

Mark s;

For each X interminals U nonterminals {

if (GoTo(s,X) is not in C)

Add GoTo(s,X) to C;

}

}

return C;

}

319CS 536 Spring 2008 ©

Configuration Sets for CSX-Lite

State Cofiguration Set

s0 Prog → •{ Stmts } Eof

s1

Prog → { • Stmts } EofStmts → •Stmt Stmts

Stmts → λ •

Stmt → • id = Expr ;

Stmt → • if ( Expr ) Stmt

s2 Prog → { Stmts •} Eof

s3

Stmts → Stmt • StmtsStmts → •Stmt Stmts

Stmts → λ •



s4 Stmt → id • = Expr ;

s5 Stmt → if • ( Expr ) Stmt

320CS 536 Spring 2008 ©

s6 Prog → { Stmts } •Eof

s7 Stmts → Stmt Stmts •

s8

Stmt → id = • Expr ;

Expr → • Expr + id

Expr → • Expr - id

Expr → • id

s9

Stmt → if ( • Expr ) Stmt



Expr → • id

s10 Prog → { Stmts } Eof •

s11

Stmt → id = Expr • ;

Expr → Expr • + id

Expr → Expr • - id

s12 Expr → id •

s13

Stmt → if ( Expr •) Stmt






321CS 536 Spring 2008 ©

s14 Stmt → id = Expr ; •

s15 Expr → Expr + • id

s16 Expr → Expr - • id

s17

Stmt → if ( Expr ) • Stmt



s18 Expr → Expr + id •

s19 Expr → Expr - id •

s20 Stmt → if ( Expr ) Stmt •


322CS 536 Spring 2008 ©

Parser Action Table

We will table possible parseractions based on the current state(configuration set) and token.

Given configuration set C and

input token T four actions arepossible:

• Reduce i: The i-th production hasbeen matched.

• Shift: Match the current token.

• Accept: Parse is correct andcomplete.

• Error: A syntax error has beendiscovered.

323CS 536 Spring 2008 ©

We will let A[C][T] represent the

possible parser actions givenconfiguration set C and inputtoken T.A[C][T] =

{Reduce i | i-th production is A→ αand A → α • is in Cand T in Follow(A) }

U (If (B → β • T γ is in C){Shift} else φ)

This rule simply collects all theactions that a parser might do

given C and T.But we want parser actions to beunique so we require that theparser action always be uniquefor any C and T.

324CS 536 Spring 2008 ©

If the parser action isn’t unique,

then we have a shift/reduce erroror reduce/reduce error. Thegrammar is then rejected asunparsable.

If parser actions are alwaysunique then we will consider ashift of EOF to be an acceptaction.

An empty (or undefined) actionfor C and T will signify that tokenT is illegal given configuration setC.

A syntax error will be signaled.



325CS 536 Spring 2008 ©

LALR Parser Driver

Given the GoTo and parser actiontables, a Shift/Reduce (LALR)parser is fairly simple:

void LALRDriver(){Push(S0);

while(true){

//Let S = Top state on parse stack

//Let CT = current token to match

switch (A[S][CT]) {case error:

SyntaxError(CT);return;case accept:

return;

case shift: push(GoTo[S][CT]);CT= Scanner();break;

case reduce i://Let prod i = A →Y1...Y m

pop m states;

//Let S’ = new top state push(GoTo[S’][A]);break;

} } }

326CS 536 Spring 2008 ©

Action Table for CSX-Lite

0 1 2 3 4 5 6 7 8 910

11

12

13

14

15

16

17

18

19

20

{ S

} R3 S R3 R2 R4 R5

if S S R4 S R5

( S

) R8 S R6 R7

id S S S S R4 S S S

= S

+ S R8 S R6 R7

- S R8 S R6 R7

; S R8 R6 R7 R5

eof A

327CS 536 Spring 2008 ©

GoTo Table for CSX-Lite

0 1 2 3 4 5 6 7 8 910

11

12

13

14

15

16

17

18

19

20

{ 1

} 6

if 5 5 5

( 9

) 17

id 4 4 12 12 18 19 4

= 8

+15 15

- 16 16

; 14

eof 10

stmts 2 7

stmt 3 3

expr 11 13

328CS 536 Spring 2008 ©

Example of LALR(1) Parsing

We’ll again parse{ a = b + c; } Eof

We start by pushing state 0 on theparse stack.

ParseStack

Top State Action Remaining Input

0 Prog → •{ Stmts } Eof Shift { a = b + c; } Eof

1

0

Prog → { • Stmts } EofStmts → • Stmt Stmts

Stmts → λ •


Stmt → • if ( Expr )

Shift a = b + c; } Eof

4

1

0

Stmt → id • = Expr ; = b + c; } Eof

8

4

1

0

Stmt → id = • Expr ;



Expr → • id

Shift b + c; } Eof



329CS 536 Spring 2008 ©

12

8

4

1

0

Expr → id • Reduce 8 + c; } Eof

11

8

4

1

0




Shift + c; } Eof

15

11

8

4

1

0

Expr → Expr + • id Shift c; } Eof

ParseStack


330CS 536 Spring 2008 ©

18

15

11

8

41

0

Expr → Expr + id • Reduce 6 ; } Eof

11

8

4

1

0




Shift ; } Eof

14

11

8

4

10

Stmt → id = Expr ; • Reduce 4 } Eof

ParseStack


331CS 536 Spring 2008 ©

3

1

0

Stmts → Stmt • StmtsStmts → •Stmt Stmts

Stmts → λ •


Stmt → • if ( Expr )Stmt

Reduce 3 } Eof

7

3

1

0

Stmts → Stmt Stmts • Reduce 2 } Eof

2

1

0

Prog → { Stmts •} Eof Shift } Eof

6

2

1

0

Prog → { Stmts } •Eof Accept Eof

Parse

Stack Top State Action Remaining Input

332CS 536 Spring 2008 ©

Error Detection in LALRParsers

In bottom-up, LALR parserssyntax errors are discoveredwhen a blank (error) entry isfetched from the parser actiontable.

Let’s again trace how thefollowing illegal CSX-lite programis parsed:

{ b + c = a; } Eof



333CS 536 Spring 2008 ©

ParseStack


0 Prog → •{ Stmts } Eof Shift {b+c=a;}Eof

1

0

Prog → { • Stmts } Eof

Stmts → • Stmt StmtsStmts → λ •


Stmt → • if ( Expr )

Shift b + c = a; } Eof

4

1

0

Stmt → id • = Expr ; Error(blank)

+ c = a; } Eof

334CS 536 Spring 2008 ©

LALR is More Powerful

Essentially all LL(1) grammars areLALR(1) plus many more.Grammar constructs that confuseLL(1) are readily handled.

• Common prefixes are no problem.Since sets of configurations aretracked, more than one prefix canbe followed. For example, in

Stmt → id = Expr ;Stmt → id ( Args ) ;

after we match an id we have

Stmt → id • = Expr ;Stmt → id • ( Args ) ;

The next token will tell us whichproduction to use.

335CS 536 Spring 2008 ©

• Left recursion is also not a

problem. Since sets of configurations are tracked, we canfollow a left-recursive productionand all others it might use. Forexample, in

Expr → • Expr + idExpr → • id

we can first match an id:

Expr → id •

Then the Expr is recognized:


The left-recursion is handled!

336CS 536 Spring 2008 ©

• But ambiguity will still block

construction of an LALR parser.Some shift/reduce or reduce/reduce conflict must appear. (Sincetwo or more distinct parses arepossible for some input).Consider our original productionsfor if-then and if-then-elsestatements:

Stmt → if ( Expr ) Stmt •

Stmt → if ( Expr ) Stmt • else Stmt

Since else can follow Stmt, wehave an unresolvable shift/reduceconflict.



337CS 536 Spring 2008 ©

Grammar Engineering

Though LALR grammars are verygeneral and inclusive, sometimesa reasonable set of productions isrejected due to shift/reduce orreduce/reduce conflicts.

In such cases, the grammar mayneed to be “engineered” to allowthe parser to operate.

A good example of this is thedefinition of MemberDecls in CSX.A straightforward definition is

MemberDecls → FieldDecls MethodDecls

FieldDecls → FieldDecl FieldDeclsFieldDecls → λ MethodDecls → MethodDecl MethodDeclsMethodDecls → λ FieldDecl → int id ;

MethodDecl → int id ( ) ; Body

338CS 536 Spring 2008 ©

When we predict MemberDecls weget:

MemberDecls → • FieldDecls MethodDeclsFieldDecls → • FieldDecl FieldDeclsFieldDecls → λ •

FieldDecl → • int id ;

Now int follows FieldDecls sinceMethodDecls ⇒+ int ...

Thus an unresolvable shift/reduceconflict exists.

The problem is that int isderivable from both FieldDeclsand MethodDecls, so when we seean int, we can’t tell which way toparse it (and FieldDecls → λ requires we make an immediatedecision!).

339CS 536 Spring 2008 ©

If we rewrite the grammar so that

we can delay deciding from wherethe int was generated, a validLALR parser can be built:

MemberDecls → FieldDecl MemberDecls

MemberDecls → MethodDeclsMethodDecls → MethodDecl MethodDeclsMethodDecls → λ FieldDecl → int id ;MethodDecl → int id ( ) ; Body

When MemberDecls is predictedwe have

MemberDecls → • FieldDecl MemberDecls

MemberDecls → • MethodDecls

MethodDecls → •MethodDecl MethodDeclsMethodDecls → λ •

FieldDecl → • int id ;MethodDecl → • int id ( ) ; Body

340CS 536 Spring 2008 ©

Now Follow(MethodDecls) =

Follow(MemberDecls) = “}”, so wehave no shift/reduce conflict.After int id is matched, the nexttoken (a “;” or a “(“) will tell uswhether a FieldDecl or aMethodDecl is being matched.



341CS 536 Spring 2008 ©

Properties of LL and LALRParsers

• Each prediction or reduce action isguaranteed correct. Hence the entireparse (built from LL predictions or

LALR reductions) must be correct.

This follows from the fact that LLparsers allow only one valid predictionper step. Similarly, an LALR parsernever skips a reduction if it isconsistent with the current token (andall possible reductions are tracked).

342CS 536 Spring 2008 ©

• LL and LALR parsers detect an syntaxerror as soon as the first invalid tokenis seen.

Neither parser can match an invalidprogram prefix. If a token is matchedit must be part of a valid programprefix. In fact, the prediction made orthe stacked configuration sets show apossible derivation of the tokenaccepted so far.

• All LL and LALR grammars areunambiguous.

LL predictions are always unique andLALR shift/reduce or reduce/reduceconflicts are disallowed. Hence onlyone valid derivation of any token

sequence is possible.

343CS 536 Spring 2008 ©

• All LL and LALR parsers require only

linear time and space (in terms of thenumber of tokens parsed).

The parsers do only fixed work pernode of the concrete parse tree, andthe size of this tree is linear in termsof the number of leaves in it (evenwith λ -productions included!).

344CS 536 Spring 2008 ©

Reading Assignment

Read Chapter 8 of Crafting aCompiler, 2nd Edition.



345CS 536 Spring 2008 ©

Symbol Tables in CSX

CSX is designed to make symboltables easy to create and use.

There are three places where anew scope is opened:

• In the class that represents theprogram text. The scope is openedas soon as we begin processing theclassNode (that roots the entireprogram). The scope stays openuntil the entire class (the wholeprogram) is processed.

• When a methodDeclNode isprocessed. The name of themethod is entered in the top-level(global) symbol table. Declarationsof parameters and locals are placedin the method’s symbol table. A

method’s symbol table is closedafter all the statements in its bodyare type checked.

346CS 536 Spring 2008 ©

• When a blockNode is processed.Locals are placed in the block’ssymbol table. A block’s symboltable is closed after all thestatements in its body are typechecked.

347CS 536 Spring 2008 ©

CSX Allows no ForwardReferences

This means we can do type-checking in one pass over the AST.As declarations are processed,their identifiers are added to thecurrent (innermost) symbol table.When a use of an identifieroccurs, we do an ordinary block-structured lookup, always usingthe innermost declaration found.Hence in

int i = j;int j = i;

the first declaration initializes i tothe nearest non-local definition of j.

The second declaration initializesj to the current (local) definitionof i.

348CS 536 Spring 2008 ©

Forward References RequireTwo Passes

If forward references are allowed,we can process declarations intwo passes.

First we walk the AST to establishsymbol tables entries for all localdeclarations. No uses (lookups)are handled in this passes.

On a second complete pass, alluses are processed, using thesymbol table entries built on thefirst pass.Forward references make typechecking a bit trickier, as we mayreference a declaration not yetfully processed.

In Java, forward references tofields within a class are allowed.

Thus in



349CS 536 Spring 2008 ©

class Duh {int i = j;int j = i;}

a Java compiler must recognizethat the initialization of i is to the

j field and that the j declarationis incomplete (Java forbidsuninitialized fields or variables).

Forward references do allowmethods to be mutually recursive.That is, we can let method a callb, while b calls a.

In CSX this is impossible!

(Why?)

350CS 536 Spring 2008 ©

Incomplete Declarations

Some languages, like C++, allowincomplete declarations.

First, part of a declaration (usuallythe header of a procedure or

method) is presented.Later, the declaration iscompleted.

For example (in C++):class C {

int i;

public:

int f();

};

int C::f(){return i+1;}

351CS 536 Spring 2008 ©

Incomplete declarations solve

potential forward referenceproblems, as you can declaremethod headers first, and bodiesthat use the headers later.

Headers support abstraction andseparate compilation too.

In C and C++, it is common to usea #include statement to add theheaders (but not bodies) of external or library routines youwish to use.

C++ also allows you to declare aclass by giving its fields andmethod headers first, with thebodies of the methods declaredlater. This is good for users of theclass, who don’t always want tosee implementation details.

352CS 536 Spring 2008 ©

Classes, Structs and Records

The fields and methods declaredwithin a class, struct or record arestored within a individual symboltable allocated for itsdeclarations.

Member names must be uniquewithin the class, record or struct,but may clash with other visibledeclarations. This is allowedbecause member names arequalified by the object they occur

in.Hence the reference x.a meanslook up x, using normal scopingrules. Object x should have a typethat includes local fields. The typeof x will include a pointer to thesymbol table containing the fielddeclarations. Field a is looked upin that symbol table.



353CS 536 Spring 2008 ©

Chains of field references are noproblem.

For example, in Java

System.out.println

is commonly used.

System is looked up and found tobe a class in one of the standard Java packages (java.lang). ClassSystem has a static member out(of type PrintStream ) andPrintStream has a member println.

354CS 536 Spring 2008 ©

Internal and External FieldAccess

Within a class, members may beaccessed without qualification.Thus in

class C {static int i;

void subr() {

int j = i;

}

}

field i is accessed like an ordinarynon-local variable.

To implement this, we can treatmember declarations like an

ordinary scope in a block-structured symbol table.

355CS 536 Spring 2008 ©

When the class definition ends, its

symbol table is popped andmembers are referenced throughthe symbol table entry for theclass name.

This means a simple reference toi will no longer work, but C.i willbe valid.

In languages like C++ that allowincomplete declarations, symboltable references need extra care.In

class C {

int i;

public:

int f();

};

int C::f(){return i+1;}

356CS 536 Spring 2008 ©

when the definition of f() is

completed, we must restore C’sfield definitions as a containingscope so that the reference to i ini+1 is properly compiled.



357CS 536 Spring 2008 ©

Public and Private Access

C++ and Java (and most otherobject-oriented languages) allowmembers of a class to be marked public or private.

Within a class the distinction isignored; all members may beaccessed.

Outside of the class, when aqualified access like C.i isrequired, only public memberscan be accessed.

This means lookup of classmembers is a two-step process.First the member name is lookedup in the symbol table of theclass. Then, the public/ private

qualifier is checked. Access to private members from outsidethe class generates an errormessage.

358CS 536 Spring 2008 ©

C++ and Java also provide a protected qualifier that allowsaccess from subclasses of theclass containing the memberdefinition.

When a subclass is defined, it

“inherits” the member definitionsof its ancestor classes. Localdefinitions may hide inheriteddefinitions. Moreover, inheritedmember definitions must be public or protected; privatedefinitions may not be directlyaccessed (though they are stillinherited and may be indirectlyaccessed through other inheriteddefinitions).

Java also allows “blank” accessqualifiers which allow publicaccess by all classes within apackage (a collection of classes).

359CS 536 Spring 2008 ©

Packages and Imports

Java allows packages which groupclass and interface definitions intonamed units.

A package requires a symbol tableto access members. Thus areference

java.util.Vector

locates the package java.util(typically using a CLASSPATH) andlooks up Vector within it.

Java supports import statementsthat modify symbol table lookuprules.

A single class import, like

import java.util.Vector;

brings the name Vector into thecurrent symbol table (unless a

360CS 536 Spring 2008 ©

definition of Vector is already

present).An “import on demand” like

import java.util.*;

will lookup identifiers in thenamed packages after explicituser declarations have beenchecked.





365CS 536 Spring 2008 ©

Overloaded Operators

A few languages, like C++, allowoperators to be overloaded.

This means users may add newdefinitions for existing operators,

though they may not create newoperators or alter existingprecedence and associativityrules.

(Such changes would forcechanges to the scanner or parser.)

For example,class complex{

float re, im;

complex operator+(complex d){

complex ans;

ans.re = d.re+re;

ans.im = d.im+im;return ans;

} }

complex c,d; c=c+d;

366CS 536 Spring 2008 ©

During type checking of anoperator, all visible definitions of the operator (includingpredefineddefinitions) are gathered andexamined.

Only one definition should

successfully pass type checks.Thus in the above example, theremay be many definitions of +, butonly one is defined to takecomplex operands.

367CS 536 Spring 2008 ©

Contextual Resolution

Overloading allows multipledefinitions of the same kind of object (method, procedure oroperator) to co-exist.

Programming languages alsosometimes allow reuse of thesame name in defining differentkinds of objects. Resolution is bycontext of use.

For example, in Java, a class namemay be used for both the class

and its constructor. Hence we seeC cvar = new C(10);

In Pascal, the name of a functionis also used for its return value.

Java allows rather extensive reuseof an identifier, with the sameidentifier potentially denoting aclass (type), a class constructor, a

368CS 536 Spring 2008 ©

package name, a method and a

field.For example,

class C {

double v;

C(double f) {v=f;}

}

class D {

int C;

double C() {return 1.0;}

C cval = new C(C+C());

}At type-checking time weexamine all potential definitionsand use that definition that isconsistent with the context of use. Hence new C() must be aconstructor, +C() must be afunction call, etc.



369CS 536 Spring 2008 ©

Allowing multiple definitions toco-exist certainly makes typechecking more complicated thanin other languages.

Whether such reuse benefitsprogrammers is unclear; it

certainly violates Java’s “keep itsimple” philosophy.

370CS 536 Spring 2008 ©

Type and Kind Information inCSX

In CSX symbol table entries and inAST nodes for expressions, it isuseful to store type and kind

information.This information is created andtested during type checking. Infact, most of type checkinginvolves deciding whether thetype and kind values for thecurrent construct and itscomponents are valid.

Possible values for type include:

• Integer (int)

• Boolean (bool)

•

Character (char)• String

371CS 536 Spring 2008 ©

• Void

Void is used to represent objectsthat have no declared type (e.g., alabel or procedure).

• ErrorError is used to represent objectsthat should have a type, but don’t(because of type errors). Errortypes suppress further typechecking, preventing cascadederror messages.

• UnknownUnknown is used as an initial value,before the type of an object isdetermined.

372CS 536 Spring 2008 ©

Possible values for kind

include:• Var (a local variable or field that

may be assigned to)

• Value (a value that may be readbut not changed)

• Array

• ScalarParm (a by-value scalarparameter)

• ArrayParm (a by-reference arrayparameter)

• Method (a procedure or function)

• Label (on a while loop)



373CS 536 Spring 2008 ©

Most combinations of type andkind represent something in CSX.

Hence type==Boolean andkind==Value is a bool constantor expression.

type==Void and kind==Method

is a procedure (a method thatreturns no value).

Type checking procedure andfunction declarations and callsrequires some care.

When a method is declared, youshould build a linked list of (type,kind) pairs, one for eachdeclared parameter.

When a call is type checked youshould build a second linked listof (type,kind) pairs for the

actual parameters of the call.

374CS 536 Spring 2008 ©

Youcompare the lengths of the listof formal andactual parameters tocheck that the correct number of parameters has been passed.

You then compare correspondingformal and actual parameter pairs

to check if each individual actualparameter correctly matches itscorresponding formal parameter.

For example, given p(int a, bool b[]){ ...

and the call

p(1,false);

you create the parameter list(Integer, ScalarParm ),(Boolean, ArrayParm )

for p’s declaration and theparameter list(Integer, Value),(Boolean, Value)

for p’s call.

375CS 536 Spring 2008 ©

Since a Value can’t match an

ArrayParm , you know that thesecond parameter in p’s call isincorrect.

376CS 536 Spring 2008 ©

Type Checking Simple VariableDeclarations

Type checking steps:

1. Check that identNode.idname isnot already in the symbol table.

2. Enter identNode.idname intosymboltable withtype=typeNode.type andkind = Variable.

varDeclNode

identNode typeNode



377CS 536 Spring 2008 ©

Type Checking InitializedVariable Declarations



2. Type check initial valueexpression.

3. Check that the initial value’s typeis typeNode.type

varDeclNode

identNode typeNode

expr tree

378CS 536 Spring 2008 ©

4. Check that the initial value’s kindis scalar ( Variable, Value orScalarParm ).

5. Enter identNode.idname intosymbol table withtype = typeNode.type and

kind = Variable.

379CS 536 Spring 2008 ©

Type Checking Const Decls



2. Type check the const value expr.3. Check that the const value’s kind

is scalar ( Variable, Value orScalarParm ).

4. Enter identNode.idname intosymbol table with type =constValue.type andkind = Value.

constDeclNode

identNode

expr tree

380CS 536 Spring 2008 ©

Type Checking IdentNodes


1. Lookup identNode.idname in thesymbol table; error if absent.

2. Copy symbol table entry’s typeand kind information into theidentNode.

3. Store a link to the symbol tableentry in the identNode (in casewe later need to access symboltable information).

identNode



381CS 536 Spring 2008 ©

Type Checking NameNodes


1. Type check the identNode.

2. If the subscriptVal is a nullnode, copy the identNode’stype and kind values into thenameNode and return.

3. Type check the subscriptVal.

4. Check that identNode’s kind is

an array.

nameNode

identNode

expr tree

382CS 536 Spring 2008 ©

5. Check that subscriptVal’s kindis scalar and type is integer orcharacter.

6. Set the nameNode’s type to theidentNode’s type and thenameNode’s kind to Variable.

383CS 536 Spring 2008 ©

Type Checking BinaryOperators


1. Type check left and right

operands.2. Check that left and right

operands are both scalars.

3. binaryOpNode.kind = Value.

binaryOpNode

expr treeexpr tree

384CS 536 Spring 2008 ©

4. If binaryOpNode.operator is a

plus, minus, star or slash:(a) Check that left and rightoperands have an arithmetictype (integer or character).(b) binaryOpNode.type =Integer

5. If binaryOpNode.operator is anand or is an or:(a) Check that left and rightoperands have a boolean type.(b) binaryOpNode.type =Boolean.

6. If binaryOpNode.operator is arelational operator:(a) Check that both left andright operands have anarithmetic type or both have aboolean type.(b) binaryOpNode.type =Boolean.



385CS 536 Spring 2008 ©

Type Checking Assignments


1. Type check the nameNode.

2. Type check the expression tree.

3. Check that the nameNode’s kindis assignable ( Variable, Array,ScalarParm , or ArrayParm ).

4. If the nameNode’s kind is scalarthen check the expression tree’s

kind is also scalar and that bothhave the same type. Thenreturn.

asgNode

nameNode

expr tree

386CS 536 Spring 2008 ©

5. If the nameNode’s and theexpression tree’s kinds are botharrays and both have the sametype, check that the arrays havethe same length. (Lengths of array parms are checked at run-

time). Then return.6. If the nameNode’s kind is arrayand its type is character and theexpression tree’s kind is string,check that both have the samelength. (Lengths of array parmsare checked at run-time). Thenreturn.

7. Otherwise, the expression maynot be assigned to the nameNode.

387CS 536 Spring 2008 ©

Type Checking While Loops


1. Type check the condition (anexpr tree).

2. Check that the condition’s typeis Boolean and kind is scalar.

3. If the label is a null node thentype check the stmtNode (theloop body) and return.

whileNode

identNode

expr tree

stmtNode

388CS 536 Spring 2008 ©

4.If there is a label (an identNode)

then:(a) Check that the label is notalready present in the symboltable.(b) If it isn’t, enter label in thesymbol table withkind= VisibleLabeland type= void.(c) Type check the stmtNode (theloop body).(d) Change the label’s kind (inthe symbol table) to

HiddenLabel.



389CS 536 Spring 2008 ©

Type Checking Breaks andContinues


1. Check that the identNode isdeclared in the symbol table.

2. Check that identNode’s kind is VisibleLabel. If identNode’skind is HiddenLabel issue aspecial error message.

breakNode

identNode

390CS 536 Spring 2008 ©

Type Checking Returns

It is useful to arrange that a staticfield named currentMethod willalways point to the methodDeclNodeof the method we are currentlychecking.


1. If returnVal is a null node, checkthat currentMethod.returnTypeis Void.

2. If returnVal (an expr) is not nullthen check that returnVal’s kindis scalar and returnVal’s type iscurrentMethod.returnType.

returnNode

expr tree

391CS 536 Spring 2008 ©

Type Checking MethodDeclarations


1. Create a new symbol table entry m , with type = typeNode.typeand kind = Method.

2. Check that identNode.idname isnot already in the symbol table;if it isn’t, enter m usingidentNode.idname.

3. Create a new scope in thesymbol table.

4. Set currentMethod = this methodDeclNode.

methodDeclNode

identNode typeNode

args tree decls tree stmts tree

392CS 536 Spring 2008 ©

5. Type check the args subtree.

6. Build a list of the symbol tablenodes corresponding to the argssubtree; store it in m .

7. Type check the decls subtree.

8. Type check the stmts subtree.

9. Close the current scope at thetop of the symbol table.



393CS 536 Spring 2008 ©

Type Checking Method Calls

We consider calls of procedures in astatement. Calls of functions in anexpression are very similar.


1. Check that identNode.idname isdeclared in the symbol table. Itstype should be Void and kindshould be Method.

2. Type check the args subtree.3. Build a list of the expressionnodes found in the args subtree.

callNode

identNode

args tree

394CS 536 Spring 2008 ©

4. Get the list of parametersymbols declared for themethod (stored in the method’ssymbol table entry).

5. Check that the arguments listand the parameter symbols list

both have the same length.6. Compare each argument node

with its correspondingparameter symbol:(a) Both should have the sametype.(b) A Variable, Value, orScalarParm kind in an argumentnode matches a ScalarParm parameter. An Array or

ArrayParm kind in an argumentnode matches an ArrayParm

parameter.

395CS 536 Spring 2008 ©

Reading Assignment

Read Chapter 9 of Crafting aCompiler, 2nd Edition.

396CS 536 Spring 2008 ©

Virtual Memory & Run-TimeMemory Organization

The compiler decides how dataand instructions are placed inmemory.

It uses an address space providedby the hardware and operatingsystem.

This address space is usuallyvirtual —the hardware andoperating system mapinstruction-level addresses to“actual” memory addresses.Virtual memory allows:

• Multiple processes to run inprivate, protected address spaces.

• Paging can be used to extendaddress ranges beyond actualmemory limits.



397CS 536 Spring 2008 ©

Run-Time Data Structures

Static Structures

For static structures, a fixedaddress is used throughout

execution.This is the oldest and simplestmemory organization.

In current compilers, it is usedfor:

• Program code (often read-only &sharable).

• Data literals (often read-only &sharable).

• Global variables.

• Static variables.

398CS 536 Spring 2008 ©

Stack Allocation

Modern programming languagesallow recursion, which requiresdynamic allocation.

Each recursive call allocates a new

copy of a routine’s local variables.The number of local dataallocations required duringprogram execution is not knownat compile-time.

To implement recursion, all thedata space required for a methodis treated as a distinct data areathat is called a frame or activationrecord .

Local data, within a frame, isaccessible only while a

subprogram is active.

399CS 536 Spring 2008 ©

In mainstream languages like C,

C++ and Java, subprograms mustreturn in a stack-like manner—themost recently called subprogramwill be the first to return.

A frame is pushed onto a run-timestack when a method is called(activated).

When it returns, the frame ispopped from the stack, freeingthe routine’s local data.

As an example, consider thefollowing C subprogram:

p(int a) {

double b;

double c[10];

b = c[a] * 2.51;

}

400CS 536 Spring 2008 ©

Procedure p requires space for the

parameter a as well as the localvariables b and c.

It also needs space for controlinformation, such as the returnaddress.

The compiler records the spacerequirements of a method.

The offset of each data itemrelative to the start of the frame isstored in the symbol table.

The total amount of space

needed, and thus the size of theframe, is also recorded.

Assume p’s control informationrequires 8 bytes (this size isusually the same for all methods).

Assume parameter a requires 4bytes, local variable b requires 8bytes, and local array c requires80 bytes.



401CS 536 Spring 2008 ©

Many machines require that wordand doubleword data be aligned ,so it is common to pad a frame sothat its size is a multiple of 4 or 8bytes.

This guarantees that at all times

the top of the stack is properlyaligned.

Here is p’s frame:

Control Information

Space for a

Space for b

Space for c

Padding

Offset = 0

Offset = 8

Offset = 12

Offset = 20

Total size= 104

402CS 536 Spring 2008 ©

Within p, each local data object isaddressed by its offset relative tothe start of the frame.

This offset is a fixed constant,determined at compile-time.

We normally store the start of the

frame in a register, so each pieceof data can be addressed as a(Register, Offset) pair, which is astandard addressing mode inalmost all computer architectures.

For example, if register R pointsto the beginning of p’s frame,variable b can be addressed as(R,12), with 12 actually beingadded to the contents of R at run-time, as memory addresses areevaluated.

403CS 536 Spring 2008 ©

Normally, the literal 2.51 of

procedure p is not stored in p’sframe because the values of localdata that are stored in a framedisappear with it at the end of acall.

It is easier and more efficient toallocate literals in a static area ,often called a literal pool orconstant pool . Java uses aconstant pool to store literals,type, method and interfaceinformation as well as class and

field names.

404CS 536 Spring 2008 ©

Accessing Frames at Run-Time

During execution there can bemany frames on the stack. When aprocedure A calls a procedure B, aframe for B’s local variables ispushed on the stack, covering A’sframe. A’s frame can’t be poppedoff because A will resumeexecution after B returns.

For recursive routines there canbe hundreds or even thousands of frames on the stack. All frames

but the topmost representsuspended subroutines, waitingfor a call to return.

The topmost frame is active; it isimportant to access it directly.

The active frame is at the top of the stack, so the stack top register could be used to accessit.



405CS 536 Spring 2008 ©

The run-time stack may also beused to hold data other thanframes.

It is unwise to require that thecurrently active frame always beat exactly the top of the stack.

Instead a distinct register, oftencalled the frame pointer , is usedto access the current frame.

This allows local variables to beaccessed directly as offset +frame pointer, using the indexedaddressing mode found on allmodern machines.

406CS 536 Spring 2008 ©

Consider the following recursivefunction that computes factorials.int fact(int n) {

if (n > 1)

return n * fact(n-1);

else

return 1;}

407CS 536 Spring 2008 ©

The run-time stack

corresponding to the callfact(3) (when the call of fact(1) is about to return) is:

We place a slot for the function’sreturn value at the very beginningof the frame.

Upon return, the return value isconveniently placed on the stack, just beyond the end of the caller’sframe. Often compilers returnscalar values in specially

Control Information

Space for n = 3

Return Value

Control Information

Space for n = 1

Return Value = 1

Control Information

Space for n = 2

Return Value

Top of Stack

Frame Pointer

408CS 536 Spring 2008 ©

designated registers, eliminating

unnecessary loads and stores. Forvalues too large to fit in a register(arrays or objects), the stack isused.

When a method returns, its frameis popped from the stack and theframe pointer is reset to point tothe caller’s frame.

In simple cases this is done byadjusting the frame pointer by thesize of the current frame.



409CS 536 Spring 2008 ©

Reading Assignment

Read Chapter 11 of Crafting aCompiler, Second Edition.

410CS 536 Spring 2008 ©

Dynamic Links

Because the stack may containmore than just frames (e.g.,function return values or registerssaved across calls), it is commonto save the caller’s frame pointeras part of the callee’s controlinformation.

Each frame points to its caller’sframe on the stack. This pointer iscalled a dynamic link because itlinks a frame to its dynamic (run-time) predecessor.

411CS 536 Spring 2008 ©

The run-time stack corresponding

to a call of fact(3), withdynamic links included, is:

Dynamic Link = Null

Space for n = 3

Return Value

Dynamic Link

Space for n = 1

Return Value = 1

Dynamic Link

Space for n = 2

Return Value

Top of Stack

Frame Pointer

412CS 536 Spring 2008 ©

Classes and Objects

C, C++ and Java do not allowprocedures or methods to nest.

A procedure may not be declaredwithin another procedure.

This simplifies run-time dataaccess—all variables are eitherglobal or local.

Global variables are staticallyallocated. Local variables are partof a single frame, accessedthrough the frame pointer.

Java and C++ allow classes tohave member functions that havedirect access to instancevariables.



413CS 536 Spring 2008 ©

Consider:class K {

int a;

int sum(){

int b;

return a+b;

} }

Each object that is an instance of class K contains a memberfunction sum . Only one translationof sum is created; it is shared byall instances of K.

When sum executes it needs two pointers to access local andobject-level data.

Local data, as usual, resides in aframe on the run-time stack.

414CS 536 Spring 2008 ©

Data values for a particularinstance of K are accessedthrough an object pointer (calledthe this pointer in Java andC++). When obj.sum() is called,it is given an extra implicit

parameter that a pointer to obj.

When a+b is computed, b, a localvariable, is accessed directlythrough the frame pointer. a, amember of object obj, isaccessed indirectly through theobject pointer that is stored in theframe (as all parameters to amethod are).

Object Pointer

Space for b

Control Information

Rest of Stack

Top of Stack

Frame Pointer

Space for a

Object Obj

415CS 536 Spring 2008 ©

C++ and Java also allow

inheritance via subclassing. Anew class can extend an existingclass, adding new fields andadding or redefining methods.

A subclass D, of class C, maybe beused in contexts expecting anobject of class C (e.g., in methodcalls).

This is supported rather easily—objects of class D always containa class C object within them.

If C has a field F within it, so doesD. The fields D declares are merelyappended at the end of theallocations for C.

As a result, access to fields of Cwithin a class D object worksperfectly.

416CS 536 Spring 2008 ©

Handling Multiple Scopes

Many languages allow proceduredeclarations to nest. Java nowallows classes to nest.

Procedure nesting can be veryuseful, allowing a subroutine todirectly access another routine’slocals and parameters.

Run-time data structures arecomplicated because multipleframes, corresponding to nestedprocedure declarations, may need

to be accessed.



417CS 536 Spring 2008 ©

To see the difficulties, assumethat routines can nest in Java or C:int p(int a){

int q(int b){

if (b < 0)

return q(-b);

elsereturn a+b;

}

return q(-10);

}

When q executes, it may accessnot only its own frame, but alsothat of p, in which it is nested.

If the depth of nesting isunlimited, so is the number of frames that must be accessible. In

practice, the level of nestingactually seen is modest—usuallyno greater than two or three.

418CS 536 Spring 2008 ©

Static Links

Two approaches are commonlyused to support access tomultiple frames. One approachgeneralizes the idea of dynamiclinks introduced earlier. Alongwith a dynamic link, we’ll alsoinclude a static link in the frame’scontrol information area. Thestatic link points to the frame of the procedure that staticallyencloses the current procedure. If a procedure is not nested withinany other procedure, its static linkis null.

419CS 536 Spring 2008 ©

The following illustrates static

links:

As usual, dynamic links always

point to the next frame down inthe stack. Static links always pointdown, but they may skip pastmany frames. They always pointto the most recent frame of theroutine that statically encloses thecurrent routine.

Dynamic Link = Null

Space for a

Dynamic Link

Space for b = 10

Dynamic Link

Space for b = -10

Top of Stack

Frame PointerStatic Link

Static Link

Static Link = Null

420CS 536 Spring 2008 ©

In our example, the static links of

both of q’s frames point to p,since it is p that encloses q’sdefinition.

In evaluating the expression a+bthat q returns, b, being local to q,is accessed directly through theframe pointer. Variable a is localto p, but also visible to q becauseq nests within p. a is accessed byextracting q’s static link, thenusing that address (plus theappropriate offset) to access a.



421CS 536 Spring 2008 ©

Displays

An alternative to using static linksto access frames of enclosingroutines is the use of a display .

A display generalizes our use of a

frame pointer. Rather thanmaintaining a single register, wemaintain a set of registers whichcomprise the display.

If procedure definitions nest ndeep (this can be easilydetermined by examining aprogram’s AST), we need n+1display registers.

Each procedure definition istagged with a nesting level.Procedures not nested within any

other routine are at level 0.Procedures nested within onlyone routine are at level 1, etc.

422CS 536 Spring 2008 ©

Frames for routines at level 0 arealways accessed using displayregister D0. Those at level 1 arealways accessed using registerD1, etc.

Whenever a procedure r is

executing, we have direct accessto r’s frame plus the frames of allroutines that enclose r. Each of these routines must be at adifferent nesting level, and hencewill use a different displayregister.

423CS 536 Spring 2008 ©

The following illustrates the use

of display registers:

Since q is at nesting level 1, itsframe is pointed to by D1. All of

q’s local variables, including b, areat a fixed offset relative to D1.

Since p is at nesting level 0, itsframe and local variables areaccessed via D0. Each frame’scontrol information area containsa slot for the previous value of theframe’s display register. A displayregister is saved when a call

Dynamic Link = Null

Space for a

Dynamic Link

Space for b = 10

Dynamic Link

Space for b = -10

Top of Stack

DisplayD1Previous D1

Previous D1

Previous D0DisplayD0

424CS 536 Spring 2008 ©

begins and restored when the call

ends. A dynamic link is stillneeded, because the previousdisplay values doesn’t alwayspoint to the caller’s frame.

Not all compiler writers agree onwhether static links or displaysare better to use. Displays allowdirect access to all frames, andthus make access to all visiblevariables very efficient. However,if nesting is deep, severalvaluable registers may need to be

reserved. Static links are veryflexible, allowing unlimitednesting of procedures. However,access to non-local procedurevariables can be slowed by theneed to extract and follow staticlinks.



425CS 536 Spring 2008 ©

Heap Management

A very flexible storage allocationmechanism is heap allocation.

Any number of data objects canbe allocated and freed in a

memory pool, called a heap .Heap allocation is enormouslypopular. Almost all non-trivial Javaand C programs use new or malloc.

426CS 536 Spring 2008 ©

Heap Allocation

A request for heap space may beexplicit or implicit .

An explicit request involves a callto a routine like new or malloc.

An explicit pointer to the newlyallocated space is returned.

Some languages allow thecreation of data objects of unknown size. In Java, the +operator is overloaded torepresent string catenation.

The expression Str1 + Str2creates a new string representingthe catenation of strings Str1 andStr2. There is no compile-timebound on the sizes of Str1 and

Str2, so heap space must beimplicitly allocated to hold thenewly created string.

427CS 536 Spring 2008 ©

Whether allocation is explicit or

implicit, a heap allocator isneeded. This routine takes a sizeparameter and examines unusedheap space to find space thatsatisfies the request.

A heap block is returned. Thisblock must be big enough tosatisfy the space request, but itmay well be bigger.

Heaps blocks contain a header field that contains the size of theblock as well as bookkeeping

information.The complexity of heap allocationdepends in large measure on howdeallocation is done.

Initially, the heap is one largeblock of unallocated memory.Memory requests can be satisfiedby simply modifying an “end of

428CS 536 Spring 2008 ©

heap” pointer, very much as a

stack is pushed by modifying astack pointer.

Things get more involved whenpreviously allocated heap objectsare deallocated and reused.

Deallocated objects are stored forfuture reuse on a free space list .

When a request for n bytes of heap space is received, the heapallocator must search the freespace list for a block of sufficientsize. There are many searchstrategies that might be used:• Best Fit

The free space list is searched forthe free block that matches mostclosely the requested size. Thisminimizes wasted heap space, thesearch may be quite slow.



429CS 536 Spring 2008 ©

• First FitThe first free heap block of sufficient size is used. Unusedspace within the block is split off and linked as a smaller free spaceblock. This approach is fast, butmay “clutter” the beginning of the

free space list with a number of blocks too small to satisfy mostrequests.

• Next FitThis is a variant of first fit in whichsucceeding searches of the freespace list begin at the positionwhere the last search ended. Theidea is to “cycle through” the entirefree space list rather than alwaysrevisiting free blocks at the head of the list.

430CS 536 Spring 2008 ©

• Segregated Free Space ListsThere is no reason why we musthave only one free space list. Analternative is to have several,indexed by the size of the freeblocks they contain.

431CS 536 Spring 2008 ©

Deallocation Mechanisms

Allocating heap space is fairlyeasy. But how do we deallocateheap memory no longer in use?

Sometimes we may never need todeallocate! If heaps objects areallocated infrequently or are verylong-lived, deallocation isunnecessary. We simply fill heapspace with “in use” objects.

Virtual memory & paging mayallow us to allocate a very large

heap area.On a 64-bit machine, if weallocate heap space at 1 MB/sec,it will take 500,000 years to spanthe entire address space!Fragmentation of a very largeheap space commonly forces usto include some form of reuse of heap space.

432CS 536 Spring 2008 ©

User-controlled Deallocation

Deallocation can be manual orautomatic. Manual deallocationinvolves explicit programmer-initiated calls to routines likefree(p) or delete(p).

The object is then added to a free-space list for subsequentreallocation.

It is the programmer’sresponsibility to free unneededheap space by executing

deallocation commands. The heapmanager merely keeps track of freed space and makes it availablefor later reuse.

The really hard decision—whenspace should be freed—is shiftedto the programmer, possiblyleading to catastrophic dangling pointer errors.



433CS 536 Spring 2008 ©

Consider the following C programfragmentq = p = malloc(1000);

free(p);

/* code containing more malloc’s */

q[100] = 1234;

After p is freed, q is adangling

pointer . q points to heap spacethat is no longer consideredallocated.

Calls to malloc may reassign thespace pointed to by q.Assignment through q is illegal,but this error is almost neverdetected.

Such an assignment may changedata that is now part of anotherheap object, leading to verysubtle errors. It may even changea header field or a free-space link,causing the heap allocator itself to fail!

434CS 536 Spring 2008 ©

Automatic Garbage Collection

The alternative to manualdeallocation of heap space isgarbage collection.

Compiler-generated code tracks

pointer usage. When a heapobject is no longer pointed to, it isgarbage, and is automatically collected for subsequent reuse.

Many garbage collectiontechniques exist. Here are someof the most importantapproaches:

435CS 536 Spring 2008 ©

Reference Counting

This is one of the oldest andsimplest garbage collectiontechniques.

A reference count field is added toeach heap object. It counts howmany references to the heapobject exist. When an object’sreference count reaches zero, it isgarbage and may collected.

The reference count field isupdated whenever a reference is

created, copied, or destroyed.When a reference count reacheszero and an object is collected, allpointers in the collected objectare also be followed andcorresponding reference countsdecremented.

436CS 536 Spring 2008 ©

As shown below, reference

counting has difficulty withcircular structures . If pointer P is

set to null, the object’s referencecount is reduced to 1. Bothobjects have a non-zero count,but neither is accessible through

any external pointer. The twoobjects are garbage, but won’t berecognized as such.

If circular structuresare common,then an auxiliary technique, likemark-sweep collection, is neededto collect garbage that referencecounting misses.

Link

Data

Reference Count = 1

Global pointer P

Link

Data

Reference Count = 2



437CS 536 Spring 2008 ©

Mark-Sweep Collection

Many collectors, including mark &sweep, do nothing until heapspace is nearly exhausted.

Then it executes a marking phase

that identifies all live heapobjects.

Starting with global pointers andpointers in stack frames, it marksreachable heap objects. Pointersin marked heap objects are alsofollowed, until all live heapobjects are marked.

After the marking phase, anyobject not marked is garbage thatmay be freed. We then sweep through the heap, collecting all

unmarked objects. During thesweep phase we also clear allmarks from heap objects found tobe still in use.

438CS 536 Spring 2008 ©

Mark-sweep garbage collection isillustrated below.

Objects 1 and 3 are markedbecause they are pointed to byglobal pointers. Object 5 ismarked because it is pointed toby object 3, which is marked.Shaded objects are not markedand will be added to the free-space list.

In any mark-sweep collector, it isvital that we mark all accessible

heap objects. If we miss a pointer,we may fail to mark a live heapobject and later incorrectly free it.

Global pointer Global pointer

Object 1 Object 3 Object 5

Internal pointer

439CS 536 Spring 2008 ©

Finding all pointers is a bit tricky

in languages like Java, C and C++,that have pointers mixed withother types within datastructures, implicit pointers totemporaries, and so forth.Considerable information aboutdata structures and frames mustbe available at run-time for thispurpose. In cases where we can’tbe sure if a value is a pointer ornot, we may need to doconservative garbage collection.

In mark-sweep garbage collectionall heap objects must be swept.This is costly if most objects aredead. We’d prefer to examine only live objects.

440CS 536 Spring 2008 ©

Compaction

After the sweep phase, live heapobjects are distributedthroughout the heap space. Thiscan lead to poor locality. If liveobjects span many memorypages, paging overhead may beincreased. Cache locality may bedegraded too.

We can add a compaction phase tomark-sweep garbage collection.

After live objects are identified,

they are placed together at oneend of the heap. This involvesanother tracing phase in whichglobal, local and internal heappointers are found and adjustedto reflect the object’s newlocation.

Pointers are adjusted by the totalsize of all garbage objects



441CS 536 Spring 2008 ©

between the start of the heap andthe current object. This isillustrated below:

Compaction merges togetherfreed objects into one large blockof free heap space. Fragments areno longer a problem.

Moreover, heap allocation isgreatly simplified. Using an “endof heap” pointer, whenever a heaprequest is received, the end of heap pointer is adjusted, makingheap allocation no more complexthan stack allocation.

Global pointerAdjusted Global pointer

Object 1 Object 3 Object 5

Adjusted internal pointer

442CS 536 Spring 2008 ©

Because pointers are adjusted,compaction may not be suitablefor languages like C and C++, inwhich it is difficult tounambiguously identify pointers.

443CS 536 Spring 2008 ©

Copying Collectors

Compaction provides manyvaluable benefits. Heap allocationis simple end efficient. There is nofragmentation problem, andbecause live objects are adjacent,paging and cache behavior isimproved.

An entire family of garbagecollection techniques, calledcopying collectors are designed tointegrate copying with

recognition of live heap objects.Copying collectors are verypopular and are widely used.

Consider a simple copyingcollector that uses semispaces . Westart with the heap divided intotwo halves—the from and to spaces .

444CS 536 Spring 2008 ©

Initially, we allocate heap requests

from the from space, using asimple “end of heap” pointer.When the from space isexhausted, we stop and dogarbage collection.

Actually, though we don’t collectgarbage. We collect live heapobjects—garbage is nevertouched.

We trace through global and localpointers, finding live objects. Aseach object is found, it is moved

from its current position in thefrom space to the next availableposition in the to space.

The pointer is updated to reflectthe object’s new location. A“forwarding pointer” is left in theobject’s old location in case thereare multiple pointers to the sameobject.





449CS 536 Spring 2008 ©

very fast at the expense of possibly wasted space.

If we have reason to believe thatthe time between garbagecollections will be greater thanthe average lifetime of most

heaps objects, we can improveour use of heap space. Assumethat 50% or more of the heap willbe garbage when the collector iscalled. We can then divide theheap into 3 segments, which we’llcall A, B and C. Initially, A and Bwill be used as the from space,utilizing 2/3 of the heap. Whenwe copy live objects, we’ll copythem into segment C, which willbe big enough if half or more of the heap objects are garbage.

Then we treat C and A as the fromspace, using B as the to space forthe next collection. If we are

450CS 536 Spring 2008 ©

unlucky and more than 1/2 theheap contains live objects, we canstill get by. Excess objects arecopied onto an auxiliary dataspace (perhaps the stack), thencopied into A after all live objectsin A have been moved. This slowscollection down, but only rarely (if our estimate of 50% garbage percollection is sound). Of course,this idea generalizes to more than3 segments. Thus if 2/3 of theheap were garbage (on average),we could use 3 of 4 segments asfrom space and the last segmentas to space.

451CS 536 Spring 2008 ©

Generational Techniques

The great strength of copyingcollectors is that they do no workfor objects that are born and diebetween collections. However, notall heaps objects are so short-lived. In fact, some heap objectsare very long-lived. For example,many programs create a dynamicdata structure at their start, andutilize that structure throughoutthe program. Copying collectorshandle long-lived objects poorly.They are repeatedly traced andmoved between semispaceswithout any real benefit.

Generational garbage collectiontechniques were developed tobetter handle objects with varyinglifetimes. The heap is divided intotwo or more generations , each

452CS 536 Spring 2008 ©

with its own to and from space.

New objects are allocated in theyoungest generation, which iscollected most frequently. If anobject survives across one ormore collections of the youngestgeneration, it is “promoted” to thenext older generation, which iscollected less often. Objects thatsurvive one or more collections of this generation are then moved tothe next older generation. Thiscontinues until very long-livedobjects reach the oldestgeneration, which is collectedvery infrequently (perhaps evennever).

The advantage of this approach isthat long-lived objects are“filtered out,” greatly reducing thecost of repeatedly processingthem. Of course, some long-lived





457CS 536 Spring 2008 ©

The basic idea is simple—if wecan’t be sure whether a value is apointer or not, we’ll beconservative and assume it is apointer. If what we think is apointer isn’t, we may retain anobject that’s really dead, but we’llfind all valid pointers, and neverincorrectly collect a live object.We may mistake an integer (or afloating value, or even a string) asan pointer, so compaction in anyform can’t be done. However,mark-sweep collection will work.

Garbage collectors that work withordinary C programs have beendeveloped. User programs neednot be modified. They simply arelinked to different library

routines, so that malloc and freeproperly support the garbagecollector. When new heap space is

458CS 536 Spring 2008 ©

required, dead heap objects maybe automatically collected, ratherthan relying entirely on explicitfree commands (though freesare allowed; they sometimessimplify or speed heap reuse).

With garbage collection available,C programmers need not worryabout explicit heap management.This reduces programming effortand eliminates errors in whichobjects are prematurely freed, orperhaps never freed. In fact,experiments have shown thatconservative garbage collection isvery competitive in performancewith application-specific manualheap management.

459CS 536 Spring 2008 ©

Jump Code

The JVM code we generate forthe following if statement isquite simple and efficient.if (B) A = 1;else A = 0;

iload 2 ; Push local #2 (B) onto stackifeq L1 ; Goto L1 if B is 0 (false)iconst_1 ; Push literal 1 onto stackistore 1 ; Store stk top into local #1(A)goto L2 ; Skip around else part

L1: iconst_0 ; Push literal 0 onto stackistore 1 ; Store stk top into local #1(A)

L2:

460CS 536 Spring 2008 ©

In contrast, the code generated

forif (F == G) A = 1;else A = 0;

(where F and G are localvariables of type integer)

is significantly more complex:

iload 4 ; Push local #4 (F) onto stack

iload 5 ; Push local #5 (G) onto stackif_icmpeq L1 ; Goto L1 if F == Giconst_0 ; Push 0 (false) onto stackgoto L2 ; Skip around next instruction

L1:iconst_1 ; Push 1 (true) onto the stack

L2:ifeq L3 ; Goto L3 if F==G is 0 (false)iconst_1 ; Push literal 1 onto stackistore 1 ; Store top into local #1(A)goto L4 ; Skip around else part

L3:iconst_0 ; Push literal 0 onto stackistore 1 ; Store top into local #1(A)

L4:



461CS 536 Spring 2008 ©

The problem is that in the JVMrelational operators don’t storea boolean value (0 or 1) ontothe stack. Rather, instructionslike if_icmpeq do a conditional branch.

So we branch to a push of 0 or1 just so we can test the valueand do a second conditionalbranch to the else part of theconditional.

Why did the JVM designerscreate such an odd way of evaluating relational operators?

A moment’s reflection shows

that we rarely actually want thevalue of a relational or logicalexpression. Rather, we usually

462CS 536 Spring 2008 ©

only want to do a conditionalbranch based on theexpression’s value in thecontext of a conditional orlooping statement.

Jump code is an alternativerepresentation of booleanvalues. Rather than placing aboolean value directly on thestack, we generate aconditional branch to either atrue label or a false label.These labels are defined at theplaces where we wishexecution to proceed once theboolean expression’s value isknown.

463CS 536 Spring 2008 ©

Returning to our previous

example, we can generate F==Gin jump code form as

iload 4 ; Push local #4 (F) onto stackiload5 ; Push local #5 (G) onto stackif_icmpne L1 ; Goto L1 if F != G

The label L1 is the “false label.”We branch to it if theexpression F == G is false;otherwise, we fall through,executing the code thatfollows. We can then generatethe then part, defining L1 at thepoint where the else part is tobe computed. The code wegenerate is

464CS 536 Spring 2008 ©

iload 4 ; Push local #4 (F) onto stack

iload5 ; Push local #5 (G) onto stackif_icmpne L1 ; Goto L1 if F != Giconst_1 ; Push literal 1 onto stackistore 1 ; Store top into local #1(A)goto L2 ; Skip around else part

L1:iconst_0 ; Push literal 0 onto stackistore 1 ; Store top into local #1(A)

L2:

This instruction sequence issignificantly shorter (andfaster) than our original

translation. Jump code isroutinely used in ifs, whiles andfors where we wish to alterflow-of-control rather thancompute an explicit booleanvalue.



465CS 536 Spring 2008 ©

Jump code comes in two forms,JumpIfTrue and JumpIfFalse.

In JumpIfTrue form, the codesequence does a conditional jump (branch) if the expressionis true, and “falls through” if theexpression is false.Analogously, in JumpIfFalseform, the code sequence does aconditional jump (branch) if theexpression is false, and “fallsthrough” if the expression istrue. We have two formsbecause different contextsprefer one or the other.

It is important to emphasizethat even though jump codelooks unusual, it is just analternative representation of boolean values. We can convert

466CS 536 Spring 2008 ©

a boolean value on the stack to jump code by conditionallybranching on its value to a trueor false label.

Similarly, we convert from jumpcode to an explicit booleanvalue, by placing the jumpcode’s true label at a load of 1and the false label at a load of 0.

467CS 536 Spring 2008 ©

Short-Circuit Evaluation

Our translation of the && and|| operators parallels that of allother binary operators:evaluate both operands ontothe stack and then do an “and”or “or” operation.

But in C, C++, C#, Java (andmost other languages), && and|| are handled specially.

These two operators are

defined to work in “shortcircuit” mode. That is, if the leftoperand is sufficient todetermine the result of theoperation, the right operandisn’t evaluated .

In particular a&&b is defined asif a then b else false.

468CS 536 Spring 2008 ©

Similarly a||b is defined as

if a then true else b.The conditional evaluation of the second operand isn’t just anoptimization—it’s essential forcorrectness. For example, in(a!=0)&&(b/a>100)we would perform a division byzero if the right operand wereevaluated when a==0.

Jump code meshes nicely withthe short-circuit definitions of && and ||, since they arealready defined in terms of conditional branches.

In particular if exp1 and exp2are in jump code form, then weneed generate no further codeto evaluate exp1&&exp2.





473CS 536 Spring 2008 ©

Otherwise, we continueevaluating the controlexpression.

We next test B. If it is greaterthan or equal to zero, B<0 isfalse, and so is the wholeexpression. We thereforebranch to label F and executethe else part.

Otherwise, we finally test C.If C is not equal to 10, thecontrol expression is false, sowe branch to label F andexecute the else part.

If C is equal to 10, the controlexpression is true, and we fall

through to the then part.

474CS 536 Spring 2008 ©

For Loops

For loops are translated muchlike while loops.

The AST for a for loop addssubtrees corresponding to loopinitialization and increment.

For loops are expected toiterate many times. Thereforeafter executing the loopinitialization, we skip past theloop body and increment codeto reach the terminationcondition, which is placed atthe bottom of the loop.

condition

forNode

increment

Exp Stmt Stmts

initializer loopBody

Stmt

475CS 536 Spring 2008 ©

{Initialization code}

goto L1L2:{Code for loop body}{Increment code}L1:

{Condition code}ifne L2 ; branch to L2 if true

cg(){ // for forLoopNodeString skip = genLab();String top = genLab();initializer.cg();branch(skip);

defineLab(top);loopBody.cg();increment.cg();defineLab(skip);condition.cg();branchNZ(top);

}

476CS 536 Spring 2008 ©

As an example, consider this

loop (i and j are locals withvariable indices of 1 and 2)for (i=100;i!=0;i--) {

j = i;

}

The JVM code we generate isbipush 100 ; Push 100istore 1 ; Store into #1 (i)goto L1 ; Skip to exit test

L2:

iload 1 ; Push local #1 (i)

istore 2 ; Store into #2 (j)

iload 1 ; Push local #1 (i)iconst_1 ; Push 1

isub ; Compute i-1

istore 1 ; Store i-1 into #1 (i)

L1:

iload 1 ; Push local #1 (i)ifne L2 ; Goto L2 if i is != 0



477CS 536 Spring 2008 ©

Java, C# and C++ allow a localdeclaration of a loop index aspart of initialization, asillustrated by the following forloop

for (int i=100; i!=0; i--) {j = i;}

Local declarations areautomatically handled duringcode generation for theinitialization expression.A localvariable is declared within thecurrent frame with a scopelimited to the body of the loop.Otherwise translation isidentical.

Documents

LectureAll.4up