24
1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module 1 Introduction to Compilers: An Overview

1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

Embed Size (px)

Citation preview

Page 1: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

1

CMPSC 160Translation of Programming Languages

Fall 2002Instructor: Hugh McGuire

slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

Lecture-Module 1

Introduction to Compilers: An Overview

Page 2: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

2

CMPSC 160 - Fall 2002

• CMPSC160 - Translation of Programming Languages

– Course Description: Study of the structure of compilers. Topics include: lexical analysis; syntax analysis including LL and LR parsers; type checking; run-time environments; intermediate code generation; and compiler-construction tools.

• Instructor: Hugh McGuire – Office: PHELP 1409B, phone: 893-5412– Office Hours: M 9:00—10:30, W 10:30—12:00

• Teaching Assistants:

– – – ________ ________– Office Hours: T.B.A.; @CSIL

• Grading (tentatively):– Homeworks 15%, Project 35%, Midterm 20%, Final 30%

Page 3: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

3

CMPSC 160 - Fall 2001

• Information about projects, homeworks, etc. are on the class web page– http://www.cs.ucsb.edu/~mcguire/teaching/160/

• Send questions to

[email protected]• Textbook: “Compilers: Principles, Techniques and Tools”

– Authors: Aho, Sethi, and Ullman

– This is a good reference book which covers the fundamental techniques in depth

• Reader: containing printouts of these lecture-notes

– Will be available in a little less than a week at campus bookstore

Page 4: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

4

Class Work

• Exams

– Mid-term (20%) and final (30%)

– In class, closed book, closed notes

– Material from the lectures and the textbook

• Homeworks (15%)

– Several homeworks

– On material from the lectures and the textbook

– Good practice for the exams

• Project (35%)

– Four part project

– A compiler for a small subset of Java written in Java

– You will learn and use compiler construction tools such as Jlex and JavaCUP

Page 5: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

5

Topics

• Overview of compilers

• Lexical analysis (Scanning)

• Syntactic analysis (Parsing)

• Context-sensitive analysis

• Type checking

• Runtime environments

• Symbol tables

• Intermediate representations

• Intermediate code generation

• Code optimization

Page 6: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

6

Compilers

• What is a compiler?– A program that translates a program in one language (source

language) into an equivalent program in another language (target language), and reports errors in the source program

• A compiler typically lowers the level of abstraction of the programC assembly code for Sun SparcJava Java bytecode

• What is an interpreter?

– A program that reads an executable program and produces the results of executing that program

• C is typically compiled

• Scheme is typically interpreted

• Java is compiled to bytecodes, which are then interpreted

Page 7: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

7

Why build compilers?

• Compilers provide an essential interface between applications and architectures

• High level programming languages:

– Increase programmer productivity

– Better maintenance

– Portable

• Low level machine details:

– Instruction selection

– Addressing modes

– Pipelines

– Registers and cache

• Compilers efficiently bridge the gap and shield the application developers from low level machine details

Page 8: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

8

Why study compilers?

• Compilers embody a wide range of theoretical techniques and their application to practice– DFAs, PDAs, formal languages, formal grammars, fixpoints

algorithms, lattice theory• Compiler construction teaches programming and software engineering

skills• Compiler construction involves a variety of areas

– theory, algorithms, systems, architecture• The techniques used in various parts of compiler construction are

useful in a wide variety of applications– Many practical applications have embedded languages,

commands, macros, etc.• Is compiler construction a solved problem?

– No! New developments in programming languages (Java) and machine architectures (RISC processors) present new challenges

Page 9: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

9

Desirable Properties of Compilers

• Compiler must generate a correct executable

– The input program and the output program must be equivalent, the compiler should preserve the meaning of the input program

• Output program should run fast

– For optimizing compilers we expect the output program to be more efficient than the input program

• Compiler itself should be fast

• Compiler should provide good diagnostics for programming errors

• Compiler should support separate compilation

• Compiler should work well with debuggers

• Optimizations should be consistent and predictable

• Compile time should be proportional to code size

Page 10: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

10

What are the issues in compiler construction?

• Source code

– written in a high level programming language

//simple example

while (sum < total)

{

sum = sum + x*10;

}

• Target code– Assembly language (chapter 9)

which in turn is translated to machine code

L1:MOV total,R0CMP sum,R0CJ< L2

GOTO L3L2:MOV #10,R0

MUL x,R0ADD sum,R0MOV R0,sumGOTO L1

L3:first instructionfollowing the while

statement

Page 11: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

11

What is the input?

• Input to the compiler is not

//simple examplewhile (sum < total) { sum = sum + x*10;}

• Input to the compiler is

//simple\bexample\nwhile\b(sum\b<\btotal)\b{\n\tsum\b=\bsum\b+\bx*10;\n}\n

• How does the compiler recognize the keywords, identifiers, the structure etc.?

Page 12: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

12

First step: Lexical analysis (Scanning)

• The compiler scans the input file and produces a stream of tokens

WHILE,LPAREN,<ID,sum>,LT,<ID,total>,RPAREN,LBRACE,

<ID,sum>,EQ,<ID,sum>,PLUS,<ID,x>,TIMES,<NUM,10>,

SEMICOL,RBRACE

• Each token has a corresponding lexeme, the character string that corresponds to the token

– For example “while” is the lexeme for token WHILE– “sum”, “x”, “total” are lexemes for token ID

Page 13: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

13

Lexical Analysis (Scanning)

• Compiler uses a set of patterns to specify valid tokens

– tokens: LPAREN, ID, NUM, WHILE, etc.

• Each pattern is specified as a regular expression

– LPAREN should match: (– WHILE should match: while– ID should match: [a-zA-Z][0-9a-zA-Z]*

• It uses finite automata to recognize these patterns

a-zA-Z 0-9a-zA-Z

ID automaton

Page 14: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

14

Lexical analysis (Scanning)

• During the scan the lexical analyzer gets rid of the white space (\b,\t,\n, etc.) and comments

• Important additional task: Error messages!– var&1 Error! Not a token! – whle Error? It matches the identifier token.

• Natural language analogy: Tokens correspond to words and punctuation symbols in a natural language

Page 15: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

15

Next Step: Syntax Analysis (Parsing)

• How does the compiler recognize the structure of the program?

– Loops, blocks, procedures, nesting?

• Parse the stream of tokens parse tree

Stmt

WhileStmt

WHILE

LPARENRPARENExpr

RelExpr

<ID,sum> LT <ID,total>

StmtBlock

LBRACE Stmt RBRACE

AssignStmt

<ID,sum> EQExpr SEMICOL

ArithExpr

Expr

<ID,sum>

PLUSExpr

ArithExprExpr Expr

<ID,x>

TIMES

<NUM,10>

Page 16: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

16

Syntax Analysis (Parsing)

• The syntax a programming language is defined by a set of recursive rules. These sets of rules are called context free grammars.

Stmt WhileStmt | Block | ... WhileStmt WHILE LPAREN Expr RPAREN StmtExpr RelExpr | ArithExpr | ...RelExpr ...

• Compilers apply these rules to produce the parse tree• Again, important additional task: Error messages!

– Missing semicolumn, missing parenthesis, etc.• Natural language analogy: It is similar to parsing English text.

Paragraphs, sentences, nounphrases, verbphrases, verbs, prepositions, articles, nouns, etc.

Page 17: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

17

Intermediate Representations

• The parse tree representation has too many details – LPAREN, LBRACE, SEMICOL, etc.

• Once the compiler understands the structure of the input program it does not need these details (they were used to prevent ambiguities)

• Compilers generate a more abstract representation after constructing the parse tree which does not include the details of the derivation

• Abstract syntax trees (AST): Nodes represent operators, children represent operands

while

<

<id,sum> <id,total>

assign

<id,sum>+

<id,sum>

*

<id,x> <num,10>

Page 18: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

18

Next Step: Semantic (Context-Sensitive) Analysis

• Are variables declared before they are used?

– We can find out if “whle” is declared by looking at the symbol table

• Do variable types match?sum = sum + x*10;

+

<id,sum> *

<id,x> <num,10>

may become

+

<id,sum>

*

<id,x> <num,10>

int2float

sum

x

float

int

SymbolTable

sum can be a floating point number,

x can be an integer

Page 19: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

19

Next Step: Code Generation

• Abstract syntax trees are a high-level intermediate representation used in earlier phases of the compilation

• There are lower level (i.e., closer to the machine code) intermediate representations

– Three –address code (Chapter 8): every instruction has at most three operands

– Jasmin: Assembly language for JVM (Java Virtual Machine) an abstract stack machine (used in the project)

• Intermediate code generation for these lower level representations and machine code generation are similar

Page 20: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

20

Code Generation: Instruction Selection

• Source code

a = b + c;

d = a + e;

• Target code

MOV b,R0

ADD c,R0

MOV R0,a

MOV a,R0

ADD e,R0

MOV R0,d

If we generate code for each statement separatelywe will not generate efficient code

codefor firststatementcode for secondstatement

This instruction is redundant

Page 21: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

21

Code Generation: Register Allocation

• There are a limited number of registers available on real machines

• Registers are valuable resources, the compiler has to use them efficiently

t =a-b;u=a-c;v=t+u;d=v+u;

d = (a-b)+(a-c)+(a-c); MOV a,R0SUB b,R0MOV a,R1SUB c,R1ADD R1,R0ADD R1,R0MOV R0,d

source code three-address code assembly code

Page 22: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

22

Improving the Code: Code Optimization

• Compilers can improve the quality of code by static analysis

– Data flow analysis, dependence analysis, code transformations, dead code elimination, etc.

temp = x*10;

while (sum < total)

{

sum = sum + temp;

}

while (sum < total)

{

sum = sum + x*10;

}

We do not need to recompute x*10 in each iteration of the loop

transformationto more efficient code

Page 23: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

23

Things to do

• Get the textbook

• Read chapters 1 and 2

Page 24: 1 CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon

24

Summary of this Lecture-Module

• Intro. Gen. Admin. of course

• Overview of compilation / topics of course

• Init. asst.