Upload
jorge-porras
View
26
Download
0
Embed Size (px)
DESCRIPTION
Essay about translator in computer science
Citation preview
Version 1.0 © Abelardo López Lagunas
TC2006 Programming Languages
Abelardo López Lagunas, Ph. D. School of Engineering and Architecture
Computer Science department ITESM Campus Toluca
1
Version 1.0 © Abelardo López Lagunas
Outline
• Course objectives • Grading and policy of the course • Tentative schedule • Textbook and other references • Introduction
• Why so many programming languages? • Programming paradigms • Programming language specification
• Abstraction, grammars & programming languages
Version 1.0 © Abelardo López Lagunas
Course Objectives
• Present the fundamental concepts • Programming language classification • Programming paradigms
• How to describe programming languages • The basic translation process • Lexical and syntactic analysis
• Present the basics of functional languages • Motivation and the use of Scheme/LISP
• Present the basic concepts in concurrency • Motivation and possibly cover a parallel language
• Present other programming paradigms • Logic languages, script languages, and domain specific languages
3
Version 1.0 © Abelardo López Lagunas
Course Policy• Class starts at 7:35 and ends at 8:55 on Mondays, and Thursdays. • Partial grades are reported using only the integer part of the number. The final grade
is calculated using floating-point numbers with four digits of precision. Grades with a fractional part greater or equal to 0.75 will be rounded to the next integer number; otherwise the fractional part will be truncated.
• The deadline for the assignments is at 16:59 on the specified date. If you decide to submit your assignment electronically the deadline is 22:59hrs (of the Campus Server).
• There are four programming assignments out of which I will select three with the highest grades.
• No project reports will be accepted after the due date at the time specified earlier. The projects should include the following: • Student ID and name for each of the team members written in the cover page, or in the
upper right corner of the first sheet of paper. Also include the submission date. • Description of the project, including the design methodology. All data structures and
function or procedure definitions. Results and conclusions.
• Make-up exams will be made only with the corresponding medical or legal documentation as well as with the authorization of the dean of your department.
• Cheating will result in a grade of 10/100 on the exam or assignment in which it occurred, a written report will be sent to both your department dean and to the registrar so that it is attached to your file.
Version 1.0 © Abelardo López Lagunas
Course Grading
• Exams: • Duration of 85 minutes. • No “cheat sheets”, text books, or articles are allowed.
• There is no project, instead there are four assignments that represent 20% of the grade. • Select your software environment by next week (Cygwin, Co-Linux, Linux, OS X. You
can use Windows tools but use this opportunity to learn something new). • As an incentive I’ll provide the tools only for non-windows platforms
All dates are tentative
Element Assigned on
Returned by
Weight Description
1st 11-Sep-14 11-Sep-14 25% Topic 1 and 2.22nd 27-Oct-14 27-Oct-14 25% Topic 2.3 through 3Final exam TBD TBD 20% Topics 1 through 5.2Assignment # 1 28-Aug-14 04-Sep-14 5% Lexical specification *Assignment # 2 08-Sep-14 15-Sep-14 10% Syntax specification *Assignment # 3 02-Oct-14 16-Oct-14 5% Programming assignment *Assignment # 4 30-Oct-14 20-Nov-14 10% Programming assignment *
Version 1.0 © Abelardo López Lagunas
Grade History
• Students who have taken a this course have told me it was of above average complexity. • To give you an idea of the
complexity of the course the figure on the right is the grade histogram of past semesters.
• The historic average is 81.6 for a population of 53 students.
6
X<7070<X<8080<X<9090<X<95X>95
Version 1.0 © Abelardo López Lagunas
Assignment Description
• There are four assignments: • Lexical analyzer and symbol table management. • Syntactic Analyzer with simple error recovery. • A programming assignment using Scheme/LISP. • A programming assignment on concurrent programming.
• Can use C threads or a parallel language, such as Erlang
• All assignments must follow the guidelines set in the course policy
Version 1.0 © Abelardo López Lagunas
Office Hours
Time Monday Tuesday Wednesday Thursday Friday
7:30 Programming Languages!
127
Adv. Digital Systems!
xxx
Programming Languages!
127
Adv. Digital Systems!
xxx8:00
8:30
9:00 Software Engineering!
127
Analysis & Modeling SW!
311
OS Lab!127
Software Engineering!
127
Analysis & Modeling SW!
3119:30
10:00
10:30 Operating Systems!311
Operating Systems!
31111:00
11:30
12:00
12:30
14:30
15:00
15:30 !Office hours
16:00
16:30 !Class preparation
17:00
17:30
18:00
18:30
Version 1.0 © Abelardo López Lagunas
Office Location
Concepción Muciño
Office block at Aulas I Second Floor
My office
Version 1.0 © Abelardo López Lagunas
• Text: • Robert W. Sebesta “Concepts of programming
languages”. 9th edition Addison Wesley. • Auxiliary texts:
• Michael L. Scott “Programming Language Pragmatics,” Morgan Kaufmann, 2nd edition.
• Harold Abelson, Gerald J. Sussman “Structure and Interpretation of Computer Programs,” on-line version (http://mitpress.mit.edu/sicp/full-text/book/book.html)
• Tom Niemann “A Compact Guide to LEX & YACC” epaperpress.com (can be downloaded from the web)
References
Version 1.0 © Abelardo López Lagunas
Concepts of Programming Languages
• Why study the basic concepts behind programming languages? • Increased ability to express ideas • Improved background for choosing appropriate
languages (choose the right language for the task) • Increased ability to learn new languages
• Better understanding of significance of implementation • Better use of languages that are already known • Learn new paradigms and try to use them in other languages (if
feasible).
• Overall advancement of computing
11
Version 1.0 © Abelardo López Lagunas
Why so many languages?
Source: http://www.digibarn.com/collections/posters/tongues/ComputerLanguagesChart-med.png
Version 1.0 © Abelardo López Lagunas
Why so many languages?
• The goal of any programming language is to capture the programmers intent. • In the beginning computer time was more important
that programmers time, so the language was closer to the actual hardware implementation. • Very difficult to do, even for small programs • Little portability across different hardware devices
• As the computing power was increased and hardware costs decreased tools were devised to translate a higher level language into the actual instructions that were actually executed in the hardware.
• However, the programmers intent can be captured in many ways
13
Version 1.0 © Abelardo López Lagunas
Language design
• There are several factors that enable the proliferation of computer languages: • Evolution of computer science: still growing • Special purpose tasks: some languages solve problems
more efficiently on different domains. • Personal preference: people think differently and thus
can express themselves better in different languages. • Ease of use: how difficult is it to learn? • Ease of implementation: some languages are harder to
map into hardware platforms. • Expressive power: some languages are better than
others in abstracting details. • Good translators: how efficient is the translator?
14
Version 1.0 © Abelardo López Lagunas
Language Evaluation Criteria
• Readability: the ease with which programs can be read and understood
• Writability: the ease with which a language can be used to create programs
• Reliability: conformance to specifications (i.e., performs to its specifications)
• Cost: the ultimate total cost
15
Source: Robert W. Sebesta. Copyright © 2009 Addison-Wesley. All rights reserved.
Version 1.0 © Abelardo López Lagunas
Language Evaluation Criteria
• Readability • Simplicity: manageable set of features and constructs
with minimal feature multiplicity and minimal operator overloading.
• Orthogonality: relatively small set of primitive constructs that can be combined in a relatively small number of ways. Every possible combination is legal
• Useful data types and expressive syntax
• Writability • Abstraction: ability to define and use complex structures
or operations in ways that allow details to be ignored • Expressivity: convenient ways of specifying operations
16
Source: Robert W. Sebesta. Copyright © 2009 Addison-Wesley. All rights reserved.
Version 1.0 © Abelardo López Lagunas
Evaluation Criteria: Reliability
• Type checking • Testing for type errors
• Exception handling • Intercept run-time errors and take corrective measures
• Aliasing • Presence of two or more distinct referencing methods for
the same memory location
• Readability and writability • A language that does not support “natural” ways of
expressing an algorithm will require the use of “unnatural” approaches, and hence reduced reliability
17
Source: Robert W. Sebesta. Copyright © 2009 Addison-Wesley. All rights reserved.
Version 1.0 © Abelardo López Lagunas
Evaluation Criteria: Cost
• Training programmers to use the language • Writing programs (closeness to particular
applications) • Compiling programs • Executing programs • Language implementation system: availability of
free compilers • Reliability: poor reliability leads to high costs • Maintaining programs
18
Source: Robert W. Sebesta. Copyright © 2009 Addison-Wesley. All rights reserved.
Version 1.0 © Abelardo López Lagunas
Evaluation Criteria: Others
• Portability • The ease with which programs can be moved from one
implementation to another
• Generality • The applicability to a wide range of applications
• Well-definedness • The completeness and precision of the language’s official
definition
19
Source: Robert W. Sebesta. Copyright © 2009 Addison-Wesley. All rights reserved.
The Four Rs of Programming Language Design
Dominic OrchardComputer Laboratory, University of Cambridge, UK
Categories and Subject Descriptors D.1.0 [Software]: Program-ming Techniques—General; I.0 [Computing Methodologies]: GEN-ERAL
General Terms Design, Languages
Keywords Programming language design, The Four Rs, Domain-specific languages
“I can learn the poor things reading, writing, and ’rithmetic,and counting as far as the rule of three, which is just as muchas the likes of them require;” Lawrie Todd: Or the Settlersin the Woods, Galt (1832) [4].
˜̃˜Many will be familiar with the old adage that at the core of
any child’s education should be the three Rs: reading, writing,and ’rithmetic. The phrase, which appeared first in print in 1825[12] has been appropriated and parodied at length (“read, reason,recite”, “reduce, reuse, recycle”, etc.). Each permutation has thesame purpose: to express succinctly the core tenets of an approachor philosophy.
The four Rs of programming language design is another suchparody of this old phrase, providing a rubric, or framework, forthe design and evaluation of effective programming languages andlanguage features.
Since the very first programming language back in the 1940s[14] thousands of programming languages have been developed,representing a broad spectrum of paradigms, perspectives, andphilosophies. And yet, there is no single language which is “allthings to all men” (and women!).
The four Rs were born out of trying to answer a number of ques-tions about the nature of programming languages and programminglanguage design: what makes a programming language effective orineffective? What should be the core aims of a language designer?How should programming languages and features be compared?Why is there no single “perfect” language? The four Rs go some-way towards answering these questions.
Before I reveal the four Rs, let’s first consider some more foun-dational questions:
Why programming languages? The development of program-ming languages has greatly aided software engineering. As hard-ware and software have grown increasingly complex, programming
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.Onward! 2011, October 22–27, 2011, Portland, Oregon, USA.Copyright c� 2011 ACM 978-1-4503-0941-7/11/10. . . $10.00
languages have developed to manage this complexity more effec-tively, aiding us in expressing ideas and solving increasingly com-plex problems.
Programming languages provide abstraction, by both hiding de-tails and allowing components to be reused, allowing programmersto more effectively manage complexity in software and hardware.While it is in principle possible for any program to be written inmachine code, it’s hard to imagine some of the larger computerprograms we interact with daily being developed in such a way. Bybuilding layers of abstraction with languages, increasingly complexsystems can be constructed.
What is programming? In essence, programming is a communi-cation process between one or more programmers and one or morecomputer systems. Programming languages are the medium of thiscommunication.
Programming is not only a communication process, it is also atranslation process. Each participant in the programming processhas an internal language, both programmers and machines. In thecase of a machine, the internal language comprises the instructionsof the underlying hardware. In the case of a programmer, the inter-nal language is far more nebulous, perhaps comprising natural andformal languages, along with other incorporeal, abstract thoughts.
In any case, a programming language acts as the intermediatelanguage of translation between the participants. Programming isthe translation from a programmer’s internal language to a pro-gramming language, and execution is the translation from theprogramming language to the machine’s internal language. Mc-Cracken, in 1957, captured some of this sentiment, saying “Pro-gramming [...] is basically a process of translating from the lan-guage convenient to human beings to the language convenientto the computer” where the convenient language for humans was“mathematics or English statements of decisions to be made” [8].Here we consider the “language convenient to human beings” to beprogramming languages, bridging the gap between our ideas andthe underlying, low-level instructions of a computer system.
Sometimes, programming is more exploration than communi-cation. In which case, a programmer explores and learns about aproblem by translating their internal thoughts into a program andthe re-internalising the result to gain further insight. Again the pro-cess is a translational.
It is from this view of programming, as a translation, communi-cation, and exploration, that the four Rs are sculpted.
A programming language should improve thefour Rs of programs: reading, writing, running,and reasoning.
These four tenets are both guidelines for language design andresearch, and criteria for judging a language. They are by no meansmutually exclusive, independent, or orthogonal, but are all inter-related. They are also not designed to subsume or replace the
157Version 1.0 © Abelardo López Lagunas
Additional Lectures• Tom Jepsen “How Programming
Languages Evolve,” IT Pro November ❘ December 1999.
20
68 IT Pro November ❘ December 1999 1520-9202/99/$10.00 © 1999 IEEE
How ProgrammingLanguages EvolveTom Jepsen
O nce upon a time, lifewas simple. Largemonolithic comput-ers, usually painted
blue, ran single-threaded batchprograms under the watchfuleyes of an operator and a systemprogrammer. A card reader
served as the input device, atape drive provided storage,
and a line printer processed out-put. Programmers wrote busi-ness applications in Cobol,scientific applications in Fortran.In either case, they worked outthe program logic on paper first,used a keypunch to producepunched cards, then ran theresulting deck through the cardreader. After a few debuggingsessions, they received theircomputed results on sheets offanfold paper. If a programrequired documentation, pro-grammers produced it on a man-ual typewriter.
Today, a few lumbering pro-gramming languages cling tolife while hordes of newer types
struggle to replace them.Thesenew species include a confusingswirl of languages that are
• compiled,• interpreted,• Web based,• scripting and modeling
capable,• object oriented,• graphically based,• text-processing based, or• founded on artificial
intelligence routines.
In addition, all kinds of spe-cialized languages address thechallenges of developing specificapplications. To an outside ob-server, this proliferation mightseem strange. Why haven’t computer scientists and IT pro-fessionals—people generally sci-entific and rational almost to afault—focused on creating a fewrobust and adaptable problem-solving languages that could suc-ceed in any computing en-vironment?
ECOLOGICAL NICHESQuite simply, the rapid spread
of computerization into allphases of modern life—and thediversity of application andproblem domains this trend hascreated—require a wide rangeof problem-solving tools.
Programming languages haveevolved to provide such toolsand, like evolution in nature,have generated numerousmutations in the process, somesuccessful and some not.A fewforces drive this evolutionaryprocess; understanding themand which languages evolved asan answer to these forces may
Learning whatforces driveprogramming-language evolutioncan help you pickone foryour project.
help you select a language foryour development project.
Send in the clonesMuch of software program-
ming consists of reinventing thewheel with a slightly differentcolor scheme, turning radius,and spoke orientation.This real-ization launched programmerson a quest to develop tech-niques for cloning and recyclingcode used on previous projects.Structured programming led to code modularization, which in time led to object-oriented programming and component-based development. The questfor reusable software also con-tributed to the developmentof architecture- and platform-independent languages.
The incredible shrinking computer
With each succeeding gener-ation, computers have housedthe same processing power insmaller packages: First, main-frames shrank to minicomput-ers, then minicomputers dwin-dled to desktops. Now eversmaller computers find theirway into handheld personal dig-ital assistants, clothing,and jew-elry. Computers have becomeubiquitous. Almost all techno-logical artifacts now contain astored program of some sort.Such small computers demandeconomical languages thatleave a comparably small foot-print.
Have code, will travelManufacturers of mainframes
and even early PCs designedtheir products for standalone
• Dominic Orchard “The Four Rs of Programming Language Design,” Onward! 2011, October 22–27, 2011, Portland, Oregon, USA. © 2011 ACM.
Version 1.0 © Abelardo López Lagunas
Language classification
• Languages can be grouped in two families • Declarative languages: what the computer is to do
• Functional (LISP/Scheme, ML, Haskell) • Dataflow (Id, Val) • Logic or constraint-based (Prolog, spreadsheets) • Template-based (XSLT)
• Imperative languages: how the computer should do it • Sequential or vonNeumann (C, Ada, Fortran)
• Scripting (Perl, Python, PHP) • Object-oriented (Smalltalk, Eiffel, C++, Java)
• Note that there are concurrent extensions of some of the above languages. • However, some languages are naturally concurrent, such as
those based on the dataflow model.
21
Version 1.0 © Abelardo López Lagunas
Influence of Computer Engineering
22
vonNeumann model
initialize the program counter
repeat forever fetch the instruction pointed by
the program counter (PC) increment the PC decode the instruction
execute the instruction
end repeat
Version 1.0 © Abelardo López Lagunas
vonNeumann Bottleneck
• Connection speed between a computer’s memory and its processor determines the speed of a computer • Program instructions often can be executed much faster
than the speed of the connection; the connection speed thus results in a bottleneck
• Known as the vonNeumann bottleneck; it is the primary limiting factor in the speed of computers
23
Version 1.0 © Abelardo López Lagunas
Motivation: Why we need Translators?• What is a translator (a.k.a. compiler)?
• Program that transforms a description written in a source language (usually a high level language) into a destination or object language (usually machine language)
• The compiler name was used early on (~1950) because the translation process was seen as compilation of subroutines selected from a library.
• A translator automates the process of mapping the high level language into machine instructions allowing the programmer to create higher level of abstractions. • Machine Language -> Assembly -> Low Level Language -> High Level
Language -> Application Programming -> ... • Translators may optimize the mapping into machine instructions
• Smaller code size, faster execution times, lower memory usage, etc. • Translators may take care of resource management
• Automatic memory management (i.e. Garbage collection)
Version 1.0 © Abelardo López Lagunas
Motivation: Why we need Translators? (2)
High level of abstraction • Complex manipulation • Automatic\Implied Resource management
Low level of abstraction • Optimized machine instructions • Explicit Resource management
• Data path management (VLIW)
Translator
0
Z 2
0 1 0 1 0 1
A B C
Version 1.0 © Abelardo López Lagunas
On Abstraction
26
‘The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise’!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! - Edsger Dijkstra
Version 1.0 © Abelardo López Lagunas
Translator Types
• Translators can be classified depending on their function: • Compilers: translate from a high level language to machine language
• Assemblers: translate from assembly language to machine language
• Interpreters: the translation process does not produce machine language. The high level language is executed as it is translated. This will be the emphasis of this course.
• Optimizers: translators optimize the object language (size, speed, etc)
• One pass or Multiple passes • One pass is the object code generation is done at the same time as the source language
is analyzed.
• Multiple passes: the object code generation is done as the last sage of translation.
• Syntax Checkers: analyze the source language for syntax and semantic errors or ambiguities (i.e. Lint)
• Parallel or Scalar
Version 1.0 © Abelardo López Lagunas
Translation Process (1)
• Lexical Analysis: takes the source language and extracts its fundamental parts, or tokens, such as identifiers, constants, reserved words.
• Syntactic Analysis: takes the list of tokens and and creates syntactic trees for grammatical structures such as expressions, statements, declarations, etc. It also verifies the syntax correctness of the source.
• Semantic Analysis: checks the syntactic trees verifies their structure, and determines their meaning. It generates an intermediate representation based on a predefined set of semantic actions.
• Intermediate Representation (IR): Also known as intermediate code represents the semantic actions into a format amenable for systematic manipulation (mainly optimization).
Semantic Analysis
Syntax Analysis
IR
IR Optimization
Instruction Selection
Code Optimization
Assembly
Linking
Object Code
Source Code
Interpretation
Lexical Analysis
Version 1.0 © Abelardo López Lagunas
Translation Process (2)• IR Optimization: performs manipulations that
reduce the IR representation by removing unnecessary code and references.
• Instruction Selection: translates IR code into object code using instruction templates. May also include register assignment.
• Object Code optimization: manipulates the object code to minimize size, execution time, memory usage, memory references, etc.
• Assembly: translates the object code into machine language. May perform further manipulations.
• Linking: Generates executable code by including runtime support routines.
• Each of the above stages may generate error messages. Error generation and reporting is very important because it is the main source of feedback to the user.
Semantic Analysis
Syntax Analysis
IR
IR Optimization
Instruction Selection
Code Optimization
Assembly
Linking
Object Code
Source Code
Interpretation
Lexical Analysis
Version 1.0 © Abelardo López Lagunas
Translation: Lexical
• Identify tokens of the language • Numbers • Identifiers • Operators • Reserved words • Strings
• Find and report errors such as • Malformation of tokens • Unknown symbols
• Assign values to the tokens • See values in parenthesis in
the example.
Source: Z = -1.8 * (X / 2)
- 1= . 8 * ( X / 2
Num(-1.8)Tokens:
Source: 72..18 / va#l
2 .7 1 / v a # l
Num(18)
.
Error bad number Div_op Error bad Id
8
Z )
Id(Z) Eq_op Mul_opLpar_op
Id(X)Div_op
Num(2)Rpar_op
Version 1.0 © Abelardo López Lagunas
Translation: Syntax\Semantics
• Syntax Analyzer builds a syntactic or parse tree • Enables checks for malformed
expressions, statements, declarations, etc.
• Semantic Analyzer uses the parse tree to: • Check declarations, types, possible
promotion, uninitialized variables, etc.
• Annotates the parse tree.
Z = -1.8 * (X / 2)
Id(Z) = Num(-1.8) Op(*) Op(() id(X) Op(/) Num(2) Op())
Lexical Analysis
=
Assignment
id Op_eq Expression
Z
( )Expression
id Op Num
X 2/
Num Op
Check_Symbol(Z), Check_Symbol (X), Promote (2)
Declared as Float Get value or reference of X
Declared as Float Get reference of Z
Promote to floatSemantic Analysis
Syntax Analysis
-1.8 *
Version 1.0 © Abelardo López Lagunas
Translation: IR\Optimization
• Translates the annotated parse tree into a linear notation • Easier to manipulate by the
optimizer
Temp1 = int_to_float(2) Temp2 = Value(X) / Temp1 Temp3 = -1.8 * Temp2 Location(Z) = Temp3
Intermediate Representation
=
Assignment
id Op_eq Expression
Z.location
( )Expression
id Op
X.value Promote(2)/
Num Op
-1.8 *
Num
IR Optimization
Temp1 = Value(X) / 2.0 Location(Z) = -1.8 * Temp1
! IR Optimization removes unnecessary code by transforming constants, resolving references, algebraic manipulations, etc.
Version 1.0 © Abelardo López Lagunas
Translation: Back End
Temp1 = Value(X) / 2.0 Location(Z) = -1.8 * Temp1
Instruction Selection
Ld RA, [X.location] Div RB, RA, 2.0 Mul RC, RB, -1.8 St [Z.location], RC
Register Allocation
Ld R0, [X.location] Div R1, R0, 2.0 Mul R2, R1, -1.8 St [Z.location], R2 Code
Optimization
Ld R0, [X.location] Div R1, R0, 2.0 Mul R0, R1, -1.8 St [Z.location], R0
CO: Remove References
Ld R0, X.Value Div R1, R0, 2.0 Mul R0, R1, -1.8
Assembly
FFED FEC5 FBA0
Reclaim Registers
LinkerRuntime Code (load) FFED FEC5 FBA0 Runtime Code (end)
Version 1.0 © Abelardo López Lagunas
Compiler hints
34
‘If you lie to the compiler, it will get its revenge.’!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! -- Henry Spencer
Version 1.0 © Abelardo López Lagunas
Translator Design & Formal Languages
• Translators are designed in a modular fashion • Enables support for different source languages on the same
platform by only changing the analysis modules “front-end”. • Enables support of several platforms for the same source
language by only changing the synthesis modules “back-end” • Can explore different optimization algorithms without
disrupting the rest of the modules.
• Translators use formal languages to describe the analysis of the source language: • Regular Expressions for Lexical Analysis • Context free grammars for Syntactic Analysis • Attributed Grammars for Semantic Analysis
Version 1.0 © Abelardo López Lagunas
Formal Language Concepts (2)
• For example if L = {A,B,...,Z,a,b,...z} D = {0,1,..,9} then • L U D is the set of letters and digits • LD is the set of a letter followed by a digit • L4 is the set of strings of four letters • L* is the set of all strings of letters, including ε • L(L U D)* is the set of all strings and digits that start with a letter,
ε is not included • D+ is the set of all strings with at least one digit.
• As can be seen this formalism helps manipulate infinite number of string combinations with a small set of symbols. • But not all the combinations are meaningful • Need a way to specify strings with patterns of interest.
Furthermore we need to formalize how to generate strings. • Introduce the concept of grammars
Version 1.0 © Abelardo López Lagunas
Formal Language Concepts (3)
• A grammar is the ordered quartet G =({N},{T},P,S) over a language L • N is a finite non-empty set of non terminal symbols (written in bold) • T is a finite non-empty set of terminal symbols • P is a set of productions of the form u -> v such that u is an element
of (N U T)+ and v is an element of (N U T)* • S is a non terminal symbol denoted start
• Example: • identifier -> letter(letter | digit)* • letter -> A | B | . . . . | Z | a | b | . . . | z • digit -> 0 | 1 | . . . . | 9
• So N = {letter, digit, identifier}, T = {A, B, ..., Z, a, b, ..., z,0 ,…9}, P are the above productions and S = {identifier}
• The | operator is a logical “or”.
Version 1.0 © Abelardo López Lagunas
Formal Language Concepts (4)
• To generate a string r from a grammar G: • Take the start symbol S and apply the productions to the non
terminal symbols in that production to produce a new string r’. • Repeat the above step recursively until the new string only
contains terminal symbols T. • Note: a grammar generates strings from a language L with
a pattern specified by the production rules. The language L is just a collection of strings over an alphabet.
• Let s, t, u y v strings over some language L. • If w = s u t, w' = s v t y u -> v is a production of the grammar G
then the string w' is directly generated by the string w • If there is a sequence of strings w0, w1,...., wn, w where w' is
directly generated by w0, w0 is directly generated by w1, and in general wi-1 is directly generated by wi then the sequence w',w0,w1,....,wn,w is called derivation sequence.
Version 1.0 © Abelardo López Lagunas
Examples of Derivation Sequences
• Given • identifier -> letter(letter | digit)*
• letter -> A | B | . . . . | Z | a | b | . . . | z
• digit -> 0 | 1 | . . . . | 9
• Then the string help3 is derived from: !
• Given a grammar G for some language L it is possible to construct a finite state automata to parse through strings. • The automata will accept any string if there is a valid derivation sequence,
otherwise it will reject it.
• Automata will be used to design lexical and syntactic analyzers that will accept all valid strings that can be generated.
• A grammar can be seen as a meta-language that specifies how to derive valid strings from a source language.
letter letter digit
h
identifier
e
letter
l
letter
p 3
Version 1.0 © Abelardo López Lagunas
Grammar Classification
• Grammars can be classified in four groups: • Type 0 - There are no construction restrictions. • Type 1 - Context Sensitive has productions u -> v where u
belongs to (N U T)+ and v belongs to (N U T)* • Type 2 - Context free has productions u -> v where u belongs to
N+ and v belongs to (N U T)* • Type 3 - Regular has the productions:
• Right A -> tB | A; B belongs to N and t belongs to an alphabet T* and A -> t
• Left A -> Bt | A; B belongs to N and t belongs to an alphabet T* and A -> t
• Note that regular grammars are also context sensitive. • Although the number of symbols of a grammar is finite it can
generate an infinite number of symbols.