Language Translation Issues

Embed Size (px)


Language Translation Issues. Lecture 5: Dolores Zage. Programming Language Syntax. The arrangement of words as elements in a sentence to show their relationship In C, X = Y + Z represents a valid sequence of symbols, XY +- does not provides significant information for - PowerPoint PPT Presentation

Text of Language Translation Issues

  • Language Translation IssuesLecture 5:Dolores Zage

  • Programming Language SyntaxThe arrangement of words as elements in a sentence to show their relationshipIn C, X = Y + Z represents a valid sequence of symbols, XY +- does notprovides significant information for understanding a programtranslation into an object programrules: 2 + 3 x 4 is 14 not 20(2+3) x 4 - specify interpretation by syntax - syntax guides the translator

  • General Syntactic CriteriaProvide a common notation between the programmer and the programming language processorthe choice is constrained only slightly by the necessity to communicate particular items of informationfor example: a variable may be represented as a real can be done by an explicit declaration as in Pascal or by an implicit naming convention as FORTRANgeneral criteria: easy to read, write, translate and unambiguous

  • ReadabilityAlgorithm is apparent from inspection of textself-documentingnatural statement formats liberal use of key words and noise wordsprovision for embedded commentsunrestricted length identifiersmnemonic operator symbolsCOBOL design emphasizes readability often at the expense of ease of writing and translation

  • WriteabilityEnhanced by concise and regular structures (notice readability->verbose, different; help us to distinguish programming features)FORTRAN - implicit naming does not help us catch misspellings (like indx and index, both are good integer variables, even though the programmer wanted indx to be index)redundancy can be goodeasier to read and allows for error checking

  • TranslationEase ofKey of easy translation is regularity of structureLISP can be translated in a few short easy rules, but it is a bear to read.COBOL has large number of syntactic constructs -> hard to translate

  • Lack of ambiguityCentral problem in every language design!Ambiguous construction allows for two or more different interpretationsthese do not arise in the structure of individual program elements but in the interplay between structuresThe dangling else is a classic example:

  • If then elseIf (boolean) thenif(boolean) thenstatement 1elsestatement 2B1B1B2B2S1S2S1S2

  • Resolve dangling elseInclude begin end delimiter around embedded conditional -ALGOLAda-> delimiter end ifC and Pascal -> final else is paired with the nearest then

  • Character setASCII26 letters -> other languages have hundreds of lettersidentifiers and key words and reserved wordsblanks can be not significant except in literal character-string data (FORTRAN) or used as separatorsdelimiters -> begin, end { }

  • Other elementsIdentifiers, operators, key words, reserved wordsFree vrs Fixed format - free written anywhere fixed - FORTRAN - first five characters are reserved for labelsstatements - simple - no embeddingstructured or nested - embedded

  • Overall Program-Subprogram StructureSeparate subprogram definitions ( Common blocks in FORTRAN)separate data definitions ( class mechanism)nested subprogram definitions (Pascal nesting one subprogram in the other)separate interface definitions - package interface in Ada - in C you can do this with an include filedata descriptions separated from executable statements (COBOL data and environment divisions)unseparated subprogram divisions - no organization - early BASIC and SNOBOL

  • Stages in TranslationProcess of translation of a program from its original syntax into executable form is central in every programming implementationtranslation can be quite simple as in LISP and Prolog but more often quite complexmost languages could be implemented with only trivial translation if you wrote a software interpreter and willing to accept slow execution speeds

  • Stages in TranslationSyntactic recognition parts of compiler theory are fairly standardAnalysis of the Source Programthe structure of the program must be laboriously built up character by character during translationSynthesis of the Object Programconstruction of the executable program from the output of the semantic analysis

  • Structure of a CompilerLexical analysisSyntactic analysisSemantic analysisOptimizationCode generationlinkingSymboltableOthertablessource programlexical tokensparse treeintermediate codeoptimized intermediate codeObjectcodeExecutablecodeObject code fromother compilationsSOURCE PROGRAM RECOGNITION PHASESOBJECTCODE GENERATIONPHASES

  • Analysis of the Source Programlexical analysis (tokenizing)parsing ( syntactic analysis)semantic analysissymbol-table maintenanceinsertion of implicit information (default settings)macro processing and compile-time operations(#ifdefs)

  • Synthesis of the Object ProgramOptimizationcode generation - internal representation must be formed into assembly language statements, machine code or other object formlinking and loading - references to external data or other subprograms

  • Translator GroupingsCrudely grouped by the number of passes they make over the source codestandard - uses 2 passesdecomposes into components, variable name usagegenerates an object program from collected informationone pass - fast compilation - Pascal was designed so that it could be done in one passthree or more passes - if execution speed is paramount

  • Formal Translation ModelsBased on the context-free theory of languagesthe formal definition of the syntax of a programming language is called a grammara grammar consists of a set of rules (production) that specify the sequences of characters (lexical items) that form allowable programs in the language beginning defined

  • Chomsky HierarchyLanguage syntax was one of the earliest formal modes to be applied to programming language designin 1959 Chomsky outlined a model of grammars

  • Classes of grammar and abstract machinesChomsky Level Grammar Class Machine Class0 Unrestricted Turning machine1 Context sensitive Linear-bounded automaton2 Context free Pushdown automaton3 Regular Finite-state automatonType 2 are our BNF grammars. Type 2 and 3 are what we use in programming languagesA type n language is one that is generated by a type n grammar, where there is no grammar type n + 1 that also generates it. Every grammar of type is, by definition, also a grammar of type n-1.

  • Grammar To Chomsky it is a 4-tuple (V, T, P, Z) whereV is an alphabetT in V is an alphabet of terminal symbolsP is a finite set of rewriting rulesZ the distinguished symbol, is a member of T-VThe language of a grammar is the set of terminal strings which can be represented from ZThe difference in the four types is in the form of the rewriting rules allowed in P

  • Type 0 or phrase structureRules can have the form:u :: = V with u in V+ and V in V*That is, the left part u can also be a sequence of symbols and the right part can be empty abc -> dca a -> nil

  • Type 1 or context sensitive or context dependentRestrict the rewriting rules xUy ::= xuywe are only allowed to Rewrite U as u only in the context xyall productions a -> b wherethe length side a always must be less than or equal to the length of b

  • G = ( {S,B,C}, {a,b,c}, S, P)P =S -> aSBCS -> abCbB -> bbbC -> bcCB -> BCcC -> ccWhat language is generated by this context sensitive grammar?

  • Deciding the language?always start with the start rule: in this case it is S but it can any nonTerminal (look at the 4-tuple definition)create a tree starting with the start rule and apply the productions finally finishing with all terminalsgeneralize the pattern

  • Identifying L given GP = 1. S -> aSBC 2. S -> abC 3. bB -> bb 4. bC -> bc 5. CB -> BC 6. cC -> cc

    SabCaSBCabcaabCBCaabBCCaabbCCaabbcCaabbccaaSBCBCaaabCBCBCaaabBCCBCaaabBCBCCaaabBBCCCaaabbBCCCaaabbbCCCaaabbbcCCaaabbbccCaaabbbcccL -> anbncn where n>= 1

  • Type 2 or context freeU can be rewritten as u regardless of the context in which it appearsThis grammar has only one symbol on the left hand sideIt also allows a rule to go the empty string

  • Context Free Expression GrammarE-> E + T | E - T | TT -> T * F | T / F | FF -> number | name | (E)

  • Type 3 - regular grammarsRestrict the rules once moreall rules must have the formu :: N or u :: WN

  • GrammarsAs we moved from type 3 to type 2 to type 1 to type 0, the resulting languages became more complextype 2 and type 3 became important in programming languagestype 3 provided a model (FSM) for building lexical analyzerstype 2 (BNF) for developing parse trees of programs

  • BNF GrammarsConsider the structure of an English sentence. We usually describe it as sequence of categoriessubject / verb / objectExamples: The girl/ played / baseball. The boy / cooked / dinner.

  • BNF GrammarsEach category can be further divided.For example subject is represented by article nounarticle / noun / verb / objectThere are other possible sentence structures besides the simple declarative ones, such as questions.auxiliary verb / subject / predicateIs / the boy / cooking dinner?

  • Represent sentences by a set of rules ::= | ::= . ::= ::=

    This specific notation is called BNF (Backus-Naur form) and was developed in the late 1950s by John Backus as way to express the syntactic definition of ALGOL. At the same time Chomsky developed a similar grammatical form, the context-free grammar. The BNF and context-free grammar for are equivalent in power; the differences ar