Language Translation Principles Part 1: Language Specification

Embed Size (px)

Text of Language Translation Principles Part 1: Language Specification

  • Language Translation PrinciplesPart 1: Language Specification

  • Attributes of a languageSyntax: rules describing use of language tokensSemantics: logical meaning of combinations of tokensIn a programming language, tokens include identifiers, keywords, and punctuation

  • Linguistic correctnessA syntactically correct program is one in which the tokens are arranged so that the code can be successfully translated into a lower-level languageA semantically correct program is one that produces correct results

  • Language translation toolsParser: scans source code, compares with established syntax rulesCode generator: replaces high level source code with semantically equivalent low level code

  • Techniques to describe syntax of a languageGrammars: specify how you combine atomic elements of language (characters) to form legal strings (words, sentences)Finite State Machines: specify syntax of a language through a series of interconnected diagramsRegular Expressions: symbolic representation of patterns describing strings; applications include forming search queries as well as language specification

  • Elements of a LanguageAlphabet: finite, non-empty set of charactersnot precisely the same thing we mean when we speak of natural language alphabetfor example, the alphabet of C++ includes the upper- and lowercase letters of the English alphabet, the digits 0-9, and the following punctuation symbols:{,},[,],(,),+,-,*,/,%,=,>,
  • Language as ADTA language is an example of an Abstract Data Type (ADT)An ADT has these characteristics:Set of possible values (an alphabet)Set of operations on those valuesOne of the operations on the set of values in a language is concatenation

  • ConcatenationConcatenation is the joining of two or more characters to form a stringMany programming language tokens are formed this way; for example:> and = form >=& and & form &&1, 2, 3 and 4 form 1234Concatenation always involves two operands either one can be a string or a single character

  • String characteristicsThe number of characters in a string is the strings lengthAn empty string is a string with length 0; we denote the empty string with the symbol The is the identity element for concatenation; if x is string, then:x = x = x

  • Closure of an alphabetThe set of all possible strings that can formed by concatenating elements from an alphabet is the alphabets closure, denoted T* for some alphabet TThe closure of an alphabet includes strings that are not valid tokens in the language; it is not a finite setFor example, if R is the real number alphabet, then R* includes:-0.092 and 563.18 but also.0.0.- and 2-4-2.9..-5.

  • Languages & GrammarsA language is a subset of the closure of an alphabetA grammar specifies how to concatenate symbols from an alphabet to form legal strings in a language

  • Parts of a grammarN: a nonterminal alphabet; each element of N represents a group of characters from:T: a terminal alphabetP: a set of rules for string production; uses nonterminals to describe language structureS: the start symbol, an element of N

  • Terminal vs. non-terminal symbolsA non-terminal symbol is used to describe or represent a set of terminal symbolsFor example, the following standard data types are terminal symols in C++ and Java: int, double, float, charThe non-terminal symbol could be used to represent any or all of these

  • Valid stringsS (the start symbol) is a single symbol, not a setGiven S and P (rules for production), you can decide whether a set of symbols is a valid string in the languageConversely, starting from S, if you can generate a string of terminal symbols using P, you can create a valid string

  • ProductionsA wa non-terminalproducesa string of terminals &non-terminals

  • DerivationsA grammar specifies a language through the derivation process:begin with the start symbolsubstitute for non-terminals using rules of production until you get a string of terminals

  • Example: a grammar for identifiers (a toy example)N = {, , }T = {a, b, c, 1, 2, 3}P = the productions: ( means produces) a b c 1 2 3S =

  • Example: deriving a12bc: (rule 2) c (rule 6) means c (rule 2)derives in one bc (rule 5)step bc (rule 3) 2bc (rule 8) 2bc (rule 3) 12bc (rule 7) 12bc a12bc

  • Closure of derivationThe symbol * means derives in 0 or more stepsA language specified by a grammar consists of all strings derivable from the start symbol using the rules of productionprovides operational test for membership in the languageif a string cant be derived using production rules, it isnt in the language

  • Example: attempting to derive 2a aSince there is no combination in the production rules, we cant proceed any furtherThis means that 2a isnt a valid string in our language

  • A grammar for signed integersN = {I, F, M}I means integerF means first symbol; optional signM means magnitudeT = {+,-,d} (d means digit 0-9)P = the productions:I FMF +F -F (means +/- is optional)M dMM dS = I

  • ExamplesDeriving 14:Deriving -7:I FMI FM M -M dM -d dd -7 14

  • Recursive rulesBoth of the previous examples (identifiers, integers) have rules in which a nonterminal is defined in terms of itself: andM dMSuch rules produce languages with infinite sets of legal sentences

  • Context-sensitive grammarA grammar in which the production rules may contain more than one non-terminal on the left sideThe opposite (all of the examples we have seen thus far), have production rules restricted to a single non-terminal on the left: these are known as context-free grammars

  • ExampleN = {A,B,C}T = {a,b,c}P = the productions:A aABCA abCCB BCbB bbbC bcThis rule is context-sensitive: C can besubstituted with c only if C is immediatelypreceded by bcC ccS = A

  • Context-sensitive grammarN = {A, B, C}T = {a, b, c}P = the productionsA --> aABCA --> abCCB --> BCbB --> bbbC --> bccC --> ccS = AExample:aaabbbcc is a valid string by:A => aABC (1)=> aaABCBC (1)=> aaabCBCBC (2)=> aaabBCCBC (3)=> aaabBCBCC (3)=> aaabBBCCC (3)=> aaabbBCCC (4)=> aaabbbCCC (4)=> aaabbbcCC (5)Here, we substituted c for C;this is allowable only if C hasb in front of it=> aaabbbccC (6)=> aaabbbccc (6)

  • Valid & invalid strings from previous example:Valid:abcaabbccaaabbbcccaaaabbbbccccInvalid:aabccbabbbcccThe grammar describes a language consisting of strings that start with a number of as, followed by an equal number of bs and cs; this language can be defined mathematically as:

    L = {anbncn | n > 0}

    Note: an means the concatenation of n as

  • A grammar for expressionsN = {E, T, F} where:E: expressionT: term T = {+, *, (, ), a}F: factorP: the productions:E -> E + TE -> TT -> T * FT -> FF -> (E)F -> aS = E

  • Applying the grammarYou cant reach a valid conclusion if you dont have a valid string, but the opposite is not trueFor example, suppose we want to parse the string (a * a) + a using the grammar we just sawFirst attempt:E=> T (by rule 2)=> F (by rule 4) and, were stuck, because F can only produce (E) or a; so we reach a dead end, even though the string is valid

  • Applying the grammarHeres a parse that works for (a*a)+a:E=> E+T (rule 1)=> T+T (rule 2)=> F+T (rule 4)=> (E)+T (rule 5)=> (T)+T (rule 2)=> (T*F)+T (rule 3)=> (T*a)+T (rule 6)=> (F*a)+F (rule 4 applied twice)=> (a*a) + a (rule 6 applied twice)

  • Deriving a valid string from a grammarArbitrarily pick a nonterminal on right side of current intermediate string & select rules for substitution until you get a string of terminalsAutomatic translators have more difficult problem:given string of terminals, determine if string is valid, then produce matching object codeonly way to determine string validity is to derive it from the start string of the grammar this is called parsing

  • The parsing problemAutomatic translators arent at liberty to pick rules randomly (as illustrated by the first attempt to translate the preceding expression)Parsing algorithm must search for the right sequence of substitutions to derive a proposed stringTranslator must also be able to prove that no derivation exists if proposed string is not valid

  • Syntax treeA parse routine can be represented as a treestart symbol is the rootinterior nodes are nonterminal symbolsleaf nodes are terminal symbolschildren of an interior node are symbols from right side of production rule substituted for parent node in derivation

  • Syntax tree for (a*a)+a

  • Grammar for a programming languageA grammar for a subset of the C++ language is laid out on pages 340-341 of the textbookA sampling (suitable for either C++ or Java) is given on the next couple of slides

  • Rules for declarations -> ; -> char | int | double(remember, this is subset of actual language) -> | , -> | | -> a|b|c| |z|A|B||Z -> 0|1|2|3|4|5|6|7|8|9

  • Rules for control structures ->if () |if () else ->while () |do while () ;

  • Rules for expressions -> ; -> | = -> | < | > | = etc.

  • Backus-Naur Form (BNF)BNF is the standardized form for specification of a programming language by its rules of productionIn BNF, the -> operator is written ::=ALGOL-60 first popularized the form

  • BNF described in terms of itself (from Wikipedia)::= |

    ::= "" "::="

    ::= " " | ""

    ::= | "|"

    ::= |

    ::= |

    ::= | "

    ::= '"' '"' | "'" "'"