TDT4205 Recitation 3 Lexical analysis - idi.ntnu.nojanchris/TDT4205/recitation-3-slides.pdf · TDT4205 Recitation 3 Lexical analysis ... Looking at the state diagram, ... block of

TDT4205 Recitation 3Lexical analysis

● Last week:– Make and makefiles

– Text filters inside and out

– Some C, idiomatically

● Today: problem set 2– We've raced through the preliminaries, time for

compiler stuff (yay!)

– Analysis by hand, and by generated analyzer– (This lecture is given both Monday and Thursday, to keep

everyone on board even with the off-beat timing)

Today's tedious practical matter

● The exercises are part of your evaluation– I'm not the one holding the ultimate responsibility that your

evaluation is fair

– Thus, I can't decide on any kind of differential treatment

– In plain English, I can not extend your deadlines

– No, not even for world tours, moon landings or funerals

– Where it says “contact the instructor”, that's Dr. Elster

– (Generally, after Feb. 15th the deadlines harden)

Worthwhile questions in plenary

● (This one is from rec. 1, but I gave a somewhat foofy answer at the time...)

● Does it make a difference whether main ends in “return int” or “exit ( int )”?

– As it turns out, No.

– The reason I hesitated was that one can register function calls to happen at program exit (w. function pointers and the atexit function).

– This mechanism is required to behave the same in both cases, so it's really a clear case. (Live and learn...)

● For myself, I'll keep writing exit for “stop the program” and return for “stop the function” (unless there turns out to be a good reason why it's silly).

Where we're at

● Things needed to– Submit homework (pdfs and tarballs)

– Build programs (make, cc)

– Build scanners (Lex/flex)

– Build parsers (Yacc/bison)

– Build symbol tables (hashtables/libghthash)

– Assemble machine code (as)

● ...but first, a bit of handwaving

The science of computer languages:even experts reach for magical metaphors

Battle /w ferocious dragon “The spirit which lives in the computer”

My humble perspectiveon the subject

● Compiler construction lies at an intersection between

– Hardware architecture (very practical things)

– Software architecture (very complicated things)

– Complexity theory (very theoretical things)

– Theories of language (very non-computery things)● What's cool about it is that handling the resulting calamity in

the middle is a success story of computer science

● Even so, the dragon's sweater reads “complexity of compiler design”, and the knight's sword is a parser generator

● Moral: bring tools to the job

– Dragons find hero programmers crunchy, and good with ketchup

General terminology:bits and bobs of languages

● Different language models are suitable depending on what you want to look at:

– Lexical models say what goes into a word, and where they are separated from each other

– Syntactical models tell which roles a given word can play in a statement it is part of

– Semantics speak of what it means when a given word appears playing a given role

● There's a whole heap of other stuff which isn't usually applied to programming languages (morphology, pragmatics, …)

● What we're after today is lumping characters into words.

Lexical analysis, the ad-hoc way

● Say that we want to recognize an arbitrary fraction; should be easy, <ctypes.h> is full of functions to classify characters...

– read character

– while ( isdigit ( character ) ) { read another }

– if ( character != ' / ' ) { die horribly }

– while ( isdigit ( character ) ) { keep reading }● First loop takes an integer numerator

● Second loop takes an integer denominator

● Condition in the middle requires that they're separated by what we expect.

● This works if you only have a few different words in your care.

The automaton way I

● DFAs are good too, they chew a character at a time

● Looking at the state diagram, each state has a finite number of transitions...

● ...so we can code them up in a finite amount of time.

● Here goes:

– if ( state=1 and c='a', or state=1 and c='b', or... ) { state = 14; /* lowercase letters go to 14 */ }

– else if ( state = 1 and c='0', or state=1 and c='1', or... ) { state = 42; /* digits in state 1...*/ }

– else if (… else if...

● (I'm beginning to think this wasn't such a fantastic idea after all)

The automaton way II

● DFA can be tabulated.– First, punch in the table

– Next, set a start state

– Loop while state isn't known to accept or reject:● Next state = table [ this state ] [ character ]

● (A recipe like this is in the Dragon book, pp. 151)

● Wonder of wonders, one algorithm will work for any DFA, just change the table!

● This is pretty much get_token in Task 2, it's not that hard.

Surrounding logic of vm.c

● Basically, it's like last week's text filter description, but with tokens:

T = token_get();

while ( T != wrong )

{

do_something ( T );

T = token_get();

}

● 'token_get' is a little more involved than 'readchar()', but it's still just an integer to branch on

● The 'do_something' is already in place, you won't have to write that

Inside token_get:where did I leave my chars?

● DFA have horrible short term memory, they barely just know where they are.

● When the time comes to accept:

– What are we accepting? (Answer is the token)

– Why did we accept this? (Answer is the lexeme)

● At the accept state in the PS2 diagram, neither is known

● To do this, impart a sense of history to your code:

– The 2nd to last state determines the token, it can be set then to be recalled if reaching the accept state

– There's a buffer 'lexeme' to plop each char into as you go along, to tell “127” from “74” even though they both match to integer tokens

A few more notes on vm.c● The table is all set up, table[state]['a'] gives transition from state on 'a'

– Initially, all lead to state -1, which works for 'reject'

– We'll assume transitions not noted lead there

– Table is (much) bigger than it has to be, for the convenience of indexing with character codes

– There's a macro T(state,c) which expands to table[state][c], this is just to save on the typing.

● The language def. isn't splendidly clean (mixes in whitespace for good measure), but the intention is (hopefully) clear

● The 'lexeme' buffer is fixed-length, and can be easily overrun with long integers. We could fix it, but it's kind of beside the point at the moment, let's assume input is friendly.

● (The stack is finite too, so it won't do long programs.)

Testing

● There are two files included, one for checking just the tokenizer, and one small program

● “./vm -t” will drop execution, this is used to test with an included list of lexemes

● (In a few cases, the input is broken through 'sed', to see if errors come out. Sed is just a text filter which can apply reg.exp. substitutions. It's a handy tool.)

● Just starting “./vm -t” without any pipeline will take stdin from the keyboard. (On most terminals, Ctrl + D will send the end-of-file character.)

The bridge to Lex

● What we just saw is exactly what Lex does:– Take some regular expressions

– Write out mother load table

– Implement traversal

● The names are a little different:– 'token_get (stdin)' is called “yylex()”

– The lexeme buffer is called yytext

● Major win: the tabulation is automated; less tedious, far less prone to mistakes

Lex specifications: declarations

● Declarations section contains initializer C code, some directives, and optionally, some named regular expressions

– “TABSTRING [\t]+” will define a symbolic name TABSTRING for the expression (which here matches a sequence of at least one tabulator character)

– References to these names can go into other expressions in the rules section: {TABSTRING}123 will match a string of at least one tab, followed by '123'

– Not necessary, but a boon for readability when expressions grow complicated

● Anything enclosed between '%{' and '%}' in this section will go verbatim in at the start of the generated C code

● There's a nasty macro in there, which gets more attention in a minute

Lex specifications: rules

● The rules section is just regular expressions annotated with basic blocks (semantic actions):

– a(a|b)* { return SOMETHING; } will see the yylex() function return “SOMETHING” whenever what appears on stdin matches an 'a' followed by zero or more ('a' or 'b')-s

– Any code could go into the semantic action, it's just a block of C. If it's empty, the reg.exp. will strip text from the input.

– A set of token values to return are already included in “parser.h”, so you don't have to invent token values

Gritty details

● The one rule already implemented in scanner.l is

“. { RETURN(yytext[0]);}”, which matches any one character and returns its ASCII code as a token value.

● Keep this rule (as the last one), it will save us from defining long symbolic names for single-char tokens like '+' and '}' (...even though this overlaps the lexeme with the token value...)

● The RETURN() macro is a hack, but a useful one:

– #ifdef-d on DUMP_TOKENS, it not only returns a token value, but also prints it along with its lexeme. Thus, we can define DUMP_TOKENS and test the scanner without plopping a greater frame around it.

– When we're done, dropping DUMP_TOKENS will give us a well-behaved scanner which just returns values.

Bringing it all together

● The specification file consists of declarations, rules and function definitions (in that order), separated by %%

● We won't need the function definitions, but you can stuff auxiliary functions in there. (If you implement main there, you can make a standalone program in a Lex spec.)

Outline of the vslc directory

● 'src' is for keeping handwritten sources.

● 'obj' fills up with object code on every build

● 'work' fills up with generated source code

● 'bin' contains the compiler binary

● 'vsl_programs' contains little examples in the language we're compiling, for testing purposes

● There's a separate makefile under vsl_programs, to manage the testing stuff separately

● 'clean' and 'purge' simply wipe out all the generated material

Outline of the vslc directory

● All this separation is a little over-engineered, but the idea is to keep the building blocks apart (since we're dismantling the compilation process anyway).

● Test cases are there to verify things at the end, noodling around with inventing your own input in such a way that you can verify it is still invaluable. Reminders:

– echo “FUNC main () { VAR a,b RETURN 0 }” | bin/vslc

– From file, 'cat myfile.vsl | bin/vslc' or 'bin/vslc -f myfile.vsl'

● Testing little bits at a time and trying to predict the outcome is likely to get things finished quicker than trying to write the entire scanner spec. correctly on the first go (even if that is still possible here...)

Notes which didn't fit anywhere else

● Lex regular expressions support a few more conveniences that the Dragon flavor. I found one nice reference by going to Google with “flexout flex regular expression cheatsheet” (gives 1st hit to a likeable one, there are hundreds of things like this)

● Mind the string literals; most escape sequences will take care of themselves (we'll interpolate them with printf later), but '“' has to be escaped by us because it's the delimiter of the strings themselves.

● “#define MACRO(p) do { a(p); b(p,0); p = 42; } while ( false )” may look a little redundant, but it's a/the way to wrap multiple statements in macros so that they look like single statements.

“#define MACRO(p) a(p); b(p,0); p=42;” fails as 'if' body

“#define MACRO(p) {a(p); b(p,0); p=42;} won't allow ';' after

Documents

TDT4205 Recitation 3 Lexical analysis - idi.ntnu.nojanchris/TDT4205/recitation-3-slides.pdf · TDT4205 Recitation 3 Lexical analysis ... Looking at the state diagram, ... block of