1 Introduction to JavaCC Cheng-Chia Chen. 2 What is a parser generator Total =price+tax; Scanner Parser price id + id Expr assignment =Total tax Total=price+tax;

1

Introduction to JavaCC

Cheng-Chia Chen

2

What is a parser generator

Total

= price + tax ;

Scanner

Parser

price

id + id

Expr

assignment

=Total

tax

T o t a l = p r i c e + t a x ;

Parser generator (JavaCC)

lexical+grammar

specification

3

JavaCC

• JavaCC (Java Compiler Compiler) is a scanner and parser generator;

• Produce a scanner and/or a parser written in java, itself is also written in Java;

• There are many parser generators. – yacc (Yet Another Compiler-Compiler) for C programming

language (dragon book chapter 4.9); – Bison from gnu.org

• There are also many parser generators written in Java– JavaCUP;– ANTLR;– SableCC

4

More on classification of java parser generators

• Bottom up Parser Generators Tools – JavaCUP;– jay, YACC for Java www.inf.uos.de/bernd/jay– SableCC, The Sable Compiler Compiler www.sablecc.org

• Topdown Parser Generators Tools– ANTLR, Another Tool for Language Recognition www.antlr.org– JavaCC, Java Compiler Compiler javacc.dev.java.net

http://www.inf.uos.de/bernd/jay

http://www.antlr.org/

5

Features of JavaCC

• TopDown LL(K) parser genrator• Lexical and grammar specifications in one file• Tree Building preprocessor

– with JJTree

• Extreme Customizable– many different options selectable

• Document Generation– by using JJDoc

• Internationalized– can handle full unicode

• Syntactic and Semantic lookahead

6

Features of JavaCC (cont’d)

• Permits extneded BNF specifications– can use | * ? + () at RHS.

• Lexical states and lexical actions• Case-sensitive/insensitive lexical analysis• Extensive debugging capability• Special tokens• Very good error reporting

7

JavaCC Installation

• Download the file javacc-3.X.zip from https://javacc.dev.java.net/

• unzip javacc-3.X.zip to a directory %JCC_HOME%• add %JCC_HOME\bin directory to your %path%.

– javacc, jjtree, jjdoc are now invokable directly from the command line.

https://javacc.dev.java.net/



8

Steps to use JavaCC

• Write a javaCC specification (.jj file)– Defines the grammar and actions in a file (say, calc.jj)

• Run javaCC to generate a scanner and a parser– javacc calc.jj– Will generate parser, scanner, token,… java sources

• Write your program that uses the parser– For example, UseParser.java

• Compile and run your program– javac -classpath . *.java– java -cp . mainpackage.MainClass

9

Example 1: parse a spec of regular expressions and match it with input strings• Grammar : re.jj• Example

– % all strings ending in "ab"– (a|b)*ab;– aba;– ababb;

• Our tasks:– For each input string (Line 3,4) determine whether it matches the

regular expression (line 2).

10

the overall picture

% comment

(a|b)*ab;

a;

ab;

REParserTokenManager

javaCC

REParser

re.jj

tokens

result

MainClass

11

Class diagram (to be added)

12

Format of a JavaCC input Grammar

• javacc_options

• PARSER_BEGIN ( <IDENTIFIER>1 )

java_compilation_unit

PARSER_END ( <IDENTIFIER>2 )

• ( production )*

13

14

the input spec file (re.jj)options {

USER_TOKEN_MANAGER=false;

BUILD_TOKEN_MANAGER=true;

OUTPUT_DIRECTORY="./reparser";

STATIC=false;

}

15

16

re.jj PARSER_BEGIN(REParser) package reparser;

import java.lang.*; … import dfa.*;

public class REParser { public FA tg = new FA();

// output error message with current line number public static void msg(String s) {

System.out.println("ERROR"+s); }

public static void main(String args[]) throws Exception {

REParser reparser = new REParser(System.in);

reparser.S(); }}PARSER_END(REParser)

17

re.jj (Token definition)

TOKEN : { <SYMBOL: ["0"-"9","a"-"z","A"-"Z"] > | <EPSILON: "epsilon" > | <LPAREN: "(“ > | <RPAREN: ")“ > | <OR: "|" > | <STAR: "*“ > | <SEMI: ";“ >

}

SKIP: { < ( [" ","\t","\n","\r","\f"] )+ >| < "%" ( ~ ["\n"] )* "\n" > { System.out.println(image); }}

18

re.jj (productions)

void S() : { FA d1; }{ d1 = R() <SEMI> { tg = d1; System.out.println("------NFA"); tg.print();

System.out.println("------DFA");tg = tg.NFAtoDFA(); tg.print();

System.out.println("------Minimize");tg = tg.minimize(); tg.print();

System.out.println("------Renumber");tg=tg.renumber(); tg.print();

System.out.println("------Execute"); } testCases()

}

19

re.jj

void testCases() : {}{ (testCase() )+ }

void testCase(): { String testInput ;}{ testInput = symbols() <SEMI> { tg.execute( testInput) ; }}

String symbols() :{Token token = null; StringBuffer result = new StringBuffer(); }{ ( token = <SYMBOL> { result.append( token.image) ; } )* { return result.toString(); }}

20

re.jj (regular expression)

// R --> RUnit | RConcat | RChoice

FA R() : {FA result ;} { result = RChoice() { return result; } }

FA RUnit() : { FA result ; Token d1; }{ ( <LPAREN> result = RChoice() <RPAREN> | <EPSILON> { result = tg.epsilon(); } | d1 = <SYMBOL> { result = tg.symbol( d1.image ); } ) { return result ; } }

21

re.jj

FA RChoice() : { FA result, temp ;} { result = RConcat() ( <OR> temp = RConcat() { result = result.choice( temp ) ;} )* {return result ; } }

FA RConcat() : { FA result, temp ;} { result = RStar() ( temp = RStar() { result = result.concat( temp ) ;} )* {return result ; } }

FA RStar() : {FA result;} { result = RUnit() ( <STAR> { result = result.closure(); } )* { return result; } }

22

Format of a JavaCC input Grammar

javacc_input ::= javacc_options

PARSER_BEGIN ( <IDENTIFIER>1 ) java_compilation_unit

PARSER_END ( <IDENTIFIER>2 ) ( production )* <EOF>

color usage:– blue --- nonterminal– <orange> – a token type– purple --- token lexeme ( reserved word; – I.e., consisting of the literal itself.)– black -- meta symbols

23

Notes

• <IDENTIFIER> means any Java identifers like var, class2, …– IDENTIFIER means IDENTIFIER only.

• <IDENTIFIER>1 must = <IDENTIFIER>2

• java_compilation_unit is any java code that as a whole can appear legally in a file.– must contain a main class declaration with the same name as

<IDENTIFIER>1 .

• Ex:

PARSER_BEGIN ( MyParser )

package mypackage;

import myotherpackage….;

public class MyParser { … }

class MyOtherUsefulClass { … } …

PARSER_END (MyParser)

24

The input and output of javacc

javaccPARSER_BEGIN ( MyParser ) package mypackage;

import myotherpackage….; public class MyParser { … }

class MyOtherUsefulClass { … } …PARSER_END (MyParser)

(MyLangSpec.jj )

MyParser.java

MyParserTokenManager.javaMyParserCostant.java

Token.java

ParserError.java

25

Notes:

• Token.java and ParseError.jar are the same for all input and can be reused.

• package declaration in *.jj are copied to all 3 outputs.• import declarations in *.jj are copied to the parser and token

manager files.

• parser file is assigned the file name <IDENTIFIER>1 .java

• The parser file has contents:

…class MyParser { …

//generated parser is inserted here. … }• The generated token manager provides one public method: Token getNextToken() throws ParseError;

26

Lexical Specification with JavaCC

27

javacc options

javacc_options ::=

[ options { ( option_binding )* } ]

• option_binding are of the form :– <IDENTIFIER>3 = <java_literal> ;

– where <IDENTIFIER>3 is not case-sensitive.

• Ex:

options {

USER_TOKEN_MANAGER=true;

BUILD_TOKEN_MANAGER=false;

OUTPUT_DIRECTORY="./sax2jcc/personnel";

STATIC=false;

}

28

More Options• LOOKAHEAD

– java_integer_literal (1)

• CHOICE_AMBIGUITY_CHECK– java_integer_literal (2) for A | B … | C

• OTHER_AMBIGUITY_CHECK– java_integer_literal (1) for (A)*, (A)+ and (A)?

• STATIC (true)• DEBUG_PARSER (false)• DEBUG_LOOKAHEAD (false)• DEBUG_TOKEN_MANAGER (false)• OPTIMIZE_TOKEN_MANAGER

– java_boolean_literal (false)

• OUTPUT_DIRECTORY (current directory)• ERROR_REPORTING (true)

29

More Options• JAVA_UNICODE_ESCAPE (false)

– replace \u2245 to actual unicode (6 char 1 char)

• UNICODE_INPUT (false)– input strearm is in unicode form

• IGNORE_CASE (false)• USER_TOKEN_MANAGER (false)

– generate TokenManager interface for user’s own scanner

• USER_CHAR_STREAM (false)– generate CharStream.java interface for user’s own inputStream

• BUILD_PARSER (true)– java_boolean_literal

• BUILD_TOKEN_MANAGER (true)• SANITY_CHECK (true)• FORCE_LA_CHECK (false)• COMMON_TOKEN_ACTION (false)

– invoke void CommonTokenAction(Token t) after every getNextToken()• CACHE_TOKENS (false)

30

Example: Figure 2.2

1. if IF2. [a-z][a-z0-9]* ID3. [0-9]+ NUM4. ([0-9]+”.”[0-9]*) | ([0-9]*”.”[0-9]+) REAL5. (“--”[a-z]*”\n”) | (“ “|”\n” | “\t” )+ nonToken, WS6. . error• javacc notations 1. “if” or “i” “f” or [“i”][“f”]2. [“a”-”z”]([“a”-”z”,”0”-”9”])*3. ([“0”-”9”])+4. ([“0”-”9”])+ “.” ( [“0”-”9”] ) * | ([“0”-”9”])* ”.” ([“0”-”9”])+

31

JvaaCC spec for the tokens from Fig 2.2

PARSER_BEGIN(MyParser) class MyParser{}PARSER_END(MyParser)/* For the regular expressin on the right, the token on the

left will be returned */TOKEN : { < IF: “if” > | < #DIGIT: [“0”-”9”] >|< ID: [“a”-”z”] ( [“a”-”z”] | <DIGIT>)* >|< NUM: (<DIGIT>)+ >|< REAL: ( (<DIGIT>)+ “.” (<DIGIT>)* ) | ( <DIGIT>+ “.” (<DIGIT>)* ) > }

32

JvaaCC spec for the tokens from Fig 2.2 (continued)

/* The regular expression here will be skipped during lexical analysis */

SKIP : { < “ “> | <“\t”> |<“\n”> }/* like SKIP but skipped text accessible from parser action

*/SPECIAL_TOKEN : {<“--” ([“a”-”z”])* (“\n” | “\r” | “\n\r” ) >}/* . For any substring not matching lexical spec, javacc will

throw an error *//* main rule */void start() : {}{ (<IF> | <ID> |<NUM> |<REAL>)* }

33

34

Grammar Specification with JavaCC

35

The Form of a Production

java_return_type java_identifier ( java_parameter_list ) :

java_block

{ expansion_choices }

• EX :

void XMLDocument(Logger logger): { int msg = 0; }

{ <StartDoc> { print(token); }

Element(logger)

<EndDoc> { print(token); }

| else()

}

36

Example ( Grammar 3.30 )

1. P L

2. S id := id

3. S while id do S

4. S begin L end

5. S if id then S

6. S if id then S else S

7. L S

8. L L;S

1,7,8 : P S (;S)*

37

JavaCC Version of Grammar 3.30PARSER_BEGIN(MyParser)

pulic class MyPArser{}

PARSRE_END(MyParser)

SKIP : {“ “ | “\t” | “\n” }

TOKEN: {

<WHILE: “while”> | <BEGIN: “begin”> | <END:”end”>

| <DO:”do”> | <IF:”if”> | <THEN : “then”>

| <ELSE:”else”> | <SEMI: “;”> | <ASSIGN: “=“>

|<#LETTER: [“a”-”z”]>

| <ID: <LETTER>(<LETTER> | [“0”-”9”] )* >

}

38

JavaCC Version of Grammar 3.30 (cont’d)

void Prog() : { } { StmList() <EOF> }

void StmList(): { } {

Stm() (“;” Stm() ) *

}

void Stm(): { } {

<ID> “=“ <ID>

| “while” <ID> “do” Stm()

| <BEGIN> StmList() <END>

| “if” <ID> “then” Stm() [ LOOKAHEAD(1) “else” Stm() ]

}

39

Types of producitons

• production ::= javacode_production

| regulr_expr_production

| bnf_production

| token_manager_decl

Note:

1,3 are used to define grammar.

2 is used to define tokens

4 is used to embeded codes into token manager.

40

JAVACODE production

• javacode_production ::= “JAVACODE”

java-return_type iava_id “(“ java_param_list “)”

java_block

• Note:– Used to define nonterminals for recognizing sth that is hard to

parse using normal production.

41

Example JAVACODE

JAVACODE void skip_to_matching_brace() { Token tok; int nesting = 1; while (true) { tok = getToken(1); if (tok.kind == LBRACE) nesting++; if (tok.kind == RBRACE) { nesting--; if (nesting == 0) break; } tok = getNextToken(); } }

42

Note:

• Do not use nonterminal defined by JAVACODE at choice point without giving LOOKHEAD.

• void NT() : {} { skip_to_matching_brace() | some_other_production() } • void NT() : {} { "{" skip_to_matching_brace() | "(" parameter_list() ")" }

43

44

TOKEN_MANAGER_DECLS

token_manager_decls ::=

TOKEN_MGR_DECLS : java_block

• The token manager declarations starts with the reserved word "TOKEN_MGR_DECLS" followed by a ":" and then a set of Java declarations and statements (the Java block).

• These declarations and statements are written into the generated token manager (MyParserTokenManager.java) and are accessible from within lexical actions.

• There can only be one token manager declaration in a JavaCC grammar file.

45

regular_expression_production

regular_expr_production ::= [ lexical_state_list ] regexpr_kind [ [ IGNORE_CASE ] ] : { regexpr_spec ( | regexpr_spec )* }

• regexpr_kind::= TOKEN | SPECIAL_TOKEN | SKIP | MORE

• TOKEN is used to define normal tokens• SKIP is used to define skipped tokens (not passed to later parser)• MORE is used to define semi-tokens (I.e. only part of a token).• SPECIAL_TOKEN is between TOKEN and SKIP tokens in that it is

passed on to the parser and accessible to the parser action but is ignored by production rules (not counted as an token). Useful for representing comments.

46

lexical_state_list

lexical_state_list::=

< * > | < java_identifier ( , java_identifier )* >• The lexical state list describes the set of lexical states for

which the corresponding regular expression production applies.

• If this is written as "<*>", the regular expression production applies to all lexical states. Otherwise, it applies to all the lexical states in the identifier list within the angular brackets.

• if omitted, then a DEFAULT lexical state is assumed.

47

regexpr_spec

regexpr_spec::=

regular_expression1 [ java_block ] [ : java_identifier ]

• Meaning:

• When a regular_expression1 is matched then– if java_block exists then execute it– if java_identifier appears, then transition to that lexical state.

48

regular_expression

regular_expression ::=

java_string_literal

| < [ [#] java_identifier : ] complex_regular_expression_choices >

| <java_identifier>

| <EOF>

• <EOF> is matched by end-of-file character only.• (3) <java_identifier> is a reference to other labeled regular_expression.

– used in bnf_production

• java_string_literal is matched only by the string denoted by itself.• (2) is used to defined a labled regular_expr and not visible to outside

the current TOKEN section if # occurs.• (1) for unnamed tokens

49

Example

<DEFAULT, LEX_ST2> TOKEN IGNORE_CASE : {

< FLOATING_POINT_LITERAL:

(["0"-"9"])+ "." (["0"-"9"])* (<EXPONENT>)? (["f","F","d","D"])? |

"." (["0"-"9"])+ (<EXPONENT>)? (["f","F","d","D"])? |

(["0"-"9"])+ <EXPONENT> (["f","F","d","D"])? |

(["0"-"9"])+ (<EXPONENT>)? ["f","F","d","D"] >

{ // do Something } : LEX_ST1

| < #EXPONENT: ["e","E"] (["+","-"])? (["0"-"9"])+ >

} • Note: if # is omitted, E123 will be recognized erroneously

as a token of kind EXPONENT.

50

Structure of complex_regular_expression• complex_regular_expression_choices::= complex_regular_expression (| complex_regular_expression )*• complex_regular_expression ::= ( complex_regular_expression_unit )*• complex_regular_expression_unit ::= java_string_literal | < java_identifier > | character_list | ( complex_regular_expression_choices ) [+|*|?]

• Note: unit concatenation;juxtapositionconcatenation;juxtaposition

complex_regular_expression choice; | choice; | complex_regular_expression_choice (.)[+|*|?](.)[+|*|?]

unit

51

character_list

character_list::=

[~] [ [ character_descriptor ( , character_descriptor )* ] ]

character_descriptor::=

java_string_literal [ - java_string_literal ]

java_string_literal ::= // reference to java grammar

“ singleCharString* “

note: java_sting_literal here is restricted to length 1.

ex:– ~[“a”,”b”] --- all chars but a and b.– [“a”-”f”, “0”-”9”, “A”,”B”,”C”,”D”,”E”,”F”] --- hexadecimal digit.– [“a”,”b”]+ is not a regular_expression_unit. Why ?

• should be written ( [“a”,”b”] )+ instead.

52

bnf_production

• bnf_production::=

java_return_type java_identifier "(" java_parameter_list ")" ":"

java_block

"{" expansion_choices "}“

• expansion_choices::= expansion ( "|" expansion )*• expansion::= ( expansion_unit )*

53

expansion_unit

• expansion_unit::= local_lookahead| java_block| "(" expansion_choices ")" [ "+" | "*" | "?" ]| "[" expansion_choices "]"| [ java_assignment_lhs "=" ] regular_expression| [ java_assignment_lhs "=" ] java_identifier "(" java_expression_list ")“Notes:1 is for lookahead; 2 is for semantic action4 = ( …)?5 is for token match6. is for match of other nonterminal

54

lookahead

• local_lookahead::= "LOOKAHEAD" "(" [ java_integer_literal ] [ "," ] [ expansion_choices ] [ "," ] [ "{" java_expression "}" ] ")“

• Notes:• 3 componets: max # lookahead + syntax + semantics• examples:

– LOOKHEAD(3)– LOOKAHEAD(5, Expr() <INT> | <REAL> , { true} )

• More on LOOKAHEAD– see minitutorial

55

JavaCC API

• Non-Terminals in the Input Grammar• NT is a nonterminal =>

returntype NT(parameters) throws ParseError;

is generated in the parser class

• API for Parser Actions• Token token;

– variable always holds the last token and can be used in parser actions.

– exactly the same as the token returned by getToken(0). – two other methods - getToken(int i) and getNextToken() can

also be used in actions to traverse the token list.

56

Token class

• public int kind;– 0 for <EOF>

• public int beginLine, beginColumn, endLine, endColumn;• public String image;• public Token next;• public Token specialToken;• public String toString()• { return image; }• public static final Token newToken(int ofKind)

57

Error reporting and recovery

• It is not user friendly to throw an exception and exit the parsing once encountering a syntax error.

• two Exceptions– ParseException . can be recovered– TokenMgrError not expected to be recovered

• Error reporting– modify ParseExcpetion.java or TokenMgeError.java– generateParseException method is always invokable in

parser action to report error

58

Error Recovery in JavaCC:

• Shallow Error Recovery• Deep Error Recovery

• Shallow Error Recovery • Ex:

void Stm() : {} { IfStm() | WhileStm() }

if getToken(1) != “if” or “while” => shallow error

59

Shallow recovery

can be recovered by additional choice:

void Stm() : {} { IfStm() | WhileStm() | error_skipto(SEMICOLON) } whereJAVACODE void error_skipto(int kind) { ParseException e = generateParseException(); // generate the

exception object. System.out.println(e.toString()); // print the error message Token t; do { t = getNextToken(); } while (t.kind != kind);}

60

Deep Error Recovery

• Same example: void Stm() : {} { IfStm() | WhileStm() } • But this time the error occurs during paring inside

IfStmt() or WhileStmt() instead of the lookahead entry.• The approach: use java try-catch construct.void Stm() : {} { try { ( IfStm() | WhileStm() ) } catch (ParseException e) { error_skipto(SEMICOLON); } } note: the new syntax for javacc bnf_production.

Documents

1 Introduction to JavaCC Cheng-Chia Chen. 2 What is a parser generator Total =price+tax; Scanner Parser price id + id Expr assignment =Total tax Total=price+tax;