Combinator parsing

Combinator ParsingBy Swanand Pagnis

Higher-Order Functions for ParsingBy Graham Hutton

• Abstract & Introduction

• Build a parser, one fn at a time

• Moving beyond toy parsers

Abstract

In combinator parsing, the text of parsers resembles BNF notation. We present the basic method, and a number of extensions. We address the special problems presented by whitespace, and parsers with separate lexical and syntactic phases. In particular, a combining form for handling the “offside rule” is given. Other extensions to the basic method include an “into” combining form with many useful applications, and a simple means by which combinator parsers can produce more informative error messages.

• Combinators that resemble BNF notation

• Whitespace handling through "Offside Rule"

• "Into" combining form for advanced parsing

• Strategy for better error messages

Introduction

Primitive Parsers

• Take input

• Process one character

• Return results and unused input

Combinators

• Combine primitives

• Define building blocks


Lexical analysis and syntax

• Combine the combinators

• Define lexical elements


input: "from:swiggy to:me" output: [("f", "rom:swiggy to:me")]

input: "42 !=> ans" output: [("4", "2 !=> ans")]

rule: 'a' followed by 'b' input: "abcdef" output: [(('a','b'),"cdef")]

rule: 'a' followed by 'b' input: "abcdef" output: [(('a','b'),"cdef")]

Combinator

Language choice

Suggested: Lazy Functional Languages

Miranda: Author's choice

Haskell: An obvious choice. 🤓

Racket: Another obvious choice. 🤓

Ruby: 🍎 to 🍊 so $ for learning

OCaml: Functional, but not lazy.

Haskell %

Simple when stick to fundamental FP

• Higher order functions

• Immutability

• Recursive problem solving

• Algebraic types

Let's build a parser, one fn at a time

type Parser a b = [a] !-> [(b, [a])]

Types help with abstraction

• We'll be dealing with parsers and combinators

• Parsers are functions, they accept input and return results

• Combinators accept parsers and return parsers

A parser is a function that accepts an input and returns parsed results and the unused input for each result

Parser is a function type that accepts a list of type a and returns all possible results as a list of tuples of type (b, [a])

(Parser Char Number) input: "42 it is!" !-- a is a [Char] output: [(42, " it is!")] !-- b is a Number

type Parser a b = [a] !-> [(b, [a])]

Primitive Parsers

succeed !:: b !-> Parser a b succeed v inp = [(v, inp)]

Always succeeds Returns "v" for all inputs

failure !:: Parser a b failure inp = []

Always fails Returns "[]" for all inputs

satisfy !:: (a !-> Bool) !-> Parser a a satisfy p [] = failure [] satisfy p (x:xs) | p x = succeed x xs !-- if p(x) is true | otherwise = failure []


Guard Clauses, if you want to Google

literal !:: Eq a !=> a !-> Parser a a literal x = satisfy (!== x)

match_3 = (literal '3') match_3 "345" !-- !=> [('3',"45")] match_3 "456" !-- !=> []

succeed failure satisfy literal

Combinators

match_3_or_4 = match_3 àlt` match_4 match_3_or_4 "345" !-- !=> [('3',"45")] match_3_or_4 "456" !-- !=> [('4',"56")]

alt !:: Parser a b !-> Parser a b !-> Parser a b (p1 àlt` p2) inp = p1 inp !++ p2 inp

(p1 àlt` p2) inp = p1 inp !++ p2 inpList concatenation

(match_3 ànd_then` match_4) "345" # !=> [(('3','4'),"5")]

🐉

and_then !:: Parser a b !-> Parser a c !-> Parser a (b, c) (p1 ànd_then` p2) inp = [ ((v1, v2), out2) | (v1, out1) !<- p1 inp, (v2, out2) !<- p2 out1 ]

and_then !:: Parser a b !-> Parser a c !-> Parser a (b, c) (p1 ànd_then` p2) inp = [ ((v1, v2), out2) | (v1, out1) !<- p1 inp, (v2, out2) !<- p2 out1 ]

List comprehensions

(v11, out11) (v12, out12) (v13, out13)

…

(v21, out21) (v22, out22)

…

(v31, out31) (v32, out32)

…

(v31, out31)

p1

p2

((v11, v21), out21) ((v11, v22), out22)

…

(match_3 ànd_then` match_4) "345" # !=> [(('3','4'),"5")]

Manipulating values

match_3 = (literal '3') match_3 "345" !-- !=> [('3',"45")] match_3 "456" !-- !=> []

(number "42") "42 !=> answer" # !=> [(42, " answer")]

(keyword "for") "for i in 1!..42" # !=> [(:for, " i in 1!..42")]

using !:: Parser a b !-> (b !-> c) !-> Parser a c (p ùsing` f) inp = [(f v, out) | (v, out) !<- p inp ]

((string "3") ùsing` float) "3" # !=> [(3.0, "")]

Levelling up

many !:: Parser a b !-> Parser a [b] many p = ((p ànd_then` many p) ùsing` cons) àlt` (succeed [])

0 or many

(many (literal 'a')) "aab" !=> [("aa","b"),("a","ab"),("","aab")]

(many (literal 'a')) "xyz" !=> [("","xyz")]

some !:: Parser a b !-> Parser a [b] some p = ((p ànd_then` many p) ùsing` cons)

1 or many

(some (literal 'a')) "aab" !=> [("aa","b"),("a","ab")]

(some (literal 'a')) "xyz" !=> []

positive_integer = some (satisfy Data.Char.isDigit)

negative_integer = ((literal '-') ànd_then` positive_integer) ùsing` cons

positive_decimal = (positive_integer ànd_then` (((literal '.') ànd_then` positive_integer) ùsing` cons)) ùsing` join

negative_decimal = ((literal '-') ànd_then` positive_decimal) ùsing` cons

number !:: Parser Char [Char] number = negative_decimal àlt` positive_decimal àlt` negative_integer àlt` positive_integer

word !:: Parser Char [Char] word = some (satisfy isLetter)

string !:: (Eq a) !=> [a] !-> Parser a [a] string [] = succeed [] string (x:xs) = (literal x ànd_then` string xs) ùsing` cons

(string "begin") "begin end" # !=> [("begin"," end")]

xthen !:: Parser a b !-> Parser a c !-> Parser a c p1 `xthen` p2 = (p1 ànd_then` p2) ùsing` snd

thenx !:: Parser a b !-> Parser a c !-> Parser a b p1 `thenx` p2 = (p1 ànd_then` p2) ùsing` fst

ret !:: Parser a b !-> c !-> Parser a c p `ret` v = p ùsing` (const v)

succeed, failure, satisfy, literal, alt, and_then, using, string, many, some, string, word, number, xthen, thenx, ret

Expression Parser & Evaluator

data Expr = Const Double | Expr Àdd` Expr | Expr `Sub` Expr | Expr `Mul` Expr | Expr `Div` Expr

(Const 3) `Mul` ((Const 6) Àdd` (Const 1))) # !=> "3*(6+1)"

parse "3*(6+1)" # !=> (Const 3) `Mul` ((Const 6) Àdd` (Const 1)))

(Const 3) Mul ((Const 6) Àdd` (Const 1))) # !=> 21

BNF Notation

expn !::= expn + expn | expn − expn | expn ∗ expn | expn / expn | digit+ | (expn)

Improving a little:

expn !::= term + term | term − term | term term !::= factor ∗ factor | factor / factor | factor factor !::= digit+ | (expn)

Parsers that resemble BNF

addition = ((term ànd_then` ((literal '+') `xthen` term)) ùsing` plus)

subtraction = ((term ànd_then` ((literal '-') `xthen` term)) ùsing` minus)

multiplication = ((factor ànd_then` ((literal '*') `xthen` factor)) ùsing` times)

division = ((factor ànd_then` ((literal '/') `xthen` factor)) ùsing` divide)

parenthesised_expression = ((nibble (literal '(')) `xthen` ((nibble expn) `thenx`(nibble (literal ')'))))

value xs = Const (numval xs) plus (x,y) = x Àdd` y minus (x,y) = x `Sub` y times (x,y) = x `Mul` y divide (x,y) = x `Div` y

expn = addition àlt` subtraction àlt` term

term = multiplication àlt` division àlt` factor

factor = (number ùsing` value) àlt` parenthesised_expn

expn "12*(5+(7-2))" # !=> [ (Const 12.0 `Mul` (Const 5.0 Àdd` (Const 7.0 `Sub` Const 2.0)),""), … ]

value xs = Const (numval xs) plus (x,y) = x Àdd` y minus (x,y) = x `Sub` y times (x,y) = x `Mul` y divide (x,y) = x `Div` y

value = numval plus (x,y) = x + y minus (x,y) = x - y times (x,y) = x * y divide (x,y) = x / y

expn "12*(5+(7-2))" # !=> [(120.0,""), (12.0,"*(5+(7-2))"), (1.0,"2*(5+(7-2))")]

expn "(12+1)*(5+(7-2))" # !=> [(130.0,""), (13.0,"*(5+(7-2))")]

Moving beyond toy parsers

Whitespace? 🤔 (

white = (literal " ") àlt` (literal "\t") àlt` (literal "\n")

white = many (any literal " \t\n")

/\s!*/

any p = foldr (alt.p) fail

any p [x1,x2,!!...,xn] = (p x1) àlt` (p x2) àlt` !!... àlt` (p xn)

white = many (any literal " \t\n")

nibble p = white `xthen` (p `thenx` white)

The parser (nibble p) has the same behaviour as parser p, except that it eats up any white-space in the input string before or afterwards

(nibble (literal 'a')) " a " # !=> [('a',""),('a'," "),('a'," ")]

symbol = nibble.string

symbol "$fold" " $fold " # !=> [("$fold", ""), ("$fold", " ")]

The Offside Rule

w = x + y where x = 10 y = 15 - 5 z = w * 2

w = x + y where x = 10 y = 15 - 5 z = w * 2

When obeying the offside rule, every token must lie either directly below, or to the right of its first token

i.e. A weak indentation policy

The Offside Combinator

type Pos a = (a, (Integer, Integer))

prelex "3 + \n 2 * (4 + 5)" # !=> [('3',(0,0)), ('+',(0,2)), ('2',(1,2)), ('*',(1,4)), … ]


satisfy !:: (a !-> Bool) !-> Parser (Pos a) a satisfy p [] = failure [] satisfy p (x:xs) | p a = succeed a xs !-- if p(a) is true | otherwise = failure [] where (a, (r, c)) = x

satisfy !:: (a !-> Bool) !-> Parser (Pos a) a satisfy p [] = failure [] satisfy p (x:xs) | p a = succeed a xs !-- if p(a) is true | otherwise = failure [] where (a, (r, c)) = x

offside !:: Parser (Pos a) b !-> Parser (Pos a) b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp inpOFF = drop (length inpON) inp onside (a, (r, c)) (b, (r', c')) = r' !>= r !&& c' !>= c

offside !:: Parser (Pos a) b !-> Parser (Pos a) b

offside !:: Parser (Pos a) b !-> Parser (Pos a) b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)]

(nibble (literal 'a')) " a " # !=> [('a',""),('a'," "),('a'," ")]

offside !:: Parser (Pos a) b !-> Parser (Pos a) b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)]

offside !:: Parser (Pos a) b !-> Parser (Pos a) b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp

offside !:: Parser (Pos a) b !-> Parser (Pos a) b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp inpOFF = drop (length inpON) inp

offside !:: Parser (Pos a) b !-> Parser (Pos a) b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp inpOFF = drop (length inpON) inp onside (a, (r, c)) (b, (r', c')) = r' !>= r !&& c' !>= c

(3 + 2 * (4 + 5)) + (8 * 10)

(3 + 2 * (4 + 5)) + (8 * 10)

(offside expn) (prelex inp_1) # !=> [(21.0,[('+',(2,0)),('(',(2,2)),('8',(2,3)),('*',(2,5)),('1',(2,7)),('0',(2,8)),(')',(2,9))])]

(offside expn) (prelex inp_2) # !=> [(101.0,[])]

Quick recap before we 🛫

∅ !|> succeed, fail !|> satisfy, literal !|> alt, and_then, using !|> many, some !|> string, thenx, xthen, return !|> expression parser & evaluator !|> any, nibble, symbol !|> prelex, offside

Practical parsers

🎯Syntactical analysis 🎯Lexical analysis 🎯Parse trees

type Parser a b = [a] !-> [(b, [a])] type Pos a = (a, (Integer, Integer))

data Tag = Ident | Number | Symbol | Junk deriving (Show, Eq) type Token = (Tag, [Char])

(Symbol, "if") (Number, "123")

Parse the string with parser p, & apply token t to the result

(p `tok` t) inp = [ (((t, xs), (r, c)), out) | (xs, out) !<- p inp] where (x, (r,c)) = head inp

(p `tok` t) inp = [ ((<token>,<pos>),<unused input>) | (xs, out) !<- p inp] where (x, (r,c)) = head inp

(p `tok` t) inp = [ (((t, xs), (r, c)), out) | (xs, out) !<- p inp] where (x, (r,c)) = head inp

((string "where") `tok` Symbol) inp # !=> ((Symbol,"where"), (r, c))

many ((p1 `tok` t1) àlt` (p2 `tok` t2) àlt` !!... àlt` (pn `tok` tn))

[(p1, t1), (p2, t2), …, (pn, tn)]

lex = many.(foldr op failure) where (p, t) òp` xs = (p `tok` t) àlt` xs

🐉

lex = many.(foldr op failure) where (p, t) òp` xs = (p `tok` t) àlt` xs

# Rightmost computation cn = (pn `tok` tn) àlt` failure

# Followed by (pn-1 `tok` tn-1) àlt` cn

many ((p1 `tok` t1) àlt` (p2 `tok` t2) àlt` !!... àlt` (pn `tok` tn))

lexer = lex [ ((some (any_of literal " \n\t")), Junk), ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]



head (lexer (prelex "where x = 10")) # !=> ([((Symbol,"where"),(0,0)), ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)), ((Number,"10"),(0,10)) ],[])

(head.lexer.prelex) "where x = 10" # !=> ([((Symbol,"where"),(0,0)), ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)), ((Number,"10"),(0,10)) ],[])

(head.lexer.prelex) "where x = 10" # !=> ([((Symbol,"where"),(0,0)), ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)), ((Number,"10"),(0,10)) ],[])

Function composition

length ((lexer.prelex) "where x = 10") # !=> 198

Conflicts? Ambiguity?

In this case, "where" is a source of conflict. It can be a symbol, or identifier.

lexer = lex [ {- 1 -} ((some (any_of literal " \n\t")), Junk), {- 2 -} ((string "where"), Symbol), {- 3 -} (word, Ident), {- 4 -} (number, Number), {- 5 -} ((any_of string ["(",")","="]), Symbol)]

Higher priority, higher precedence

Removing Junk

strip !:: [(Pos Token)] !-> [(Pos Token)] strip = filter ((!!= Junk).fst.fst)

((!!= Junk).fst.fst) ((Symbol,"where"),(0,0)) # !=> True ((!!= Junk).fst.fst) ((Junk,"where"),(0,0)) # !=> False

(fst.head.lexer.prelex) "where x = 10" # !=> [((Symbol,"where"),(0,0)), ((Junk," "),(0,5)), ((Ident,"x"),(0,6)), ((Junk," "),(0,7)), ((Symbol,"="),(0,8)), ((Junk," "),(0,9)), ((Number,"10"),(0,10))]

(strip.fst.head.lexer.prelex) "where x = 10" # !=> [((Symbol,"where"),(0,0)), ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)), ((Number,"10"),(0,10))]

Syntax Analysis

characters !|> lexical analysis !|> tokens

tokens !|> syntax analysis !|> parse trees

f x y = add a b where a = 25 b = sub x y

answer = mult (f 3 7) 5



Script



Definition



Body



Expression



Definition



Primitives

data Script = Script [Def] data Def = Def Var [Var] Expn data Expn = Var Var | Num Double | Expn Àpply` Expn | Expn `Where` [Def] type Var = [Char]

prog = (many defn) ùsing` Script

defn = ( (some (kind Ident)) ànd_then` ((lit "=") `xthen` (offside body))) ùsing` defnFN

body = ( expr ànd_then` (((lit "where") `xthen` (some defn)) òpt` [])) ùsing` bodyFN

expr = (some prim) ùsing` (foldl1 Apply)

prim = ((kind Ident) ùsing` Var) àlt` ((kind Number) ùsing` numFN) àlt` ((lit "(") `xthen` (expr `thenx` (lit ")")))

!-- only allow a kind of tag kind !:: Tag !-> Parser (Pos Token) [Char] kind t = (satisfy ((!== t).fst)) ùsing` snd

— only allow a given symbol lit !:: [Char] !-> Parser (Pos Token) [Char] lit xs = (literal (Symbol, xs)) ùsing` snd

prog = (many defn) ùsing` Script

defn = ( (some (kind Ident)) ànd_then` ((lit "=") `xthen` (offside body))) ùsing` defnFN

body = ( expr ànd_then` (((lit "where") `xthen` (some defn)) òpt` [])) ùsing` bodyFN



data Script = Script [Def] data Def = Def Var [Var] Expn data Expn = Var Var | Num Double | Expn Àpply` Expn | Expn `Where` [Def] type Var = [Char]

Orange functions are for transforming values.

Use data constructors to generate parse trees

Use evaluation functions to evaluate and generate a value



Script [ Def "f" ["x","y"] ( ((Var "add" Àpply` Var "a") Àpply` Var "b") `Where` [ Def "a" [] (Num 25.0), Def "b" [] ((Var "sub" Àpply` Var "x") Àpply` Var "y")]), Def "answer" [] ( (Var "mult" Àpply` ( (Var "f" Àpply` Num 3.0) Àpply` Num 7.0)) Àpply` Num 5.0)]

Strategy for writing parsers

1. Identify components i.e. Lexical elements


2. Structure these elements a.k.a. syntax

defn = ((some (kind Ident)) ànd_then` ((lit "=") `xthen` (offside body))) ùsing` defnFN

body = (expr ànd_then` (((lit "where") `xthen` (some defn)) òpt` [])) ùsing` bodyFN



3. BNF notation is very helpful

4. TDD in the absence of types

Where to, next?

Monadic ParsersGraham Hutton, Eric Meijer

Introduction to FPPhilip Wadler

The Dragon BookIf your interest is in compilers

Libraries?

Haskell: Parsec, MegaParsec. ✨ OCaml: Angstrom. ✨ 🚀 Ruby: rparsec, or roll you own Elixir: Combine, ExParsec Python: Parsec. ✨

Thank you!

Twitter: @_swanand GitHub: @swanandp

Technology

Combinator parsing