aboutsummaryrefslogtreecommitdiffstats
path: root/doc/lispref/peg.texi
diff options
context:
space:
mode:
Diffstat (limited to 'doc/lispref/peg.texi')
-rw-r--r--doc/lispref/peg.texi202
1 files changed, 118 insertions, 84 deletions
diff --git a/doc/lispref/peg.texi b/doc/lispref/peg.texi
index ef4dfa7653e..fbf57852ee0 100644
--- a/doc/lispref/peg.texi
+++ b/doc/lispref/peg.texi
@@ -7,29 +7,34 @@
7@chapter Parsing Expression Grammars 7@chapter Parsing Expression Grammars
8@cindex text parsing 8@cindex text parsing
9@cindex parsing expression grammar 9@cindex parsing expression grammar
10@cindex PEG
10 11
11 Emacs Lisp provides several tools for parsing and matching text, 12 Emacs Lisp provides several tools for parsing and matching text,
12from regular expressions (@pxref{Regular Expressions}) to full 13from regular expressions (@pxref{Regular Expressions}) to full
13@acronym{LL} grammar parsers (@pxref{Top,, Bovine parser 14left-to-right (a.k.a.@: @acronym{LL}) grammar parsers (@pxref{Top,,
14development,bovine}). @dfn{Parsing Expression Grammars} 15Bovine parser development,bovine}). @dfn{Parsing Expression Grammars}
15(@acronym{PEG}) are another approach to text parsing that offer more 16(@acronym{PEG}) are another approach to text parsing that offer more
16structure and composibility than regular expressions, but less 17structure and composibility than regular expressions, but less
17complexity than context-free grammars. 18complexity than context-free grammars.
18 19
19A @acronym{PEG} parser is defined as a list of named rules, each of 20A Parsing Expression Grammar (@acronym{PEG}) describes a formal language
20which matches text patterns, and/or contains references to other 21in terms of a set of rules for recognizing strings in the language. In
22Emacs, a @acronym{PEG} parser is defined as a list of named rules, each
23of which matches text patterns and/or contains references to other
21rules. Parsing is initiated with the function @code{peg-run} or the 24rules. Parsing is initiated with the function @code{peg-run} or the
22macro @code{peg-parse} (see below), and parses text after point in the 25macro @code{peg-parse} (see below), and parses text after point in the
23current buffer, using a given set of rules. 26current buffer, using a given set of rules.
24 27
25@cindex parsing expression 28@cindex parsing expression
26The definition of each rule is referred to as a @dfn{parsing 29@cindex root, of parsing expression grammar
27expression} (@acronym{PEX}), and can consist of a literal string, a 30@cindex entry-point, of parsing expression grammar
28regexp-like character range or set, a peg-specific construct 31Each rule in a @acronym{PEG} is referred to as a @dfn{parsing
29resembling an elisp function call, a reference to another rule, or a 32expression} (@acronym{PEX}), and can be specified a a literal string, a
30combination of any of these. A grammar is expressed as a tree of 33regexp-like character range or set, a peg-specific construct resembling
31rules in which one rule is typically treated as a ``root'' or 34an Emacs Lisp function call, a reference to another rule, or a
32``entry-point'' rule. For instance: 35combination of any of these. A grammar is expressed as a tree of rules
36in which one rule is typically treated as a ``root'' or ``entry-point''
37rule. For instance:
33 38
34@example 39@example
35@group 40@group
@@ -56,14 +61,17 @@ first rule is considered the ``entry-point'':
56@end group 61@end group
57@end example 62@end example
58 63
59This macro represents the simplest use of the @acronym{PEG} library, 64@c FIXME: These two should be formally defined using @defmac and @defun.
60but also the least flexible, as the rules must be written directly 65@findex with-peg-rules
61into the source code. A more flexible approach involves use of three 66@findex peg-run
62macros in conjunction: @code{with-peg-rules}, a @code{let}-like 67The @code{peg-parse} macro represents the simplest use of the
63construct that makes a set of rules available within the macro body; 68@acronym{PEG} library, but also the least flexible, as the rules must be
64@code{peg-run}, which initiates parsing given a single rule; and 69written directly into the source code. A more flexible approach
65@code{peg}, which is used to wrap the entry-point rule name. In fact, 70involves use of three macros in conjunction: @code{with-peg-rules}, a
66a call to @code{peg-parse} expands to just this set of calls. The 71@code{let}-like construct that makes a set of rules available within the
72macro body; @code{peg-run}, which initiates parsing given a single rule;
73and @code{peg}, which is used to wrap the entry-point rule name. In
74fact, a call to @code{peg-parse} expands to just this set of calls. The
67above example could be written as: 75above example could be written as:
68 76
69@example 77@example
@@ -79,33 +87,43 @@ above example could be written as:
79This allows more explicit control over the ``entry-point'' of parsing, 87This allows more explicit control over the ``entry-point'' of parsing,
80and allows the combination of rules from different sources. 88and allows the combination of rules from different sources.
81 89
90@c FIXME: Use @defmac.
91@findex define-peg-rule
82Individual rules can also be defined using a more @code{defun}-like 92Individual rules can also be defined using a more @code{defun}-like
83syntax, using the macro @code{define-peg-rule}: 93syntax, using the macro @code{define-peg-rule}:
84 94
85@example 95@example
96@group
86(define-peg-rule digit () 97(define-peg-rule digit ()
87 [0-9]) 98 [0-9])
99@end group
88@end example 100@end example
89 101
90This also allows for rules that accept an argument (supplied by the 102This also allows for rules that accept an argument (supplied by the
91@code{funcall} PEG rule). 103@code{funcall} PEG rule, @pxref{PEX Definitions}).
92 104
105@c FIXME: Use @defmac.
106@findex define-peg-ruleset
93Another possibility is to define a named set of rules with 107Another possibility is to define a named set of rules with
94@code{define-peg-ruleset}: 108@code{define-peg-ruleset}:
95 109
96@example 110@example
111@group
97(define-peg-ruleset number-grammar 112(define-peg-ruleset number-grammar
98 '((number sign digit (* digit)) 113 '((number sign digit (* digit))
99 digit ;; A reference to the definition above. 114 digit ;; A reference to the definition above.
100 (sign (or "+" "-" "")))) 115 (sign (or "+" "-" ""))))
116@end group
101@end example 117@end example
102 118
103Rules and rulesets defined this way can be referred to by name in 119Rules and rulesets defined this way can be referred to by name in
104later calls to @code{peg-run} or @code{with-peg-rules}: 120later calls to @code{peg-run} or @code{with-peg-rules}:
105 121
106@example 122@example
123@group
107(with-peg-rules number-grammar 124(with-peg-rules number-grammar
108 (peg-run (peg number))) 125 (peg-run (peg number)))
126@end group
109@end example 127@end example
110 128
111By default, calls to @code{peg-run} or @code{peg-parse} produce no 129By default, calls to @code{peg-run} or @code{peg-parse} produce no
@@ -125,11 +143,11 @@ act upon parsed strings, rules can include @dfn{actions}, see
125Parsing expressions can be defined using the following syntax: 143Parsing expressions can be defined using the following syntax:
126 144
127@table @code 145@table @code
128@item (and E1 E2 ...) 146@item (and @var{e1} @var{e2}@dots{})
129A sequence of @acronym{PEX}s that must all be matched. The @code{and} form is 147A sequence of @acronym{PEX}s that must all be matched. The @code{and}
130optional and implicit. 148form is optional and implicit.
131 149
132@item (or E1 E2 ...) 150@item (or @var{e1} @var{e2}@dots{})
133Prioritized choices, meaning that, as in Elisp, the choices are tried 151Prioritized choices, meaning that, as in Elisp, the choices are tried
134in order, and the first successful match is used. Note that this is 152in order, and the first successful match is used. Note that this is
135distinct from context-free grammars, in which selection between 153distinct from context-free grammars, in which selection between
@@ -141,43 +159,43 @@ Matches any single character, as the regexp ``.''.
141@item @var{string} 159@item @var{string}
142A literal string. 160A literal string.
143 161
144@item (char @var{C}) 162@item (char @var{c})
145A single character @var{C}, as an Elisp character literal. 163A single character @var{c}, as an Elisp character literal.
146 164
147@item (* @var{E}) 165@item (* @var{e})
148Zero or more instances of expression @var{E}, as the regexp @samp{*}. 166Zero or more instances of expression @var{e}, as the regexp @samp{*}.
149Matching is always ``greedy''. 167Matching is always ``greedy''.
150 168
151@item (+ @var{E}) 169@item (+ @var{e})
152One or more instances of expression @var{E}, as the regexp @samp{+}. 170One or more instances of expression @var{e}, as the regexp @samp{+}.
153Matching is always ``greedy''. 171Matching is always ``greedy''.
154 172
155@item (opt @var{E}) 173@item (opt @var{e})
156Zero or one instance of expression @var{E}, as the regexp @samp{?}. 174Zero or one instance of expression @var{e}, as the regexp @samp{?}.
157 175
158@item SYMBOL 176@item @var{symbol}
159A symbol representing a previously-defined PEG rule. 177A symbol representing a previously-defined PEG rule.
160 178
161@item (range CH1 CH2) 179@item (range @var{ch1} @var{ch2})
162The character range between CH1 and CH2, as the regexp @samp{[CH1-CH2]}. 180The character range between @var{ch1} and @var{ch2}, as the regexp
181@samp{[@var{ch1}-@var{ch2}]}.
163 182
164@item [CH1-CH2 "+*" ?x] 183@item [@var{ch1}-@var{ch2} "+*" ?x]
165A character set, which can include ranges, character literals, or 184A character set, which can include ranges, character literals, or
166strings of characters. 185strings of characters.
167 186
168@item [ascii cntrl] 187@item [ascii cntrl]
169A list of named character classes. 188A list of named character classes.
170 189
171@item (syntax-class @var{NAME}) 190@item (syntax-class @var{name})
172A single syntax class. 191A single syntax class.
173 192
174@item (funcall E ARGS...) 193@item (funcall @var{e} @var{args}@dots{})
175Call @acronym{PEX} E (previously defined with @code{define-peg-rule}) 194Call @acronym{PEX} @var{e} (previously defined with
176with arguments @var{ARGS}. 195@code{define-peg-rule}) with arguments @var{args}.
177 196
178@item (null) 197@item (null)
179The empty string. 198The empty string.
180
181@end table 199@end table
182 200
183The following expressions are used as anchors or tests -- they do not 201The following expressions are used as anchors or tests -- they do not
@@ -210,19 +228,19 @@ Beginning of symbol.
210@item (eos) 228@item (eos)
211End of symbol. 229End of symbol.
212 230
213@item (if E) 231@item (if @var{e})
214Returns non-@code{nil} if parsing @acronym{PEX} E from point succeeds (point 232Returns non-@code{nil} if parsing @acronym{PEX} @var{e} from point
215is not moved). 233succeeds (point is not moved).
216
217@item (not E)
218Returns non-@code{nil} if parsing @acronym{PEX} E from point fails (point
219is not moved).
220 234
221@item (guard EXP) 235@item (not @var{e})
222Treats the value of the Lisp expression EXP as a boolean. 236Returns non-@code{nil} if parsing @acronym{PEX} @var{e} from point fails
237(point is not moved).
223 238
239@item (guard @var{exp})
240Treats the value of the Lisp expression @var{exp} as a boolean.
224@end table 241@end table
225 242
243@c FIXME: peg-char-classes should be mentioned in the text below.
226@vindex peg-char-classes 244@vindex peg-char-classes
227Character class matching can use the same named character classes as 245Character class matching can use the same named character classes as
228in regular expressions (@pxref{Top,, Character Classes,elisp}) 246in regular expressions (@pxref{Top,, Character Classes,elisp})
@@ -234,12 +252,13 @@ in regular expressions (@pxref{Top,, Character Classes,elisp})
234@cindex parsing stack 252@cindex parsing stack
235By default the process of parsing simply moves point in the current 253By default the process of parsing simply moves point in the current
236buffer, ultimately returning @code{t} if the parsing succeeds, and 254buffer, ultimately returning @code{t} if the parsing succeeds, and
237@code{nil} if it doesn't. It's also possible to define ``actions'' 255@code{nil} if it doesn't. It's also possible to define @dfn{parsing
238that can run arbitrary Elisp at certain points in the parsed text. 256actions} that can run arbitrary Elisp at certain points in the parsed
239These actions can optionally affect something called the @dfn{parsing 257text. These actions can optionally affect something called the
240stack}, which is a list of values returned by the parsing process. 258@dfn{parsing stack}, which is a list of values returned by the parsing
241These actions only run (and only return values) if the parsing process 259process. These actions only run (and only return values) if the parsing
242ultimately succeeds; if it fails the action code is not run at all. 260process ultimately succeeds; if it fails the action code is not run at
261all.
243 262
244Actions can be added anywhere in the definition of a rule. They are 263Actions can be added anywhere in the definition of a rule. They are
245distinguished from parsing expressions by an initial backquote 264distinguished from parsing expressions by an initial backquote
@@ -247,12 +266,13 @@ distinguished from parsing expressions by an initial backquote
247of hyphens (@samp{--}) somewhere within it. Symbols to the left of 266of hyphens (@samp{--}) somewhere within it. Symbols to the left of
248the hyphens are bound to values popped from the stack (they are 267the hyphens are bound to values popped from the stack (they are
249somewhat analogous to the argument list of a lambda form). Values 268somewhat analogous to the argument list of a lambda form). Values
250produced by code to the right are pushed to the stack (analogous to 269produced by code to the right of the hyphens are pushed onto the stack
251the return value of the lambda). For instance, the previous grammar 270(analogous to the return value of the lambda). For instance, the
252can be augmented with actions to return the parsed number as an actual 271previous grammar can be augmented with actions to return the parsed
253integer: 272number as an actual integer:
254 273
255@example 274@example
275@group
256(with-peg-rules ((number sign digit (* digit 276(with-peg-rules ((number sign digit (* digit
257 `(a b -- (+ (* a 10) b))) 277 `(a b -- (+ (* a 10) b)))
258 `(sign val -- (* sign val))) 278 `(sign val -- (* sign val)))
@@ -261,6 +281,7 @@ integer:
261 (and "" `(-- 1)))) 281 (and "" `(-- 1))))
262 (digit [0-9] `(-- (- (char-before) ?0)))) 282 (digit [0-9] `(-- (- (char-before) ?0))))
263 (peg-run (peg number))) 283 (peg-run (peg number)))
284@end group
264@end example 285@end example
265 286
266There must be values on the stack before they can be popped and 287There must be values on the stack before they can be popped and
@@ -271,43 +292,53 @@ only left-hand terms will consume (and discard) values from the stack.
271At the end of parsing, stack values are returned as a flat list. 292At the end of parsing, stack values are returned as a flat list.
272 293
273To return the string matched by a @acronym{PEX} (instead of simply 294To return the string matched by a @acronym{PEX} (instead of simply
274moving point over it), a rule like this can be used: 295moving point over it), a grammar can use a rule like this:
275 296
276@example 297@example
298@group
277(one-word 299(one-word
278 `(-- (point)) 300 `(-- (point))
279 (+ [word]) 301 (+ [word])
280 `(start -- (buffer-substring start (point)))) 302 `(start -- (buffer-substring start (point))))
303@end group
281@end example 304@end example
282 305
283The first action pushes the initial value of point to the stack. The 306@noindent
284intervening @acronym{PEX} moves point over the next word. The second 307The first action above pushes the initial value of point to the stack.
285action pops the previous value from the stack (binding it to the 308The intervening @acronym{PEX} moves point over the next word. The
286variable @code{start}), and uses that value to extract a substring 309second action pops the previous value from the stack (binding it to the
287from the buffer and push it to the stack. This pattern is so common 310variable @code{start}), then uses that value to extract a substring from
288that @acronym{PEG} provides a shorthand function that does exactly the 311the buffer and push it to the stack. This pattern is so common that
289above, along with a few other shorthands for common scenarios: 312@acronym{PEG} provides a shorthand function that does exactly the above,
313along with a few other shorthands for common scenarios:
290 314
291@table @code 315@table @code
292@item (substring @var{E}) 316@findex substring (a PEG shorthand)
293Match @acronym{PEX} @var{E} and push the matched string to the stack. 317@item (substring @var{e})
294 318Match @acronym{PEX} @var{e} and push the matched string onto the stack.
295@item (region @var{E}) 319
296Match @var{E} and push the start and end positions of the matched 320@findex region (a PEG shorthand)
297region to the stack. 321@item (region @var{e})
298 322Match @var{e} and push the start and end positions of the matched
299@item (replace @var{E} @var{replacement}) 323region onto the stack.
300Match @var{E} and replaced the matched region with the string @var{replacement}. 324
301 325@findex replace (a PEG shorthand)
302@item (list @var{E}) 326@item (replace @var{e} @var{replacement})
303Match @var{E}, collect all values produced by @var{E} (and its 327Match @var{e} and replaced the matched region with the string
304sub-expressions) into a list, and push that list to the stack. Stack 328@var{replacement}.
329
330@findex list (a PEG shorthand)
331@item (list @var{e})
332Match @var{e}, collect all values produced by @var{e} (and its
333sub-expressions) into a list, and push that list onto the stack. Stack
305values are typically returned as a flat list; this is a way of 334values are typically returned as a flat list; this is a way of
306``grouping'' values together. 335``grouping'' values together.
307@end table 336@end table
308 337
309@node Writing PEG Rules 338@node Writing PEG Rules
310@section Writing PEG Rules 339@section Writing PEG Rules
340@cindex PEG rules, pitfalls
341@cindex Parsing Expression Grammar, pitfalls in rules
311 342
312Something to be aware of when writing PEG rules is that they are 343Something to be aware of when writing PEG rules is that they are
313greedy. Rules which can consume a variable amount of text will always 344greedy. Rules which can consume a variable amount of text will always
@@ -319,9 +350,10 @@ backtracking. For instance, this rule will never succeed:
319(forest (+ "tree" (* [blank])) "tree" (eol)) 350(forest (+ "tree" (* [blank])) "tree" (eol))
320@end example 351@end example
321 352
322The @acronym{PEX} @code{(+ "tree" (* [blank]))} will consume all 353@noindent
323repetitions of the word ``tree'', leaving none to match the final 354The @acronym{PEX} @w{@code{(+ "tree" (* [blank]))}} will consume all
324@code{"tree"}. 355the repetitions of the word @samp{tree}, leaving none to match the final
356@samp{tree}.
325 357
326In these situations, the desired result can be obtained by using 358In these situations, the desired result can be obtained by using
327predicates and guards -- namely the @code{not}, @code{if} and 359predicates and guards -- namely the @code{not}, @code{if} and
@@ -331,6 +363,7 @@ predicates and guards -- namely the @code{not}, @code{if} and
331(forest (+ "tree" (* [blank])) (not (eol)) "tree" (eol)) 363(forest (+ "tree" (* [blank])) (not (eol)) "tree" (eol))
332@end example 364@end example
333 365
366@noindent
334The @code{if} and @code{not} operators accept a parsing expression and 367The @code{if} and @code{not} operators accept a parsing expression and
335interpret it as a boolean, without moving point. The contents of a 368interpret it as a boolean, without moving point. The contents of a
336@code{guard} operator are evaluated as regular Lisp (not a 369@code{guard} operator are evaluated as regular Lisp (not a
@@ -345,6 +378,7 @@ rule:
345(end-game "game" (eob)) 378(end-game "game" (eob))
346@end example 379@end example
347 380
381@noindent
348when run in a buffer containing the text ``game over'' after point, 382when run in a buffer containing the text ``game over'' after point,
349will move point to just after ``game'' then halt parsing, returning 383will move point to just after ``game'' then halt parsing, returning
350@code{nil}. Successful parsing will always return @code{t}, or the 384@code{nil}. Successful parsing will always return @code{t}, or the