DESIGN.md (19980B)
1 # parseclj Design Goals / Roadmap 2 3 parseclj is an Emacs Lisp library for parsing Clojure code and [EDN 4 data](https://github.com/edn-format/edn). It supports several input and output 5 formats, all powered by the same shift-reduce parser function. 6 7 This documents describes the design goals for parseclj, and as such may describe features which are not implemented yet. 8 9 ## Motivation 10 11 People's parsing needs can differ for several reasons 12 13 - parsing code (Clojure) vs parsing data (EDN, a subset of Clojure) 14 - asserting valid input (fail fast) vs dealing gracefully with syntax errors (editor integration) 15 - analyzing code (whitespace can be ignored) vs doing transformations on code (whitespace aware round-tripping) 16 - parsing the contents of a buffer, or a string, or a file 17 - strict parsing (all tagged literals must be understood) vs pass-through (parsing/unparsing unknown tagged literals while ignoring semantics) 18 19 This library aims to support all of these use cases, either directly, or by providing the building blocks to do it yourself. 20 21 ## Prior art 22 23 [edn.el](https://github.com/expez/edn.el) is an EDN-to-elisp parser based on the PEG parser generator library. 24 25 ## Challenges 26 27 The data structures available in Emacs are less rich than those used by Clojure. 28 29 - Clojure has `nil` and `false`, Emacs only has `nil`. 30 - Emacs has no notion of sets 31 - Emacs has no date/timestamp type 32 - there is a `time-date.el` which has functions for working with day/time represented as cons cells. 33 - Emacs has no "character" type (characters are represented as numbers) 34 - Emacs does not support custom records/types (there is a Common Lisp inspired object system, but it implements types on top of regular lists and vectors). 35 - Emacs does not support adding metadata to values 36 - Emacs does not support bignums 37 38 On the other hand Emacs supports strings/buffers with arbitrary encoding, on the JVM and on JavaScript strings are always UTF-16/UCS-2. 39 40 ## Architecture 41 42 The implementation is implemented in three parts: a lexer, a parser, and [multiple reducers](#multiple-reducers). 43 44 ### Lexer 45 46 The *lexer* turns the input text, a buffer, into tokens, data structures representing a single syntactical unit, such as a symbol, a number, or a delimiter like "(", ")", "#{", or "#_". 47 48 In parseclj the lexer is a single function `parseclj-lex-next` which can be called repeatedly to get a sequence of tokens. `parseclj-lex-next` returns the token at "point" (i.e. the Emacs cursor position), and moves point to after the token. 49 50 A *token* is a association list (list of cons cells), with keys `:token-type`, `:form`, `:position`, and optionally `:error-type`. 51 52 Note: we don't add line/column numbers to the token, the consumer can add these if needed based on the position of point before calling `parseclj-lex-next`. 53 54 Example: 55 56 This Clojure/EDN input: 57 58 ``` clojure 59 (42 "hello" #_ignored #{:a}) 60 ``` 61 62 Results in these tokens 63 64 ``` emacs-lisp 65 ((:token-type . :lparen) 66 (:form . "(") 67 (:pos . 1)) 68 69 ((:token-type . :number) 70 (:form . "42") 71 (:pos . 2)) 72 73 ((:token-type . :whitespace) 74 (:form . " ") 75 (:pos . 4)) 76 77 ((:token-type . :string) 78 (:form . "\"hello\"") 79 (:pos . 5)) 80 81 ((:token-type . :whitespace) 82 (:form . " ") 83 (:pos . 12)) 84 85 ((:token-type . :discard) 86 (:form . "#_") 87 (:pos . 13)) 88 89 ((:token-type . :symbol) 90 (:form . "ignored") 91 (:pos . 15)) 92 93 ((:token-type . :whitespace) 94 (:form . " ") 95 (:pos . 22)) 96 97 ((:token-type . :set) 98 (:form . "#{") 99 (:pos . 23)) 100 101 ((:token-type . :keyword) 102 (:form . ":a") 103 (:pos . 25)) 104 105 ((:token-type . :rbrace) 106 (:form . "}") 107 (:pos . 27)) 108 109 ((:token-type . :rparen) 110 (:form . ")") 111 (:pos . 28)) 112 113 ((:token-type . :eof) 114 (:form . nil) 115 (:pos . 29)) 116 ``` 117 118 Note that the lexer makes no attempt to "interpret" the tokens, it merely finds their boundaries. Concatentating the `:form` values yields a string identical to the original input. 119 120 Tokens can be recognized by the `:token-type` key, which must always come first in the association list. 121 122 ## Shift-reduce parser 123 124 The parser is again a single function `parseclj-parse`. It is a higher order function, with much of the final result determined by the `reduce-leaf` and `reduce-branch` functions passed in as arguments. 125 126 `parseclj-parse` internally operates by using a stack. This stack contains tokens (as returned by `parseclj-lex-next`), and reduced values. 127 128 `reduce-leaf` is a two-argument function. It takes the current value of the stack, and a token, and returns an updated stack, typically by parsing the token to a value and pushing that value onto the stack. 129 130 `reduce-branch` is a three-argument function. It takes the current value of the stack, a node type, and a list of children, and returns an updated stack. 131 132 The parser reads consecutive input tokens. If the token represents a leaf node (like a number, symbol, string), then it calls `reduce-leaf`, giving it a chance to add a value to the stack. If the token is a non-leaf node (a delimiter) it gets put on the stack as-is. This part is known as the "shift" phase. 133 134 After "shifting" a value on to the stack, the parser tries to "reduce", by inspecting the top one, two, or three items on the stack. 135 136 If the top item is a closing delimiter token, then the parser scans down the stack until it finds a matching opening delimiter. It pops both delimiters and everything in between them off the stack, and passes them to `reduce-branch`, which can "reduce" this sequence to a single value (say, a list), and push that item onto the stack. 137 138 The type of values pushed onto the stack depends on the reducing functions used. The parser only distinguishes between tokens and non-tokens. It follows that a reducing functions should not push raw tokens back onto the stack. 139 140 When parsing finishes the stack will contain all parsed forms in reverse order. It will call `reduce-branch` once more with a node type of `:root`, to give it a chance to finalize. 141 142 ### Example 143 144 An example, when parsing the following EDN, with parsing done up to the position of `|`, the stack looks as follows. 145 146 ``` clojure 147 ;; input 148 (1 2 (3|)) 149 ``` 150 151 ``` emacs-lisp 152 ;; stack 153 (3 154 ((:token-type . :lparen) (:form . "(") (:pos . 6))) 155 2 156 1 157 ((:token-type . :lparen) (:form . "(") (:pos . 1))) 158 ``` 159 160 Now the parser encounters the first closing parenthesis. It pops `3` and `:lparen` off the stack, and passes `(((:token-type . :lparen) 3 ((:token-type . :rparen)))` to `reduce-branch`, which reduces this a single list, and pushes it onto the stack. 161 162 ``` clojure 163 ;; input 164 (1 2 (3)|) 165 ``` 166 167 ``` emacs-lisp 168 ;; stack 169 ((3) 170 2 171 1 172 ((:token-type . :lparen) (:form . "(") (:pos . 1))) 173 ``` 174 175 Now the parser encounters the second closing parenthesis. It pops everything until `:lparen` off the stack, and passes it to `reduce-branch`, which turns the result into a list and pushes it onto the stack. 176 177 ``` clojure 178 ;; input 179 (1 2 (3))| 180 ``` 181 182 ``` emacs-lisp 183 ;; stack 184 ((1 2 (3))) 185 ``` 186 187 ### Dealing with parse errors 188 189 `parseclj-parse` needs to be able to parse invalid input. Imagine analyzing a user's buffer while they are editing, to provide contextual help or do linting. Even when delimiters are unbalanced it should still be possible to get a "best effort" parse result. It turns out the shift-reduce approach provides that out of the box. The result of parsing invalid input is a stack which still has unreduced tokens in it. 190 191 Unmatched opening delimiter: 192 193 ``` clojure 194 ;; input 195 (1 2 { 3) 196 ``` 197 198 ``` emacs-lisp 199 ;; result 200 ((1 2 ((:token-type . :lbrace)) 3)) 201 ``` 202 203 Unmatched closing delimiter: 204 205 ``` clojure 206 ;; input 207 (1 2 3)) 208 ``` 209 210 ``` emacs-lisp 211 ;; result 212 ((1 2 3) ((:token-type . :lparen))) 213 ``` 214 215 In many cases it will be desirable to "fail fast", and raise an error as soon as a syntax error is encountered. A `reduce-branch` function can do so if it wishes by checking its input sequence for raw tokens, and raising an error if any are present. 216 217 ## Multiple Reducers 218 219 While the shift-reduce parser parses input, it reduces parsed values (leafs tokens and collections) using "reducing functions". These functions are in charge of giving actual meaning to the parsed data. 220 221 Currently, there are two sets of reducing functions implemented: one set that reduces parsed values to AST data structures, and another one that reduces to Emacs Lisp values. 222 223 The AST reducing functions are part of [`parseclj`](https://github.com/clojure-emacs/parseclj), while the Emacs Lisp reducing functions are separated into a differente package called [`parseedn`](https://github.com/clojure-emacs/parseedn). 224 225 To understand why these two set of reducing functions are separated, it's important to look at the difference between EDN and Clojure, and also between Emacs Lisp elements and EDN elements. 226 227 ### EDN vs Clojure 228 229 EDN syntax is a subset of Clojure syntax used for representing data (as opposed to code). 230 231 Several constructs that are common in Clojure code are not valid in EDN. This includes 232 233 - quoting, syntax quoting, syntax quote splicing, etc. 234 - read time evaluation with `#=(,,,)` 235 236 ### Dealing with tagged literals 237 238 EDN/Clojure syntax is extensible. Values can be prefixed with user-provided tags, which indicates they need to be transformed after reading by a registered handler function. 239 240 In general a consumer of EDN needs to be aware of the possible tags that can occur in the input, as well as their semantics, in order to preserve the data's semantics. When a parser encounters an unknown tag then this is considered an error. 241 242 The EDN parser functions will take an optional tag handler function. This function receives two arguments, the tag symbol, and the parsed form that follows the tag. It should either return a correctly coerced value, or raise an error. A default function that knows how to deal with the tags that are part of the EDN spec will be provided. 243 244 When parsing code to an AST the situation is different, here tags simply become part of the AST, without caring about their semantics. 245 246 ### Representing EDN as Emacs Lisp values 247 248 See also the section at the top regarding differences between Clojure's data types vs those available in Emacs Lisp. 249 250 These are the choices that the edn.el library has made: 251 252 - represent sets, inst and uuid values as `cl-defstruct` objects 253 - `'(edn-set (... set values ...))` 254 - `'(edn-inst high low)` see `time-date.el` 255 - parse maps as hash tables 256 - represent characters as integers, *but* parse `\newline` `\return` `\space` and `\tab` as the symbols `'newline` `'return` etc. 257 - parse `true` as `t`, `nil` and `false` as `nil`. 258 259 #### Differences with EDN.el 260 261 At the moment the `parseedn-*` copy the parsing behavior of edn.el, *except* that the character literals `\newline`, `\return`, `\space`, and `\tab` are parsed to their character code (10, 13, 32, and 9 respectively), instead of to symbols. 262 263 ### AST 264 265 An AST (abstract syntax tree) is a tree structure made up of nodes. A node looks like this 266 267 ``` emacs-lisp 268 ((:node-type . :root) 269 (:position . 0) 270 (:children . (((:node-type . :comment) 271 (:position . 0) 272 (:form . ";; cool stuff\n;; in here")) 273 ((:node-type . :list) 274 (:position . 26) 275 (:children . (((:node-type . :number) 276 (:position . 27) 277 (:form "123") 278 (:value 123)))))))) 279 ``` 280 281 Nodes are an alist with `:node-type` as the first key and a `:position`. Leaf nodes have `:form` and `:value` keys, for the unparsed and parsed form respectively. `:whitespace` and `:comment` nodes only have a `:form`, not a `:value`. 282 283 Non-leaf nodes contain a list of `:children`. 284 285 ## Public API 286 287 > Disclaimer: outdated information -- this API has been deprecated in favor of [Alternative Package Layout](#alternative-package-layout), and it's only kept here for historical purposes. 288 289 parseclj provides three "parse modes" 290 291 - `edn` meant for parsing data, it parses EDN to emacs lisp data 292 - `ast` meant for analyzing source code, parses to a "semantic" AST, does not preserve whitespace or comments 293 - `source` meant for transforming/round-tripping source codes. Like `ast`, but preserves whitespace and comments 294 295 For each of these there can be the following functions 296 297 - `parseclj-{mode}` parse the current buffer starting at `point`, raise an error when syntax/lexing errors are encountered 298 - `parseclj-{mode}-full` same as above but ignore syntax errors, returning a partially parsed result 299 - `clj-print-{mode}` turn the result of the corresponding parse function back into Clojure/EDN, and insert it into the current buffer 300 301 Each of these have `-str` variant which instead works on strings. This yields a total potential API of: 302 303 ``` 304 (defun parseedn (&OPTIONAL tag-handler)) 305 (defun parseedn-full (&OPTIONAL tag-handler)) 306 (defun clj-print-edn (edn)) 307 (defun parseedn-str (string &OPTIONAL tag-handler)) 308 (defun parseedn-full-str (string &OPTIONAL tag-handler)) 309 (defun clj-print-edn-str (edn)) 310 (defun parseclj-ast ()) 311 (defun parseclj-ast-full ()) 312 (defun clj-print-ast (node)) 313 (defun parseclj-ast-str ()) 314 (defun parseclj-ast-full-str ()) 315 (defun clj-print-ast-str (node)) 316 (defun parseclj-source ()) 317 (defun parseclj-source-full ()) 318 (defun clj-print-source (node)) 319 (defun parseclj-source-str ()) 320 (defun parseclj-source-full-str ()) 321 (defun clj-print-source-str (node)) 322 ``` 323 324 This may seem like a massive API surface, but these will all be only one or two lines, combining the generic parsing function with the right reducers. 325 326 Beyond that we provide two sets of functions for working with AST nodes. 327 328 - zipper-like functions for navigating and transforming ASTs 329 - functions for reading values out of ASTs, effectively treating them as the data structure they represent. 330 331 This second set of functions is important for applications that need to deal with EDN data, but for which the lossy nature of EDN->Elisp transformation is not an option. For instance, unrepl sends EDN messages, but these messages contain code forms that we need to be able to reproduce. In this case converting `false` to `nil` or a set to a list is not acceptable. Instead we can parse the EDN to an AST, and deal with the AST directly. 332 333 ## Alternative package layout 334 335 **Package**: parseclj 336 337 Contains the core parser backend, and the Clojure-to-AST parser 338 339 - file: parseclj.el 340 341 ``` emacs-lisp 342 (defun parseclj-parse-clojure (&rest string-and-options) 343 "Parse Clojure source to AST. 344 345 Reads either from the current buffer, starting from point, until 346 point-max, or reads from the optional string argument. 347 348 STRING-AND-OPTIONS can be an optional string, followed by 349 key-value pairs to specify parsing options. 350 351 - `:lexical-preservation' Retain whitespace, comments, and 352 discards. Defaults to false (`nil'). 353 - `:fail-fast' Raise an error 354 when encountering invalid syntax. Defaults to true (`t'). ") 355 356 (defun parseclj-unparse-clojure (ast &rest options) 357 "Parse Clojure AST to source code. 358 359 Given an abstract syntax tree AST (as returned by 360 parseclj-parse-clojure), turn it back into source code, and 361 insert it into the current buffer. 362 363 OPTIONS is a list of key value pairs containing options. 364 365 - `:lexical-preservation' If `t', assume the AST contains 366 whitespace. If `nil', insert whitespace between forms. When 367 parsing with `:lexical-preservation', you should unparse the 368 same way. ") 369 370 (defun parseclj-unparse-clojure-to-string (ast &rest options) 371 "Parse Clojure AST to a source code string. 372 373 Given an abstract syntax tree AST (as returned by 374 parseclj-parse-clojure), turn it back into source code, and 375 return it as a string 376 377 OPTIONS is a list of key value pairs containing options. 378 379 - `:lexical-preservation' If `t', assume the AST contains 380 whitespace. If `nil', insert whitespace between forms. When 381 parsing with `:lexical-preservation', you should unparse the 382 same way.") 383 ``` 384 385 - file: parseclj-lex.el 386 387 ``` emacs-lisp 388 (defun parseclj-lex-next () 389 "Move past the token at point, and return the token") 390 ``` 391 392 - file: parseclj-ast.el 393 394 ``` emacs-lisp 395 (defun parseclj-ast--reduce-leaf (stack token) 396 "Create a leaf AST node and put it onto the stack. 397 398 Given the current parser STACK and a TOKEN coming from the lexer, 399 create an AST leaf node and return an updated stack. 400 401 Whitespace and comment tokens are ignored (i.e. the stack is 402 returned unchanged). 403 404 This function is only called for tokens that correspond with AST 405 leaf nodes.") 406 407 (defun parseclj-ast--reduce-leaf-with-lexical-preservation (stack token) 408 "Create a leaf AST node and put it onto the stack. 409 410 Given the current parser STACK and a TOKEN coming from the lexer, 411 create an AST leaf node and return an updated stack. 412 413 This functions creates nodes for whitespace and comment tokens, 414 for other tokens it delegates to `parseclj-ast--reduce-leaf'.") 415 416 (defun parseclj-ast--reduce-branch (stack type children) 417 "Create a branch AST node and put it onto the stack. 418 419 This function is passed the current parser STACK the node TYPE to 420 be created, and a list of AST nodes that will become the CHILDREN 421 of the newly created node. 422 423 This implementation ignores `:discard' nodes (#_), when called 424 with a TYPE of `:discard' it returns the stack unchanged.") 425 426 (defun parseclj-ast--reduce-branch-with-lexical-preservation (stack type children) 427 "Create a branch AST node and put it onto the stack. 428 429 This function is passed the current parser STACK the node TYPE to 430 be created, and a list of AST nodes that will become the CHILDREN 431 of the newly created node. 432 433 This implementation handles `:discard' nodes (#_), for other node 434 types it delegates to `parseclj-ast--reduce-branch'.") 435 436 (defun parseclj-ast-value (node) 437 "Given an AST NODE, returns its value. 438 439 Recursively convert the AST node into an Emacs Lisp value. E.g. 440 turn a `:list' node into a sexp, a `:number' node into a number. 441 442 This operation is lossy because not all Clojure types have a 443 corresponding type in Emacs. `nil' and `false' form a 444 particularly poignant case, both are converted to `nil'.") 445 ``` 446 447 **Package**: [parseedn](https://github.com/clojure-emacs/parseedn) 448 449 - file: parseedn.el 450 451 ``` emacs-lisp 452 (defun parseedn-read (&rest string-and-options) 453 "Reads an EDN form and converts it an Emacs Lisp value. 454 455 Reads either from the current buffer, starting from point, or 456 reads from the optional string argument. Only reads the first 457 complete form. When used on a buffer, this moves `(point)' to 458 after the form. 459 460 By default uses an output format that uses tagged lists to 461 preserve type information. This makes the conversion lossless, 462 but still easy to process. 463 464 \"#{1 2 3}\" => (set (1 2 3)) 465 \"false\" => (false) 466 \"t\" => t 467 \"true\" => (true) 468 \"(1 2 3)\" => (list (1 2 3)) 469 \"#uuid \\\"255efd69-dec9-4428-9142-cebd5357fb2a\\\"\" 470 => (uuid \"255efd69-dec9-4428-9142-cebd5357fb2a\") 471 472 Alternatively a compatibility mode is available which mimics 473 exactly the behavior of `edn-read' as implemented in `edn.el'. 474 475 STRING-AND-OPTIONS can be an optional string, followed by 476 key-value pairs to specify parsing options. 477 478 - `:compat' Mimic edn.el. Defaults to false (`nil'). 479 - `:tag-readers' An association list mapping symbols to 480 functions, used to parse tagged literals. The function is given 481 the parsed value and given an opportunity to transform it. 482 Defaults for `uuid' and `inst' are provided but can be 483 overridden. 484 - `:fail-fast' Raise an error when encountering invalid syntax. 485 Defaults to true (`t'). ") 486 487 (defun parseclj-print (value &rest options) 488 "Convert an Emacs Lisp value to EDN. 489 490 OPTIONS is a list of key value pairs containing options. 491 492 By default assumes that any list is of the form `(type value)'. 493 Extra `:tag-writers' can be specified to handle unknown types. 494 Alternatively a compatibility mode is available which emulates 495 the behavior of `edn.el' 496 497 - `:compat' If `t', mimic `edn.el'. Defaults to `false' (`nil'). 498 When this is set to `t' then `:tag-writers' is ignored. 499 - `:tag-writers' An association list from symbol to function. 500 Each function is given a list including `type' tag, and should 501 return a value that can be handled by `parseclj-print'.") 502 ```