dotemacs

My Emacs configuration
git clone git://git.entf.net/dotemacs
Log | Files | Refs | LICENSE

DESIGN.md (19980B)


      1 # parseclj Design Goals / Roadmap
      2 
      3 parseclj is an Emacs Lisp library for parsing Clojure code and [EDN
      4 data](https://github.com/edn-format/edn). It supports several input and output
      5 formats, all powered by the same shift-reduce parser function.
      6 
      7 This documents describes the design goals for parseclj, and as such may describe features which are not implemented yet.
      8 
      9 ## Motivation
     10 
     11 People's parsing needs can differ for several reasons
     12 
     13 - parsing code (Clojure) vs parsing data (EDN, a subset of Clojure)
     14 - asserting valid input (fail fast) vs dealing gracefully with syntax errors (editor integration)
     15 - analyzing code (whitespace can be ignored) vs doing transformations on code (whitespace aware round-tripping)
     16 - parsing the contents of a buffer, or a string, or a file
     17 - strict parsing (all tagged literals must be understood) vs pass-through (parsing/unparsing unknown tagged literals while ignoring semantics)
     18 
     19 This library aims to support all of these use cases, either directly, or by providing the building blocks to do it yourself.
     20 
     21 ## Prior art
     22 
     23 [edn.el](https://github.com/expez/edn.el) is an EDN-to-elisp parser based on the PEG parser generator library.
     24 
     25 ## Challenges
     26 
     27 The data structures available in Emacs are less rich than those used by Clojure.
     28 
     29 - Clojure has `nil` and `false`, Emacs only has `nil`.
     30 - Emacs has no notion of sets
     31 - Emacs has no date/timestamp type
     32   - there is a `time-date.el` which has functions for working with day/time represented as cons cells.
     33 - Emacs has no "character" type (characters are represented as numbers)
     34 - Emacs does not support custom records/types (there is a Common Lisp inspired object system, but it implements types on top of regular lists and vectors).
     35 - Emacs does not support adding metadata to values
     36 - Emacs does not support bignums
     37 
     38 On the other hand Emacs supports strings/buffers with arbitrary encoding, on the JVM and on JavaScript strings are always UTF-16/UCS-2.
     39 
     40 ## Architecture
     41 
     42 The implementation is implemented in three parts: a lexer, a parser, and [multiple reducers](#multiple-reducers).
     43 
     44 ### Lexer
     45 
     46 The *lexer* turns the input text, a buffer, into tokens, data structures representing a single syntactical unit, such as a symbol, a number, or a delimiter like "(", ")", "#{", or "#_".
     47 
     48 In parseclj the lexer is a single function `parseclj-lex-next` which can be called repeatedly to get a sequence of tokens. `parseclj-lex-next` returns the token at "point" (i.e. the Emacs cursor position), and moves point to after the token.
     49 
     50 A *token* is a association list (list of cons cells), with keys `:token-type`, `:form`, `:position`, and optionally `:error-type`.
     51 
     52 Note: we don't add line/column numbers to the token, the consumer can add these if needed based on the position of point before calling `parseclj-lex-next`.
     53 
     54 Example:
     55 
     56 This Clojure/EDN input:
     57 
     58 ``` clojure
     59 (42 "hello" #_ignored #{:a})
     60 ```
     61 
     62 Results in these tokens
     63 
     64 ``` emacs-lisp
     65 ((:token-type . :lparen)
     66  (:form . "(")
     67  (:pos . 1))
     68 
     69 ((:token-type . :number)
     70  (:form . "42")
     71  (:pos . 2))
     72 
     73 ((:token-type . :whitespace)
     74  (:form . " ")
     75  (:pos . 4))
     76 
     77 ((:token-type . :string)
     78  (:form . "\"hello\"")
     79  (:pos . 5))
     80 
     81 ((:token-type . :whitespace)
     82  (:form . " ")
     83  (:pos . 12))
     84 
     85 ((:token-type . :discard)
     86  (:form . "#_")
     87  (:pos . 13))
     88 
     89 ((:token-type . :symbol)
     90  (:form . "ignored")
     91  (:pos . 15))
     92 
     93 ((:token-type . :whitespace)
     94  (:form . " ")
     95  (:pos . 22))
     96 
     97 ((:token-type . :set)
     98  (:form . "#{")
     99  (:pos . 23))
    100 
    101 ((:token-type . :keyword)
    102  (:form . ":a")
    103  (:pos . 25))
    104 
    105 ((:token-type . :rbrace)
    106  (:form . "}")
    107  (:pos . 27))
    108 
    109 ((:token-type . :rparen)
    110  (:form . ")")
    111  (:pos . 28))
    112 
    113 ((:token-type . :eof)
    114  (:form . nil)
    115  (:pos . 29))
    116 ```
    117 
    118 Note that the lexer makes no attempt to "interpret" the tokens, it merely finds their boundaries. Concatentating the `:form` values yields a string identical to the original input.
    119 
    120 Tokens can be recognized by the `:token-type` key, which must always come first in the association list.
    121 
    122 ## Shift-reduce parser
    123 
    124 The parser is again a single function `parseclj-parse`. It is a higher order function, with much of the final result determined by the `reduce-leaf` and `reduce-branch` functions passed in as arguments.
    125 
    126 `parseclj-parse` internally operates by using a stack. This stack contains tokens (as returned by `parseclj-lex-next`), and reduced values.
    127 
    128 `reduce-leaf` is a two-argument function. It takes the current value of the stack, and a token, and returns an updated stack, typically by parsing the token to a value and pushing that value onto the stack.
    129 
    130 `reduce-branch` is a three-argument function. It takes the current value of the stack, a node type, and a list of children, and returns an updated stack.
    131 
    132 The parser reads consecutive input tokens. If the token represents a leaf node (like a number, symbol, string), then it calls `reduce-leaf`, giving it a chance to add a value to the stack. If the token is a non-leaf node (a delimiter) it gets put on the stack as-is. This part is known as the "shift" phase.
    133 
    134 After "shifting" a value on to the stack, the parser tries to "reduce", by inspecting the top one, two, or three items on the stack.
    135 
    136 If the top item is a closing delimiter token, then the parser scans down the stack until it finds a matching opening delimiter. It pops both delimiters and everything in between them off the stack, and passes them to `reduce-branch`, which can "reduce" this sequence to a single value (say, a list), and push that item onto the stack.
    137 
    138 The type of values pushed onto the stack depends on the reducing functions used. The parser only distinguishes between tokens and non-tokens. It follows that a reducing functions should not push raw tokens back onto the stack.
    139 
    140 When parsing finishes the stack will contain all parsed forms in reverse order. It will call `reduce-branch` once more with a node type of `:root`, to give it a chance to finalize.
    141 
    142 ### Example
    143 
    144 An example, when parsing the following EDN, with parsing done up to the position of `|`, the stack looks as follows.
    145 
    146 ``` clojure
    147 ;; input
    148 (1 2 (3|))
    149 ```
    150 
    151 ``` emacs-lisp
    152 ;; stack
    153 (3
    154  ((:token-type . :lparen) (:form . "(") (:pos . 6)))
    155  2
    156  1
    157  ((:token-type . :lparen) (:form . "(") (:pos . 1)))
    158 ```
    159 
    160 Now the parser encounters the first closing parenthesis. It pops `3` and `:lparen` off the stack, and passes `(((:token-type . :lparen) 3 ((:token-type . :rparen)))` to `reduce-branch`, which reduces this a single list, and pushes it onto the stack.
    161 
    162 ``` clojure
    163 ;; input
    164 (1 2 (3)|)
    165 ```
    166 
    167 ``` emacs-lisp
    168 ;; stack
    169 ((3)
    170  2
    171  1
    172  ((:token-type . :lparen) (:form . "(") (:pos . 1)))
    173 ```
    174 
    175 Now the parser encounters the second closing parenthesis. It pops everything until `:lparen` off the stack, and passes it to `reduce-branch`, which turns the result into a list and pushes it onto the stack.
    176 
    177 ``` clojure
    178 ;; input
    179 (1 2 (3))|
    180 ```
    181 
    182 ``` emacs-lisp
    183 ;; stack
    184 ((1 2 (3)))
    185 ```
    186 
    187 ### Dealing with parse errors
    188 
    189 `parseclj-parse` needs to be able to parse invalid input. Imagine analyzing a user's buffer while they are editing, to provide contextual help or do linting. Even when delimiters are unbalanced it should still be possible to get a "best effort" parse result. It turns out the shift-reduce approach provides that out of the box. The result of parsing invalid input is a stack which still has unreduced tokens in it.
    190 
    191 Unmatched opening delimiter:
    192 
    193 ``` clojure
    194 ;; input
    195 (1 2 { 3)
    196 ```
    197 
    198 ``` emacs-lisp
    199 ;; result
    200 ((1 2 ((:token-type . :lbrace)) 3))
    201 ```
    202 
    203 Unmatched closing delimiter:
    204 
    205 ``` clojure
    206 ;; input
    207 (1 2 3))
    208 ```
    209 
    210 ``` emacs-lisp
    211 ;; result
    212 ((1 2 3) ((:token-type . :lparen)))
    213 ```
    214 
    215 In many cases it will be desirable to "fail fast", and raise an error as soon as a syntax error is encountered. A `reduce-branch` function can do so if it wishes by checking its input sequence for raw tokens, and raising an error if any are present.
    216 
    217 ## Multiple Reducers
    218 
    219 While the shift-reduce parser parses input, it reduces parsed values (leafs tokens and collections) using "reducing functions".  These functions are in charge of giving actual meaning to the parsed data.
    220 
    221 Currently, there are two sets of reducing functions implemented: one set that reduces parsed values to AST data structures, and another one that reduces to Emacs Lisp values.
    222 
    223 The AST reducing functions are part of [`parseclj`](https://github.com/clojure-emacs/parseclj), while the Emacs Lisp reducing functions are separated into a differente package called [`parseedn`](https://github.com/clojure-emacs/parseedn).
    224 
    225 To understand why these two set of reducing functions are separated, it's important to look at the difference between EDN and Clojure, and also between Emacs Lisp elements and EDN elements.
    226 
    227 ### EDN vs Clojure
    228 
    229 EDN syntax is a subset of Clojure syntax used for representing data (as opposed to code).
    230 
    231 Several constructs that are common in Clojure code are not valid in EDN. This includes
    232 
    233 - quoting, syntax quoting, syntax quote splicing, etc.
    234 - read time evaluation with `#=(,,,)`
    235 
    236 ### Dealing with tagged literals
    237 
    238 EDN/Clojure syntax is extensible. Values can be prefixed with user-provided tags, which indicates they need to be transformed after reading by a registered handler function.
    239 
    240 In general a consumer of EDN needs to be aware of the possible tags that can occur in the input, as well as their semantics, in order to preserve the data's semantics. When a parser encounters an unknown tag then this is considered an error.
    241 
    242 The EDN parser functions will take an optional tag handler function. This function receives two arguments, the tag symbol, and the parsed form that follows the tag. It should either return a correctly coerced value, or raise an error. A default function that knows how to deal with the tags that are part of the EDN spec will be provided.
    243 
    244 When parsing code to an AST the situation is different, here tags simply become part of the AST, without caring about their semantics.
    245 
    246 ### Representing EDN as Emacs Lisp values
    247 
    248 See also the section at the top regarding differences between Clojure's data types vs those available in Emacs Lisp.
    249 
    250 These are the choices that the edn.el library has made:
    251 
    252 - represent sets, inst and uuid values as `cl-defstruct` objects
    253   - `'(edn-set (... set values ...))`
    254   - `'(edn-inst high low)` see `time-date.el`
    255 - parse maps as hash tables
    256 - represent characters as integers, *but* parse `\newline` `\return` `\space` and `\tab` as the symbols `'newline` `'return` etc.
    257 - parse `true` as `t`, `nil` and `false` as `nil`.
    258 
    259 #### Differences with EDN.el
    260 
    261 At the moment the `parseedn-*` copy the parsing behavior of edn.el, *except* that the character literals `\newline`, `\return`, `\space`, and `\tab` are parsed to their character code (10, 13, 32, and 9 respectively), instead of to symbols.
    262 
    263 ### AST
    264 
    265 An AST (abstract syntax tree) is a tree structure made up of nodes. A node looks like this
    266 
    267 ``` emacs-lisp
    268 ((:node-type . :root)
    269  (:position . 0)
    270  (:children . (((:node-type . :comment)
    271                 (:position . 0)
    272                 (:form . ";; cool stuff\n;; in here"))
    273                ((:node-type . :list)
    274                 (:position . 26)
    275                 (:children . (((:node-type . :number)
    276                                (:position . 27)
    277                                (:form "123")
    278                                (:value 123))))))))
    279 ```
    280 
    281 Nodes are an alist with `:node-type` as the first key and a `:position`. Leaf nodes have `:form` and `:value` keys, for the unparsed and parsed form respectively. `:whitespace` and `:comment` nodes only have a `:form`, not a `:value`.
    282 
    283 Non-leaf nodes contain a list of `:children`.
    284 
    285 ## Public API
    286 
    287 > Disclaimer: outdated information -- this API has been deprecated in favor of [Alternative Package Layout](#alternative-package-layout), and it's only kept here for historical purposes.
    288 
    289 parseclj provides three "parse modes"
    290 
    291 - `edn` meant for parsing data, it parses EDN to emacs lisp data
    292 - `ast` meant for analyzing source code, parses to a "semantic" AST, does not preserve whitespace or comments
    293 - `source` meant for transforming/round-tripping source codes. Like `ast`, but preserves whitespace and comments
    294 
    295 For each of these there can be the following functions
    296 
    297 - `parseclj-{mode}` parse the current buffer starting at `point`, raise an error when syntax/lexing errors are encountered
    298 - `parseclj-{mode}-full` same as above but ignore syntax errors, returning a partially parsed result
    299 - `clj-print-{mode}` turn the result of the corresponding parse function back into Clojure/EDN, and insert it into the current buffer
    300 
    301 Each of these have `-str` variant which instead works on strings. This yields a total potential API of:
    302 
    303 ```
    304 (defun parseedn (&OPTIONAL tag-handler))
    305 (defun parseedn-full (&OPTIONAL tag-handler))
    306 (defun clj-print-edn (edn))
    307 (defun parseedn-str (string &OPTIONAL tag-handler))
    308 (defun parseedn-full-str (string &OPTIONAL tag-handler))
    309 (defun clj-print-edn-str (edn))
    310 (defun parseclj-ast ())
    311 (defun parseclj-ast-full ())
    312 (defun clj-print-ast (node))
    313 (defun parseclj-ast-str ())
    314 (defun parseclj-ast-full-str ())
    315 (defun clj-print-ast-str (node))
    316 (defun parseclj-source ())
    317 (defun parseclj-source-full ())
    318 (defun clj-print-source (node))
    319 (defun parseclj-source-str ())
    320 (defun parseclj-source-full-str ())
    321 (defun clj-print-source-str (node))
    322 ```
    323 
    324 This may seem like a massive API surface, but these will all be only one or two lines, combining the generic parsing function with the right reducers.
    325 
    326 Beyond that we provide two sets of functions for working with AST nodes.
    327 
    328 - zipper-like functions for navigating and transforming ASTs
    329 - functions for reading values out of ASTs, effectively treating them as the data structure they represent.
    330 
    331 This second set of functions is important for applications that need to deal with EDN data, but for which the lossy nature of EDN->Elisp transformation is not an option. For instance, unrepl sends EDN messages, but these messages contain code forms that we need to be able to reproduce. In this case converting `false` to `nil` or a set to a list is not acceptable. Instead we can parse the EDN to an AST, and deal with the AST directly.
    332 
    333 ## Alternative package layout
    334 
    335 **Package**: parseclj
    336 
    337 Contains the core parser backend, and the Clojure-to-AST parser
    338 
    339 - file: parseclj.el
    340 
    341 ``` emacs-lisp
    342 (defun parseclj-parse-clojure (&rest string-and-options)
    343   "Parse Clojure source to AST.
    344 
    345 Reads either from the current buffer, starting from point, until
    346 point-max, or reads from the optional string argument.
    347 
    348 STRING-AND-OPTIONS can be an optional string, followed by
    349 key-value pairs to specify parsing options.
    350 
    351 - `:lexical-preservation' Retain whitespace, comments, and
    352   discards. Defaults to false (`nil').
    353 - `:fail-fast' Raise an error
    354   when encountering invalid syntax. Defaults to true (`t'). ")
    355 
    356 (defun parseclj-unparse-clojure (ast &rest options)
    357   "Parse Clojure AST to source code.
    358 
    359 Given an abstract syntax tree AST (as returned by
    360 parseclj-parse-clojure), turn it back into source code, and
    361 insert it into the current buffer.
    362 
    363 OPTIONS is a list of key value pairs containing options.
    364 
    365 - `:lexical-preservation' If `t', assume the AST contains
    366   whitespace. If `nil', insert whitespace between forms. When
    367   parsing with `:lexical-preservation', you should unparse the
    368   same way. ")
    369 
    370 (defun parseclj-unparse-clojure-to-string (ast &rest options)
    371   "Parse Clojure AST to a source code string.
    372 
    373 Given an abstract syntax tree AST (as returned by
    374 parseclj-parse-clojure), turn it back into source code, and
    375 return it as a string
    376 
    377 OPTIONS is a list of key value pairs containing options.
    378 
    379 - `:lexical-preservation' If `t', assume the AST contains
    380   whitespace. If `nil', insert whitespace between forms. When
    381   parsing with `:lexical-preservation', you should unparse the
    382   same way.")
    383 ```
    384 
    385 - file: parseclj-lex.el
    386 
    387 ``` emacs-lisp
    388 (defun parseclj-lex-next ()
    389   "Move past the token at point, and return the token")
    390 ```
    391 
    392 - file: parseclj-ast.el
    393 
    394 ``` emacs-lisp
    395 (defun parseclj-ast--reduce-leaf (stack token)
    396   "Create a leaf AST node and put it onto the stack.
    397 
    398 Given the current parser STACK and a TOKEN coming from the lexer,
    399 create an AST leaf node and return an updated stack.
    400 
    401 Whitespace and comment tokens are ignored (i.e. the stack is
    402 returned unchanged).
    403 
    404 This function is only called for tokens that correspond with AST
    405 leaf nodes.")
    406 
    407 (defun parseclj-ast--reduce-leaf-with-lexical-preservation (stack token)
    408   "Create a leaf AST node and put it onto the stack.
    409 
    410 Given the current parser STACK and a TOKEN coming from the lexer,
    411 create an AST leaf node and return an updated stack.
    412 
    413 This functions creates nodes for whitespace and comment tokens,
    414 for other tokens it delegates to `parseclj-ast--reduce-leaf'.")
    415 
    416 (defun parseclj-ast--reduce-branch (stack type children)
    417   "Create a branch AST node and put it onto the stack.
    418 
    419 This function is passed the current parser STACK the node TYPE to
    420 be created, and a list of AST nodes that will become the CHILDREN
    421 of the newly created node.
    422 
    423 This implementation ignores `:discard' nodes (#_), when called
    424 with a TYPE of `:discard' it returns the stack unchanged.")
    425 
    426 (defun parseclj-ast--reduce-branch-with-lexical-preservation (stack type children)
    427   "Create a branch AST node and put it onto the stack.
    428 
    429 This function is passed the current parser STACK the node TYPE to
    430 be created, and a list of AST nodes that will become the CHILDREN
    431 of the newly created node.
    432 
    433 This implementation handles `:discard' nodes (#_), for other node
    434 types it delegates to `parseclj-ast--reduce-branch'.")
    435 
    436 (defun parseclj-ast-value (node)
    437   "Given an AST NODE, returns its value.
    438 
    439 Recursively convert the AST node into an Emacs Lisp value. E.g.
    440 turn a `:list' node into a sexp, a `:number' node into a number.
    441 
    442 This operation is lossy because not all Clojure types have a
    443 corresponding type in Emacs. `nil' and `false' form a
    444 particularly poignant case, both are converted to `nil'.")
    445 ```
    446 
    447 **Package**: [parseedn](https://github.com/clojure-emacs/parseedn)
    448 
    449 - file: parseedn.el
    450 
    451 ``` emacs-lisp
    452 (defun parseedn-read (&rest string-and-options)
    453   "Reads an EDN form and converts it an Emacs Lisp value.
    454 
    455 Reads either from the current buffer, starting from point, or
    456 reads from the optional string argument. Only reads the first
    457 complete form. When used on a buffer, this moves `(point)' to
    458 after the form.
    459 
    460 By default uses an output format that uses tagged lists to
    461 preserve type information. This makes the conversion lossless,
    462 but still easy to process.
    463 
    464   \"#{1 2 3}\" => (set (1 2 3))
    465   \"false\"    => (false)
    466   \"t\"        => t
    467   \"true\"     => (true)
    468   \"(1 2 3)\"  => (list (1 2 3))
    469   \"#uuid \\\"255efd69-dec9-4428-9142-cebd5357fb2a\\\"\"
    470     => (uuid \"255efd69-dec9-4428-9142-cebd5357fb2a\")
    471 
    472 Alternatively a compatibility mode is available which mimics
    473 exactly the behavior of `edn-read' as implemented in `edn.el'.
    474 
    475 STRING-AND-OPTIONS can be an optional string, followed by
    476 key-value pairs to specify parsing options.
    477 
    478 - `:compat' Mimic edn.el. Defaults to false (`nil').
    479 - `:tag-readers' An association list mapping symbols to
    480   functions, used to parse tagged literals. The function is given
    481   the parsed value and given an opportunity to transform it.
    482   Defaults for `uuid' and `inst' are provided but can be
    483   overridden.
    484 - `:fail-fast' Raise an error when encountering invalid syntax.
    485   Defaults to true (`t'). ")
    486 
    487 (defun parseclj-print (value &rest options)
    488   "Convert an Emacs Lisp value to EDN.
    489 
    490 OPTIONS is a list of key value pairs containing options.
    491 
    492 By default assumes that any list is of the form `(type value)'.
    493 Extra `:tag-writers' can be specified to handle unknown types.
    494 Alternatively a compatibility mode is available which emulates
    495 the behavior of `edn.el'
    496 
    497 - `:compat' If `t', mimic `edn.el'. Defaults to `false' (`nil').
    498   When this is set to `t' then `:tag-writers' is ignored.
    499 - `:tag-writers' An association list from symbol to function.
    500   Each function is given a list including `type' tag, and should
    501   return a value that can be handled by `parseclj-print'.")
    502 ```