Re: Handling Optional Values in Rust macro_rules; in Which the Peanut Gallery Shows Off Implementations for Their Pet Language

November 9, 2022

Screenreader and assistive technologies note: All preformatted blocks on this page are Common Lisp code.

While I have a notable distaste for Rust in general, I'm actually quite a fan of its macros; although they're not as good as Lisp's macros (naturally, since one of Lisp's defining features is homoiconicity which makes macros easy and powerful) they're better than the awful text replacement macros used by really any other programming language, if the language has macros at all. IMO, macros (or ** at least some form of reflection) is essential to avoid otherwise unavoidable repetitive code—I'm in the miniscule minority but I sometimes end up using m4 in languages without macros instead of just biting the bullet and repeating boilerplate over and over.

Wikipedia: Homoiconicity

However, the instant I read the original post it seemed to me a quintessential example for Lisp's macros. These examples are taken from a gemtext parser I wrote a few months ago in Common Lisp—actually part of my as of yet unreleased gemroff markup language that I'm currently writing this post in, so the real macros are moderately more complex with some extra features.

The Simple Case

First off, it seems like JeanG3nie's Zig example is operating under the assumption that the parsed token symbols are named identically to their string representation. Maybe for something like Ada, where most language keywords are full words this makes sense. If you're doing that, then you can ignore everything else here and just use the following:

(defun string->token (str)
  (intern (string-upcase str) :keyword))

(defun token->string (tok)
  (string-downcase (symbol-name tok)))

The only thing notable about this code is that it case folds the strings. The Common Lisp reader normally case folds symbols to all upper case so we do that when converting strings to tokens (this also conveniently makes tokens case-insensitive, which is often but not always what you want so YMMV). Conversely, humans tend to prefer stuff being in lower case so when converting the other way we case fold the string back down.

The Common Case

However, イェンス is definitely talking about mapping tokens in a non-deterministic fashion, e.g. the token name BANG_EQUAL maps to the string "!=" so you'll need to manually write out the mapping ** somewhere . The goal in something like this is to write the mapping precisely once and let the macro deal with expanding it into all the variants you need.

We'll define a macro that you can use like the following:

(deftokens (string->token token->string)
  (">" :blockquote)
  ("*" :unordered-list)
  ("```" :preformatted-toggle)
  ; ... etc etc ...
  (nil :text))

Which will then define the accessors STRING->TOKEN and TOKEN->STRING that you can use like so:

(token->string :blockquote) ; -> ">"
(string->token "```") ; -> :preformatted-toggle
(token->string :text) ; -> nil

; in my real implementation this would return the "fallback" keyword :TEXT as
; gemtext lines default to text lines
(string->token "blah") ; -> nil

Here's the entire implementation:

(defmacro deftokens ((tokenizer-name stringizer-name) &rest tokens)
  "Given a list TOKENS of form \"(string-representation token-keyword)\",
transform them into an alist and define accessor functions TOKENIZER-NAME and
STRINGIZER-NAME to convert strings to tokens and vice-versa.

A tokenizer will return NIL if the string matches no token, and a stringizer
will return NIL if the token doesn't have an associated string."
  (let ((param (gensym))
        (alist (gensym)))
    `(let ((,alist
            ',(mapcar
                (lambda (list)
                  (cons (cadr list) (car list)))
                tokens)))
       (defun ,stringizer-name (,param)
         (cdr (assoc ,param ,alist)))

       (defun ,tokenizer-name (,param)
         ; use EQUAL instead of the default EQ because we're comparing strings
         (car (rassoc ,param ,alist :test #'equal))))))

If you're familiar with how quasiquoting works then it should be pretty easy to decipher. I'll explain it a bit more in-depth anyways though.

Common Lisp HyperSpec: Backquote templating (known by Schemers as quasiquote, which is a better name IMO)

As is often the case it's easier to tell what's going on by showing an example final expansion. The usage of DEFTOKENS above would expand to the following:

(let ((#:alist
       '((:blockquote . ">")
         (:unordered-list . "*")
         (:preformatted-toggle . "```")
         (:text . nil))))
  (defun token->string (#:param)
    (cdr (assoc #:param #:alist)))

  (defun string->token (#:param)
    (car (rassoc #:param #:alist :test #'equal))))

Really all the macro does is transform the list of tokens to a form that's easier to work with and defines some accessor functions. You could actually really just write this directly, but keep in mind the rest of the implementation; there's many extensions to it that are discussed later, and also in the gemroff parser I use it twice: to define the set of gemtext line types and to define the set of gemroff line types. I also use the generated list of token types later on in the renderer stage too. Described here is a very isolated example.

Let's dive into it:

(let ((param (gensym))
      (alist (gensym))

First, it generates some guaranteed-unique symbols with GENSYM that'll be used throughout the macro. If you don't do this and instead directly use an existing symbol, then those symbols could conflict with symbols the user of the function might have defined and cause all sorts of issues (cf. Scheme's so-called hygenic macros that are less powerful and moderately more difficult to write but will never cause issues like that). These GENSYM'd symbols are represented by the symbols beginning with #: in the expansion above. Note that this let statement isn't actually in the generated expansion, just the symbols it creates.

`(let ((,alist
        ',(mapcar
            (lambda (list)
              (cons (cadr list) (car list)))
            tokens)))

We then get into the actual meat of the macro. It first uses MAPCAR to iterate over the list of tokens and transform them into an association list (alist).

An alist is structured like so:

((key1 . value1) (key2 . value2) <...> (keyn . valuen))

Basically, it's a list of pairs, where the `car` (first item in a pair) is the key and the `cdr` (second item in a pair) is the value.

— A previous post of mine: Some Inconsequential Lisp History; or, the Story of Associative Lists

It binds the alist to one of our GENSYMd symbols inside a LET.

(defun ,stringizer-name (,param)
  (cdr (assoc ,param ,alist)))

(defun ,tokenizer-name (,param)
  ; use EQUAL instead of the default EQ because we're comparing strings
  (car (rassoc ,param ,alist :test #'equal))))))

Inside of that LET it defines the accessor functions with the given names, that use the standard functions ASSOC and RASSOC to query the alist. This exploits the fact that DEFUN (DEFine a FUNction) always makes a new function at the top-level to capture a closure; in other words the alist is visible to the accessor functions but code outside of the macro cannot access it.

Wikipedia: Closure (computer programming)

That's… literally it. The whole thing just makes a linked list mapping tokens to strings or NIL for no string, and defines functions that search the list. It looks more complex because of the GENSYMs and quoting, but at it's core it's really much simpler than the functionally equivalent macro in other languages. Notably there's no dealing with unwrapping return values as Optional, you just check if it's NIL or not (which is essentially the same thing as unwrapping an Option in Rust but a 100x less of an eyesore IMO)

Making it More Robust

A noted limitation of this implementation versus any statically typed language (well, actually not Go or C but we'll pretend they don't exist) is that the keyword symbols we're using to represent tokens are not limited to a specific set like an enum is. You could do something like `(token->string :bazooper)` and get NIL back, and with the implementation above you'd have no way to tell if :BAZOOPER is a real token with no associated string like :TEXT or if it's really not part of the set of tokens. In practice rarely an issue but I'm obsessed with ensuring code can handle every possible input which includes dealing with random-ass keywords instead of the set of tokens you're expecting.

There's several ways to remedy the issue (return a second value that indicates whether it matched or not, return the pair from the alist instead of just the CAR or CDR of it, etc), but my preferred way is to use Common Lisp's surprisingly useful and well-integrated type system.

Just add the following to the macro with the appropriate parameters PREDICATE-NAME and TYPE-NAME added:

(defun ,predicate-name (,param)
  (when (assoc ,param ,alist) t)) ; WHEN implicitly returns NIL if false

(deftype ,type-name ()
  `(satisfies ,,predicate-name))

It defines a predicate function TOKENP that returns T (true) if the token exists or NIL (false) otherwise. It then defines a type TOKEN that says an object of type TOKEN is anything that satisfies the predicate TOKENP.

It's at this point that you'd want to look into autogenerating the names for things; like given the name TOKEN you'd generate the names TOKEN->STRING, STRING->TOKEN, TOKENP, and TOKEN. Or if this is a single-use macro not intended to be used anywhere else in your program nor as part of a library's API, just be lazy and hardcode the names of everything.

Extending It

There are a lot of improvements you could make to it; and the great thing about Lisp is that it's so easy to add to and extend stuff. For example, as mentioned above you could automatically generate the accessor names à la DEFSTRUCT; or instead of closing over the list of tokens you could instead define a top-level variable (or constant) so other code can access the alist too.

In my real implementation I added the additional information of how many arguments each line type expected, and the parser would use that to split the rest line appropriately. Think about how link lines in gemtext take two arguments, but in gemroff many more line types takes more than just one argument or zero arguments so you can't readily special-case one line type like in a gemtext parser.

A change you most likely ** shouldn't make is changing the association list to a hashmap. While in general a hashmap will in general be O(1) while an alist search will in general will be O(n), for maps with fewer than ~200–300 items an alist will actually be faster (and for very small number of items like the 30–40 keywords common in a tokenizer, it'll be ** much faster). Because while the speed of calculating a hash will be the same no matter how much data there is, the time to calculate a hash for a small hash table is slower than linearly searching a small linked list. With some minor manual optimization like sorting the tokens approximately from most to least common, an alist will be plenty fast. Note that the Rust version uses a linear search too, just with the Rust-idiomatic match statement instead of a Lisp-idiomatic association list.