This article sums it up perfectly. I was interested in building a compiler long before going to college and this was the most accessible body of work.
Building a recursive descent parser from scratch was an eye opener to 17yo me on how a seemingly very complex problem that I had no idea how to approach can be made simple by breaking it down into the right primitives.
From the Publications section of that Wikipedia page:
>The April 1971 Communications of the ACM article "Program Development by Stepwise Refinement",[22][23] concerning the teaching of programming, is considered to be a classic text in software engineering.[24] The paper is considered to be the earliest work to formally outline the top-down method for designing programs.[25][26] The article was discussed by Fred Brooks in his influential book The Mythical Man-Month and was described as "seminal" in the ACM's brief biography of Wirth published in connection to his Turing Award.[27][28]
Wirth also wrote an extremely accessible book on Compiler Construction, using exactly the hand written recursive descent parsing approach discussed by OP.
The initial edition was published in 1976, in German, but the latest version is available online:
There are also parser generators like ANTLR (https://en.wikipedia.org/wiki/ANTLR) which take an input not unlike yacc, but generate a LL parser using explicit code, rather than the table driven LALR parsing of yacc.
Thank you. Just to confirm, by "accessible", do you mean easy to understand?
Anyway, I think I had come across that book on the net, but did not check it out at the time. I don't remember the exact reason, maybe it was because I didn't want to go into the subject of compilers at the time, and was only interested in interpreters, because I prefer to take things one step at a time.
Yes, accessible in the sense of being readable without extensive prior knowledge. If I recall correctly, I read the initial edition while still in high school.
"breaking things down into the right primitives" is the real key to programming. There are many books and web pages about algorithms, but I wish there were more searchable and browsable resources for how to approach problems through primitives.
The process of breaking a complex problem down into the right primitives requires great understanding of the original problem in the first place.
Whats blocking me during programming usually are edge cases I had no idea about. Its still hard to find good material on compilers if you are not into reading dry ass books. Thats a me problem though, I simply cant force myself to read boring factual only content (one of the reasons as to why I love beejs guides).
> The process of breaking a complex problem down into the right primitives requires great understanding of the original problem in the first place.
Yes, but with experience that just becomes a matter of recognizing problem and design patterns. When you see a parsing problem, you know that the simplest/best design pattern is just to define a Token class representing the units of the language (keywords, operators, etc), write a NextToken() function to parse characters to tokens, then write a recursive descent parser using that.
Any language may have it's own gotchas and edge cases, but knowing that recursive descent is pretty much always going to be a viable design pattern (for any language you are likely to care about), you can tackle those when you come to them.
What language do you use parser combinators in, and what kind of grammar do you parse usually? Nom was terribly verbose and unergonomic even by Rust's standards. Haskell's Megaparsec/Parsec is good but yeah, it's Haskell, you need to handle multiple monads (Parser itself is monadic, then your AST state, and maybe some error handling) at once and that's where I got confused. But I appreciated the elegance.
I experimented with PCs in Haskell and Rust (nom), then moved on to parser generators in Rust (pest.rs), Ocaml (Menhir), Haskell (Happy) and finally ended up with python's Lark - the speed of experimenting with different syntax/grammars is just insane.
Parser combinators is more of a concept than a library. You could make your own supporting the stuff you need. I like writing programs in languages I don't know or I barely know. I usually just take one of the popular libraries in any given language.
For Rust I used Nom and I didn't mind it all that much although I noticed it's quite baroque. If I had more to write I'd probably make some wrappers or macros of my own for most commonly used Nom snippets.
I've used tree-sitter for generating my parsers in Rust, and just working with the untyped syntax tree it generates, and gives you error-tolerance for free. It's a bit of a setup at first tho, requiring an extra crate for the generated parser, but editing it from there saves so much time.
What do you mean exactly by "error-tolerance"? Is it like, each node is wrapped into a result type, that you have to match against each time you visit it, even though you know for a fact, that it is not empty or something like that?
I suppose that one of the pros of using tree-sitter is its portability? For example, I could define my grammar to both parse my code and to do proper syntax highlighting in the browser with the same library and same grammar? Is that correct? Also it is used in neovim extensively to define syntax for a languages? Otherwise it would have taken to slightly modify the grammar.
Oh nono, with tree-sitter, you get an untyped syntax tree. That means, you have a Cursor object to walk the tree, which creates Node objects as you traverse, that have a "kind" (name of the tree-sitter node), span, and children. (I recommend using the rust tree-sitter bindings itself, not the rust wrapper rust-sitter).
Yes, portability like that is a huge benefit, though I personally utilized it for that yet. I just use it as an error-tolerant frontend to my compiler.
As to how errors are reported, tree-sitter creates an ERROR or MISSING node when a particular subtree has invalid syntax. I've found that it never leaves a node in an invalid state, (so never would it create a binaryop(LeftNode(...), Op, ERROR) if RightNode is not optional. Instead it would create an ERROR for binaryop too. This allows you to safely unwrap known fields. ERROR nodes only really bunch up in repeat() and optional()s where you would implicity handle them.
Error tolerance in this context means the parser produces a walkable AST even if the input code is syntactically invalid, instead of just throwing/reporting the error. It’s useful for IDEs, where the code is often in an invalid state as the developer is typing, but you still want to be able to report diagnostics on whatever parts of the code are syntactically valid.
That's a good point - recursive descent as a general lesson in program design, in addition to being a good way to write a parser.
Table driven parsers (using yacc/etc) used to be emphasized in old compiler writing books such as Aho & Ullman's famous "dragon (front cover) book". I'm not sure why - maybe part efficiency for the slower computers of the day, and part because in the infancy of computing a more theoretical/algorithmic approach seemed more sophisticated and preferable (the cannonical table driven parser building algorithm was one of Knuth's algorithms).
Nowadays it seems that recursive descent is the preferred approach for compilers because it's ultimately more practical and flexible. Table driven can still be a good option for small DSLs and simple parsing tasks, but recursive descent is so easy that it's hard to justify anything else, and LLM code generation now makes that truer than ever!
There is a huge difference in complexity between building a full-blown commercial quality optimizing compiler and a toy one built as a learning exercise. Using something like LLVM as a starting point for a learning exercise doesn't seem very useful (unless your goal is to build real compilers) since it's doing all the heavy lifting for you.
I guess you can argue about how much can be cut out of a toy compiler for it still to be a useful learning exercise in both compilers and tackling complex problems, but I don't see any harm in going straight from parsing to code generation, cutting out AST building and of course any IR and optimization. The problems this direct approach causes for code generation, and optimization, can be a learning lesson for why a non-toy compiler uses those!
A fun approach I used at work once, wanting to support a pretty major C subset as the language supported by a programmable regression test tool, was even simpler ... Rather than having the recursive descent parser generate code, I just had it generate executable data structures - subclasses of Statement and Expression base classes, with virtual Execute() and Value() methods respectively, so that the parsed program could be run by calling program->Execute() on the top level object. The recursive descent functions just returned these statement or expression values directly. To give a flavor of it, the ForLoopStatement subclass held the initialization, test and increment expression class pointers, and then the ForLoopStatement::Execute() method could just call testExpression->Value() etc.
Rather than having the recursive descent parser generate code, I just had it generate executable data structures - subclasses of Statement and Expression base classes, with virtual Execute() and Value() methods respectively, so that the parsed program could be run by calling program->Execute() on the top level object.
You can do this with SES on AWS or Mailgun.
We use SES for this exact use case. Giving agents programmatic access to an email inbox and programmatically creating inboxes.
Yes you can use SES or Mailgun. But you would also need to build inboxes, email threading, attachment parsing, semantic search, structured data extraction, and more.
They advertise in the sidebar, offer discounts for the first month, occasional free trials of Pro features. You don’t have to try to use a Pro feature to get the sales pitch.
The problem is, it's not necessarily the job of the person who is tasked with doing it.
These are the kind of things that fall between the gaps in smaller companies and there's no expert to build this disaster recovery plan because there is no risk or compliance department.
It falls into the lap of whomever is dealing with the audits or whoever has a reputation for getting things done and unblocking people.
Shoe Dog - Phil Knight.
A great read about the struggles of how Nike started. But was left unsatisfied as the book ends with the company going public but doesnt cover its ascension in the late 80s/90s and how that went down.
Building a recursive descent parser from scratch was an eye opener to 17yo me on how a seemingly very complex problem that I had no idea how to approach can be made simple by breaking it down into the right primitives.