Writing a Unix Shell – Part II

kevindong · on June 14, 2017

If you want to learn more about writing a shell from an undergraduate coursework perspective (and far closer to how bash does things), this is a chapter entirely on writing your own shell: https://www.cs.purdue.edu/homes/grr/SystemsProgrammingBook/B...

That PDF covers the basics of making your own shell (i.e. splitting input into tokens aka using a lexer, parsing the resulting tokens aka using yacc [a token parser], I/O redirection, piping, executing commands, wildcarding, interrupts, environmental variables, history, subshell, revising what you've typed without having to retype it, etc.).

Every undergraduate CS major at Purdue University is required to do the infamous shell lab (essentially, recreate csh). This project really taught me how shells work. Before doing this project, I could do the minimum in shells, but now I'm fairly competent at it.

thechao · on June 14, 2017

Since not a lot of folk knew this at my shop: wordexp() is the posix library for lexing strings "like the shell".

userbinator · on June 14, 2017

Note that wordexp() will also, unless explicitly told otherwise, perform command substitution and thus is capable of executing other processes. Be wary of using it on untrusted input.

userbinator · on June 14, 2017

I would not recommend using a lexer/parser generator for writing a shell, especially if you want to support things like backquote substitution and here-docs, since parsing and evaluating are interleaved. A recursive-descent parser is more flexible and suitable to the task.

chubot · on June 14, 2017

Yes, this great article is all about how the maintainer of Bash regrets that it uses a parser generator (yacc):

http://www.aosabook.org/en/bash.html

I've mentioned this here before, but I was able to parse bash almost entirely up front, without interleaving parsing and execution. The first half of my blog [1] is about this.

To make a long story short, I use four interleaved parsers, and they ask the lexer to change state at the appropriate points. It's three separate recursive descent parsers, and then a Pratt parser for C-style arithmetic expressions.

It works very nicely, and surprisingly the algorithm is efficient, requiring only two tokens of lookahead: http://www.oilshell.org/blog/2016/11/17.html

Aside from lookahead, the lexer reads the text exactly once, not 2, 3, or 4 times.

There are two things you can't parse up front that I know of:

- Associative array syntax, but this is bash 4.0-specific: http://www.oilshell.org/blog/2016/10/20.html

- A crazy instance of runtime parsing of arithmetic expressions inside strings, AFTER variable substitution: https://github.com/oilshell/oil/issues/3 (all shells I tested implement this, not just bash)

Also there is one issue that would require arbitrary lookahead:

- Bash does arbitrary lookahead to distinguish $((1+2)) and $((echo hi)), the former being arithmetic, and the latter being a subshell inside a command sub, but it's not required by POSIX: http://www.oilshell.org/blog/2016/11/18.html

In bash, Brace substitution is really metaprogramming which can be done at parse time. You can manipulate program fragments, e.g. a{b,$((i++)),c,d}e, and it doesn't rely on any program input.

In ksh, brace substitution is done AFTER variable substitution, so it's another level of runtime parsing.

Globbing is done AFTER variable substitution in all shells.

But yes, lex and yacc are totally unsuitable for parsing shell. It's unbelievably awkward to express, and results in more code, because the parser has to be used for interactive input (the $PS2 problem), and it also should be used for command completion, e.g completing something like 'echo $(ls /b<TAB>...' .

It also forces you into parsing at runtime, as far as I can tell. The yylex() interface involves a lot of globals and the generated parsers probably don't compose as I would like.

[1] http://www.oilshell.org/blog/

dozzie · on June 14, 2017

It's much simpler than you would want to paint it.

  [dozzie@alojzy dozzie]$ `echo for` i in 1 2 3 4; do echo $i; done
  zsh: parse error near `do'
  [dozzie@alojzy dozzie]$ bash -c '`echo for` i in 1 2 3 4; do echo $i; done'
  bash: -c: line 0: syntax error near unexpected token `do'
  bash: -c: line 0: ``echo for` i in 1 2 3 4; do echo $i; done'

I think parser generator will be sufficient, no need to resort to I-can't-write-grammars recursive descent.

Unless you were talking about csh syntax; that one I never learned, so I don't know if it is context-free.

DonaldPShimoda · on June 14, 2017

> Every undergraduate CS major at Purdue University is required to do the infamous shell lab

Huh. Is the course based on CMU's book "Computer Systems: A Programmer's Perspective"? The CS:APP book is used by like a hundred schools nationwide and there's a shell lab there too: http://csapp.cs.cmu.edu/3e/shlab.pdf

My school (the University of Utah) used this book for one of our courses. Worse than the shell lab was implementing malloc by hand, haha. Rough weekend right there.

pmorici · on June 14, 2017

The undergrad OS class at Purdue has a Malloc from scratch lab and a shell lab. Really great class, I don't know what if anything it is based on they had us buy the "Dinasaur" book but I don't remember ever using it that was a while ago now though.

kingbirdy · on June 14, 2017

My school did it as well, the shell lab was our final project

pacaro · on June 14, 2017

Thanks for continuing your series.

This is a great illustration of how non-trivial a production quality shell is.

Parsing input is "tricky"

Each builtin needs comprehensive error handling

kevindong · on June 14, 2017

IRL, shells don't do string manipulation (well, technically everything becomes string manipulation at some point, but in this context not in the normal sense of the term). Shells generally use a lexer to split inputs up into tokens (generally using regex) [0] and then make sense of the inputs using a parser (the most famous of which is called yacc [1]).

[0]: I was going to link to Bash's lex file here, but they appear to do something funky which would require a non-trivial amount of time to find, understand, and write here. So, you'll just have to take my word on this. I give you wikipedia as a substitute: https://en.wikipedia.org/wiki/Lexical_analysis

[1]: https://git.savannah.gnu.org/cgit/bash.git/tree/parse.y

chubot · on June 14, 2017

The lexer for bash is inside that file, parse.y -- see yylex(), which calls read_token(). It doesn't use lex; it's written by hand.

I'm not sure what you mean that shells don't do string manipulation. Almost ALL they do is string manipulation.

That's true for the shell interpreter, which has to make sense of the input program, and for user programs, which are processing argv strings like file system paths, and stdin.

There are actually a handful of different parsers inside bash, which I mention here: http://www.oilshell.org/blog/2016/10/26.html

Brace substitution is another little parser as well. And globbing, and regex, both of which need their own parsers. (bash has its own glob parser, but some shells use libc's glob implementation). bash is really at 4-7 sublanguages in one.

The annoying thing about shell is that it makes it impossible NOT to do string manipulation in your program, because there is all this implicit stuff like word splitting.

pacaro · on June 14, 2017

One of my takeaways from TFA was along the lines of…

"Hmmm… he's using strtok, that's not how a real shell would work. What would a minimal shell, without scripting, pipes, redirects etc. do? Just correctly parsing legal file paths (which TFA needs to correctly implement 'cd') is well out of scope of a small article like this."

chubot · on June 14, 2017

Right, a real shell obviously can't use strtok. If you're leaving out pipes, redirects, and any control flow, then separating a shell string into words for the argv[] array is fairly similar to lexing a C-escaped string (e.g. in C, Java, Python, JavaScript).

You have backslashes, single quotes, and double quotes basically. Traditionally this is done with switch statement in a loop in C.

But that is not a good approach for a real shell. Even inside double quotes you can have a fully recursive program, like:

    $ echo "hi ${v1:-A${v2:-X${v3}Y}B}"
    hi AXYB

Once you have recursion then you need some kind of parser, not just a lexer.

laumars · on June 14, 2017

I've been mocked on HN for saying this before but Bash and other shells of it's ilk are programming languages in their own right. I mean sure you're dependant on the suite of tools in $PATH to do anything useful, but that's not that much different to the standard libraries that make modern languages so powerful.

Spakman · on June 14, 2017

I have have a hard time seeing what there is to mock about your opinion of shells. I absolutely consider them languages - better at some things, worse at others.

tyingq · on June 14, 2017

Still missing the check for fork returning -1, which would make your waitpid() hang forever, as -1 waits on all p̶i̶d̶s̶,̶ ̶e̶v̶e̶n̶ ̶i̶n̶i̶t̶.̶ children.

Edit: less problematic than my original wrong guess, but still bad for a shell, which eventually supports multiple concurrent children.

lkurusa · on June 14, 2017

Wrong, waitpid(-1, ...) waits on all child processes of the current process, and would return -1 with errno set to ECHLD if there is no children.

The point about missing error checking still stands though!

tyingq · on June 14, 2017

Ah, yes, any child...thanks for the catch.