It's actually not that hard to deal with text, even in partially edited states. It's just most people don't know how to build a decent incremental parser with fairly good error recovery, but some of us do.
But the hard part isn't dealing with invalid parses and error recovery. The hard part is dealing with completely valid bits of code that aren't yet finished:
(doseq [x (range|cursor|)]
)
You can wait until the end of time, but that won't finish ;) Whereas I probably wanted to get out (range 10) before it went off and looped forever.
Text also doesn't provide you with stable IDs. If I previously had function "foo" in a file and now that's gone, but there's a function "bar" in the same place with roughly the same code, does that mean it was renamed? Or did foo get deleted and we need to surface that anything using foo is an error? How would you know? The only thing you have to go on is position and position can change for all sorts of other reasons. I ran into this problem with trying to do functions in single editors: when the file can change in any way, reconciling that file with your previous view of the world is tremendously difficult.
I don't have a problem with this. If you accidentally encode an infinite loop, you just break it on the next code change. You can memoize your token stream in an incremental lexer to use tokens as stable IDs: you use the same symbol for bar as you did for foo because the token/tree start position never changed; only the contents of the existing token happened to changed! This is what I do in YinYang, and it works well. Of course, to get here, you have to be incremental 100% and not use existing batch tools (or adapt them first to be incremental).
I agree that the problem is not error recovery, but is more generaly described by Peter Naur's "Programming as Theory Building"[1]. With that, programming is hard because creating a theory, collectively, and teaching other people about it, is hard. Making it easier is therefore solving the problem Douglas Engelbart envisioned solving: Augmenting the Human Intellect
I wrote a workshop paper last year, but it wasn't very deep. Actually, incremental parsing, at least the way I do, doesn't really involve any special algorithms, here are the three points:
* making your lexer incremental first, memoize tokens between edits so you have something to attach trees to that will still be there after the next edit!
* Match braces (or determine indented blocks) separately before you even bother parsing. This makes incremental parsing more robust, and it is really easy to do without parsing information (unless your language overloads brace lexemes, like Java generics do! Scala didn't have this problem however). Also, autocomplete braces even if your users really hate that, because otherwise you won't have anything nice to reason about 99% of the time (indented languages are really better here for obvious reasons). Tell them to just use emacs if they don't like it.
* The last thing to do is just memoize parse trees (by attaching them to your memoized tokens) and then replay parsing on trees whenever their parses could have changed. So if their is an edit, invalidate the inner most tree, or multiple trees if the edit occurs at a boundary. If a child tree parse changes, invalidate its parents. Log your symbol table additions so you can undo them when they are no longer done on a parse (or the parse tree is deleted); trace your symbol table references if the symbol binding for the name being looked up changes, and don't worry about type checking in a separate pass, because you can just replay the parse multiple times if necessary.
The last point is quite fascinating and something I'm working on right now to turn into a more general programming paradigm. Check out: