This brings back old memories. We were given this task as an assignment for a course in college called "Theory of Computation". I stayed up for 3 days without sleep to implement this thing. I did it in c++ though, so a lot of that time was due to the choice of the language.
Edit: Forgot to add, got 140pts on it(over a 100, there was a bonus). Yeah, I'm awesome.
Indeed. In college, I implemented a regular expression compiler in Java, targeting JVM bytecodes directly, which ended up outperforming java.util.regex of the time (JDK 1.4).
For regular expressions in actual practice, keeping the expression as an NFA is often done, as things like captures with backreferences cannot be expressed as a DFA, and also the NFA to DFA conversion may, in pathological cases, result in an exponential explosion in states.
If you're comfortable with your lexing and parsing, tree building and threading (i.e. threaded tree), you can implement a basic NFA regex matcher supporting captures and backrefs quite easily, in no more than a day. Matching is basically a recursive visit over the threaded parse tree nodes, with some creative threading links, such as in series for concatenation, in parallel for alternation, and in cycles for closures. Backing out from recursion handles the nondeterminism.
Exactly. And that in a nutshell is the problem with this article. It's an exposition of the kind of regex engine we all learned about in school, not the kind of tool that is useful for practical software development.
A regex matcher, alone, isn't really useful for much beyond a lexer generator. It's captured subexpressions and backreferences (well, really just captures) that turn it into the gadget we all love. And as you point out, that is not achievable with the simple state machine engine we all remember from class.
> It's captured subexpressions and backreferences (well, really just captures) that turn it into the gadget we all love. And as you point out, that is not achievable with the simple state machine engine we all remember from class.
Did you read [0]? It demonstrates that submatch extraction can be (and has been) implemented with DFAs:
"The extraction of submatch boundaries has been mostly ignored by computer science theorists, and it is perhaps the most compelling argument for using recursive backtracking. However, Thompson-style algorithms can be adapted to track submatch boundaries without giving up efficient performance. The Eighth Edition Unix regexp(3) library implemented such an algorithm as early as 1985, though as explained below, it was not very widely used or even noticed."
I said captures with backreferences, not captures alone. As the link you cite mentions, "as far as the theoretical term is concerned, regular expressions with backreferences are not regular expressions. The power that backreferences add comes at great cost: in the worst case, the best known implementations require exponential search algorithms, like the one Perl uses. [...] No one knows how to implement regular expressions with backreferences efficiently, though no one can prove that it's impossible either. (Specifically, the problem is NP-complete [...])"