How do you implement a non-backtracking NFA other than converting it to a DFA?

mananaysiempre · on June 20, 2024

Assuming each state transition consumes a character (“epsilon-free NFA”), just follow the definition. Before each step, you have a list of NFA states you might be in, initially containing only the start state. During each step, go through each state on that list and enter any acceptable successors into a new one. Deduplicate and repeat.

This is (potentially super)linear in the number of states, but that doesn’t preclude it from being linear in the input length. In particular, even though it’s essentially a bad lazy implementation of standard NFA-to-DFA conversion, you’re avoiding the exponential state blowup you get in the eager version. (I say it’s a bad lazy implementation because a proper one would memoize; that, however, would bring the worst-case blowup back.) I think that, in addition to the fact that people do actually do this kind of thing in production implementations, qualifies it as a separate approach.

If your NFA does have epsilon-transitions—like, say, the NFAs obtained from the textbook Thompson regexp-to-NFA construction—you can either use a different (Glushkov) regexp-to-NFA construction[1], eliminate those transitions as a separate pass[2], or adjust the algorithm above to eliminate duplicate NFA states on the fly[3].

[1]: https://en.wikipedia.org/wiki/Glushkov%27s_construction_algo...

[2]: https://news.ycombinator.com/item?id=19272990

[3]: https://swtch.com/~rsc/regexp/regexp2.html

zeroonetwothree · on June 21, 2024

Last time I implemented this myself I just computed the epsilon closure at each step.

burntsushi · on June 20, 2024

At search time, the principle difference between an NFA and a DFA is that, in an NFA, you can be in more than one state at the same time. There's no problem with implementing your transition function to advance all states you are in simultaneously for each character of input. You'll get `O(m * n)` time, where `m` is proportional to the number of states in the NFA and `n` is proportional to the length of the input.

The GP is remarking on a confusable term, "NFA algorithm," that has come to mean something different from its underlying theory. The PCRE2 documentation itself, for example[1], remarks about how it uses the "NFA algorithm" as its standard approach to matching. But of course, PCRE2 supports way more than the theoretical model of non-deterministic finite automata (NFA) actually supports. PCRE2 is much more expressive. Yet, the implementation approach is still called "NFA algorithm." Presumably the reason for this is that an NFA can be simulated by doing backtracking, but if you don't keep track of where you've visited, the worst case runtime of this is superlinear, possibly catastrophically so[2]. The problem is that the backtracking regex engines have evolved well beyond what an NFA actually supports. But the name stuck.

To make things even more confusing, as I mentioned above, an NFA can be used to execute a search in O(m * n) time. (We typically call that "linear in the size of the input" since the NFA itself can usually, but not always, be regarded as a constant factor. i.e., A literal regex string in your program source code.) So when you say something like, "a finite automata regex engine uses NFAs to guarantee linear time," you might be really confused if your background is in backtracking engines which... also use an "NFA," but of course cannot guarantee linear search time.

So what we have is ambiguous terminology, not unlike "regex" itself, which could perhaps mean a real theoretical "regular expression" (in the case of Hyperscan, RE2, Go's regexp package and Rust's regex crate) or it could mean "a very expressive DSL for pattern matching that supports more than what regular languages can describe" (in the case of PCRE2, ECMAScript's regex engine, Ruby's Oniguruma, Python's `re` module, and more). I've made my peace with "regex" being ambiguous, but the fact that "NFA algorithm" is itself ambiguous despite "non-deterministic finite automata" being a somewhat baroque and specific technical term, is very unfortunate.

To make things even worse---and I shit you not, I am not joking---the backtracking people seem to have taken to calling a linear time NFA search the "DFA algorithm."[3] (At least the PCRE2 docs mention it's not a "traditional" finite state machine...) But... it's not a DFA. The backtracking folks have effectively elevated the terms "NFA" and "DFA" to mean something very different from their theoretical underpinnings.

So when you have people that are only aware of one meaning of "NFA algorithm" talking to other people only aware of the other meaning of "NFA algorithm," chaos ensues.

[1]: https://www.pcre.org/current/doc/html/pcre2matching.html

[2]: https://github.com/BurntSushi/rebar?tab=readme-ov-file#cloud...

[3]: https://www.pcre.org/current/doc/html/pcre2matching.html#TOC...

fanf2 · on June 20, 2024

Yeah, it’s vexing. This backtracking / NFA / DFA confusion originally came from Jeffrey Friedl’s book “mastering regular expressions” (O’Reilly, 1997) — at least, that’s the first place I saw the mistake in print. Philip Hazel relied a lot on Friedl’s book when writing PCRE, which is why the PCRE docs get it wrong. (I spoke to Philip about this many years ago when we worked together.)

burntsushi · on June 20, 2024

Yeah I didn't mention Friedl because I've dumped on him in the past, and didn't want to belabor it. But Friedl definitely didn't start this. According to him, he was just using what was common vernacular even back then: http://regex.info/blog/2006-09-15/248 (Note that blog is from 2006, but he's actually quoting himself from ~10 years prior to that in 1997.)

But... he is the one who ultimately installed this ambiguity into a classic book that is still read to this day. So I think he can at least be blamed for popularizing the confusion. And especially since the entire book is framed around "NFA" and "DFA" regex engines. It's not like it was just some one-off mention. The mix-up is baked into the conceptual fabric of the book. The landscape also looked different back then. It predated RE2 for example, and RE2 is, I think, principally responsible for bringing "backtracking semantics" to finite automata oriented engines. So Friedl's book is forever stuck in a false dichotomy that only happened to exist back when he wrote it.

mikeqq2024 · on June 24, 2024

Thanks for the detailed explanation to grok the whole landscape.

loeg · on June 20, 2024

As GP said, you can keep a bag of current states.