Exploring the Linguistics Behind Regular Expressions

zokier · on March 4, 2018

What I find fascinating in the history of regexps is that their syntax has remained remarkably similar over the years and across very different implementations and uses. Considering how much variance we have in programming languages in general, and how many people consider regexp syntax to be unfriendly, I'm not sure why there hasn't been more experimentation (and serious alternatives!) in this area.

jwilk · on March 4, 2018

Perl 6 breaks with the traditional regexp syntax:

https://docs.perl6.org/language/regexes

zokier · on March 4, 2018

Also breaks with the traditional regexp by not actually being regular expressions :)

raiph · on March 6, 2018

There have been two such breaks, not one. Perls 1-5 correspond to the first such break. Perl 6 corresponds to the second.

If "regular expressions" is taken to refer to formal language theory regular expressions -- which is NOT what 99% of devs mean by the terms regex or regular expressions -- then "breaks with the tradition" (formal language theory "tradition") happened somewhere in the 1960s to 1980s timeframe, when capturing parens and backreferences were introduced in [qs]?ed or similar. (Years before the first Perl arrived in 1987 to popularize and extend said break with tradition.)

If "traditional regexp" is instead taken to refer to this latter notion of "regular expression", i.e to match what 99% of devs DO mean by the terms regex or regular expression -- Perl 5 compatible regexes, PCRE, etc. -- then Perl 6 represents a second break with tradition, breaking away from the currently still popular Perl 5 "tradition".

In other words:

* "Regex" meaning 1: formal regular expressions (from 1950s)

* "Regex" meaning 2: Perl (5) compatible regexps (from 1960s)

* "Regex" meaning 3: Perl 6 rules[1] (first officially available 2015)

[1] "Perl 6 rules are the regular expression, string matching and general-purpose parsing facility of Perl 6 ..." (https://en.wikipedia.org/wiki/Perl_6_rules)

brianon99 · on March 7, 2018

I think what most people mean when they say 'regex' is actually 2 thing:

1. The syntax of that pcre-like regex engine accept.

2. regular language, a kind of formal language.

Many regex engine nowadays like the one in perl5 and onigmura already breaks 2, but still makes 1 compatible. I think what perl6 does is also breaks 1. (I am not experienced in Perl6. Please correct me if I am wrong.) I don't think it is a problem, though.

b2gills · on March 13, 2018

In Perl 6 regexes are a type of method, and you can use them in grammars which are a type of class. (You can use them on their own as well)

Which means you can subclass grammars, compose in regexes with roles, and have parameterized regexes.

The syntax has also had an overhaul to make it more consistent with itself as well as the rest of Perl 6. Since you can embed Perl 6 code, some features of other regular expression engines haven't been implemented as they aren't needed.

The result of using a regex or grammar is also now a parse tree rather than True/False or the matched substring.

I generally recommend reading the code for JSON::Tiny::Grammar as a quick example of what it is like. https://github.com/moritz/json/blob/master/lib/JSON/Tiny/Gra...

bmn__ · on March 13, 2018

> I think what perl6 does is also breaks 1.

No, you can use the backward-compatible syntax if you don't want to spend any time porting to the newer improved syntax.

https://docs.perl6.org/language/regexes#Perl_5_compatibility...

http://design.perl6.org/Differences.html#New_regex_syntax

https://docs.perl6.org/language/5to6-nutshell#Add_:P5_or_:Pe...

jklehm · on March 4, 2018

There's this[1] effort for a more readable syntax[2][3].

[1] https://verbalexpressions.github.io/

[2] https://github.com/VerbalExpressions/JSVerbalExpressions

[3] https://github.com/VerbalExpressions/PythonVerbalExpressions

CJefferson · on March 4, 2018

I love the idea, but wow that website is awful. It just seems to link to implementations in a set of languages. Is there any overview, aim of the project, or language agnostic set of instructions?

posterboy · on March 4, 2018

I expected the article to dive a bit deeper into natural language understanding, identifying regexes in natural language and how those constructs are used to build grammars higher in the hierarchy.

I should really get around to read more on this, but it quickly explodes in complexity. Suddenly I'm on wikipedia reading about the analytic hierarchy of mathematics. All the while hardly anyone seems to expect that english should adhere to formal grammars.

It would be really interesting instead to only use the concepts that have been introduced. So, what's the bare minimum to be expected from a speaker prior to such a text? Purely appending composition, I guess, enumerations, starting with 'yes' and _. nothing. Typically the first word in a conversation is a greeting. Hello world! On the other hand, there's an obvious relation between 'I' and '1'.

It's interesting that a lot of language can be built up so that a sentence can be understood at each stage of its build (with a bit of abusing the language). Words then are learned simply by association of being close to other known words, so "hello" is implicitly expanded to a whole context. Which in turn is learned from gestures. And continuous repetition is very important. Feedback. ie. success is learned from quieting a crying baby down. Words are learned by echoing back wkrds that are heard repeatedly (rather phonemes, so I would start with "hi", not "hello"). And a lot of repetition is own sounds, to learn sounding at all and to not forget them again.

And later, hole ideas have to be repeated again and again and refined ... which I guess is why I am writing all this.

Although, "simulation" allows us to do all this quietly and heuristics and proofs can significantly simplify the process. I guess that can be linked to context free and higher grammars.

And because of the repition I appreciate this post.

steffann · on March 4, 2018

I get the suspicion that the writer doesn't understand the ?, * and + operators...

yorwba · on March 4, 2018

There is a problem here in that different regex libraries have different semantics for these.

I checked the manual for PCRE (man pcrepattern), and it says that ? has both the meaning of {0,1} (zero or one repetition), as well as turning * and + into non-greedy variants if directly following them.

Similarly, + usually has the meaning of {1,} (at least once) but can also quantify * and + to prevent backtracking.

For an engine whose semantics differ from PCRE, non-greedy matching or backtracking might not even make sense, if the matching is implemented differently (e.g. using finite automata that don't backtrack).

posterboy · on March 4, 2018

a correction would be helpful.

   /x?/ one or none x
   /x+/ one or more x
   /x*/ none or more x

with variable operators it's more complicated. In /.+/ the operator is repeated, not the first match as with /(.)\1*/

Also, there are extensions in various implementations that are not in fact a regular, so no finite automata.

rocqua · on March 4, 2018

I expected more focus on the creeping capabilities of regexes. Especially how this relates to the tendency towards Turing completeness.

I've heard it said that every language is 'doomed' to creep towards being turing complete. This is 'doom' because turing completeness entails suffering from the halting problem.

crdoconnor · on March 4, 2018

>I've heard it said that every language is 'doomed' to creep towards being turing complete.

Usually that happens for DSLs for tools which probably should have just been ordinary libraries in existing languages (e.g. ant).

It's often tricky knowing where architecturally to draw the line between turing completeness and non-turing completeness and the technology landscape is littered with examples of tools which put it in the wrong place and later tried to hack around it.

Turing completeness where it isn't necessary IMHO isn't really a problem because of the halting problem per se - it's a problem because turing complete code has a higher maintenance cost at the best of times and attracts a ton of technical debt at the worst.

Old school frameworkless PHP was the clearest example of this IMHO - the lack of a clear separation between business (should be TC) logic and presentation logic (should be low powered templating language) caused messes all over the place.

eesmith · on March 4, 2018

It's also tricky because it's so easy to make a Turing complete system by accident. You may not even realize that you've crossed that line.

crdoconnor · on March 4, 2018

It really shouldn't be hard to tell for the designer. If you're considering implementing loops, conditionals or variables in your DSL then you should kind of realize what direction you're headed in.

The hard part is realizing from the get go (before backwards compatibility concerns kick in), that your problem space is not conducive to non-turing complete languages in the first place, and that instead of inventing an exciting new DSL, maybe you should just write a library.

eesmith · on March 4, 2018

When CSS+HTML5 became Turing complete, do you think the designers knew it?

Or the designers of page fault handling in X86?

As I understand it, C++ templates were not supposed to be Turing complete, but they are.

These examples and more come from http://beza1e1.tuxen.de/articles/accidentally_turing_complet... .

ggm · on March 4, 2018

Regular Expressions existed before UNIX. But G/RE/P made regular expressions both expressive, and commonplace. POSIX carried the job forward, Perl had a role to play too.

One family of expression in grep, sed, awk, ed, ex and vi. Thats awesome.

What I find strange, is how late EMACS family editing came to a sensible mechanism to use them. Global search and replace in emacs has always felt significantly more 'clumsy' than in the ed/ex/vi family.

Maybe its me.

carapace · on March 4, 2018

One of the strange and wonderful things in the history of the world is that Chomsky's Transformational Grammar forms the basis of both computer languages and Neuro-Linguistic Programming.

Part of the origin story of NLP is that they used Transformational Grammar to analyze therapeutic exchanges between therapists and clients. The "Meta Model" explicitly uses grammatical structure to detect missing or elided information.