My biggest issue with regular expressions is remembering the exact syntax of each of the different common regex engines. Javascript, Perl, `grep`, `grep -E`, vim, `awk`, `sed`, etc...
Each one seems to have slightly different syntax, require different characters to be escaped, has different defaults (is global search enabled by default? Multiline? What about case sensitivity?), some don't support certain lookarounds, how does grouping work, and so on.
It's a real pain, especially when you want a quick one-off regex. In that case the learning curve changes the economics of what the right tool is for the job. Often I'll just end up writing a program using a tool I already know, even though I'm aware that it's a less efficient choice—at least its inefficiency is predictable. Of course if you do that enough times then you've spent more than the original learning curve would have cost! I have done this in painfully many contexts. Regexes are an obvious case, probably because they're so obviously doing the same thing, just differently enough to waste your time.
Some people actually like searching for and paging through documentation to learn how, e.g., regex format X does character escaping. And they tend to remember such things, too. I don't and don't.
This is a very big issue and one I recently faced in the workplace. Needed a script to parse some log files to generate CSV reports for the business users. I knew jq (https://stedolan.github.io/jq/) and hence was able to write less than 10 lines of jq to do it combined with some preprocessing using sed.
I then realised that NOBODY on the team knew jq apart from me and I had to rewrite it in Python which took me 4 days to do correctly and handle everything that jq did for me.
There are three types of regex as far as I know: basic (aka GNU), extended and PERL. Grep uses GNU as the name implies.
egrep or grep -E uses extended. PERL is used elsewhere like JavaScript. Typically you'll see pcre which is the library for Perl Compatible Regular Expressions.
The name "grep" is unrelated to GNU. According to The Jargon File, the etymology is:
from the qed/ed editor idiom g/re/p, where re stands for a regular expression, to Globally search for the Regular Expression and Print the lines containing matches to it
In fact, grep predates the GNU project by almost a decade.
Also, it's the first time I hear about "GNU" regexps.
There are many differencese between "compatible" regular expressions, for (many) examples read the end of section "Important Notes About Lookbehind" on [1] or compare 24 dialects on [2] for example.
Erlang requires doubling up on escape backslashes, which generally causes me grief even after I flail about and finally remember it. Subtleties abound.
I wonder if there is a tool that lets you enter a regular expression from one engine and get back the equivalent regular expressions for other engines?
Some of the look arounds might be too powerful to do in some engines so it would not always work, but it would still be quite useful if it just handled the differences in escaping, specifying modes like case insensitivity or multiline matching, and referencing match groups in replacements.
Not exactly, but that's a(n internal) feature of my Strukt[1]. It's how the optimizer is able to take any of {exact string, glob, ICU regex} as user input, and convert them into efficient queries for various data sources.
For example, if you do ListFiles[folder=~/Downloads] -> FilterString[field=filename, regex=\.pdf$], it will parse the regex, verify that it only uses features which are available in Spotlight's query syntax (roughly, a glob with slightly funny syntax), and rewrite that adjacent pair of operations into a single API call behind the scenes:
kMDItemFSName LIKE[c] "*.pdf"
I can report, having written parsers for a few common regex dialects, that there are a ton of obscure regex features, with different semantics everywhere. 100% conversion is almost never possible.
A lot of the work in deciding if a translation is possible is in identifying if something is good enough, e.g., SQLite can't do case-insensitive search (in general), but if your regex happens to be /([0-9]+)/ then case-sensitive search will work just fine. Fortunately, for Strukt, if a conversion is impossible, I can just run the operations as written: it's much slower, but still correct.
I've thought about breaking this part of Strukt off into a tool just for regex editing, but that always seemed rather esoteric. Would there be any use or demand for that, do you think?
> remembering the exact syntax of each of the different common regex engines
I have the same problem. There's no way around those little differences other than to suffer them. It simply isn't feasible to remember all the nuances of the different flavors.
I just know one really well and rely on getting it wrong for the others: guess, test, revise, repeat.
That's one of my pet peeves too. Another pet peeve is that I only touch regex once every few months. When you have a regex set to parse a domain of data you move on to other things. After a while you forget the regex syntax and when you have to parse another set of text or have to debug the older regex, I end up having to relearn a bit of it. But I guess it comes with the territory.
It explains what each character does just by hovering over a regex. Best tool to learn or to fine tune your regular expression (with testing included).
Love RegexBuddy. That feature along with the "use" tab which generates code for whatever language you select. I don't have to remember the specifics in all the languages I work in. I just select from the dropdowns, for example, "JavaScript (Chrome)" then "Use regex object to get the part of a string matched by a numbered group". Replace the placeholder variable names, and you're good to go!
It will also do things like warn you if you use named groups if your selected language doesn't support them, and the "Use" dropdown won't provide that option.
Hah wow... I've had this app for over a decade and never noticed that feature... right next to the Debug button which I have used numerous times on really gnarly regexes
https://www.debuggex.com/ is a neat one for showing
a syntax chart (or railroad chart) visualization.
I generally use https://regex101.com for its display of matched groups (when I'm dealing with complex groups and/or replacement backreferences), and http://regexstorm.net/tester when I specifically need to check a regex that will be running in .NET or Powershell.
For advanced users (and those who want to become regex gurus), the most helpful regex site for me is http://www.rexegg.com/. Also the O'Reilly book "Mastering Regular Expressions" is probably worth gold.
I've been teaching regular expressions for years, and offer a free online e-mail course on the subject (http://RegexpCrashCourse.com/).
This site is a very nice summary of regexp syntax and is written well -- but it's missing two crucial pieces that help people learn: Examples and exercises. Without practice, there's no way that people can remember the syntax.
I recently encountered a case where a URL had an underscore at the end of a subdomain name. It seems underscores are okay anywhere else, but while my friend on Windows was able to load the website, I wasn't (on Linux) using Firefox, curl, remote screenshot service which presumably ran Linux etc. According to various RFCs, they should be okay anywhere within the subdomain name.
Has anyone encountered this behavior? Couldn't find anything on the internet; maybe it's just my computer?
I'm not enough of a history boffin to know how Microsoft came to support it differently (perhaps something from the Netbios and NT era). At this point in time though, I don't see either party changing their default validations to agree on a single definition.
Wait a second...does this imply that if I put downloads that should only be of interest to our Windows customers on a server named something like downloads_.ourdomain.com, it might keep out all those annoying bots that ignore robots.txt and make a lot of noise in my logs? I'm guessing that most of the bots are not running on Windows.
That's a pretty bad idea, you shouldn't rely on this kind of stuff.
If there are people running OSX or Linux that want Windows downloads, or someone is behind a captive portal or proxy (like squid), they probably won't be able to reach it anymore.
If you have a real problem with bots, I'd look at what IPs they are coming from, and how often they try to connect. Something like IP blacklisting, or fail2ban might work for your use case.
Yes, I feel that the "Bonus" section (with no explanation even) is rather encouraging beginners to mis-use regular expressions in general, and - more specifically - contains errors.
I personally avoid regexes where possible, including in this situation. IMO the right way to validate a URL is to feed it to a URL parser and see if it errors out. I can see errors in this regex right away - and in many other regexes you find from Googling. People just drop them into their codebase and their eyes glaze over when you ask them whether or not it's actually correct. How many websites fail on user+whatever@gmail.com because they copied a bad regex?
i.e. Encountering them here, there and everywhere. Then one day realising you have a good knowledge of the subject without ever having set out to learn it.
I spent some time reading some resource or another on how regexes work, but the vast majority of my learning has been trying things in https://regex101.com/ and seeing if they do what I want. The breakdown on the side of the page is especially helpful.
And when you think you've learned regex, learn that you haven't: http://fent.github.io/randexp.js (a regex "reverser" of sorts) [1]
Seriously. Test every non-trivial regex with something like this, you'll probably be surprised at how permissive most regexes are.
Regexes are great. They're super-concise and perform amazingly well. But they're one of the biggest footguns I know of. Treat them as such and you'll probably do fine.
---
[1] for instance, the URL regex they use is incorrect, and it's super obvious when you plug it into that site:
I didn't truly understand regular expressions until I saw how they are executed. There are simple algorithms for executing them, so that might not be such a bad way to teach.
I learned from the book The Unix Programming Environment but I also didn't truly understand regexp until I read how they were executed, in Programming in Standard ML the first chapter shows how to implement a complete package/parser for regexp http://www.cs.cmu.edu/~rwh/isml/book.pdf
The lack of readability of regex makes me wonder if there isn't a better way. I've seen Elm's parser which introduces a few neat concepts like parser pipelines. https://github.com/elm-tools/parser
Have you looked at Perl 6 at all? Whitespace in regexen is insignificant if not quoted, so not only can you add a little space between sections, you can split a regex over several lines and add comments throughout.
It also has first-class grammars, so you're less tempted to reach for regex when something more powerful would help.
For some people Regex Golf [1] might be an interesting way to learn. You are actually building increasingly complex regex as you go, and can just look up bits of syntax you don't know as needed.
Capture groups are where I get 99% of value from regexs. Being able to quickly transform data is where I find regexs perform best. Just matching is not something I have to do all that often.
Each one seems to have slightly different syntax, require different characters to be escaped, has different defaults (is global search enabled by default? Multiline? What about case sensitivity?), some don't support certain lookarounds, how does grouping work, and so on.