Learn Regex the Easy Way

HellsMaddy · on Aug 10, 2017

My biggest issue with regular expressions is remembering the exact syntax of each of the different common regex engines. Javascript, Perl, `grep`, `grep -E`, vim, `awk`, `sed`, etc...

Each one seems to have slightly different syntax, require different characters to be escaped, has different defaults (is global search enabled by default? Multiline? What about case sensitivity?), some don't support certain lookarounds, how does grouping work, and so on.

dang · on Aug 10, 2017

Completely agree. Emacs too.

It's a real pain, especially when you want a quick one-off regex. In that case the learning curve changes the economics of what the right tool is for the job. Often I'll just end up writing a program using a tool I already know, even though I'm aware that it's a less efficient choice—at least its inefficiency is predictable. Of course if you do that enough times then you've spent more than the original learning curve would have cost! I have done this in painfully many contexts. Regexes are an obvious case, probably because they're so obviously doing the same thing, just differently enough to waste your time.

Some people actually like searching for and paging through documentation to learn how, e.g., regex format X does character escaping. And they tend to remember such things, too. I don't and don't.

hashhar · on Aug 10, 2017

Relevant XCKD: http://www.xkcd.com/1205

This is a very big issue and one I recently faced in the workplace. Needed a script to parse some log files to generate CSV reports for the business users. I knew jq (https://stedolan.github.io/jq/) and hence was able to write less than 10 lines of jq to do it combined with some preprocessing using sed.

I then realised that NOBODY on the team knew jq apart from me and I had to rewrite it in Python which took me 4 days to do correctly and handle everything that jq did for me.

sglane · on Aug 10, 2017

There are three types of regex as far as I know: basic (aka GNU), extended and PERL. Grep uses GNU as the name implies. egrep or grep -E uses extended. PERL is used elsewhere like JavaScript. Typically you'll see pcre which is the library for Perl Compatible Regular Expressions.

jwilk · on Aug 10, 2017

The name "grep" is unrelated to GNU. According to The Jargon File, the etymology is:

from the qed/ed editor idiom g/re/p, where re stands for a regular expression, to Globally search for the Regular Expression and Print the lines containing matches to it

In fact, grep predates the GNU project by almost a decade.

Also, it's the first time I hear about "GNU" regexps.

sglane · on Aug 10, 2017

Thanks, just learned something new. grep predates GNU by about a decade it seems.

Hello71 · on Aug 10, 2017

a more accurate term would be "BRE with GNU extensions", e.g. 

jwilk · on Aug 10, 2017

 are part of BRE syntax. They are not GNU extensions.

Source: http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html#tag...

Hello71 · on Aug 10, 2017

er, I meant \{\}.

jwilk · on Aug 10, 2017

\{n\}, \{n,\} and \{n,m\} are all in POSIX.

\{,m\} is a (pretty obscure) GNU extensions.

Hello71 · on Aug 10, 2017

hm, I could've sworn it wasn't. http://www.regular-expressions.info/gnu.html says that \?, \+, and \| are GNU extensions though.

1wd · on Aug 10, 2017

There are many differencese between "compatible" regular expressions, for (many) examples read the end of section "Important Notes About Lookbehind" on [1] or compare 24 dialects on [2] for example.

[1] http://www.regular-expressions.info/lookaround.html [2] http://www.regular-expressions.info/refadv.html

macintux · on Aug 10, 2017

Erlang requires doubling up on escape backslashes, which generally causes me grief even after I flail about and finally remember it. Subtleties abound.

pjtr · on Aug 10, 2017

Another strange one is the (old) Visual Studio syntax with its :i (actually often quite helpful) and $1 instead of \1 (even in the new syntax).

https://msdn.microsoft.com/en-us/library/2k3te2cs(v=vs.110)....

tzs · on Aug 10, 2017

I wonder if there is a tool that lets you enter a regular expression from one engine and get back the equivalent regular expressions for other engines?

Some of the look arounds might be too powerful to do in some engines so it would not always work, but it would still be quite useful if it just handled the differences in escaping, specifying modes like case insensitivity or multiline matching, and referencing match groups in replacements.

ken · on Aug 10, 2017

Not exactly, but that's a(n internal) feature of my Strukt[1]. It's how the optimizer is able to take any of {exact string, glob, ICU regex} as user input, and convert them into efficient queries for various data sources.

For example, if you do ListFiles[folder=~/Downloads] -> FilterString[field=filename, regex=\.pdf$], it will parse the regex, verify that it only uses features which are available in Spotlight's query syntax (roughly, a glob with slightly funny syntax), and rewrite that adjacent pair of operations into a single API call behind the scenes:

    kMDItemFSName LIKE[c] "*.pdf"

I can report, having written parsers for a few common regex dialects, that there are a ton of obscure regex features, with different semantics everywhere. 100% conversion is almost never possible.

A lot of the work in deciding if a translation is possible is in identifying if something is good enough, e.g., SQLite can't do case-insensitive search (in general), but if your regex happens to be /([0-9]+)/ then case-sensitive search will work just fine. Fortunately, for Strukt, if a conversion is impossible, I can just run the operations as written: it's much slower, but still correct.

I've thought about breaking this part of Strukt off into a tool just for regex editing, but that always seemed rather esoteric. Would there be any use or demand for that, do you think?

[1]: https://freerobotcollective.com

LyndsySimon · on Aug 10, 2017

That's an interesting idea - interesting enough that it would be fun to at least throw together a little proof-of-concept...

masklinn · on Aug 10, 2017

Also replacement (sed, emacs, idea, …), is it $1, is it \1, is it something else, does it support named groups?

crispyambulance · on Aug 10, 2017

    >  remembering the exact syntax of each of the different common regex engines

I have the same problem. There's no way around those little differences other than to suffer them. It simply isn't feasible to remember all the nuances of the different flavors.

I just know one really well and rely on getting it wrong for the others: guess, test, revise, repeat.

Fileformat · on Aug 10, 2017

Try http://www.regexplanet.com/ for testing. It supports a bunch of different regex engines & switching between them.

blackkettle · on Aug 10, 2017

My go to resource for this issue is:

* http://www.regular-expressions.info/tools.html

nooks and crannies of pretty much every major flavor and variant, prepared as an easy reference.

JepZ · on Aug 10, 2017

I wanted to write that exact same first sentence (what a weird experience). :D :D

Another thing is escaping: when you write sed regexs in a bash script you have to use slightly different escaping sequences sometimes...

davidreiss · on Aug 10, 2017

That's one of my pet peeves too. Another pet peeve is that I only touch regex once every few months. When you have a regex set to parse a domain of data you move on to other things. After a while you forget the regex syntax and when you have to parse another set of text or have to debug the older regex, I end up having to relearn a bit of it. But I guess it comes with the territory.

ridiculous_fish · on Aug 10, 2017

Python has a nice trick for commenting regexs with re.VERBOSE. I use it for all non-trivial regexs.

https://docs.python.org/2/library/re.html#re.VERBOSE

clement75009 · on Aug 10, 2017

For me, the best Regex ressource is still http://regexr.com

It explains what each character does just by hovering over a regex. Best tool to learn or to fine tune your regular expression (with testing included).

warcode · on Aug 10, 2017

I'll add https://regex101.com as an alternative.

chrisan · on Aug 10, 2017

This is a paid app, but the debugger on this has been my favorite

https://www.regexbuddy.com/

You can even step through the matching process and see how your matches are made

squeaky-clean · on Aug 10, 2017

Love RegexBuddy. That feature along with the "use" tab which generates code for whatever language you select. I don't have to remember the specifics in all the languages I work in. I just select from the dropdowns, for example, "JavaScript (Chrome)" then "Use regex object to get the part of a string matched by a numbered group". Replace the placeholder variable names, and you're good to go!

It will also do things like warn you if you use named groups if your selected language doesn't support them, and the "Use" dropdown won't provide that option.

I really wish it wasn't Windows only.

chrisan · on Aug 10, 2017

Hah wow... I've had this app for over a decade and never noticed that feature... right next to the Debug button which I have used numerous times on really gnarly regexes

gregmac · on Aug 10, 2017

https://www.debuggex.com/ is a neat one for showing a syntax chart (or railroad chart) visualization.

I generally use https://regex101.com for its display of matched groups (when I'm dealing with complex groups and/or replacement backreferences), and http://regexstorm.net/tester when I specifically need to check a regex that will be running in .NET or Powershell.

nexxer · on Aug 10, 2017

I've been using The Regex Coach [0] for years. Simple, free, does the job.

[0] http://www.weitz.de/regex-coach/

otterpro · on Aug 10, 2017

For advanced users (and those who want to become regex gurus), the most helpful regex site for me is http://www.rexegg.com/. Also the O'Reilly book "Mastering Regular Expressions" is probably worth gold.

squeaky-clean · on Aug 10, 2017

> Also the O'Reilly book "Mastering Regular Expressions" is probably worth gold.

The book is worth it's price if only for chapter 1, definitely seconding this recommendation.

hacker_9 · on Aug 10, 2017

I'll add http://www.ultrapico.com/expresso.htm as another alternative! Writes out the regex steps in English too.

vacri · on Aug 10, 2017

I like https://regexper.com/ as it gives you a visual flow of what's happening.

reuven · on Aug 10, 2017

I've been teaching regular expressions for years, and offer a free online e-mail course on the subject (http://RegexpCrashCourse.com/).

This site is a very nice summary of regexp syntax and is written well -- but it's missing two crucial pieces that help people learn: Examples and exercises. Without practice, there's no way that people can remember the syntax.

darkstar999 · on Aug 10, 2017

It isn't missing those. Every section links to regex101.com prefilled, ready for experimentation.

reuven · on Aug 10, 2017

I'll revise my criticism, then: There are some examples, but not enough. And the links to an interactive regexp system is very smart and nice.

_d4bj · on Aug 10, 2017

Good article (didn't realize there were other kinds of lookaround), but maybe the bottom should link to well-tested standards-based regexes instead.

    URL: ^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$

I recently encountered a case where a URL had an underscore at the end of a subdomain name. It seems underscores are okay anywhere else, but while my friend on Windows was able to load the website, I wasn't (on Linux) using Firefox, curl, remote screenshot service which presumably ran Linux etc. According to various RFCs, they should be okay anywhere within the subdomain name.

Has anyone encountered this behavior? Couldn't find anything on the internet; maybe it's just my computer?

keeperofdakeys · on Aug 10, 2017

It seems to mostly come down to differences in how things are defined. DNS itself can handle almost arbitrary data https://tools.ietf.org/html/rfc2181#section-11, while an Internet Hostname was defined to be more strict https://tools.ietf.org/html/rfc1123#section-2. The same issue also exists with dashes at the end of domain components.

I'm not enough of a history boffin to know how Microsoft came to support it differently (perhaps something from the Netbios and NT era). At this point in time though, I don't see either party changing their default validations to agree on a single definition.

Edit: If you're curious, this is the first commit that appears to be the first glibc commit limiting dashes at the end of URLS https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fa0bc.... I don't know about BSD libc, or windows however.

tzs · on Aug 10, 2017

Wait a second...does this imply that if I put downloads that should only be of interest to our Windows customers on a server named something like downloads_.ourdomain.com, it might keep out all those annoying bots that ignore robots.txt and make a lot of noise in my logs? I'm guessing that most of the bots are not running on Windows.

keeperofdakeys · on Aug 10, 2017

That's a pretty bad idea, you shouldn't rely on this kind of stuff.

If there are people running OSX or Linux that want Windows downloads, or someone is behind a captive portal or proxy (like squid), they probably won't be able to reach it anymore.

If you have a real problem with bots, I'd look at what IPs they are coming from, and how often they try to connect. Something like IP blacklisting, or fail2ban might work for your use case.

_d4bj · on Aug 10, 2017

Wow, how did you find that commit?

keeperofdakeys · on Aug 10, 2017

Both git and this git web view allow you to view all the commits that have modified just that file. Eg. https://sourceware.org/git/?p=glibc.git;a=history;f=resolv/r.... So it's a simple matter of looking at the diffs between commits.

Of course that's assuming you know the right file, which is often the harder problem.

sambe · on Aug 10, 2017

Yes, I feel that the "Bonus" section (with no explanation even) is rather encouraging beginners to mis-use regular expressions in general, and - more specifically - contains errors.

ddevault · on Aug 10, 2017

I personally avoid regexes where possible, including in this situation. IMO the right way to validate a URL is to feed it to a URL parser and see if it errors out. I can see errors in this regex right away - and in many other regexes you find from Googling. People just drop them into their codebase and their eyes glaze over when you ask them whether or not it's actually correct. How many websites fail on user+whatever@gmail.com because they copied a bad regex?

thinkMOAR · on Aug 10, 2017

hmm interesting, do you still have the domain/url? You could search for it in your history using regex :)

junke · on Aug 10, 2017

https://xkcd.com/1313/

dayvid · on Aug 10, 2017

https://regexone.com/ helped me finally learn Regex in 2-3 hours.

It's a step-by-step interactive site. One of the best educational programming sites I've been to.

Testing Regex is a lot easier when you have the fundamentals down and there's a million resources to test Regex.

frou_dh · on Aug 10, 2017

I learned regex the "Ambient" way.

i.e. Encountering them here, there and everywhere. Then one day realising you have a good knowledge of the subject without ever having set out to learn it.

reificator · on Aug 10, 2017

I spent some time reading some resource or another on how regexes work, but the vast majority of my learning has been trying things in https://regex101.com/ and seeing if they do what I want. The breakdown on the side of the page is especially helpful.

retox · on Aug 10, 2017

Waiting for the additional "Read someone else's Regex the easy way". I'm not holding my breath :)

Agree with others in that RegexBuddy is indispensable for a windows dev learning this magic stuff.

Some useful and interesting regex developments coming in the next version of JavaScript. Support for international text and (bleh) emoji incoming.

Groxx · on Aug 10, 2017

And when you think you've learned regex, learn that you haven't: http://fent.github.io/randexp.js (a regex "reverser" of sorts) [1]

Seriously. Test every non-trivial regex with something like this, you'll probably be surprised at how permissive most regexes are.

Regexes are great. They're super-concise and perform amazingly well. But they're one of the biggest footguns I know of. Treat them as such and you'll probably do fine.

---

[1] for instance, the URL regex they use is incorrect, and it's super obvious when you plug it into that site:

    ^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$

`[[a-zA-Z0-9]\-\.]` you can't nest character sets like that. So this matches the letters "[]-." as well as all a-z,A-Z,0-9 ranges.

jules · on Aug 10, 2017

I didn't truly understand regular expressions until I saw how they are executed. There are simple algorithms for executing them, so that might not be such a bad way to teach.

hackermailman · on Aug 10, 2017

I learned from the book The Unix Programming Environment but I also didn't truly understand regexp until I read how they were executed, in Programming in Standard ML the first chapter shows how to implement a complete package/parser for regexp http://www.cs.cmu.edu/~rwh/isml/book.pdf

Willamin · on Aug 10, 2017

The lack of readability of regex makes me wonder if there isn't a better way. I've seen Elm's parser which introduces a few neat concepts like parser pipelines. https://github.com/elm-tools/parser

mclehman · on Aug 11, 2017

Have you looked at Perl 6 at all? Whitespace in regexen is insignificant if not quoted, so not only can you add a little space between sections, you can split a regex over several lines and add comments throughout.

It also has first-class grammars, so you're less tempted to reach for regex when something more powerful would help.

gregmac · on Aug 10, 2017

For some people Regex Golf [1] might be an interesting way to learn. You are actually building increasingly complex regex as you go, and can just look up bits of syntax you don't know as needed.

[1] https://alf.nu/RegexGolf

crncosta · on Aug 10, 2017

I really enjoy this type of tutorial format, concise and easy to follow. Thanks for take time to produce it.

chenster · on Aug 10, 2017

A picture is worth a thousand words. How about a visual regex tester - http://emailregex.com/regex-visual-tester/#a%5Cbc%5Cd*

VeejayRampay · on Aug 10, 2017

Very good post. It has all the important information, provides clear examples, doesn't try to get too fancy or showboat. Well done.

jwilk · on Aug 10, 2017

Previously: https://news.ycombinator.com/item?id=14846506

I'm afraid not much has been improved since then.

This is not a good learning source.

gmac · on Aug 10, 2017

For an even-more-beginner's guide, see the slides to a session I teach to Econ postgrads once a year[1].

These introduces the metacharacters gradually, using a task-based approach. We start by finding street addresses, per https://xkcd.com/208/.

[1] http://mackerron.com/text/text-slides.pdf (page 19 onwards) with supporting resources at http://mackerron.com/text/

chenster · on Aug 10, 2017

For email, regular expression, there's http://emailregex.com

j05huaNathaniel · on Aug 10, 2017

Could use some work explaining capture groups

com2kid · on Aug 10, 2017

Capture groups are where I get 99% of value from regexs. Being able to quickly transform data is where I find regexs perform best. Just matching is not something I have to do all that often.