I did not verify this regex, on the very surface scan seems OK: ./main -m ./mode...

chrismorgan · on March 28, 2023

Badly wrong, in part because your prompt was badly wrong, because there is no short regex to validate an email address. (It needs a way of saying “I can’t complete that because the prompt makes no sense for this reason”.)

What it emitted accepts a large number of invalid addresses (due to things like not checking dot placement, and the inexplicable (…)+ wrapping around the entire thing), and doesn’t accept a large number of valid addresses (some comparatively esoteric, like local parts containing any of !#$%&'*/=?^`{|}~ or IP addresses for the domain name, and some very reasonable, like TLDs of more than six characters, or internationalised TLDs even in Punycode form).

The description it emits does not match the regular expression at all well, either.

The second regex it emits is even worse than the first, unnecessarily uses PCRE-specific syntax, and is given with a nonsensical description. (Note: the asterisks got turned into italics, backslash-escape them here on HN. With this fixed, the regex was \A[\w.]*@[\w.]*\.\w{1,4}\z.)

> on the very surface scan seems OK

And there’s the danger of this stuff. As a subject-matter expert on regex and email, I glanced at the regular expression and was immediately appalled (… quite apart from the whole “here we go again, this is certain to be terrible” cringe on the prompt). But it looks plausible enough if you aren’t.

mewpmewp2 · on March 28, 2023

It is a bit crazy to me someone posts a regex like that without verifying and saying on surface level it looks good, implying the whole thing was useful and a good result.

beiller · on March 28, 2023

I said it looks ok, not good. My comment is mostly about me being surprised a valid regex came out. I also asked it to write a regex to parse html which it happily answered. What does gpt4 say about parsing html ;)

mewpmewp2 · on March 28, 2023

But it is either going to be useful or harmful. Harmful if doing the regex validation itself is worse than not doing any validation at all or a very simple validation just checking that there is @ included somewhere.

pama · on March 28, 2023

For comparison GPT-4 provides the following Python regex and then warns that it does not catch all edge cases and that it’s better to use a dedicated library like email-validator:

email_pattern = r"^(?=.{1,256})(?=.{1,64}@.{1,255}$)(?=\S)(?:(?!@)[\w&'+._%-]+(?:(?<!\\)[,;])?)(?<=\S)@((?=\S)(?!-)[A-Za-z0-9-]{1,63}(?<!-)\.?)+[A-Za-z]{2,19}(?<=\S)$"

mewpmewp2 · on March 28, 2023

I would say either that or just have a basic check that there is an @ somewhere.

mewpmewp2 · on March 28, 2023

There are much longer top level domains than 6 characters though.

And the second one is confusing me. It seems to be matching a single character only for the initial portion?

Neither of them seem good, and especially the last one.

And the way it describes both seems off as well. I would have to say it brings more harm than good based on that.

bob1029 · on March 27, 2023

This looks like good performance. We are keeping an open mind with regard to actually-open alternatives.