"I will hire you anyway, but I will pay you less" Regex, useful in any job...

username_my1 · on March 20, 2024

regex is useful but chatgpt is amazing at it, so why spend a minute keeping such useless knowledge in mind.

if you know where to find something no point in knowing it.

ykonstant · on March 20, 2024

Does gpt produce efficient regex? Are there any experts here that can assess the quality and correctness of gpt-generated regex? I wonder how regex responses by gpt are validated if the prompter does not have the knowledge to read the output.

da39a3ee · on March 20, 2024

You don't have to be an expert; you should very rarely be using regexes so complex that you can't understand them.

zacmps · on March 20, 2024

It might not be obvious when you hit that point, bad regexes can be subtle, just see that old cloudflare postmortem.

mnau · on March 20, 2024

Even simple regexs can be problematic, e.g. Gitlab RCE bug through ExifTools

https://devcraft.io/2021/05/04/exiftool-arbitrary-code-execu...

> "a\ > ""

> The second quote was not escaped because in the regex $tok =~ /(\\+)$/ the $ will match the end of a string, but also match before a newline at the end of a string, so the code thinks that the quote is being escaped when it’s escaping the newline.

hnlmorg · on March 20, 2024

...and if you can understand them then you clearly understand regex enough not to need ChatGPT to write them

kaibee · on March 20, 2024

I understand assembly too.

hnlmorg · on March 21, 2024

The only time you'd want to write assembly in production code would be if you need to hand roll some optimisation. So I don't really understand your point here,

2devnull · on March 20, 2024

That was one of my first uh oh moments with gpt. Getting code that clearly had untestable/unreadable regexen, which given the source must have meant the regex were gpt generated. So much is going to go wrong, and soon.

thecatspaw · on March 20, 2024

what does gpt say how we should validate email addresses?

criley2 · on March 20, 2024

Prompt:

'I'm writing a nodejs javascript application and I need a regex to validate emails in my server. Can you write a regex that will safely and efficiently match emails?'

GPT4 / Gemini Advanced / Claude 3 Sonnet

GPT4: `const emailRegex = /^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;` Full answser: https://justpaste.it/cg4cl

Gemini Advanced: `const emailRegex = /^[a-zA-Z0-9.!#$%&'+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)$/;` Full answer: https://justpaste.it/589a5

Claude 3: `const emailRegex = /^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})$/;` Full answer: https://justpaste.it/82r2v

dfawcus · on March 20, 2024

Whereas email more or less lasts forever (mailbox contents), and has to be backwards compatible with older versions back to (at least) RFC 821/822, or those before. It also allows almost any character (when escaped at 821 level) in the host or domain part (domain names allow any byte value).

So a Internet email address match pattern has to be: "..*@..*", anything else can reject otherwise valid addresses.

That however does not account for earlier source routed addresses, not the old style UUCP bang paths. However those can probably be ignored for newly generated email.

I regularly use an email address with a "+" in the host part. When I used qmail, I often used addresses like: "foo-a/b-bar-tat@DOMAIN". Mainly for auto filtering received messages from mailing lists.

zaxomi · on March 20, 2024

Still doesn't support internationalized domain names.

croemer · on March 20, 2024

Terrible answers as far as I can tell, especially Chat got would throw out many valid email addresses.

rhd · on March 20, 2024

chatgpt-4:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

https://chat.openai.com/share/696f7046-7f43-4331-b12b-538566...

chatgpt-3.5:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

https://chat.openai.com/share/aaa09ae8-3fd9-4df7-a417-948436...

layer8 · on March 20, 2024

…which both excludes addresses allowed by the RFC and includes addresses disallowed by the RFC. (For example, the RFC disallows two consecutive dots in the local-part.)

KMnO4 · on March 20, 2024

I take the descriptivist approach to email validation, rather than the prescriptivist.

I know an email has to have a domain name after the @ so I know where to send it.

I also know it has to have something before the @ so the domain’s email server knows how to handle it.

But do I care if the email server is supports sub addresses, characters outside of the commonly supported range (eg quotation marks and spaces), or even characters which aren’t part of the RFC? I do not.

If the user gives me that email, I’ll trust them. Worst case they won’t receive the verification email and will need to double check it. But it’s a lot better than those websites who try to tell me my email is invalid because their regex is too picky.

jcranmer · on March 20, 2024

The HTML email regex validation [1] is probably the best rule to use for validating an email address in most user applications. It prohibits IP address domain literals (which the emailcore people have basically said is of limited utility [2]), and quoted strings in the localpart. Its biggest fault is allowing multiple dots to appear next to each other, which is a lot of faff to put in a regex when you already have to individually spell out every special character in atext.

[1] https://html.spec.whatwg.org/multipage/input.html#email-stat...

[2] https://datatracker.ietf.org/doc/draft-ietf-emailcore-as/

layer8 · on March 20, 2024

I generally agree, but the two consecutive dots (or leading/trailing dots) are an example that would very likely be a typo and that you wouldn’t particularly want to send. Similar for unbalanced quotes, angle brackets, and other grammar elements.

dumbo-octopus · on March 20, 2024

I wonder whether simply (regex) replacing a sequence of .'s with a single one as part of a post-processing step would be effective.

layer8 · on March 20, 2024

That would be bad form, IMO. The user may have typed john..kennedy@example.com by mistake instead of john.f.kennedy@example.com, and now you’ll be sending their email to john.kennedy@example.com. Similar for leading or trailing dots. You can’t just decide what a user probably meant, when they type in something invalid.

wtetzner · on March 20, 2024

Yeah, that's about as far as I've ever been comfortable going in terms of validating email addresses too: some stuff followed by "@" followed by more stuff.

Though I guess adding a check for invalid dot patterns might be worthwhile.

marcosdumay · on March 20, 2024

What is maybe more important to note, it completely disallows the language of some 4/5 of the humanity. And partially disallows some 2/3 of the rest.

sebstefan · on March 20, 2024

Actually pretty good response if the programmer bothers to read all of it

I'd be more emphatic that you shouldn't rely on regexes to validate emails and that this should only be used as an "in the form validation" first step to warn of user input error, but the gist is there

> This regex is *practical for most applications* (??), striking a balance between complexity and adherence to the standard. It allows for basic validation but does not fully enforce the specifications of RFC 5322, which are much more intricate and challenging to implement in a single regex pattern.

^ ("challenging"? Didn't I see that emails validation requires at least a grammar and not just a regex?)

> For example, it doesn't account for quoted strings (which can include spaces) in the local part, nor does it fully validate all possible TLDs. Implementing a regex that fully complies with the RFC specifications is impractical due to their complexity and the flexibility allowed in the specifications.

> For applications requiring strict compliance, it's often recommended to use a library or built-in function for email validation provided by the programming language or framework you're using, as these are more likely to handle the nuances and edge cases correctly. Additionally, the ultimate test of an email address's validity is sending a confirmation email to it.

bonki · on March 20, 2024

Not good at all, but a little better than expected. I use + in email addresses prominently and there are so many websites who don't even allow that...

zaxomi · on March 20, 2024

Remember to first punycode the domain part of an email address before trying to validate it, or it will not work with internationalized domain names.

jameshart · on March 20, 2024

Support for IDN email addresses is still patchy at best. Many systems can’t send to them; many email hosts still can’t handle being configured for them.

skeaker · on March 20, 2024

There really ought to be a regex repository of common use cases like these so we don't have to reinvent the wheel or dig up a random codebase that we hope is correct to copy from every time.

berkes · on March 20, 2024

> if you know where to find something no point in knowing it.

Nonsense. And you know it.

First, you need to know what to find, before knowing where to find it. And knowing what to find requires intricate knowledge of the thing. Not intricate implementation details, but enough to point yourself in the right direction.

Secondly, you need to know why to find thing X and not thing Y. If anything, ChatGPT is even worse than google or stackoverflow in "solving the XY problem for you". XY is a problem you don't want solved, but instead to be told that you don't want to solve it.

Maybe some future LLM can also push back. Maybe some future LLM can guide you to the right answer for a problem. But at the current state: nope.

Related: regexes are almost never the best answer to any question. They are available and quick, so all considered, maybe "the best" for this case. But overall: nah.

pksebben · on March 20, 2024

While I agree with your point that knowing things matters, it is entirely possible with the current batch of LLMs to get to an answer you don't know much about. It's actually one of the few things they do reliably well.

You start with what you do know, asking leading questions and being clear about what you don't, and you build towards deeper and deeper terminology until you get to the point where there are docs to read (because you still can't trust them to get the specifics right).

I've done this on a number of projects with pretty astonishing results, building stuff that would otherwise be completely out of my wheelhouse.

lolc · on March 20, 2024

Funny for me there have been instances where the LLM did push back. I had a plan of how to solve something and tasked the LLM with a draft implementation. It kept producing another solution which I kept rejecting and specifying more details so it wouldn't stray. In the end I had to accept that my solution couldn't work, and that the proposed one was acceptable. It's going to happen again, because it often comes up with inferior solutions so I'm not very open to the reverse situation.

berkes · on March 21, 2024

I should have clarified better. Because, indeed, I have the same experience with copilot. Where it suggested code that I disliked but was actually the right one and mine the wrong one.

I was talking about X-Y on a higher level though. Architecture, Design Patterns, that kind of stuff. LLMs are (still?) particularly bad at this. Which is rather obvious if you think of them as "just" statistical models: it'll just suggest what is done most often in your context, not what is current best for your context.

lolc · on March 21, 2024

Yea I don't think the crop of LLM is useful for this. They let themselves be lead by what's written, and struggle to understand negation even. So when I suspect there is a better solution, I have a hard time getting such an answer, even if asking explicitly for alternatives. I doubt it's just a question of training, they seem to lock themselves on the context. When using Phind, this is somewhat mitigated by mixing in context from the web, which can lead to responses that include alternatives.

HumblyTossed · on March 20, 2024

This is something ChatGPT would say.

gitaarik · on March 21, 2024

Sure, why bother understanding anything if ChatGPT can just produce the answers for you ;). You don't have to understand the answers even. Actually you don't need to understand the question also. Just forward the question from your manager to ChatGPT and forward the answer back to your manager ;). Why make life difficult for yourself?