Sorry but the whole point is to not use a closed source third party API with a dubious privacy police ran buy a multinational surveillance capitalism megacorporation.
When you use it via the first-party Azure service, the data you input is de-facto guaranteed to not be fed back into the model, so you don't expand your circle of trust by feeding MSFT with your code.
I did not verify this regex, on the very surface scan seems OK:
./main -m ./models/7B/ggml-model-q4_0.bin -t 4 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647 -n 1024 -p $'Here is a handy short form regex to validate an email address:
'
Here is a handy short form regex to validate an email address:
^([a-zA-Z0-9_.+-]+@[a-zA-Z0-9.-]+\.[a-z\.]{2,6})+$
The first character can be either uppercase or lower case. The second group must contain at least one letter and the third group may contain any number of characters (up to 5). The last part ensures that it ends with @ followed by two more letters separated by dots.
If you want to make sure that your input string contains only valid characters for emails then use this regex instead:
\A[\w.]@[\w.]\.\w{1,4}\z
Badly wrong, in part because your prompt was badly wrong, because there is no short regex to validate an email address. (It needs a way of saying “I can’t complete that because the prompt makes no sense for this reason”.)
What it emitted accepts a large number of invalid addresses (due to things like not checking dot placement, and the inexplicable (…)+ wrapping around the entire thing), and doesn’t accept a large number of valid addresses (some comparatively esoteric, like local parts containing any of !#$%&'*/=?^`{|}~ or IP addresses for the domain name, and some very reasonable, like TLDs of more than six characters, or internationalised TLDs even in Punycode form).
The description it emits does not match the regular expression at all well, either.
The second regex it emits is even worse than the first, unnecessarily uses PCRE-specific syntax, and is given with a nonsensical description. (Note: the asterisks got turned into italics, backslash-escape them here on HN. With this fixed, the regex was \A[\w.]*@[\w.]*\.\w{1,4}\z.)
> on the very surface scan seems OK
And there’s the danger of this stuff. As a subject-matter expert on regex and email, I glanced at the regular expression and was immediately appalled (… quite apart from the whole “here we go again, this is certain to be terrible” cringe on the prompt). But it looks plausible enough if you aren’t.
It is a bit crazy to me someone posts a regex like that without verifying and saying on surface level it looks good, implying the whole thing was useful and a good result.
I said it looks ok, not good. My comment is mostly about me being surprised a valid regex came out. I also asked it to write a regex to parse html which it happily answered. What does gpt4 say about parsing html ;)
But it is either going to be useful or harmful. Harmful if doing the regex validation itself is worse than not doing any validation at all or a very simple validation just checking that there is @ included somewhere.
For comparison GPT-4 provides the following Python regex and then warns that it does not catch all edge cases and that it’s better to use a dedicated library like email-validator:
llama's 7b model internally is, for me, on a totally different level quality-wise. Even when explicitly instructed to not make up stuff and just say 'I don't know', it will still go ahead and ramble and invent things. When I tell it to only use the prompt data it will still invent, or just ignore the prompt data. It's not useful for production (i.e., to be exposed to 'regular' non AI users).
ChatGPT, on the other hand, will listen to those instructions and say when it does not know, and will keep to only the prompt data.
One problem with the current way these models are being trained is that they have no idea of what they're saying. It's just a recursive guess the next word type algorithm. I would not expect the confidence levels of any given fragment, let alone an average, to be a meaningful predictor of truth.
Also completely spitballing, I expect that a big chunk of OpenAI's 'secret sauce' is simple processing layers above and beyond the model. If you input gibberish to llama does it give you an output? If OpenAI is artificially tokenizing inputs (as opposed to just sending inputs straight to the software), it would both dramatically limit the input domain, thus improving output tuning, as well as give "it" the ability to say when it doesn't know something. I put "it" in quotes since that response would not becoming from the LLM, but from the preprocess tokenization system returning an error code in natural language.
I think there's some weak indirect evidence for this in the service itself, since incoherent inputs are instantly rejected, whereas even simple queries take dramatically longer to output even the first word. It's like the input is not even being sent to the LLM software for processing.
I've been debating the idea of building tiers or layers of models to accomplish the same.
It very well could be that this go/no-go pre-processor is simply another ML model trained on a binary classification task. Stack a few of these and you can wind up with some interesting programming models.
This would also explain the ease at which ChatGPT gets rid of escapes/bad prompts - they have an additional layer that assesses whether the question could be, for example, racist, and then spits out a 'Sorry as a language-model I am not trained to answer this kind of question'. No need to retrain the main 14B transformer model.
Even if we had a 100% private ChatGPT instance, it wouldn't fully cover our internal use case.
There is way more context to our business than can fit in 4/8/32k tokens. Even if we could fit the 32k token budget, it would be very expensive to run like this 24/7. Fine-tuning a base model is the only practical/affordable path for us.