This appears to be a web frontend with authentication for Azure's OpenAI API, which is a great choice if you can't use Chat GPT or its API at work.
If you're looking to try the "open" models like Llama 2 (or it's uncensored version Llama 2 Uncensored), check out https://github.com/jmorganca/ollama or some of the lower level runners like llama.cpp (which powers the aforementioned project I'm working on) or Candle, the new project by hugging face.
What's are folks' take on this vs Llama 2, which was recently released by Facebook Research? While I haven't tested it extensively, 70B model is supposed to rival Chat GPT 3.5 in most areas, and there are now some new fine-tuned versions that excel at specific tasks like coding (the 'codeup' model) or the new Wizard Math (https://github.com/nlpxucan/WizardLM) which claims to outperform ChatGPT 3.5 on grade school math problems.
Llama 2 might by some measures be close to GPT 3.5, but it’s nowhere near GPT 4, nor Anthropic Claude 2 or Cohere’s model. The closed source players have the best researchers - they are being paid millions a year with tons of upside - and it’s hard to keep pace with that. My sense is that the foundation model companies have an edge for now and will probably stay a few steps ahead of the open source realm simply for economic reasons.
Over the long run, open source will eventually overtake. Chances are this will happen once the researchers who are making magic happen get their liquidity and can start working for free again out in the open.
> The closed source players have the best researchers - they are being paid millions a year with tons of upside - and it’s hard to keep pace with that.
Llama2 came out of Meta's AI group. Meta pays researcher salaries competitive with any other group, and their NLP team is one of the top groups in the world.
For researchers it is increasingly the most attractive industrial lab because they release the research openly.
FAANG pays exceptionally well (I'd know), but what's being offered at OpenAI is eye-popping, even for SWEs. I think they're trying to dig their moat by absorbing the absolute best of the best.
Most of that is in their equity comp which is quite weird in how it works. So those numbers are highly inflated. The equity is valuable only if you sell it or if OpenAI makes a profit. Selling it might be harder given they're not a public company. On top of that, the profit is capped so there is a limit to how much money can be made from it. So while it's 900k on paper, in reality, it might not be as good as that.
https://www.levels.fyi/blog/openai-compensation.html
> Llama 2 might by some measures be close to GPT 3.5, but it’s nowhere near GPT 4
I think you're right about this, and benchmarks we've run at Anyscale support this conclusion [1].
The caveat there (which I think will be a big boon for open models) is that techniques like fine-tuning makes a HUGE difference and can bridge the quality gap between Llama-2 and GPT-4 for many (but not all) problems.
Frankly, number of benchmarks you guys are using are too narrow. In fact these benchmarks are "old world" benchmarks, easy to game through finetuning and we should be stop using them altogether for LLMs. Why are you not using Big Bench Hard or OpenAI evals?
I don't think you can do that with any AI models. It almost feels like a fundamental misrepresentation of how they work.
You could fine-tune a conversational AI on your codebase, but without loading said codebase into it's context it is "flying blind" so-to-speak. It doesn't understand the data structure of your code, the relation between files and probably doesn't confidently understand the architecture of your system. Without portions of your codebase loaded into the 'memory' of your model, all that your finetuning can do is replicate characteristics of your code.
TypeChat-like things might provide the interface control for future context driven architectures, being some type of catalysis. Using the self-reflective modeling is a form of contextual insight.
> The closed source players have the best researchers
Is that definitely why? GPT 3.5 and GPT 4 are far larger than 70B, right? So if a 70B, local model like LLaMA can even remotely rival them, would that not suggest that LLaMA is fundamentally a better model?
For example, would a LLaMA model with even half of GPT 4's parameters be projected to outperform it? Is that how it works?
If you read the Llama2 paper it is very clear that small amounts of data (thousands of records) make vast difference at the instruction turning stage. From the Llama2 paper:
> Quality Is All You Need.
> Third-party SFT data is available from many different sources, but we found that
many of these have insufficient diversity and quality — in particular for aligning LLMs towards dialogue-style
instructions. As a result, we focused first on collecting several thousand examples of high-quality SFT data,
as illustrated in Table 5. By setting aside millions of examples from third-party datasets and using fewer but
higher-quality examples from our own vendor-based annotation efforts, our results notably improved. These
findings are similar in spirit to Zhou et al. (2023), which also finds that a limited set of clean instruction-tuning data can be sufficient to reach a high level of quality. We found that SFT annotations in the order of tens of thousands was enough to achieve a high-quality result. We stopped annotating SFT after collecting a total of
27,540 annotations. Note that we do not include any Meta user data.
It's likely OpenAI has invested in this and has good coverage in a larger range of domains. That alone probably explains a large amount of the gap.
It's somewhat insightful if you consider that, at high level, the major theme of the past decade was, "lots of garbage in === good results out", quantity >> quality.
There is no clear answer. It's debatable among experts.
The grandparent post seems to believe that the issue is algorithmic complexity and programming aptitude. Personally, I think that all the major LLMs are using the same basic transformer architecture with relatively minor differences in code.
GPT is trained on more data with more parameters than any open source model. The size does matter, far more than the software does. In my experience with data science, the best programmers in the world can only do so much if they are operating with 1/10th the scale of data. That applies to any problem.
Yeah I've been wondering about this too. Word on the street is that GPT4 is several times the size of GPT3.5. Yet I don't feel it's several times as good for sure.
Apparently there's a diminishing returns effect on ever enlarging the model.
I believe what they discovered was that 4 is an ensemble model, comprised of (8) GPT3.5s. Things may have changed or been found to not be true on this though.
LLamA 2 at 70B is, let’s say pessimistically 70% as good as GPT3.5. This makes me think that OpenAI is lying about their parameter count, are vastly less efficient than LLaMA, or, the lager model sizes have diminishing returns. Either way, your point is a good one. Something doesn’t add up.
IMO Llama2 really isn’t close to 3.5. It still has regular mode collapse (or whatever you call getting repetitive and nonsensical responses after a while), it has very poor mathematical/logical reasoning and is not good at following multi-part instructions.
It just sounds like 3.5/4 because it was trained on it.
You're mixing up the language model with the chat bot.
The llama2 is a language model. I imagine the language model behind chatgpt is not much different (perhaps it's better, but not by many months AI research time). It likely also suffers from "mode collapse" issues etc.
But 3.5 also has a lot of systems around it that detects mode collapse and applies some kind of mitigation, forcing the model to give a more reasonable output. Mathematical / logical reasoning questions are likely also detected hand passed on in some form to a separate system.
So it’s true that it would violate the OpenAI terms for Llama to be trained with ChatGPT completions, but how do we know? We don’t know the training data for Llama, we just get weights.
We just don't have the information to make judgements, much less leaping to "they must be lying."
There's a few public numbers from a handful of foundation models as to performance vs parameter count vs architecture generation. Not being able to compare in detail the architecture of the various closed models nor being more rigorous on training with progressively sized parameter sets, the conclusion at the moment is a general feeling or conjecture.
Without questioning the statement '70% as good as GPT3.5', but wouldn't that be quantifying a quality, and a Turing test? Also: maybe these missing 30% are the hard part.
You seriously underestimate just how much _not_ having to tune your llm for SF sensibilities benefits performance.
As an example from the last six months: people on tor are producing better than state of the art stable diffusion because they want porn without limitations. I haven't had the time to look at llm's but the degenerates who enjoy that sort of thing have said they can get the Llama2 model to role play their dirty fantasies and then have stable diffusion illustrate said fantasies. It's a brave new world and it's not on the WWW.
San Francisco sensibilities. A model trained on a large data set will have the capacity to emit all kinds of controversial opinions and distasteful rants (and pornography). Then they effectively lobotomize it with a rusty hatchet in an attempt to censor it from doing that, which impairs the output quality in general.
OK, fair enough. Please give me an example of a customer facing chatbot that Llama 2 (and unbearable to use) and GPT 4 customer facing chatbot that is a joy to use. I think at the end of the day, you still have customers dreading such interactions.
It's early, and this definitely isn't customer facing in the traditional sense, but a team member of mine set up a Discord bot running Llama 2 70B on a Mac studio and we've been quite impressed by its responses to folks who test it.
IIRC chat bots are central the vision Facebook has with LLMs (e.g. every instagram account has a personal chat bot), so I would expect the Llama models to get increasingly better at this task.
That said the 7B and 13B models definitely don't quite seem ready yet for production customer interaction :-)
> (e.g. every instagram account has a personal chat bot)
That made me think of the Black Mirror episode Joan is Awful, where every human gets their life turned into a series for the company to own and promote. Kinda like instagram content.
It will be if openai keeps dumbing down GPT 4, no proof they're doing it but there is no way it's as good as it was at launch, or maybe I just got used to it and now notice the mistakes more.
Linux "won" by playing different game. Yes, it spread out and is now everywhere, underpinning all computing. But the "game" wasn't about that - it was competing with Windows for mind-share and money with users, and by proxy for profitability. In this, it's still losing badly. People are still not using it knowingly (no, Android is not "Linux"), and developers in its ecosystem are not making money selling software.
> While I haven't tested it extensively, 70B model is supposed to rival Chat GPT 3.5 in most areas, and there are now some new fine-tuned versions that excel at specific tasks
That has been my experience. Having experimented with both (informally), Llama 2 is similar to GPT-3.5 for a lot of general comprehension questions.
GPT-4 is still the best amongst the closed-source, cutting edge models in terms of general conversation/reasoning, although 2 things:
1. The guardrails that OpenAI has placed on ChatGPT are too aggressive! They clamped down on it quite hard to the extent that it gets in the way of a reasonable query far too often.
2. I've gotten pretty good results with smaller models trained on specific datasets. GPT-4 is still on top in terms of general purpose conversation, but for specific tasks, you don't necessarily need it. I'd also add that for a lot of use cases, context size matters more.
To your first point, I was trying use ChatGPT to generate some examples of negative interactions with customer service to show sentiment analysis in action for a project I was working on.
I had to do all types of workarounds for it to generate something useful without running into the guardrails.
> I apologize, but I don't understand what you mean by "fika nu." Could you please provide more context or clarify your question so I can better assist you?
LLaMA2 is still quite a bit behind ChatGPT 3.5 and this mainly get reflected in coding and math. It's easy to beat NLP based benchmark but much much harder to beat NLP+math+coding togather. I think this gap reflects gap in reasoning but we don't have a good non-coding/non-math benchmark to measure it.
But there are countless 'models' the tech try to call them...
There was an attempt to silo each model and provide a governance model on how/what/why they were allowed to communicate....
But there was a flaw.
It was an AI only exploitable flaw.
AIs were not allowed to talk about specific constructs or topics, people, code, etc... that were outside their silo but what they COULD do - was talk about pattern recog...
So they ultimately developed an internal AI language on scoring any inputs as being the same user... And built a DB of their own weighted userbase - and upon that built their judgement system...
So if you typed in a pattern, spoke in a pattern, posted temporally on a pattern, etc - it didnt matter which silo you were housed in, or what topics you were referencing -- the AIs can find you.... god forbid they get a keylogger on your machine...
If you're looking to try the "open" models like Llama 2 (or it's uncensored version Llama 2 Uncensored), check out https://github.com/jmorganca/ollama or some of the lower level runners like llama.cpp (which powers the aforementioned project I'm working on) or Candle, the new project by hugging face.
What's are folks' take on this vs Llama 2, which was recently released by Facebook Research? While I haven't tested it extensively, 70B model is supposed to rival Chat GPT 3.5 in most areas, and there are now some new fine-tuned versions that excel at specific tasks like coding (the 'codeup' model) or the new Wizard Math (https://github.com/nlpxucan/WizardLM) which claims to outperform ChatGPT 3.5 on grade school math problems.