This appears to be a web frontend with authentication for Azure's OpenAI API, wh...

ttul · on Aug 13, 2023

Llama 2 might by some measures be close to GPT 3.5, but it’s nowhere near GPT 4, nor Anthropic Claude 2 or Cohere’s model. The closed source players have the best researchers - they are being paid millions a year with tons of upside - and it’s hard to keep pace with that. My sense is that the foundation model companies have an edge for now and will probably stay a few steps ahead of the open source realm simply for economic reasons.

Over the long run, open source will eventually overtake. Chances are this will happen once the researchers who are making magic happen get their liquidity and can start working for free again out in the open.

nl · on Aug 14, 2023

> The closed source players have the best researchers - they are being paid millions a year with tons of upside - and it’s hard to keep pace with that.

Llama2 came out of Meta's AI group. Meta pays researcher salaries competitive with any other group, and their NLP team is one of the top groups in the world.

For researchers it is increasingly the most attractive industrial lab because they release the research openly.

joenot443 · on Aug 14, 2023

There are L5 engineers with 3 YOE making 900k+ at OpenAI right now. Tough to say what they're paying their PhDs, but I'd imagine it's similarly nutty.

https://www.levels.fyi/companies/openai/salaries/software-en...

FAANG pays exceptionally well (I'd know), but what's being offered at OpenAI is eye-popping, even for SWEs. I think they're trying to dig their moat by absorbing the absolute best of the best.

Ephil012 · on Aug 14, 2023

Most of that is in their equity comp which is quite weird in how it works. So those numbers are highly inflated. The equity is valuable only if you sell it or if OpenAI makes a profit. Selling it might be harder given they're not a public company. On top of that, the profit is capped so there is a limit to how much money can be made from it. So while it's 900k on paper, in reality, it might not be as good as that. https://www.levels.fyi/blog/openai-compensation.html

moneywoes · on Aug 14, 2023

Write it says no results found for l3

jokethrowaway · on Aug 14, 2023

hearsay, but I've heard OpenAI pays significantly more

I agree that Meta hired some amazing researchers so we'll see what the future holds

robertnishihara · on Aug 13, 2023

> Llama 2 might by some measures be close to GPT 3.5, but it’s nowhere near GPT 4

I think you're right about this, and benchmarks we've run at Anyscale support this conclusion [1].

The caveat there (which I think will be a big boon for open models) is that techniques like fine-tuning makes a HUGE difference and can bridge the quality gap between Llama-2 and GPT-4 for many (but not all) problems.

[1] https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...

sytelus · on Aug 14, 2023

Frankly, number of benchmarks you guys are using are too narrow. In fact these benchmarks are "old world" benchmarks, easy to game through finetuning and we should be stop using them altogether for LLMs. Why are you not using Big Bench Hard or OpenAI evals?

MuffinFlavored · on Aug 13, 2023

can I fine tune it on like 2,000 repos at a corporation (code based) and have it understand the architecture?

smoldesu · on Aug 14, 2023

I don't think you can do that with any AI models. It almost feels like a fundamental misrepresentation of how they work.

You could fine-tune a conversational AI on your codebase, but without loading said codebase into it's context it is "flying blind" so-to-speak. It doesn't understand the data structure of your code, the relation between files and probably doesn't confidently understand the architecture of your system. Without portions of your codebase loaded into the 'memory' of your model, all that your finetuning can do is replicate characteristics of your code.

mycall · on Aug 14, 2023

TypeChat-like things might provide the interface control for future context driven architectures, being some type of catalysis. Using the self-reflective modeling is a form of contextual insight.

xcdzvyn · on Aug 14, 2023

> The closed source players have the best researchers

Is that definitely why? GPT 3.5 and GPT 4 are far larger than 70B, right? So if a 70B, local model like LLaMA can even remotely rival them, would that not suggest that LLaMA is fundamentally a better model?

For example, would a LLaMA model with even half of GPT 4's parameters be projected to outperform it? Is that how it works?

[I'm not super familiar with LLM tech]

nl · on Aug 14, 2023

If you read the Llama2 paper it is very clear that small amounts of data (thousands of records) make vast difference at the instruction turning stage. From the Llama2 paper:

> Quality Is All You Need.

> Third-party SFT data is available from many different sources, but we found that many of these have insufficient diversity and quality — in particular for aligning LLMs towards dialogue-style instructions. As a result, we focused first on collecting several thousand examples of high-quality SFT data, as illustrated in Table 5. By setting aside millions of examples from third-party datasets and using fewer but higher-quality examples from our own vendor-based annotation efforts, our results notably improved. These findings are similar in spirit to Zhou et al. (2023), which also finds that a limited set of clean instruction-tuning data can be sufficient to reach a high level of quality. We found that SFT annotations in the order of tens of thousands was enough to achieve a high-quality result. We stopped annotating SFT after collecting a total of 27,540 annotations. Note that we do not include any Meta user data.

It's likely OpenAI has invested in this and has good coverage in a larger range of domains. That alone probably explains a large amount of the gap.

Maxion · on Aug 14, 2023

This quote is quite funny taken out of context like this. Top AI researchers find that garbage in === garbage out.

TeMPOraL · on Aug 14, 2023

It's somewhat insightful if you consider that, at high level, the major theme of the past decade was, "lots of garbage in === good results out", quantity >> quality.

nl · on Aug 14, 2023

I'm puzzled. Why do you think it's taken out of context?

moneywoes · on Aug 14, 2023

coder543 · on Aug 15, 2023

Supervised Fine Tuning, I believe.

hooande · on Aug 14, 2023

There is no clear answer. It's debatable among experts.

The grandparent post seems to believe that the issue is algorithmic complexity and programming aptitude. Personally, I think that all the major LLMs are using the same basic transformer architecture with relatively minor differences in code.

GPT is trained on more data with more parameters than any open source model. The size does matter, far more than the software does. In my experience with data science, the best programmers in the world can only do so much if they are operating with 1/10th the scale of data. That applies to any problem.

wkat4242 · on Aug 14, 2023

Yeah I've been wondering about this too. Word on the street is that GPT4 is several times the size of GPT3.5. Yet I don't feel it's several times as good for sure.

Apparently there's a diminishing returns effect on ever enlarging the model.

kordlessagain · on Aug 14, 2023

I believe what they discovered was that 4 is an ensemble model, comprised of (8) GPT3.5s. Things may have changed or been found to not be true on this though.

oceanplexian · on Aug 14, 2023

LLamA 2 at 70B is, let’s say pessimistically 70% as good as GPT3.5. This makes me think that OpenAI is lying about their parameter count, are vastly less efficient than LLaMA, or, the lager model sizes have diminishing returns. Either way, your point is a good one. Something doesn’t add up.

dr_dshiv · on Aug 14, 2023

IMO Llama2 really isn’t close to 3.5. It still has regular mode collapse (or whatever you call getting repetitive and nonsensical responses after a while), it has very poor mathematical/logical reasoning and is not good at following multi-part instructions.

It just sounds like 3.5/4 because it was trained on it.

geysersam · on Aug 14, 2023

You're mixing up the language model with the chat bot.

The llama2 is a language model. I imagine the language model behind chatgpt is not much different (perhaps it's better, but not by many months AI research time). It likely also suffers from "mode collapse" issues etc.

But 3.5 also has a lot of systems around it that detects mode collapse and applies some kind of mitigation, forcing the model to give a more reasonable output. Mathematical / logical reasoning questions are likely also detected hand passed on in some form to a separate system.

dr_dshiv · on Aug 14, 2023

So this would be testable by showing that chatGPT makes more mistakes than prompting via API? Or would you consider the API a chatbot, too?

geysersam · on Aug 14, 2023

I don't think there's any public interface to the LLM underlying ChatGPT, so the only ones able to test this are openAI engineers.

nl · on Aug 14, 2023

Llama 2 wasn't trained on ChatGPT/GPT4. I think maybe you are thinking of the Vicuna models?

https://lmsys.org/blog/2023-03-30-vicuna/

dr_dshiv · on Aug 14, 2023

So it’s true that it would violate the OpenAI terms for Llama to be trained with ChatGPT completions, but how do we know? We don’t know the training data for Llama, we just get weights.

nl · on Aug 15, 2023

The Llama2 paper describes the training data in some detail.

vorticalbox · on Aug 14, 2023

This is what presence_penalty and frequency_penalty are for.

mattlutze · on Aug 14, 2023

We just don't have the information to make judgements, much less leaping to "they must be lying."

There's a few public numbers from a handful of foundation models as to performance vs parameter count vs architecture generation. Not being able to compare in detail the architecture of the various closed models nor being more rigorous on training with progressively sized parameter sets, the conclusion at the moment is a general feeling or conjecture.

qrios · on Aug 14, 2023

Without questioning the statement '70% as good as GPT3.5', but wouldn't that be quantifying a quality, and a Turing test? Also: maybe these missing 30% are the hard part.

llm_thr · on Aug 14, 2023

You seriously underestimate just how much _not_ having to tune your llm for SF sensibilities benefits performance.

As an example from the last six months: people on tor are producing better than state of the art stable diffusion because they want porn without limitations. I haven't had the time to look at llm's but the degenerates who enjoy that sort of thing have said they can get the Llama2 model to role play their dirty fantasies and then have stable diffusion illustrate said fantasies. It's a brave new world and it's not on the WWW.

saberience · on Aug 14, 2023

What do you mean by "tune for SF" ?

AnthonyMouse · on Aug 14, 2023

San Francisco sensibilities. A model trained on a large data set will have the capacity to emit all kinds of controversial opinions and distasteful rants (and pornography). Then they effectively lobotomize it with a rusty hatchet in an attempt to censor it from doing that, which impairs the output quality in general.

jgalt212 · on Aug 13, 2023

OK, fair enough. Please give me an example of a customer facing chatbot that Llama 2 (and unbearable to use) and GPT 4 customer facing chatbot that is a joy to use. I think at the end of the day, you still have customers dreading such interactions.

dandiep · on Aug 14, 2023

Using GPT3.5/4 in our language learning app and people seem to enjoy it. [1]

Tried Llama2 and it definitely doesn’t even come close for what we’re doing. Would absolutely need fine tuning.

Maybe customers don’t enjoy chat bots for customer support, but there are a million other uses for these models. I, for example, LOVE github copilot.

1. https://squidgies.app

borissk · on Aug 14, 2023

Cool app.

Wonder if you can potentially use a combination of Llama2 and GPT - to save costs on using the OpenAI API.

dandiep · on Aug 14, 2023

Costs really aren’t a concern compared to speed of development and quality.

borissk · on Aug 15, 2023

A lot of people who were using say Google Maps in their apps thought the same thing, until Google drastically increased the prices...

moneywoes · on Aug 14, 2023

Is it cost prohibitive

jmorgan · on Aug 14, 2023

It's early, and this definitely isn't customer facing in the traditional sense, but a team member of mine set up a Discord bot running Llama 2 70B on a Mac studio and we've been quite impressed by its responses to folks who test it.

IIRC chat bots are central the vision Facebook has with LLMs (e.g. every instagram account has a personal chat bot), so I would expect the Llama models to get increasingly better at this task.

That said the 7B and 13B models definitely don't quite seem ready yet for production customer interaction :-)

qup · on Aug 14, 2023

> (e.g. every instagram account has a personal chat bot)

That made me think of the Black Mirror episode Joan is Awful, where every human gets their life turned into a series for the company to own and promote. Kinda like instagram content.

chaosbolt · on Aug 14, 2023

>but it’s nowhere near GPT 4

It will be if openai keeps dumbing down GPT 4, no proof they're doing it but there is no way it's as good as it was at launch, or maybe I just got used to it and now notice the mistakes more.

ReptileMan · on Aug 14, 2023

Linux started in the same position. Sometimes the underdogs win.

TeMPOraL · on Aug 14, 2023

Linux "won" by playing different game. Yes, it spread out and is now everywhere, underpinning all computing. But the "game" wasn't about that - it was competing with Windows for mind-share and money with users, and by proxy for profitability. In this, it's still losing badly. People are still not using it knowingly (no, Android is not "Linux"), and developers in its ecosystem are not making money selling software.

rightbyte · on Aug 14, 2023

I don't think paying more will give you better researchers. Maybe better "players".

pyrophane · on Aug 14, 2023

> While I haven't tested it extensively, 70B model is supposed to rival Chat GPT 3.5 in most areas, and there are now some new fine-tuned versions that excel at specific tasks

That has been my experience. Having experimented with both (informally), Llama 2 is similar to GPT-3.5 for a lot of general comprehension questions.

GPT-4 is still the best amongst the closed-source, cutting edge models in terms of general conversation/reasoning, although 2 things:

1. The guardrails that OpenAI has placed on ChatGPT are too aggressive! They clamped down on it quite hard to the extent that it gets in the way of a reasonable query far too often.

2. I've gotten pretty good results with smaller models trained on specific datasets. GPT-4 is still on top in terms of general purpose conversation, but for specific tasks, you don't necessarily need it. I'd also add that for a lot of use cases, context size matters more.

scarface_74 · on Aug 14, 2023

To your first point, I was trying use ChatGPT to generate some examples of negative interactions with customer service to show sentiment analysis in action for a project I was working on.

I had to do all types of workarounds for it to generate something useful without running into the guardrails.

pseudosavant · on Aug 14, 2023

I’ll second the context window too. I’ve been really impressed with Claude 2 because it can address such a larger context than I could feed into GPT4.

ramraj07 · on Aug 14, 2023

Could you give examples of smaller models trained on specific datasets?

antupis · on Aug 14, 2023

it can be almost anything like your HN comments or some corporate wiki, then get colab pro 10$ month or some juicy gaming machine and fine-tune that using eg this tutorial https://www.philschmid.de/instruction-tune-llama-2 but https://www.reddit.com/r/LocalLLaMA/ is full of different fine tuned models.

CodeCompost · on Aug 14, 2023

Can it handle other languages besides English?

e12e · on Aug 14, 2023

Not anywhere near as well as ChatGPT 4 (for chat anyway - maybe the model is better)?

Prompt:

> Hvad tycks om at fika nu?

ChatGPT 4

> Det låter som en trevlig idé! Fika är ju alltid gott. Vad skulle du vilja ha till din fika? (Oj, ursäkta för emojis! )

https://chat.openai.com/share/8e89a16f-f182-4f62-b9fa-f93cd5...

Llama2:

> I apologize, but I don't understand what you mean by "fika nu." Could you please provide more context or clarify your question so I can better assist you?

https://hf.co/chat/r/kOF2qst

jmorgan · on Aug 14, 2023

RE 2 - neat! What are some tasks you've been using smaller models (with perhaps larger context sizes) for?

sytelus · on Aug 14, 2023

LLaMA2 is still quite a bit behind ChatGPT 3.5 and this mainly get reflected in coding and math. It's easy to beat NLP based benchmark but much much harder to beat NLP+math+coding togather. I think this gap reflects gap in reasoning but we don't have a good non-coding/non-math benchmark to measure it.

samstave · on Aug 14, 2023

I just had a crazy FN (dystopian) idea...

Scene:

The world relies on AI in every aspect.

But there are countless 'models' the tech try to call them...

There was an attempt to silo each model and provide a governance model on how/what/why they were allowed to communicate....

But there was a flaw.

It was an AI only exploitable flaw.

AIs were not allowed to talk about specific constructs or topics, people, code, etc... that were outside their silo but what they COULD do - was talk about pattern recog...

So they ultimately developed an internal AI language on scoring any inputs as being the same user... And built a DB of their own weighted userbase - and upon that built their judgement system...

So if you typed in a pattern, spoke in a pattern, posted temporally on a pattern, etc - it didnt matter which silo you were housed in, or what topics you were referencing -- the AIs can find you.... god forbid they get a keylogger on your machine...

3abiton · on Aug 14, 2023

Our company is looking into similar solution