> Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchma...

dragonwriter · on Oct 22, 2024

> Why isn't Anthropic clearer about Sonnet being better then?

They are clear that both: Opus > Sonnet and 3.5 > 3.0. I don't think there is a clear universal better/worse relationship between Sonnet 3.5 and Opus 3.0; which is better is task dependent (though with Opus 3.0 being five times as expensive as Sonnet 3.5, I wouldn't be using Opus 3.0 unless Sonnet 3.5 proved clearly inadequate for a task.)

hobofan · on Oct 22, 2024

> I don't understand why this seems purposefully ambiguous?

I wouldn't attribute this to malice when it can also be explained by incompetence.

Sonnet 3.5 New > Opus 3 > Sonnet 3.5 is generally how they stack up against each other when looking at the total benchmarks.

"Sonnet 3.5 New" has just been announced, and they likely just haven't updated the marketing copy across the whole page yet, and maybe also haven't figured out how to graple with the fact that their new Sonnet model was ready faster than their next Opus model.

At the same time I think they want to keep their options open to either:

A) drop a Opus 3.5 soon that will bring the logic back in order again

B) potentially phase out Opus, and instead introduce new branding for what they called a "reasoning model" like OpenAI did with o1(-preview)

diggan · on Oct 22, 2024

> I wouldn't attribute this to malice when it can also be explained by incompetence.

I don't think it's malice either, but if Opus costs more to them to run, and they've already set a price they cannot raise, it makes sense they want people to use models they have a higher net return on, that's just "business sense" and not really malice.

> and they likely just haven't updated the marketing copy across the whole page yet

The API docs have been updated though, which is the second page I linked. It mentions the new model by it's full name "claude-3-5-sonnet-20241022" so clearly they've gone through at least that page. Yet the wording remains ambiguous.

> Sonnet 3.5 New > Opus 3 > Sonnet 3.5 is generally how they stack up against each other when looking at the total benchmarks.

Which ones are you looking at? Since the benchmark comparison in the blogpost itself doesn't include Opus at all.

hobofan · on Oct 22, 2024

> Which ones are you looking at? Since the benchmark comparison in the blogpost itself doesn't include Opus at all.

I manually compared it with the values from the benchmarks they published when they originally announced the Claude 3 model family[0].

Not all rows have a 1:1 row in the current benchmarks, but I think it paints a good enough picture.

[0]: https://www.anthropic.com/news/claude-3-family

dotancohen · on Oct 22, 2024

> B) potentially phase out Opus, and instead introduce new branding for what they called a "reasoning model" like OpenAI did with o1(-preview)

When should we be using the -o OpenAI models? I've not been keeping up and the official information now assumes far too much familiarity to be of much use.

hobofan · on Oct 22, 2024

I think it's first important to note that there is a huge difference between -o models (GPT 4o; GPT 4o mini) and the o1 models (o1-preview; o1-mini).

The -o models are "just" stronger versions of their non-suffixed predecessors. They are the latest (and maybe last?) version of models in the lineage of GPT models (roughly GPT-1 -> GPT-2 -> GPT-3 -> GPT-3.5 -> GPT-4 -> GPT-4o).

The o1 models (not sure what the naming structure for upcoming models will be) are a new family of models that try to excel at deep reasoning, by allowing the models to use an internal (opaque) chain-of-thought to produce better results at the expense of higher token usage (and thus cost) and longer latency.

Personally, I think the use cases that justify the current cost and slowness of o1 are incredibly narrow (e.g. offline analysis of financial documents or deep academic paper research). I think in most interactive use-cases I'd rather opt for GPT-4o or Sonnet 3.5 instead of o1-preview and have the faster response time and send a follow-up message. Similarly for non-interactive use-cases I'd try to add a layer of tool calling with those faster models than use o1-preview.

I think the o1-like models will only really take off, if the prices for it are coming down, and it is clearly demonstrated that more "thinking tokens" correlate to predictably better results, and results that can compete with highly tuned prompts/fine tuned models that or currently expensive to produce in terms of development time.

jcheng · on Oct 22, 2024

Agreed with all that, and also, when used via API the o1 models don't currently support system prompts, streaming, or function calling. That rules them out for all of the uses I have.

maeil · on Oct 23, 2024

> The -o models are "just" stronger versions of their non-suffixed predecessors.

Cheaper and faster, but not notably "stronger" at real-world use.

dotancohen · on Oct 23, 2024

Thank you.

ryandvm · on Oct 23, 2024

Jesus, maybe they should let the AIs run the product naming.