Hacker News new | past | comments | ask | show | jobs | submit login

It's closer to <30k before performance degrades too much for 3.5/3.7. 200k/64k is meaningless in this context.



Is there a benchmark to measure real effective context length?

Sure, gpt-4o has a context window of 128k, but it loses a lot from the beginning/middle.


Here's an older study that includes Claude 3.5: https://www.databricks.com/blog/long-context-rag-capabilitie...?



They often publish "needle in a haystack" benchmarks that look very good, but my subjective experience with a large context is always bad. Maybe we need better benchmarks.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: