AFAICT, kimi k2 was the first to apply this technique [1]. I wonder if Anthropic came up with it independently or if they trained a model in 5 months after seeing kimi’s performance.
The main thing that keeps me from using Jupyter notebooks for anything that's not entirely Python, is Python.
For me, pipenv/pyenv/conda/poetry/uv/dependencies.txt and the invitable "I need to upgrade Python to run this notebook, ugh, well, ok -- two weeks later - g####m that upgrade broke that unrelated and old ansible and now I cannot fix these fifteen barely held up servers" is pure hell.
I try to stay away from Python for foundational stuff, as any Python project that I work on¹ will break at least yearly on some dependency or other runtime woe. That goes for Ansible, Build Pipelines, deploy.py or any such thing. I would certainly not use Jupyter notebooks for such crucial and foundational automation, as the giant tree of dependencies and requirements it comes with, makes this far worse.
¹ Granted, my job makes me work on an excessive amount of codebases, At least six different Python projects last two months, some requiring python 2.7, some requiring deprecated versions of lib-something.h some cutting edge, some very strict in practice but not documented (It works on the machine of the one dev that works on it as long as he never updates anything?). And Puppet or Chef - being Ruby, are just as bad, suffering from the exact same issues, only that Ruby has had one (and only one!) package management system for decades now.
I recently wrote a post outlining our method to reduce hallucinations in LLM agents by leveraging a verified semantic cache. The approach pre-populates the cache with verified question-answer pairs, ensuring that frequently asked questions are answered accurately and consistently without invoking the LLM unnecessarily.
The key idea lies in dynamically determining how queries are handled:
- Strong matches (≥80% similarity): Responses are directly served from the cache.
- Partial matches (60–80% similarity): Verified answers are used as few-shot examples to guide the LLM.
- No matches (<60% similarity): The query is processed by the LLM as usual.
This not only minimizes hallucinations but also reduces costs and improves response times.
If the user asks such a question, your agent should not invoke the RAG at all, but simply answer from the history. You need to focus on your orchestration step.
Search for ReAct agents, can build using either LangGraph or Bedrock Agents.
Skeptical that this will be a "good" experience for everyone involved considering how generic AI openers/responses are, but also hopeful that it can reduce friction for some.
We are trying to make it as personalized to you as possible. Currently, it is still limited, but we seek as much context of the user and the matches to personalize each message over time.
Not an expert by any means but streaming HQ video is pretty expensive (even more so for live content), seems like the only providers that can do so profitably are YouTube and Netflix. I'm sure a big reason for that is the engineering (esp. CDN)
This is actually not true nowadays. Streaming HQ video is pretty cheap (check out per GB pricing from Cloudfront or Fastly and divide that by 5-10 to get a realistic number)
90% of our customers do not allow this due to data sovereignty.
Bedrock here is lagging so far behind several customers assume AWS simply aren't investing here anymore - or if they are it's an afterthought - and a very expensive one at that.
I've spoken with several account managers and SAs and they seem similarly frustrated with the continual response from above that useful models are "coming soon".
You can't even BYO models here, we usually end up spinning up big ol' GPU EC2 instances and serving our own, or for some tasks running locally as you can get better openweight LLMs.
Hmm interesting, didn't realize that data sovereignty requirements were so stringent. Wonder how other cloud providers are doing in this sense considering GPU shortages across the board.
I'm confused, what's expensive about it? It's a serverless pay per token model?
Do you mean specifically the Bedrock Knowledgebase/RAG -- that uses serverless OpenSearch which costs at minimum $200ish/month bc it doesn't scale to zero?
https://aws.amazon.com/blogs/opensource/using-strands-agents...