Optillm: An Optimizing Inference Proxy with Plugins

codelion · 2024-10-05T06:38:00 1728110280

Optillm is an optimizing inference proxy that has over a dozen techniques that aim to improve the accuracy of the responses using test-time compute. Over the last couple of months we have set several SOTA results using smaller and less capable models like gpt-4o-mini.

Recently, we have added support for plugins that enable capabilities like memory, privacy and code execution to optillm. Plugins are just python scripts that you can also write yourself, optillm would then load them at start from the directory.

You can now also combine the plugins and techniques using & and | operators. E.g. We recently evaluated the new FRAMES benchmark from Google. Using a combination of plugins and techniques (we used readurls&memory-gpt-4o-mini) we were able to get 65.7% accuracy on the benchmark which is very close to what Google reported in their paper with Gemini Flash 1.5 (66.5) which has a context length that is almost 10 times that of gpt-4o-mini.