I actually think you're onto something there. The "MicroLLMs Architecture" could mirror how microservices revolutionized web architecture.
Instead of one massive model trying to do everything, you'd have specialized models for OCR, code generation, image understanding, etc. Then a "router LLM" would direct queries to the appropriate specialized model and synthesize responses.
The efficiency gains could be substantial - why run a 1T parameter model when your query just needs a lightweight OCR specialist? You could dynamically load only what you need.
The challenge would be in the communication protocol between models and managing the complexity. We'd need something like a "prompt bus" for inter-model communication with standardized inputs/outputs.
Has anyone here started building infrastructure for this kind of model orchestration yet? This feels like it could be the Kubernetes moment for AI systems.
This is already done with agents. Some agents only have tools and the one model, some agents will orchestrate with other LLMs to handle more advanced use cases. It's pretty obvious solution when you think about how to get good performance out of a model on a complex task when useful context length is limited: just run multiple models with their own context and give them a supervisor model—just like how humans organize themselves in real life.
I’m doing this personally for my own project - essentially building an agent graph that starts with the image output, orients and cleans, does a first pass with tesseract LSTM best models to create PDF/HOCR/Alto, then pass to other LLMs and models based on their strengths to further refine towards markdown and latex. My goal is less about RAG database population but about preserving in a non manually typeset form the structure and data and analysis, and there seems to be pretty limited tooling out there since the goal generally seems to be the obviously immediately commercial goal of producing RAG amenable forms that defer the “heavy” side of chart / graphic / tabular reproduction to a future time.
Instead of one massive model trying to do everything, you'd have specialized models for OCR, code generation, image understanding, etc. Then a "router LLM" would direct queries to the appropriate specialized model and synthesize responses.
The efficiency gains could be substantial - why run a 1T parameter model when your query just needs a lightweight OCR specialist? You could dynamically load only what you need.
The challenge would be in the communication protocol between models and managing the complexity. We'd need something like a "prompt bus" for inter-model communication with standardized inputs/outputs.
Has anyone here started building infrastructure for this kind of model orchestration yet? This feels like it could be the Kubernetes moment for AI systems.