Author here. I'm the founder of Mixpeek — we build multimodal search infrastructure.
The core problem: most vector search assumes your query is a sentence or a single image. But we kept getting customers who wanted to pass entire video files as queries — a media company searching their archive with a raw broadcast clip, a legal team querying with a full contract PDF, an IP safety pipeline scanning videos frame-by-frame against a brand index.
The key insight was that the decomposition pipeline we already use for ingestion (split → embed → store) is the same operation needed at query time — just routing output to search instead of write. Same extractor, same chunking, same embedding model. This guarantees query and index vectors are always in the same space.
The execution is: detect large input → decompose via extractor → batch embed in parallel → N concurrent ANN searches → fuse results (RRF/max/avg). From the caller's perspective, the API shape doesn't change at all.
One decision I'd be curious to get feedback on: we explicitly dropped an "auto" mode that would pick chunking strategy based on file type. The right decomposition depends on what you're searching for, not just the file itself. Felt like the wrong abstraction to hide. Curious if others have found ways to make auto-config work well here.
Happy to answer questions about the fusion strategies, the credit model, or the architecture.
I built amux because running 5–10 Claude Code agents at once across different repos turned into an unmanageable mess of terminal tabs and forgotten sessions.
The core problem: Claude Code sessions crash at 3am from context compaction, agents silently block on permission prompts, and there's no good way to see which of your 8 running sessions actually needs attention. I was losing work and wasting money.
amux is a tmux-based multiplexer that gives you a single control plane for all your headless Claude Code sessions — from a web dashboard, your phone, or the CLI.
*What it actually does:*
- Registers Claude Code sessions as named tmux panes, each with its own conversation history and working directory
- Live status detection (working / needs input / idle) streamed via SSE — you see at a glance which agents need you
- Self-healing watchdog that auto-compacts context, restarts crashed sessions, and replays the last message
- Built-in kanban board backed by SQLite with atomic task claiming (CAS), so agents can pick up work items without race conditions
- REST API for everything — agents discover peers and delegate work via `curl`. The API reference gets injected into each agent's global memory, so plain-English orchestration works out of the box
- Per-session token tracking with daily spend breakdowns, so you know what each agent costs before the bill arrives
- Git conflict detection that warns when two agents share a directory + branch, with one-click branch isolation
*What it's not:*
It's not a wrapper around Claude Code's native agent teams feature. It operates at a layer below that — it doesn't modify Claude Code at all. It parses ANSI-stripped tmux output. No hooks, no patches, no monkey-patching. If Claude Code updates tomorrow, amux still works.
*Technical decisions:*
The whole thing is a single ~12,000-line Python file with inline HTML/CSS/JS. No npm, no build step, no Docker. I know the single-file approach is polarizing, but for a tool that runs on your dev machine and you might want to hack on, I've found it dramatically lowers the barrier. It restarts on save.
TLS is auto-provisioned in priority order: Tailscale cert → mkcert → self-signed fallback. The idea is you install it on your dev box, run `amux serve`, and access it securely from your phone over Tailscale while you're away from your desk. I use the mobile PWA daily — kick off a batch of tasks, go walk the dog, check progress from my phone.
The kanban board uses SQLite with compare-and-swap for task claiming. This matters because when you have multiple agents that can pick up work, you need atomicity — two agents hitting `/api/board/PROJ-5/claim` simultaneously should result in exactly one of them getting it.
One thing we’ve been thinking about with Amux is that the unit of compute shouldn’t just be the terminal session—it should be the agent itself. That means each pane/session can expose things like:
* tokens in / tokens out
* cumulative run cost
* model + pricing tier
* runtime duration
* optional budget caps
So when you spin up 5–10 agents, you can immediately see which one is burning tokens or looping.
Longer term I’d love for Amux to treat agents a bit like processes in `htop` where you can see resource usage across all agents in one place and kill/restart the expensive ones quickly.
Curious how you're currently surfacing cost in your setups — logs, dashboards, or something inline with the agent runtime?
We process video, images, and documents through 20+ ML models simultaneously at Mixpeek. A single 10-minute video triggers transcription, visual embeddings, scene descriptions, face detection, object detection, brand safety classification, and more — all in parallel with different compute requirements.
We wrote up the full Ray architecture we use in production on KubeRay/GKE. Not a tutorial — more of a "here's what we actually run and what bit us."
Some highlights:
- *Custom resource isolation* — We use a synthetic `{"batch": 1}` resource to prevent batch pipeline tasks from starving Ray Serve inference replicas. Same cluster, zero interference, no runtime overhead.
- *Flexible actor pools* — Fixed-size `ActorPoolStrategy(size=8)` deadlocks when concurrent jobs compete for workers. `min_size=1, max_size=N` guarantees every job can make progress.
- *Shared preprocessing* — Naive approach runs S3 download + format normalization once per extractor. With 10 extractors on 1,000 files, that's 10,000 redundant reads. We preprocess once and fan out via Ray Dataset.
- *Distributed Qdrant writes* — Ray Data's `Datasink` API distributes vector DB writes across all workers with backpressure, instead of collecting everything on one node.
- *Fire-and-forget progress tracking* — A Ray actor as a shared counter lets workers report progress without blocking the pipeline.
- *Zero-CPU head node* — Learned this one the hard way when a runaway batch job took down our scheduler.
The post includes the KubeRay YAML, Ray Serve autoscaling configs, pipeline code, and the LocalStack parquet workaround that saved us hours of debugging silent hangs.
llm coordination is just one feature - the core (and why i built amux) was so that i can quickly delegate from my phone, see outputs, monitor, etc without raw ssh.
A couple implementation details for anyone curious: CAD previews and exploded views are rendered client-side using Replicad (WASM) + WebGL, so there’s no server-side geometry rendering.
I also recorded a short walkthrough showing a build from prompt → parts → enclosure → validation:
reply