OnnxStream running TinyLlama and Mistral 7B, with CUDA support

brucethemoose2 · on Jan 14, 2024

This is a pretty interesting idea for diffusion, but much less so for llms, where (after prompt ingestion) the entire weights have to be cycled through for basically every word.

spapas82 · on Jan 14, 2024

This works very good on my pc! I've got as i3-12100f CPU, 16GB RAM and RTX 2060. Running it with `llm --cuda 4` answers very fast, with a lot of hallucination though :|