74% smaller 713GB to 185GB.
Use the magic incantation -ot ".ffn_.*_exps.=CPU" to offload MoE layers to RAM, allowing non MoEs to fit < 24GB VRAM on 16K context! The rest sits in RAM & disk.
74% smaller 713GB to 185GB.
Use the magic incantation -ot ".ffn_.*_exps.=CPU" to offload MoE layers to RAM, allowing non MoEs to fit < 24GB VRAM on 16K context! The rest sits in RAM & disk.