synapz_org's comments

synapz_org · 2026-03-11T18:10:08 1773252608

Author here. Some context on what this is and why it matters.

Covenant-72B is the largest language model pre-trained through fully permissionless, decentralized coordination. 72 billion parameters, approximately 1.1 trillion tokens, trained across 70+ contributors on commodity internet connections. No datacenter, no central cluster, and no whitelisting of participants. Anyone with GPUs could join or leave at any time during the run.

The two hard problems in this setting are bandwidth and trust.

For bandwidth: synchronizing full gradients for a 72B model over residential internet is not feasible. We developed SparseLoCo, which compresses gradient communication by over 146x. Each peer transmits 1.56% of a full gradient per round using top-k sparsification, 2-bit quantization, and error feedback. The result was 94.5% compute utilization and 70-second communication overhead per round (versus 8.3 minutes for INTELLECT-1, a whitelisted 10B run).

For trust: when anyone can participate, anyone can submit garbage updates. Gauntlet is our validation layer. It scores every submission every round by measuring loss improvement on assigned and held-out data, running integrity checks, and applying persistent ranking. Only top-scoring updates touch the model.

The base model is competitive with LLaMA-2-70B on ARC (trained on half the token budget). After fine-tuning, the chat model outperforms both K2-Chat and LLaMA-2-70B-Chat on IFEval and MATH.

Weights are Apache 2.0 on HuggingFace: https://huggingface.co/1Covenant/Covenant-72B

Built by Covenant AI with Mila Quebec. Happy to answer questions about the training protocol, compression methods, or the validation mechanism.

synapz_org · 2025-09-01T19:22:12 1756754532

Paper: https://arxiv.org/abs/2508.15706 Code: https://github.com/tplr-ai/SparseLoCo

Templar AI has developed SparseLoCo, a distributed training algorithm that achieves extreme compression ratios (1-3% sparsity + 2-bit quantization) while outperforming existing methods like DiLoCo and DeMo on both loss and communication efficiency.

The Core Problem

Training LLMs across data centers or over the internet is bottlenecked by communication: as model scale grows, each synchronization can require transferring hundreds of gigabytes of pseudo-gradients. DiLoCo reduces the frequency of synchronizations, but the communication remains dense and large. This makes distributed training impractical for many scenarios, especially internet-scale collaboration.

Technical Approach

Our key insight: The infrequent communication of DiLoCo can be aggressively compressed via TOP-k sparsification while improving performance.

Algorithm highlights:

* Replace global momentum with per-replica error feedback * Apply TOP-k magnitude compression (1-3% density) + 2-bit quantization to pseudo-gradients * Maintain infrequent communication (H=15-250 steps) like DiLoCo * Use chunked TOP-k for better parallelism and reduced index overhead

Results

Communication reduction: With >97× compression, SparseLoCo outperforms DiLoCo across all benchmarks. Sparse aggregation appears to provide regularization benefits beyond just compression.

Communication infrequency: Consistently outperforms DiLoCo across communication frequency ∈ {15, 30, 50, 100, 250} on 512M parameter models.

Real deployment: Currently running on Bittensor with a 70B model and 20 participants in the gather operation (out of many more total participants): 70 seconds communication with <500Mbps bandwidth. Our previous successful deployment of a medium sized (200B token) run of an 8B parameter model and 20 gather participants achieved communication average of 12 seconds vs 4.5 minutes compute time.

Key Technical Contributions

1. Local momentum approximation: Show that DiLoCo's global outer momentum can be well-approximated by local accumulators (>90% cosine similarity)

2. Error feedback as momentum: Demonstrate that TOP-k + error feedback naturally provides similar benefits to outer momentum

3. Sparse aggregation benefits: Find that sparse aggregation actually improves performance over dense methods—likely due to emphasis on high-saliency components

4. Extreme quantization: Error feedback enables 2-bit quantization without additional accumulators or performance drops

Implementation Details

* Chunked TOP-k (4096 elements/chunk) reduces index transmission overhead * Custom index compression: 8.9, 6.6, 5.6 bits per value for different sparsity levels * Drop-in replacement for DiLoCo all-reduce operations * Compatible with existing distributed training frameworks

Limitations & Future Work

* Tested on 512M parameter models (though deployed on 8-70B) * Chunk size optimization could be further explored * Random-k performs significantly worse than TOP-k

This work makes distributed training viable over commodity internet connections and opens possibilities for global AI training collaborations that were previously bandwidth-prohibited.