Once you know how to compress 32-bit parameters to ternary, compressing ternary to binary is the easy part. :)
They would keep re-compressing the model in its entirety, recursively until the whole thing was a single bit, but the unpacking and repacking during inference is a bitch.
They would keep re-compressing the model in its entirety, recursively until the whole thing was a single bit, but the unpacking and repacking during inference is a bitch.