Daniel & Michael Han:
DeepSeek-R1 has been making waves recently by rivaling OpenAI’s O1 reasoning model while being fully open-source. We explored how to enable more local users to run it and managed to quantize DeepSeek’s R1 671B parameter model to 131GB in size, a 80% reduction in size from the original 720GB, whilst being very functional.
By studying DeepSeek R1’s architecture, we managed to selectively quantize certain layers to higher bits (like 4bit), and leave most MoE layers (like those used in GPT-4) to 1.5bit (see Unsloth Dynamic 4-bit). Naively quantizing all layers breaks the model entirely, causing endless loops and gibberish outputs. Our dynamic quants solve this.
The 1.58bit quantization should fit in 160GB of VRAM for fast inference (2x H100 80GB), with it attaining around 140 tokens per second. You don’t need VRAM (GPU) to run 1.58bit R1, just 20GB of RAM (CPU) will work however it maybe slow. For optimal performance, we recommend the sum of VRAM + RAM to be at least 80GB+.
Leave a Reply