GitHub

DeepSeek.cpp: CPU-Only Inference for DeepSeek LLMs

What is the project about?

DeepSeek.cpp is a C++ implementation for running inference on DeepSeek large language models (LLMs). It's designed for CPU-only environments, focusing on efficiency and minimal dependencies. It's derived from the "Yet Another Language Model" (yalm) project.

What problem does it solve?

  • Provides a way to run DeepSeek LLMs on devices without GPUs or with limited GPU resources.
  • Offers a lightweight and dependency-minimal alternative to larger inference frameworks (like llama.cpp and vllm), making it suitable for low-end hardware.
  • Eliminates the need for a Python runtime, simplifying deployment in some environments.

What are the features of the project?

  • CPU-only inference: Runs entirely on the CPU.
  • DeepSeek model support: Specifically tailored for the DeepSeek family of LLMs (V2-Lite, V2, V2.5, V3, R1), with varying levels of support for different precisions (FP32, BF16, FP16, F8E5M2).
  • Quantization: Supports F8E5M2 quantization with blockwise quantization (128x128 blocks) for reduced memory usage and improved performance. MoE gates and layer norms remain in full precision for better accuracy. INT4 and 1.58-bit quantization are planned.
  • Multiple modes: Supports "completion" (text generation), "passkey" (a specific test mode), and "perplexity" (evaluating the model's confidence) modes.
  • Hugging Face weights conversion: Includes a Python script (convert.py) to convert Hugging Face safetensor weights and configuration files into a .dseek format used by the C++ code.
  • Sliding window context: Supports a sliding window context length.
  • CLI interface: Provides a command-line interface (./build/main) for easy interaction.
  • Temperature control: Allows adjusting the temperature parameter for controlling the randomness of text generation.

What are the technologies used in the project?

  • C++20: The core inference engine is written in C++.
  • Python: Used for the weight conversion script (convert.py) and requires requirements.txt
  • Hugging Face Transformers: The project interacts with models and weights in the Hugging Face format.
  • Safetensors: Uses the safetensors format for model weights.
  • Git LFS: Used for downloading large model files.

What are the benefits of the project?

  • Accessibility: Enables running DeepSeek models on a wider range of hardware.
  • Efficiency: Optimized for CPU usage, potentially offering better performance on CPU-bound systems.
  • Simplicity: Small codebase (<2k LOC excluding external libraries) and minimal dependencies make it easier to understand, modify, and deploy.
  • No Python runtime: Reduces dependencies and simplifies deployment in environments where Python is not readily available.

What are the use cases of the project?

  • Low-resource environments: Running LLMs on devices with limited compute power, such as older computers, embedded systems, or cloud instances without GPUs.
  • Research and experimentation: Provides a simple platform for experimenting with DeepSeek models and potentially modifying the inference process.
  • Educational purposes: The small codebase makes it a good learning resource for understanding LLM inference.
  • Offline applications: Suitable for applications where internet connectivity is limited or unavailable.
  • Testing and development: Can be used for testing and development of applications that will eventually use DeepSeek models.

Important Notes (Limitations):

  • Decoding only: Currently only supports decoding (generating one token at a time). Prefill (processing a batch of prompt tokens) is not implemented.
  • Naive multi-latent attention: Uses the basic implementation of multi-latent attention, not the optimized version.
  • Large memory requirements for V3: DeepSeek V3 requires significant memory (around 650GB for F8E5M2), which may necessitate swap space on many systems, leading to performance degradation.
  • Repetition issues: The models can sometimes get stuck in repetitive loops, especially at lower temperatures. A temperature around 1.0 is recommended.
  • Incomplete feature implementation: Some newer architectural features of DeepSeek V3 are not yet implemented.
  • WIP Status: Many features are marked as "WIP" (Work In Progress), indicating ongoing development.
deepseek.cpp screenshot