DeepSeek.cpp: CPU-Only Inference for DeepSeek LLMs

What is the project about?

DeepSeek.cpp is a C++ implementation for running inference on DeepSeek large language models (LLMs). It's designed for CPU-only environments, focusing on efficiency and minimal dependencies. It's derived from the "Yet Another Language Model" (yalm) project.

What problem does it solve?

Provides a way to run DeepSeek LLMs on devices without GPUs or with limited GPU resources.
Offers a lightweight and dependency-minimal alternative to larger inference frameworks (like llama.cpp and vllm), making it suitable for low-end hardware.
Eliminates the need for a Python runtime, simplifying deployment in some environments.

What are the features of the project?

CPU-only inference: Runs entirely on the CPU.
DeepSeek model support: Specifically tailored for the DeepSeek family of LLMs (V2-Lite, V2, V2.5, V3, R1), with varying levels of support for different precisions (FP32, BF16, FP16, F8E5M2).
Quantization: Supports F8E5M2 quantization with blockwise quantization (128x128 blocks) for reduced memory usage and improved performance. MoE gates and layer norms remain in full precision for better accuracy. INT4 and 1.58-bit quantization are planned.
Multiple modes: Supports "completion" (text generation), "passkey" (a specific test mode), and "perplexity" (evaluating the model's confidence) modes.
Hugging Face weights conversion: Includes a Python script (convert.py) to convert Hugging Face safetensor weights and configuration files into a .dseek format used by the C++ code.
Sliding window context: Supports a sliding window context length.
CLI interface: Provides a command-line interface (./build/main) for easy interaction.
Temperature control: Allows adjusting the temperature parameter for controlling the randomness of text generation.

What are the technologies used in the project?

C++20: The core inference engine is written in C++.
Python: Used for the weight conversion script (convert.py) and requires requirements.txt
Hugging Face Transformers: The project interacts with models and weights in the Hugging Face format.
Safetensors: Uses the safetensors format for model weights.
Git LFS: Used for downloading large model files.

What are the benefits of the project?

Accessibility: Enables running DeepSeek models on a wider range of hardware.
Efficiency: Optimized for CPU usage, potentially offering better performance on CPU-bound systems.
Simplicity: Small codebase (<2k LOC excluding external libraries) and minimal dependencies make it easier to understand, modify, and deploy.
No Python runtime: Reduces dependencies and simplifies deployment in environments where Python is not readily available.

What are the use cases of the project?

Low-resource environments: Running LLMs on devices with limited compute power, such as older computers, embedded systems, or cloud instances without GPUs.
Research and experimentation: Provides a simple platform for experimenting with DeepSeek models and potentially modifying the inference process.
Educational purposes: The small codebase makes it a good learning resource for understanding LLM inference.
Offline applications: Suitable for applications where internet connectivity is limited or unavailable.
Testing and development: Can be used for testing and development of applications that will eventually use DeepSeek models.

Important Notes (Limitations):

Decoding only: Currently only supports decoding (generating one token at a time). Prefill (processing a batch of prompt tokens) is not implemented.
Naive multi-latent attention: Uses the basic implementation of multi-latent attention, not the optimized version.
Large memory requirements for V3: DeepSeek V3 requires significant memory (around 650GB for F8E5M2), which may necessitate swap space on many systems, leading to performance degradation.
Repetition issues: The models can sometimes get stuck in repetitive loops, especially at lower temperatures. A temperature around 1.0 is recommended.
Incomplete feature implementation: Some newer architectural features of DeepSeek V3 are not yet implemented.
WIP Status: Many features are marked as "WIP" (Work In Progress), indicating ongoing development.