DeepSeek.cpp: CPU-Only Inference for DeepSeek LLMs
What is the project about?
DeepSeek.cpp is a C++ implementation for running inference on DeepSeek large language models (LLMs). It's designed for CPU-only environments, focusing on efficiency and minimal dependencies. It's derived from the "Yet Another Language Model" (yalm) project.
What problem does it solve?
- Provides a way to run DeepSeek LLMs on devices without GPUs or with limited GPU resources.
- Offers a lightweight and dependency-minimal alternative to larger inference frameworks (like llama.cpp and vllm), making it suitable for low-end hardware.
- Eliminates the need for a Python runtime, simplifying deployment in some environments.
What are the features of the project?
- CPU-only inference: Runs entirely on the CPU.
- DeepSeek model support: Specifically tailored for the DeepSeek family of LLMs (V2-Lite, V2, V2.5, V3, R1), with varying levels of support for different precisions (FP32, BF16, FP16, F8E5M2).
- Quantization: Supports F8E5M2 quantization with blockwise quantization (128x128 blocks) for reduced memory usage and improved performance. MoE gates and layer norms remain in full precision for better accuracy. INT4 and 1.58-bit quantization are planned.
- Multiple modes: Supports "completion" (text generation), "passkey" (a specific test mode), and "perplexity" (evaluating the model's confidence) modes.
- Hugging Face weights conversion: Includes a Python script (
convert.py
) to convert Hugging Face safetensor weights and configuration files into a.dseek
format used by the C++ code. - Sliding window context: Supports a sliding window context length.
- CLI interface: Provides a command-line interface (
./build/main
) for easy interaction. - Temperature control: Allows adjusting the temperature parameter for controlling the randomness of text generation.
What are the technologies used in the project?
- C++20: The core inference engine is written in C++.
- Python: Used for the weight conversion script (
convert.py
) and requiresrequirements.txt
- Hugging Face Transformers: The project interacts with models and weights in the Hugging Face format.
- Safetensors: Uses the safetensors format for model weights.
- Git LFS: Used for downloading large model files.
What are the benefits of the project?
- Accessibility: Enables running DeepSeek models on a wider range of hardware.
- Efficiency: Optimized for CPU usage, potentially offering better performance on CPU-bound systems.
- Simplicity: Small codebase (<2k LOC excluding external libraries) and minimal dependencies make it easier to understand, modify, and deploy.
- No Python runtime: Reduces dependencies and simplifies deployment in environments where Python is not readily available.
What are the use cases of the project?
- Low-resource environments: Running LLMs on devices with limited compute power, such as older computers, embedded systems, or cloud instances without GPUs.
- Research and experimentation: Provides a simple platform for experimenting with DeepSeek models and potentially modifying the inference process.
- Educational purposes: The small codebase makes it a good learning resource for understanding LLM inference.
- Offline applications: Suitable for applications where internet connectivity is limited or unavailable.
- Testing and development: Can be used for testing and development of applications that will eventually use DeepSeek models.
Important Notes (Limitations):
- Decoding only: Currently only supports decoding (generating one token at a time). Prefill (processing a batch of prompt tokens) is not implemented.
- Naive multi-latent attention: Uses the basic implementation of multi-latent attention, not the optimized version.
- Large memory requirements for V3: DeepSeek V3 requires significant memory (around 650GB for F8E5M2), which may necessitate swap space on many systems, leading to performance degradation.
- Repetition issues: The models can sometimes get stuck in repetitive loops, especially at lower temperatures. A temperature around 1.0 is recommended.
- Incomplete feature implementation: Some newer architectural features of DeepSeek V3 are not yet implemented.
- WIP Status: Many features are marked as "WIP" (Work In Progress), indicating ongoing development.
