GitHub

Project Description: llama2.c

What is the project about?

llama2.c is a project focused on providing a minimalist and simple implementation for training and inferring Llama 2 Large Language Models (LLMs). It includes a PyTorch implementation for training and a single C file (run.c, ~700 lines) for inference. It's designed to be educational, easily hackable, and have minimal dependencies.

What problem does it solve?

  • Provides a simplified, understandable, and dependency-free way to inference Llama 2 models.
  • Allows users to experiment with and understand the Llama 2 architecture without complex frameworks.
  • Enables training and inference of smaller, customized LLMs for specific, narrow domains.
  • Offers a starting point for deploying LLMs in resource-constrained environments (though the initial focus is on fp32, with int8 quantization support).
  • Reduces the barrier to entry for experimenting with LLMs, especially for educational purposes.

What are the features of the project?

  • Training: PyTorch-based training of Llama 2 models from scratch or finetuning.
  • Inference: Pure C inference engine (run.c) with no external dependencies (beyond math.h).
  • Model Support: Supports both custom-trained models and Meta's official Llama 2 models (after conversion).
  • Quantization: Includes support for int8 quantization (runq.c) to reduce model size and improve inference speed.
  • Custom Tokenizers: Allows training and using custom SentencePiece tokenizers for improved efficiency on specific datasets.
  • Sampling Options: Provides command-line arguments for controlling sampling parameters (temperature, top-p).
  • Chat Mode: Supports interactive chat with Llama 2 Chat models.
  • Hugging Face Model Support: Can load Hugging Face models that use the Llama 2 architecture.
  • OpenMP Support: Can be compiled with OpenMP for multi-core CPU acceleration.
  • Testing: Includes both Python (pytest) and C-based tests.

What are the technologies used in the project?

  • Programming Languages: C (inference), Python (training and export), Makefile.
  • Libraries/Frameworks:
    • PyTorch (training)
    • SentencePiece (for custom tokenizers)
    • OpenMP (optional, for parallel processing)
  • Models: Llama 2 architecture.

What are the benefits of the project?

  • Simplicity: Easy to understand and modify due to the minimal codebase.
  • Portability: The C inference engine has minimal dependencies, making it potentially portable to various platforms.
  • Educational: Serves as a valuable learning resource for understanding LLM inference.
  • Hackability: Designed to be easily forked and customized for specific applications.
  • Efficiency: Offers good performance, especially with optimizations like -Ofast and OpenMP, and int8 quantization.
  • Reduced Size: int8 quantization significantly reduces the size of model checkpoints.
  • Flexibility: Supports custom tokenizers and different sampling strategies.

What are the use cases of the project?

  • Education: Learning about LLM architectures and inference.
  • Research: Experimenting with LLM training and inference techniques.
  • Prototyping: Quickly building and testing LLM-powered applications.
  • Custom LLM Development: Training and deploying small, specialized LLMs for specific tasks.
  • Edge Deployment (Potential): With further optimization, the C inference engine could be used in resource-constrained environments (though this is not the primary initial focus).
  • Chatbot Development: Using the chat mode to interact with Llama 2 Chat models.
  • Code Generation: Using Code Llama models for code-related tasks (though support is still experimental).
llama2.c screenshot