albertan017/LLM4Decompile

LLM4Decompile: Decompiling Binary Code with Large Language Models

What is the project about?

LLM4Decompile is a pioneering open-source project that leverages Large Language Models (LLMs) to decompile binary code (specifically, Linux x86_64 binaries compiled with GCC at various optimization levels, O0-O3) back into human-readable C source code. It comes in two main flavors: LLM4Decompile-End (which decompiles directly from binary/assembly) and LLM4Decompile-Ref (which refines the output of the Ghidra decompiler).

What problem does it solve?

Decompilation is the process of reverse engineering executable code (binary) back into a higher-level representation (like C code). This is a challenging task traditionally done by human experts. LLM4Decompile automates and improves this process, making it faster and potentially more accurate. It addresses the difficulty of understanding and analyzing compiled code, which is crucial for tasks like:

Security analysis: Finding vulnerabilities in software where source code is unavailable.
Malware analysis: Understanding the behavior of malicious programs.
Legacy code maintenance: Recovering lost source code or understanding old systems.
Software interoperability: Adapting software to different platforms.

What are the features of the project?

Decompilation of Linux x86_64 binaries: The current version focuses on this specific architecture and operating system.
Support for multiple optimization levels (O0-O3): Handles code compiled with different levels of GCC optimization.
Two decompilation approaches:
- LLM4Decompile-End: Directly decompiles assembly code (obtained via objdump) to C code.
- LLM4Decompile-Ref: Takes the pseudo-code output from the Ghidra decompiler and refines it, improving its quality.
Multiple model sizes: Offers models ranging from 1.3 billion to 22 billion parameters, allowing users to choose a balance between performance and resource requirements.
Evaluation benchmarks: Includes "HumanEval-Decompile" (based on standard C libraries) and "ExeBench" (using real-world project code) to measure decompilation quality.
Re-executability metric: Assesses the functional correctness of the decompiled code by checking if it passes predefined test cases.
Easy-to-use API: Provides a simple way to load the models and perform decompilation using the Hugging Face Transformers library.
Colab notebook: Offers an interactive demonstration of the model's usage.
Training scripts: Includes scripts for training the model on a subset of the data.

What are the technologies used in the project?

Large Language Models (LLMs): The core technology. Specific models used as bases include Yi-Coder-9B.
Python: The primary programming language.
PyTorch: The deep learning framework.
Hugging Face Transformers: A library for working with pre-trained LLMs.
GCC: The GNU Compiler Collection, used for compiling the C code into binaries.
Objdump: A command-line tool for disassembling binary files into assembly code.
Ghidra: A software reverse engineering (SRE) framework developed by the NSA (used in the LLM4Decompile-Ref models).
Conda: For environment management.

What are the benefits of the project?

Automated decompilation: Reduces the need for manual reverse engineering, saving time and effort.
Improved decompilation quality: LLMs can potentially produce more accurate and readable decompiled code than traditional tools, especially for complex code.
Open-source and accessible: The code and models are publicly available, promoting research and collaboration.
Multiple model options: Users can choose the model that best suits their needs and resources.
Ongoing development: The project is actively being improved, with plans to support more architectures and configurations.

What are the use cases of the project?

Vulnerability research: Analyzing binaries for security flaws.
Malware analysis: Understanding the behavior of malicious software.
Software auditing: Verifying the security and correctness of closed-source software.
Legacy code recovery: Recovering source code from old binaries.
Code understanding: Learning how compiled code works.
Education: Teaching reverse engineering and compiler principles.
Software porting: Assisting in porting software to different platforms by providing a higher-level understanding of the original binary.