AI-powered patching: the future of automated vulnerability fixes - Research Paper Notes
How has Google managed to fix 15% of its sanitizer bugs in the code using AI?
As AI continues to improve rapidly, it has also developed the ability to find security bugs in the code, every time it finds a vulnerability, its as opportunity to fix. Lets see how Google uses AI to patch vulnerabilities it the code.
SAIF Model.
Google has developed something called as the Secure AI Framework (SAIF) which is harnessing the Googles very own, Gemini AI model to find and fix vulnerabilities in the code. It includes a fundamental pillar addressing the need to “automate defenses to keep pace with new and existing threats.
It does it by automating a pipeline, to prompt LLM to generate code fixes for human review.
LLMs Vs Sanitzer Bugs
While Google promotes using memory safe languages like Rust, but there still exists a huge amount to legacy code base in use because and the LLM still continues to find bugs in them generates fixes for human review. Since the bugs are discovered post-merge, sanitizer testing results in a backlog of issues that do not hinder immediate progress. Consequently, the median time-to-fix for these bugs is longer compared to those identified before the code is merged.
LLM Powered Pipeline
1. Find vulnerabilities
2. Isolate and reproduce them
3. Use LLMs to create fixes
4. Test the fixes
5. Surface the best fix for human review and submission
Step 1 - Finding Sanitizer bugs
The process of detecting and reproducing sanitizer bugs in the LLM pipeline involves two main considerations: preserving all information from the test run, especially the stack trace, to aid the LLM in determining the fix, and ensuring the service can run easily reproducible tests to catch non-deterministic bugs and confirm that errors haven't already been fixed if there's a delay between detection and a fix attempt
Step 2 - Reproducing bugs in isolation
LLMs have limited context length, so the prompt for the LLM must be concise. To achieve this, the specific code needing a fix is isolated, typically at the file level, as most files fit within the LLM's context length. If the context length is insufficient, smaller code pieces like functions or class definitions can be used. The code triggering the sanitizer error may not be the code needing modification, so initially, a heuristic was used to select the first file in the stack trace within the same directory as the test file. However, this approach was not very effective. Instead, a custom ML model was trained using past bugs to score files based on the likelihood they contain the code needing modification. This score guides the fix strategy, determining which files to prompt the LLM to fix and in what order.
Step 3 Creating Fix using LLM
Prompt used: You are senior software engineer, tasked with fixing sanitzer errors, Please fix them
….code
//Please fix the <error-type> error originating here.
…. LOC pointed by the stack trace
…. Code
Step 4 Testing the Generated Fix
To test the LLM's solutions, an automated process was needed to create commits from the generated output and run automated tests on the modified code. LLMs often add extraneous details to the code, which can complicate patch generation and testing. To address this, few-shot prompting can be used to provide examples of the desired output structure, or special symbols can be requested to enclose the generated code for easier filtering. Additionally, since LLMs may not output the entire file or function, the insertion point for code modifications must be located. This can be done by prompting the LLM to include several lines of code before and after the modification, allowing simple text analysis to match the correct location. These methods result in an automated commit ready for testing and sanitizer checks. Different models performed better on different types of errors, so the pipeline was constructed to prompt several models sequentially, giving each model a few attempts before moving on to the next if no solution was found.
Step 5 Surfacing the best fixes for human review and approvals
These tests and sanitizer checks are only the first step to address the possibility of hallucinations. Currently, an ML-generated fix must be reviewed by humans, even if it passes all tests. For additional safety, a double human filter was employed: the first round rejected about 10-20% of generated commits as false positives or poor solutions. The remaining commits were sent to code owners for final validation. Approximately 95% of these were accepted without discussion, a higher rate than human-generated changes. This could be due to thorough filtering or greater trust in technology-generated solutions. To address potential over-reliance, developers should be aware of LLM errors and evaluate suggestions rigorously. For example, temporary code changes with "TODOs" or fixes that remove failing tests can be problematic. Improving the quality of training data can enhance LLM-generated fixes.