× closer
Credit: Pixabay/CC0 Public Domain
A team of computer scientists led by the University of Massachusetts Amherst recently announced a new method to automatically generate entire resources that can be used to prevent software bugs and verify that underlying code is correct.
The new method, called Baldur, harnesses the artificial intelligence power of large language models (LLMs) and, when combined with the state-of-the-art tool Thor, delivers an unprecedented performance of nearly 66%. The team was recently awarded a Best Paper Award at the ACM Joint European Software Engineering Conference and Symposium on Foundations of Software Engineering.
„Even though our software is everywhere and we all use it every day, unfortunately we expect our software to be buggy,” says Yuri Brun, a professor in the College of Information and Computer Science at UMass Amherst and senior author of the paper. Author of the book.
The consequences of bad software can range anywhere from security breaches or to annoyingly-flaky design or sudden crashes when it comes to precision software used to control space exploration or healthcare devices.
Of course, there are methods to verify the software as it exists. A popular method is simple: you have a human go through the code line by line and manually check that there are no errors. Or you can run the code and check that it does what you expect it to do. For example, if you expect your word-processing software to break a line every time you press the „Return” key, but it outputs a question mark, you know something is wrong with the code.
The problem with both methods is that they are prone to human error, and checking for every possible glitch is extraordinarily time-consuming, expensive, and impossible for anything but trivial systems.
A more complete, but difficult, method is to create a mathematical proof showing that the code does what is expected, then use a theorem prover to confirm that the proof is also correct. This method is called mechanical testing.
But writing these testimonials manually is incredibly time-consuming and requires extensive expertise. „These proofs can be many times longer than the software code,” says Emily Furst, the paper's lead author, who completed the research as part of her doctoral research at UMass Amherst.
With the rise of LLMs, ChatGPT being the most popular example, trying to automate such resources is a possible solution. However, „one of the biggest challenges with LLMs is that they're not always perfect,” says Brune. „Instead of deteriorating and letting you know something is wrong, they 'fail silently,' giving the wrong answer but presenting it as if it were correct. And, often, the worst thing you can do is fail silently.”
This is where Baldur comes in.
First, whose team did its work at Google, used Minerva, an LLM trained on a large set of natural language text, and then fine-tuned it to 118GB of mathematical science papers and webpages containing mathematical expressions.
Next, he fine-tuned the LLM in Isabelle/HOL, a language in which mathematical proofs are written. Baldur then worked with the theorem prover to create a full proof and verify its work. When the theorem prover caught an error, it fed the proof and information about the error back into the LLM so that it could learn from its mistake and produce a new and more confident error-free proof.
This process gives a significant increase in accuracy. A sophisticated tool for automatic resource generation is called Thor, which can generate 57% of the time. When Baldur (Thor's brother according to Norse mythology) is paired with Thor, the two can create resources 65.7% of the time.
Although there is still a large margin of error, Baldur is the most effective and efficient way to verify software correctness ever devised.
is paper Published As a part of Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on Foundations of Software Engineering.
More information:
Emily Furst et al., Baldur: Full-Source Creation and Repair with Large Language Models, Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on Foundations of Software Engineering (2023) DOI: 10.1145/3611643.3616243