Understanding Vibe Proving | Towards Data Science

“What I cannot create, I do not understand”
— attributed to R. Feynman

After Vibe Coding, we seem to have entered the (very niche, but much cooler) era of Vibe Proving: DeepMind wins gold at the International Mathematical Olympiad, Harmonic solves a non-trivial problem in number theory, and for the first time in history AI systems appear to be doing serious mathematics.

We are on the cusp of a profound change in the field of mathematics. Vibe proving is here.

Aristotle from @HarmonicMath just proved Erdos Problem #124 in @leanprover, all by itself. This problem has been open for nearly 30 years since conjectured in the paper “Complete sequences…

— Vlad Tenev (@vladtenev) November 30, 2025

At the same time, we are constantly reminded that LLMs hallucinate: it seems we have a paradox at hand. If LLMs can produce both garbage text (including math proofs) and correct reasoning (including math reasoning), how do we tell which model completions are good and which ones are hallucinations?

Training for vibe proving: LLMs generate candidate proofs, which are verified through a software, providing the reward for post-training. [ Image by the author ]

The short LinkedIn-length answer is that we use an LLM to generate mathematical reasoning, then we externalize our trust to special software which verifies it. But this short answer raises in turn new questions:

How does this “special software” work?

Why do we trust it?

How do we train an LLM to use it for proofs?

It is tempting to write articles on big ideas like “LLMs and math”, providing intuitive, non-rigorous answers in the most general setting possible. As an example, this is how a new book tries to sell you on the whole “math and computers” story:

The “truth oracle” that thinkers have sought for centuries, a tool to definitively verify or refute any mathematical or logical assertion. (…) [The book] offers a profound answer to a longstanding mystery: Can computers reveal universal truths?

We will do the exact opposite here. Instead of miracle-sounding words and analogies, we will build an LLM-for-proofs system from scratch, with the goal of giving exact answers for simple scenarios: we gain in precision, trading some generality. Instead of big-picture arguments, we strive for inspectable scripts which work on simple “proofs”. We break down our guide into two parts, which may also be enjoyed independently:

Part 1 (this one): understanding verifiers, and building a solid mental model of what makes math so special (why don’t we trust LLMs writing laws in the same way we do with proofs?). This will answer questions 1 and 2 above;

Part 2 (coming soon): training an open-source LLM to produce proofs, using our verifier to provide a reward signal during reinforcement learning. This will answer 3.

Of course, one could piece together the basics by reading papers at the frontier. Cutting-edge reports, however, typically use a special-purpose language and are filled with many other technical considerations, making it difficult to disentangle the essential components and build an initial mental model of the problem space.

By building an LLM-for-proofs, we get direct exposure to the main ideas so that further explorations should be easier: as we cannot answer all the questions in-line, more advanced follow-ups are addressed in the FAQs at the end.

Now, buckle up, clone the repo and code along.

What does it mean for a proof to be correct?

“Archimedes will be remembered when Aeschylus is forgotten, because languages die and mathematical ideas do not”

— G. H. Hardy

Proof checkers are software programs that take as input mathematical reasoning generated by an LLM and verify that it is correct. This is deceptively simple: understanding what “correct” means and how it is possible to establish correctness are the key to building a precise mental model of how LLMs can be so successful at math.

That computers can verify math in generality may indeed be surprising, but we all have familiarity with mechanical application of specific mathematical rules. In high school we learn the rule to check that 2x is the derivative of x² + 3 – crucially, we may not know what a derivative is, or what those symbols even mean, but as long as we apply the rule, we can verify the claim. In a nutshell, what we’ll find today is that this phenomenon is much more general than what we could suspect: mathematical reasoning itself can be codified in symbols, and treated algorithmically.

Everybody knows the syllogism as a paradigmatic case of correct reasoning: “All men are mortal; Socrates is a man; therefore Socrates is mortal”. The standard for correctness is logical consequence: a statement C is a logical consequence of (necessarily follows from) premises S₁, S₂ … Sₖ if it is not possible for S₁, S₂ … Sₖ to be true and C to be false. In other words, whenever the premises are true, the conclusion is necessarily true.

In simple cases, it is easy to convince ourselves that the conclusion follows from the premises, but what about more complex ones? Can you “see” immediately that infinite prime numbers are a logical consequence of basic multiplication and division?

Proofs are a sequence of statements, starting with one or more premises, and ending with a conclusion, such that each statement necessarily follows from the premises and / or previous statements: the purpose of a proof is to show us exactly how C follows from the premises, by proceeding step by step. For example, we can indeed show that there are infinitely many primes:

Suppose, for the sake of contradiction, that there are only finitely many primes. List them all: p₁, p₂, …, pₖ.
Consider a new number N = (p₁ × p₂ × … × pₖ) + 1. So N is either prime or composite;
If N is prime, then we have found a prime that is not in the original list, contradicting the assumption that p₁, …, pₖ were all the primes.
If N is composite, it must have a prime divisor q. But q cannot be among p₁, …, pₖ, because dividing N by any p leaves remainder 1. So again we have found a prime not in the original list, q.
In both cases we contradict the assumption that there were only finitely many primes. Therefore, there must be infinitely many primes.

As long as you are comfortable with basic arithmetic and have an informal grasp on “negation” and “contradiction”, you can follow this chain of deductions. The crucial insight is that coming up with a proof in the first place may require a lot of (for lack of a better word) creativity, but checking one is a purely mechanical procedure: you don’t have to take my word for it, you can check for yourself that my proof is correct. Indeed, a proof you cannot check is not a proof: as Alonzo Church put it, if an auditor cannot check with “certain means” a sequence of formulas put forward as a proof, he may demand another proof, and “until the supplementary proof is provided, he may refuse to be convinced” (and so on, potentially forever).

To sum things up: a proof is a step-by-step checkable way to show that a conclusion follows from a set of premises. Since proofs are checkable by design, we can build software that performs the mechanical checks at scale, and use it to distinguish garbage from correct proofs generated by LLMs.

As the history of software shows, algorithms work well on formally defined languages, such as Python or Rust. On our way to build our own tiny verifier, we then need to pick a way to represent proofs so that a computer can easily check them.

A programming language for reasoning

“The virtue of formal texts is that their manipulations, in order to be legitimate, need to satisfy only a few simple rules; they are, when you come to think of it, an amazingly effective tool for ruling out all sorts of nonsense that, when we use our native tongues, are almost impossible to avoid.”
— E.W. Dijkstra

Vibe Coding may give us the illusion that we now “program in English”, but in the end computers always execute instructions in a programming language. If we want a computer to check proofs, we need to express them in a precise language: unambiguous to parse, easy to manipulate.

After dreaming about it for centuries, mathematicians have figured out how to use mathematics to model reasoning itself, a special programming language, if you will, to carry out proofs in a rigorous way. Let’s take this trivial proof, which purports to show that 12 is even and divisible by 3 starting from two self-evident premises:

S₁: 12 is greater than 10 and 12 is even
S₂: 12 is divisible by 3
By line 1, we can see that 12 is even
By line 2, we can see that 12 is divisible by 3
C, by line 3 and 4: therefore, 12 is even and 12 is divisible by 3

We can use variables and boolean operations to represent sentences in symbols, analogously to abstracting strings into constants in Python: Q = “12 is greater than 10”; R = “12 is even”; Z = “12 is divisible by 3”

Q AND R
Z
R (1)
Z (2)
R AND Z (3,4)

With dependencies between lines expressed in brackets, it is easy to see what mechanical rules can be used here: one rule (“AND elimination”) states that from alpha AND beta (where alpha, beta are placeholders) we can eliminate AND and get one of the two; another rule (“AND introduction”) says to introduce alpha AND beta provided that alpha, beta appear in some previous step. Unlike the English proof, this translation allows for mechanical manipulation: if you wish for a coding analogy, a correct proof is like a script that “compiles correctly”, as automatically established by an algorithm (the compiler / verifier).

In calculus, when we introduce trigonometric functions, we need new derivative rules. In a similar fashion, when we introduce more boolean operators (NOT, OR), we need new rules. The reader can check the details online (or directly in the repo), but the general gist remains: a rule will re-arrange one or more lines into a new line, either eliminating existing operators or introducing them.

To sum things up: we can transform an English proof to symbols that can be easily manipulated, not unlike treating a parabola as an equation so that we can get the derivative by some “re-arranging” of the terms. Checking the symbolic proof tells us whether the reasoning is correct and the mathematical conclusion can be trusted.

However, there are infinitely many things we may want to prove: how do we know that in a particularly complex proof the rules won’t break down and pass a garbage proof for a correct result? In other words, didn’t we just shift the burden from LLMs that hallucinate to rules that (plausibly fallible) mathematicians invented?

This is where the real magic happens: since reasoning itself is now precisely stated, we go meta and prove things about our rules for proofs themselves!

A sound(ness) check

“I don’t believe in empirical science. I only believe in a priori truth.”
— attributed to K. Gödel

Even if an algorithm checks that the rules in the proof were applied correctly, we still don’t know that the proof is verified; after all, programs have bugs all the time! In other words, how do we know the manipulation rules themselves are correct?

The insightful reader may have noticed our sleight of hand already. As humans, we care about the truth of our statements: we want to draw true conclusions from what we know already. However, machines care about the sequence of symbols on a proof line, and how a new line is created by mechanically rearranging those (remember: you can get the derivative of x² + 3 by re-arranging the symbols, without knowing what a slope is!). What is then the connection between symbols and truth? Ideally, these two perspectives should align: by using rules on true premises, I should get new lines which are themselves true. In other words, rules are truth-preserving: no matter how complex the proofs, no application of a rule will produce a false statement starting from a true one.

When we ascend from investigating specific proofs to discussing what all proofs should do, we go from logic to meta-logic. It is meta-logic that answers our question, which is called the “soundness” of the system, i.e. a proof that, in every sequence of manipulations leading from premises to a conclusion, the conclusion is necessarily true whenever the premises are. As soundness proofs are tedious, we show how the proof works for only one rule, and take the chance to introduce how the concept of truth is modeled in our language.

The above proof added alpha AND beta as a new step starting from alpha, beta appearing separately. So far, so good. To model truth, we define an interpretation, which is an assignment I of truth values (1 = True, 0 = False) to our variables, with the standard recursive definition for complex sentences:

I(p AND q) = 1 if and only if I(p) = 1 and I(q) = 1
I(NOT q) = 1 if and only if I(q) = 0
…

This definition would feel like a déjà vu for any coder that has ever done something like if boolean_var and boolean_other_var, and that is because AND, OR, NOT in Python are modeled after the standard semantics of logical connectives (one of the many links between programming and logic!).

An interpretation function defines a possible state of the world by mapping variables to booleans: for example, one world has p true and q false, another world has both false, and so on (in the coding case, the interpretation is done in the script when initializing boolean_var and boolean_other_var). To complete the soundness check, we need to show truth preservation in all possible world states. We use a truth table to find out that, whenever p and q are true, p AND q is true as well:

As we wanted, the rule indeed complies with our definition of logical consequence: every time the premises are true, the conclusion is necessarily so (row 1). Is the converse true? As in, if a conclusion follows from some premises, is there always a proof of that? Only in some “programming languages” (like ours), the answer is yes: this is known as completeness, and it’s the dual of soundness (also it is more interesting, but much harder to prove!).

If we are able (and we are!) to guarantee that every rule we use is truth-preserving, we gain the certainty that no matter how complex a proof, no matter how far in the future, no matter if LLMs or humans are reasoning, the proof checking software will never pass a garbage proof for correct reasoning.

Barring implementation bugs (which are of course unavoidable) and a faithful English-to-symbols translation, then, a verified proof is the ultimate standard for certainty: a precise, unambiguous form of reasoning which is mathematically sound and algorithmically checked. This is not only important as a guiding principle when trusting LLMs with mathematical discovery, but also underscores how different mathematics is from anything else: while we may wake up one day and find out Newton was wrong (we did!), we will never wake up one day and find out primes are not infinite. Once a proof is verified, it is eternal knowledge we can build upon safely to gain new knowledge:

[Mathematicians from Ancient Greek] are not clever schoolboys, but Fellows of another college.

G. H. Hardy

See you, math cowboys!

“Talk is cheap. Show me the code.”
— L. Torvalds

In this first piece, we built a mental model to understand why we should even expect mathematical reasoning to be something that can be reliably verified by a computer. In answering our initial questions, we learned the distinctive, mechanical nature of proofs, and introduced the minimum ingredients we need to represent proofs as a program: a language and manipulation rules (basically, a DSL!)

In the companion repository, proof_checker_playground.py displays a sample of valid and invalid proofs that leverages a small Python class that implements our verification logic. The proof checker first verifies that a given series of steps is syntactically correct: like any programming language, we need symbols to be properly formatted! Then, it proceeds step-by-step, simulating a mathematician checking that indeed every new line in the proof follows from previous ones. Our code achieves this in a slightly unorthodox way compared to industry-strength provers, sacrificing performance (homework: can you guess why it is computationally inefficient? How long will a table grow as the number of variables increase?) for pedagogical purposes. When verifying step 4 in this sample proof:

S₁: …
S₂: ..
S₃: A AND B
A (3)
…

the checker will retrieve the justification (the premise on line 3), then build a truth table with the required variables (A, B), with line 3 as the premise and the target step (line 4) being the conclusion:

A	B	A AND B (line 3)	*A (4)*
1	1	1	1
1	0	0	1
0	1	0	0
0	0	0	0

The savvy reader may already see where this is going: the proof step is considered correct if, whenever the premises are true, the conclusion is true as well – the definition of logical consequence we started from!

In the next piece, we will train an LLM to generate proofs, and use the scaffolding built today to verify those proofs. We will leverage closed-source LLMs and the checker to help us create a dataset of verified proofs, then use Tinker to set up a reinforcement learning loop. We prompt a small open-source model to prove a conclusion from some premises, than pass the generated proof to the checker and reward the model if it is correct.

While it is unclear how much RL is “just” stressing existing capabilities or learning novel things, either way we hope that even a small model can prove simple facts in our “programming language”: see you, math cowboys!

FAQ

If you want to go deeper, here are a few common follow-ups:

Rules are truth-preserving, but how do I know the premises are true in the first place? Logically speaking, a consequence can follow from wrong premises, but a proof that starts from 2+2=5 will hardly be interesting. In most real-world cases, our premises are the results of other theorems, so we know them to be consequences of other, more fundamental facts; the bad news is that at some point the buck should stop and no more premises can be invoked. It turns out that the vast majority of math can be backed by a few statements, the axioms of set theory: since axioms are the premises you cannot prove, it’s interesting to ask ourselves what justifies believing those in the first place. Math axioms are however less debatable than, say, morality, which is why it is way easier to believe in formalizing math (where everybody agrees on the premises, so we “just” check the reasoning), than doing the same for legal arguments (where often the two parties start from different premises). The good news, though, is that a proof resembles a computer program in more than one way: once we know a theorem, we can now “import” it into a larger proof in a similar way to how we import pandas in our code.
If proof checking is fully algorithmic, can we also write an algorithm to prove all truths? In full generality, we cannot. The existence of truths we cannot prove is likely one of the most profound mathematical facts of the last century: this is the essential take-away of Gödel’s Incompleteness Theorem. Moreover, we don’t know if some statements are true or false as our best axiom system sometimes cannot decide (the Axiom of Choice being a famous example of something independent from set theory).
Can we prove interesting things about the notion of proof itself? Yes! While the Church’s point is definitely a sound argument, on his way to proving his celebrated theorem, Gödel proved that the predicate IsProofOf(proof, statement) is algorithmic. Plus, there is an entire logic for provability!
Where can I read more? My favorite intro-to-intermediate text is Language, Proof and Logic, which is especially geared towards non-math students; Computability and Logic is an intermediate text with a focus on the relationship between logic, proof and computation.

Acknowledgements

Thanks to Patrick John Chia, Federico Bianchi, Ethan Rosenthal, Ryan Vilim, Davis Treybig for precious feedback over previous versions of this draft. If you like the intersection of genAI, reasoning about distributed systems and verification, you can also check out our research at Bauplan.

AI coding assistants were used to write the companion repository, but no assistant was used to write the text (if not for proof-reading and typo correction).

Source link

#Understanding #Vibe #Proving #Data #Science