This question is somewhat related to my previous question and is also inspired from this other question concerning the credibility of extensive computations (although from a different perspective).
In the course of preparing a paper, I verified a conjecture computationally for a large range of cases.
For small cases, I was able to run the computations on my laptop, since they required little RAM. However, at some point this was no longer feasible. At my university we have access to a high-performance computing (HPC) cluster, which allowed me to perform the larger calculations.
This raises a concern: when the paper is submitted, the referee may wish to check the correctness of these computations. But what if they do not have access to an HPC cluster? In that case, they can only verify a portion of the claims.
Is it sufficient to provide the raw output together with the code on GitHub/Zenodo, or is this not considered credible? After all, the raw output will consist of a .txt file that could, in principle, be tampered with (although doing so would clearly be dishonest, against all academic standards, and would eventually be detected by the community). My concern is precisely how referees can reasonably test such results in order to avoid this potential issue.
-
4$\begingroup$ My first thoughts would be: (1) If you're solving an NP-complete problem, you probably need large computing resources to perform the computation but you can include a certificate in the output that can be quickly verified. (2) If you're testing a conjecture from $n = 0$ to $n = 10^5$ and the computations are independent for different values of $n,ドル consider making it easy to run it just on a small range of large values and check only that in your output. $\endgroup$Jean Abou Samra– Jean Abou Samra2025年09月11日 20:42:53 +00:00Commented Sep 11 at 20:42
-
6$\begingroup$ They won't, and if its is an important result, someone else will try to replicate it and find an issue if the result is incorrect. If it is not an important result, nothing will happen. $\endgroup$Piyush Grover– Piyush Grover2025年09月11日 22:17:39 +00:00Commented Sep 11 at 22:17
-
19$\begingroup$ Unless it is a major result, no referee is going to bother trying to replicate a computer calculation. It is even unusual for a referee to carefully check a complicated non-computer calculation. While a referee should believe the results in a paper, the authors are the ones who are ultimately responsible for a paper’s correctness. $\endgroup$Andy Putman– Andy Putman2025年09月11日 23:42:21 +00:00Commented Sep 11 at 23:42
-
2$\begingroup$ Actually, even if it is an important result, no one may try to replicate it, at least for many years. $\endgroup$Kimball– Kimball2025年09月12日 13:02:24 +00:00Commented Sep 12 at 13:02
-
6$\begingroup$ The problem you describe is just everyday business in every empirical field of inquiry that does not have absolute proof (so everything but math). Dan's answer addresses this but I want to emphasize that it is mathematics that is the odd one out here, and one should therefore just adopt what is established best practice everywhere else. $\endgroup$M. Winter– M. Winter2025年09月12日 20:37:58 +00:00Commented Sep 12 at 20:37
5 Answers 5
I have written a few papers which involve computer calculations in parts of the proofs of some results, sometimes involving extensive use of high-performance computing resources which a referee could not be expected to reproduce. Here is the approach I have settled on (with the aid of helpful suggestions from referees!), in order to ensure that referees and other readers can see why such results are indeed true:
In the proof of any such result, describe the algorithm you implemented.
Include a proof that the algorithm indeed calculates what it's supposed to calculate.
Along with the description of the algorithm and the proof, give a link to a webpage where you have hosted your code, i.e., your implementation of the algorithm.
Add comments to the code to explain what each part does. In particular, the comments should allow the reader to match up each step in the algorithm with a block of code. The comments should also make it routine for the reader to see that the relevant step in the algorithm is indeed carried out by that code block, so the reader can verify the correctness of your code in roughly the same way that they would read an ordinary mathematical proof to verify its correctness.
Include a parameter in the code which lets the user run it up to some degree, some case, etc.--even if your referees or readers don't want to blow a week on a compute node on a HPC grid to run your code in every relevant case, you want them to be able to use the code in some limited range of cases, something completable in a few minutes on ordinary consumer hardware, in order to be further convinced that the code does what they think it does, and that they haven't misunderstood anything in it.
I think the end result is pretty convincing, even if the referee or reader doesn't run the code at all: you provide a proof that a certain algorithm calculates something, and then you provide a piece of code that, as transparently as possible, implements that algorithm.
This is the best approach I've been able to think of, although maybe there are better ones I haven't thought of. My referees at least seem to have been satisfied with this approach. If you want to see an example of this approach in action, take a look at the proof of Proposition 3.1 in this paper that Hassan Abdallah and I wrote.
-
2$\begingroup$ This is a very useful answer. Thank you! $\endgroup$Chess– Chess2025年09月12日 08:02:23 +00:00Commented Sep 12 at 8:02
-
5$\begingroup$ One more thing: you can (and should) do sanity checks, such as compare with existing data/algorithms and known theoretical results. $\endgroup$Kimball– Kimball2025年09月12日 12:59:55 +00:00Commented Sep 12 at 12:59
-
1$\begingroup$ This is nice but I would've added putting the date of the version of the code used in the journal version. Also the version of MAGMA used (or better, using OSS like GAP and specify the version - no guarantee that old versions of MAGMA will be available even a year later). $\endgroup$Joshua Grochow– Joshua Grochow2025年09月15日 03:01:36 +00:00Commented Sep 15 at 3:01
In the experimental sciences it is very common for research to be published that would cost thousands, tens of thousands, or hundreds of thousands of dollars to replicate (in some cases even billions of dollars as in the case of LHC and the LIGO gravitational wave detector). No one expects the referees to actually replicate the study before it can be published - this is simply an unrealistic standard to apply. Of course, that means the same "loophole" you are describing where the data can be faked or tampered with still exists. But it is what it is, and people understand this is simply the best system we have for uncovering scientific truths. Until the results are replicated by independent groups of researchers, the community can have doubts about its validity.
The situation in math is similar - as long as you describe your methods in sufficient detail (which in the case of computer-assisted proofs includes making your code public), you can feel like you have done your part and the referees are in a good position to trust your results after making reasonable efforts to verify your work. With software that requires complicated setup or infrastructure, the referees likely won't run it, but that's okay. Even if they could run it, there would still be a possibility of errors creeping in because of honest mistakes - bugs, incorrect proofs, etc (which are a lot more common than faked results). Such errors will eventually be discovered by other researchers.
-
$\begingroup$ Interesting point, thank you! $\endgroup$Chess– Chess2025年09月12日 18:22:29 +00:00Commented Sep 12 at 18:22
Others have given good answers, but I want to give one striking case study, which may provide some "lessons learned." In their paper, Computing $\pi(x)$: The Meissel-Lehmer Method (Math. Comp. 44 (1985), 537–560), Lagarias, Miller, and Odlyzko gave a brief history of the computation of $\pi(x)$, the number of primes less than or equal to $x$.
- In 1885, Meissel claimed that $\pi(10^9) = 50847478$.
- In 1959, Lehmer showed that Meissel's value of $\pi(10^9)$ was incorrect. Lehmer extended the computation and claimed that $\pi(10^{10}) = 455052512$.
- In 1972, Bohman showed that Lehmer's value of $\pi(10^{10})$ was incorrect. Bohman extended the computation up to $\pi(10^{13})$.
- Lagarias, Miller, and Odlyzko showed that Bohman's value of $\pi(10^{13})$ was incorrect.
So have Lagarias, Miller, and Odlyzko finally "broken the curse"? Here's what they say.
We checked our computations of $\pi(10^{13})$ in several ways. First, we checked that the program computed values of $\pi(x)$ that agreed with existing tables for values of $x$ smaller than 10ドル^{13}$. Second, the value of $\pi(10^{13})$ was computed several times using different sieving limits, in which case the intermediate terms summing to $\pi(10^{13})$ are different for each different sieving limit. Third, we computed $\pi(10^{13}+10^5)$ and sieved the interval $[10^{13}, 10^{13} + 10^5]$ to locate all primes in it, using this to get a check on the computation for $\pi(10^{13})$. Similarly, we checked all the other values of $\pi(x)$ in the table by also computing $\pi(x + 10^5)$ and sieving the interval $[x, x + 10^5]$. (The other values in Bohman's table agree with ours.)
The takeaway is that you should anticipate various cross-checks on your own HPC computation that a careful referee might want to do if they had access to your HPC resources, and, if possible, perform those cross-checks yourself.
Another good example of a cross-check comes from record-breaking attempts to compute digits of $\pi$ (here, of course, $\pi$ is the famous constant, not the prime-counting function!). The computed decimal digits of $\pi$ are converted to hexadecimal and then verified using the BBP spigot algorithm.
-
$\begingroup$ This is an interesting story, that I didn't know at all. Thank you very much for sharing it! $\endgroup$Chess– Chess2025年09月13日 18:24:40 +00:00Commented Sep 13 at 18:24
-
3$\begingroup$ A historical related note about computing digits of π (the constant)- in the 19th century William Shanks did it to the then unheard of 527 digits, but then a few years later went to extend it out to 707, but due to an error at the way beginning, his remaining digits are all wrong (except for a few which match by coincidence). This would not be noticed and corrected until Ferguson 70 years later. $\endgroup$JoshuaZ– JoshuaZ2025年09月13日 22:26:56 +00:00Commented Sep 13 at 22:26
Even correct programs can give wrong results, because computers really do make mistakes sometimes. However, this is a minuscule phenomenon compared to errors caused by humans.
One way to add confidence to a computational result is to replicate it independently. I find it is quite effective if there are two authors and each one writes separate programs.
If both authors code the same algorithm, there is the possibility of comparing many intermediate results, but on the other hand it has been known since the 60s that programmers tend to make the same errors when coding the same algorithm. An alternative is to code two different algorithms, if they are available, as then getting the same wrong answer requires making two different errors with the same effect.
AI programs are getting better at writing correct short programs and finding errors in human code. ChatGPT recently impressed me by finding a subtle bug in around a hundred lines of C that I was working on. It won't be long before substantial pieces of code will be written by such programs with an typical precision better than a human programmer, especially if the task is one that can be described very precisely.
-
7$\begingroup$ It may also not be long before computer-generated code contains bugs (whether created intentionally or unintentionally) that are more subtle and difficult to find than any human-generated bug. $\endgroup$Timothy Chow– Timothy Chow2025年09月14日 12:36:38 +00:00Commented Sep 14 at 12:36
-
1$\begingroup$ That is an interesting point regarding AI. ChatGPT is likely more effective at writing short pieces of code or detecting errors in languages such as C++, Java, or Python, which are the most widely used. But, in my field (algebraic computations, where I rely on Macaulay2) the syntax, functions, and overall style are quite different. ChatGPT often makes mistakes and is not particularly useful for identifying errors in my code. Perhaps in the future these tools will become more competitive in several areas, which would be a valuable development, as they could save researchers significant time. $\endgroup$Chess– Chess2025年09月14日 17:41:37 +00:00Commented Sep 14 at 17:41
-
$\begingroup$ The other problem with "independently coded" algorithms is that it is hard to be genuinely independent for low-level functionality. If two implementations both use, say, GMP then are they independent? $\endgroup$Gordon Royle– Gordon Royle2025年09月15日 11:53:58 +00:00Commented Sep 15 at 11:53
-
1$\begingroup$ @TimothyChow Malware writers are skilled in creating bugs designed to be hard for humans to detect (though they are not bugs in the eyes of the malware writers). AI will become both better than humans at detecting such bugs, and also better at producing them. $\endgroup$Brendan McKay– Brendan McKay2025年09月15日 12:29:19 +00:00Commented Sep 15 at 12:29
-
$\begingroup$ @GordonRoyle If the two implementations call GMP for different calculations then they are arguably nearly independent. If there is a bug in GMP, then chances are that it affects only a small fraction of calculations. If so, then the chances that two totally different sequences of computations give the same wrong answer are very small. $\endgroup$Timothy Chow– Timothy Chow2025年09月15日 13:37:24 +00:00Commented Sep 15 at 13:37
My answer isn’t so practical for the working mathematician, but it is very interesting!
Verifiable delay functions https://eprint.iacr.org/2018/627.pdf can be used to generate computer checkable proof certificates much smaller than the original proof size.
The great internet mersenne prime search (GIMPS) just moved over to VDFs and doubled their throughout, since they no longer have to double check all answers. It is 200x faster to verify the certificate than generate the proof. See post on https://www.mersenne.org/
You must log in to answer this question.
Explore related questions
See similar questions with these tags.