Why incremental builds in "make" don't use hashing algorithms?

Question 1

I am a beginner with make and I'm wondering about when to use make clean.

One colleague told me that incremental builds with make are based on the files timestamps. So, if you checkout an old version of a file in your VCS, it'll have an "old" timestamp and it'll be marked as "no need to recompile this file". Then, that file wouldn't be included in the next build.
According to that same colleague, it would be a reason to use make clean.

Anyway, I roughly got the answer to the question "when to use make clean" from other StackExchange questions but my other question then is:

Why do incremental builds using make rely on files timestamps and not on SHA-1 for example? Git, for instance, shows that we can successfully determine if a file was modified using the SHA-1.
Is it for speed issues?

Question 2

make was created in the 70's. SHA-1 was created in the 90's. Git was created in 00's. The last thing you want is for some obscure builds that were working for 30 years to suddenly fail because somebody decided to go all modern with a tried and tested system.

Question 3

Hashing the files all the time is slow. I think git also uses filesystem metadata to optimize its checks for changed files.

Question 4

The original solution based on file dates is very simple, it does not need any additional files for storing the hash codes, and it worked remarkably well over several decades. Why should someone replace a well working solution by a more complicated one? Moreover, AFAIK most VCS system assign checked out files the "checkout date", so changed files will correctly cause a recompile without "make clean".

Question 5

@Ordous: Amusing, but is it relevant here? Software doesn't rust out; it gives out because someone changed something in the surrounding environment. Unless they didn't, in which case it should still work.

Question 6

@RobertHarvey Of course it is! Sure, if you don't update your make then your software won't break, however make makes rather an effort to have backwards compatibility in new versions. Changing core behavior for no good reason is pretty much the opposite of that. And the dates show why it was not originally made to use SHA-1, or why it was not easy to retrofit it when it became available (make was already decades old by then).

Question 7

An obvious (and arguably superficial) problem would be that the build system would have to keep record of the hashes of the files that were used for the last build. While this problem could certainly be solved, it would require side storage when the time-stamp information is already present in the file-system.

More seriously, though, the hash would not convey the same semantics. If you know that file T was built from dependency D with hash H₁ and then find out that D now hashes to H₂, should you re-build T? Probably yes, but it could also be that H₂ actually refers to an older version of the file. Time-stamps define an ordering while hashes are only comparable for equality.

A feature that time-stamps support is that you can simply update the time-stamp (for example, using the POSIX command-line utility touch) in order to trick make into thinking that a dependency has changed or – more interestingly – a target is more recent than it actually is. While playing with this is a great opportunity to shoot yourself into the foot, it is useful from time to time. In a hash-based system, you would need support from the build-system itself to update its internal database of hashes used for the last build without actually building anything.

While an argument could certainly be made for using hashes over time-stamps, my point is that they are not a better solution to achieve the same goal but a different solution to achieve a different goal. Which of these goals is more desirable might be open to debate.

Question 8

While the semantics differ between hashes and time stamps, it's normally irrelevant in this case as you most likely want a build based on the current files, no matter their age.

Question 9

Most of what you say is correct. However a well-implemented build system that uses hashes like Google blaze/bazel (the internal version of blaze, the open source one is bazel) beats the pants off of a timestamped system like Make. That said, you do have to put a lot of effort into repeatable builds so that it is always safe to use old build artifacts rather than rebuilding.

Question 10

The mapping here isn't many to one, it's one to one. If D now hashes to H2, and you don't have some output T2 built from D@H2, you need to produce and store it. Thereafter, regardless of what order D switches between the H1 and H2 states in, you will be able to use cached output.

Question 11

Bazel, meson, please - they all absolutely suck usability-wise. Their DSLs are 8 to 20 times as verbose as make's. They're also absurdly opinionated and impose crazy restrictions on where things can be located. If you want to adopt any of those for an existing big project, you will probably have to refactor the entire project structure inside out. GNU Make imposes no restrictions. It allows recipes to read from anywhere and write anywhere your user has permissions to, even using absolute paths. It is perfect BUT for lack of hash support.

Question 12

If your recipe can read from anywhere without sand boxing then it becomes increasingly likely that a mistake in your dependency encoding will appear over time. Either you take a dependency that is not really a dependency leading to unnecessary rebuilds, or you forget a dependency, leading to incorrect builds. It also makes it harder to produce builds that will work on any machine, since it easy to accidentally take a system dependency. These are some common failure modes of Make on large projects.

Question 13

A few points about hashes vs timestamps in build-systems:

When you checkout a file, the timestamp should be updated to the current time, which triggers a rebuild. What your colleague describes is not usually a failure mode of timestamp systems.
Timestamps are marginally faster than hashes. A timestamp system only has to check the timestamp, whereas a hash system must check the timestamp and then potentially the hash.
Make is designed to be lightweight and self-contained. To overcome (2), hash based systems will usually run a background process for checking hashes (e.g. Facebook's Watchman). This is counter to the design goals (and history) of Make.
Hashes prevent unnecessary rebuilds when a timestamp has changed but not the contents. Often, this offsets the cost of computing the hash.
Hashes enable artefact caches to be shared across projects and over a network. Again, this more than offsets the cost of computing hashes.
Modern hash-based build-systems include Bazel (Google) and Buck (Facebook).
Most developers should consider using a hash-based system, since they do not have the same requirements as those under which Make was designed.

Question 14

Bazel and Buck suck. They are absurdly opinionated and impose crazy restrictions on where things can be located. You try porting your rules one by one and you quickly realize that NOTHING CAN FIND ANYTHING because there's some kind of sandboxing going on and in order to pierce this stupid firewall you must write SCREENFULS of extra declarations in the build specs. And people clench their teeth and deal with it, because hashes. THAT is an indication of how useful this capability is.

Question 15

The sand boxing is actually orthogonal to the hashing. However, both are features that lead to maintainable and predictable build systems in large projects. I personally don’t find the sand boxing restrictive, and typically projects follow this convention anyway. Buck and Bazel declarations are very terse, being written in Python. The sand boxing trade offs are a bit like type-checking. It makes a few things more difficult but gives many more guarantees.

Question 16

The user should be IN CONTROL of trade offs.

Question 17

If you allow users to easily break the sandbox then you lose correctness guarantees across the build. This would preclude features that the Bazel team wanted to prioritize, such as correctness, distributed caching, distributed execution, composability of projects, etc. Sometimes more freedom can actually lead to fewer features. See youtube.com/watch?v=GqmsQeSzMdw for a good talk on this concept.

Question 18

@SzczepanHołyszewski Perhaps you should open an issue for the problem you are having with Bazel. It certainly does not make building software impossible, as evidenced by the various companies that are leveraging it successfully. "Constraints are freedom" is just a catchy title, don't read too much into it :)

Question 19

Hashing an entire project is very slow. You have to read every single byte of every single file. Git doesn't hash every file every time you run a git status either. Nor do VCS checkouts normally set a file's modification time to the original authored time. A backup restore would, if you take care to do so. The whole reason filesystems have timestamps is for use cases like these.

A developer typically runs make clean when a dependency not directly tracked by the Makefile changes. Ironically, this usually includes the Makefile itself. It usually also includes compiler versions. Depending on how well your Makefile is written, it could include external library versions.

These are the sorts of things that tend to get updated when you do a version control update, so most developers just get in the habit of running a make clean at the same time, so you know you're starting from a clean slate. You can get away without doing it a lot of the time, but it's really difficult to predict the times you can't.

Question 20

You can use filesystems like ZFS where the cost of hashing is amortized over the time when the files are being modified, rather than being paid all at once when you build.

5gon12eder 5gon12eder 7,2362 gold badges25 silver badges29 bronze badges · Accepted Answer · 2016-05-25 00:54:52Z

An obvious (and arguably superficial) problem would be that the build system would have to keep record of the hashes of the files that were used for the last build. While this problem could certainly be solved, it would require side storage when the time-stamp information is already present in the file-system.

More seriously, though, the hash would not convey the same semantics. If you know that file T was built from dependency D with hash H₁ and then find out that D now hashes to H₂, should you re-build T? Probably yes, but it could also be that H₂ actually refers to an older version of the file. Time-stamps define an ordering while hashes are only comparable for equality.

A feature that time-stamps support is that you can simply update the time-stamp (for example, using the POSIX command-line utility touch) in order to trick make into thinking that a dependency has changed or – more interestingly – a target is more recent than it actually is. While playing with this is a great opportunity to shoot yourself into the foot, it is useful from time to time. In a hash-based system, you would need support from the build-system itself to update its internal database of hashes used for the last build without actually building anything.

While an argument could certainly be made for using hashes over time-stamps, my point is that they are not a better solution to achieve the same goal but a different solution to achieve a different goal. Which of these goals is more desirable might be open to debate.

While the semantics differ between hashes and time stamps, it's normally irrelevant in this case as you most likely want a build based on the current files, no matter their age.
Most of what you say is correct. However a well-implemented build system that uses hashes like Google blaze/bazel (the internal version of blaze, the open source one is bazel) beats the pants off of a timestamped system like Make. That said, you do have to put a lot of effort into repeatable builds so that it is always safe to use old build artifacts rather than rebuilding.
The mapping here isn't many to one, it's one to one. If D now hashes to H2, and you don't have some output T2 built from D@H2, you need to produce and store it. Thereafter, regardless of what order D switches between the H1 and H2 states in, you will be able to use cached output.
Bazel, meson, please - they all absolutely suck usability-wise. Their DSLs are 8 to 20 times as verbose as make's. They're also absurdly opinionated and impose crazy restrictions on where things can be located. If you want to adopt any of those for an existing big project, you will probably have to refactor the entire project structure inside out. GNU Make imposes no restrictions. It allows recipes to read from anywhere and write anywhere your user has permissions to, even using absolute paths. It is perfect BUT for lack of hash support.
If your recipe can read from anywhere without sand boxing then it becomes increasingly likely that a mistake in your dependency encoding will appear over time. Either you take a dependency that is not really a dependency leading to unnecessary rebuilds, or you forget a dependency, leading to incorrect builds. It also makes it harder to produce builds that will work on any machine, since it easy to accidentally take a system dependency. These are some common failure modes of Make on large projects.

Stack Exchange Network

Why incremental builds in "make" don't use hashing algorithms?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Why incremental builds in "make" don't use hashing algorithms?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions