I am a beginner with make
and I'm wondering about when to use make clean
.
One colleague told me that incremental builds with make
are based on the files timestamps. So, if you checkout an old version of a file in your VCS, it'll have an "old" timestamp and it'll be marked as "no need to recompile this file". Then, that file wouldn't be included in the next build.
According to that same colleague, it would be a reason to use make clean
.
Anyway, I roughly got the answer to the question "when to use make clean
" from other StackExchange questions but my other question then is:
Why do incremental builds using
make
rely on files timestamps and not on SHA-1 for example? Git, for instance, shows that we can successfully determine if a file was modified using the SHA-1.
Is it for speed issues?
3 Answers 3
An obvious (and arguably superficial) problem would be that the build system would have to keep record of the hashes of the files that were used for the last build. While this problem could certainly be solved, it would require side storage when the time-stamp information is already present in the file-system.
More seriously, though, the hash would not convey the same semantics. If you know that file T was built from dependency D with hash H1 and then find out that D now hashes to H2, should you re-build T? Probably yes, but it could also be that H2 actually refers to an older version of the file. Time-stamps define an ordering while hashes are only comparable for equality.
A feature that time-stamps support is that you can simply update the time-stamp (for example, using the POSIX command-line utility touch
) in order to trick make
into thinking that a dependency has changed or – more interestingly – a target is more recent than it actually is. While playing with this is a great opportunity to shoot yourself into the foot, it is useful from time to time. In a hash-based system, you would need support from the build-system itself to update its internal database of hashes used for the last build without actually building anything.
While an argument could certainly be made for using hashes over time-stamps, my point is that they are not a better solution to achieve the same goal but a different solution to achieve a different goal. Which of these goals is more desirable might be open to debate.
-
1While the semantics differ between hashes and time stamps, it's normally irrelevant in this case as you most likely want a build based on the current files, no matter their age.axl– axl2016年05月25日 06:07:15 +00:00Commented May 25, 2016 at 6:07
-
Most of what you say is correct. However a well-implemented build system that uses hashes like Google blaze/bazel (the internal version of blaze, the open source one is bazel) beats the pants off of a timestamped system like Make. That said, you do have to put a lot of effort into repeatable builds so that it is always safe to use old build artifacts rather than rebuilding.btilly– btilly2016年05月25日 20:28:13 +00:00Commented May 25, 2016 at 20:28
-
The mapping here isn't many to one, it's one to one. If
D
now hashes toH2
, and you don't have some outputT2
built fromD@H2
, you need to produce and store it. Thereafter, regardless of what orderD
switches between theH1
andH2
states in, you will be able to use cached output.pxq– pxq2017年06月07日 19:28:00 +00:00Commented Jun 7, 2017 at 19:28 -
1
Bazel
,meson
,please
- they all absolutely suck usability-wise. Their DSLs are 8 to 20 times as verbose as make's. They're also absurdly opinionated and impose crazy restrictions on where things can be located. If you want to adopt any of those for an existing big project, you will probably have to refactor the entire project structure inside out. GNU Make imposes no restrictions. It allows recipes to read from anywhere and write anywhere your user has permissions to, even using absolute paths. It is perfect BUT for lack of hash support.Szczepan Hołyszewski– Szczepan Hołyszewski2020年10月08日 01:43:21 +00:00Commented Oct 8, 2020 at 1:43 -
If your recipe can read from anywhere without sand boxing then it becomes increasingly likely that a mistake in your dependency encoding will appear over time. Either you take a dependency that is not really a dependency leading to unnecessary rebuilds, or you forget a dependency, leading to incorrect builds. It also makes it harder to produce builds that will work on any machine, since it easy to accidentally take a system dependency. These are some common failure modes of Make on large projects.sdgfsdh– sdgfsdh2020年10月08日 06:45:04 +00:00Commented Oct 8, 2020 at 6:45
A few points about hashes vs timestamps in build-systems:
- When you checkout a file, the timestamp should be updated to the current time, which triggers a rebuild. What your colleague describes is not usually a failure mode of timestamp systems.
- Timestamps are marginally faster than hashes. A timestamp system only has to check the timestamp, whereas a hash system must check the timestamp and then potentially the hash.
- Make is designed to be lightweight and self-contained. To overcome (2), hash based systems will usually run a background process for checking hashes (e.g. Facebook's Watchman). This is counter to the design goals (and history) of Make.
- Hashes prevent unnecessary rebuilds when a timestamp has changed but not the contents. Often, this offsets the cost of computing the hash.
- Hashes enable artefact caches to be shared across projects and over a network. Again, this more than offsets the cost of computing hashes.
- Modern hash-based build-systems include Bazel (Google) and Buck (Facebook).
- Most developers should consider using a hash-based system, since they do not have the same requirements as those under which Make was designed.
-
Bazel and Buck suck. They are absurdly opinionated and impose crazy restrictions on where things can be located. You try porting your rules one by one and you quickly realize that NOTHING CAN FIND ANYTHING because there's some kind of sandboxing going on and in order to pierce this stupid firewall you must write SCREENFULS of extra declarations in the build specs. And people clench their teeth and deal with it, because hashes. THAT is an indication of how useful this capability is.Szczepan Hołyszewski– Szczepan Hołyszewski2020年10月08日 01:49:04 +00:00Commented Oct 8, 2020 at 1:49
-
2The sand boxing is actually orthogonal to the hashing. However, both are features that lead to maintainable and predictable build systems in large projects. I personally don’t find the sand boxing restrictive, and typically projects follow this convention anyway. Buck and Bazel declarations are very terse, being written in Python. The sand boxing trade offs are a bit like type-checking. It makes a few things more difficult but gives many more guarantees.sdgfsdh– sdgfsdh2020年10月08日 06:38:16 +00:00Commented Oct 8, 2020 at 6:38
-
The user should be IN CONTROL of trade offs.Szczepan Hołyszewski– Szczepan Hołyszewski2020年10月08日 17:25:08 +00:00Commented Oct 8, 2020 at 17:25
-
2If you allow users to easily break the sandbox then you lose correctness guarantees across the build. This would preclude features that the Bazel team wanted to prioritize, such as correctness, distributed caching, distributed execution, composability of projects, etc. Sometimes more freedom can actually lead to fewer features. See youtube.com/watch?v=GqmsQeSzMdw for a good talk on this concept.sdgfsdh– sdgfsdh2020年10月08日 17:53:15 +00:00Commented Oct 8, 2020 at 17:53
-
1@SzczepanHołyszewski Perhaps you should open an issue for the problem you are having with Bazel. It certainly does not make building software impossible, as evidenced by the various companies that are leveraging it successfully. "Constraints are freedom" is just a catchy title, don't read too much into it :)sdgfsdh– sdgfsdh2020年10月09日 10:51:55 +00:00Commented Oct 9, 2020 at 10:51
Hashing an entire project is very slow. You have to read every single byte of every single file. Git doesn't hash every file every time you run a git status
either. Nor do VCS checkouts normally set a file's modification time to the original authored time. A backup restore would, if you take care to do so. The whole reason filesystems have timestamps is for use cases like these.
A developer typically runs make clean
when a dependency not directly tracked by the Makefile changes. Ironically, this usually includes the Makefile itself. It usually also includes compiler versions. Depending on how well your Makefile is written, it could include external library versions.
These are the sorts of things that tend to get updated when you do a version control update, so most developers just get in the habit of running a make clean
at the same time, so you know you're starting from a clean slate. You can get away without doing it a lot of the time, but it's really difficult to predict the times you can't.
-
You can use filesystems like ZFS where the cost of hashing is amortized over the time when the files are being modified, rather than being paid all at once when you build.pxq– pxq2017年06月07日 19:30:21 +00:00Commented Jun 7, 2017 at 19:30
make
was created in the 70's. SHA-1 was created in the 90's. Git was created in 00's. The last thing you want is for some obscure builds that were working for 30 years to suddenly fail because somebody decided to go all modern with a tried and tested system.make
then your software won't break, howevermake
makes rather an effort to have backwards compatibility in new versions. Changing core behavior for no good reason is pretty much the opposite of that. And the dates show why it was not originally made to use SHA-1, or why it was not easy to retrofit it when it became available (make
was already decades old by then).