I always wondered why git prefers hashes over revision numbers. Revision numbers are much clearer and easier to refer to (in my opinion): There is a difference between telling someone to take a look at revision 1200 or commit 92ba93e! (Just to give one example).
So, is there any reason for this design?
-
4You can tag a commit with "v1.0" and then refer to the commit by that tag. See git-scm.com/book/en/v2/Git-Basics-TaggingMichael Durrant– Michael Durrant05/07/2015 10:15:38Commented May 7, 2015 at 10:15
6 Answers 6
A single, monotonically increasing revision number only really makes sense for a centralized version control system, where all revisions flow to a single place that can track and assign numbers. Once you get into the DVCS world, where numerous copies of the repository exist and changes are being pulled from and pushed to them in arbitrary workflows, the concept just doesn't apply. (For example, there's no one place to assign revision numbers - if I fork your repository and you decide a year later to pull my changes, how could a system ensure that our revision numbers don't conflict?)
-
12You might want to look at the Bazaar way -- a DVCS that still maintains revision numbers. The only guarantee there is that revision numbers are unique within a branch.krlmlr– krlmlr07/19/2013 19:58:01Commented Jul 19, 2013 at 19:58
-
4@krlmlr
Person 1: "Hey, <P2>, what was revision 12345 for?" P2: "Revision 12345 was commited by <P3>." P3: "I don't have a revision 12345..."
- If I remember correctly, Mercurial has a similar problem. On the other hand, if they were using git, they'd all have identical references for each commit.Izkata– Izkata07/20/2013 02:27:31Commented Jul 20, 2013 at 2:27 -
1@Izkata:
P1: "Do you have revision with the GUID gdlmsnblngoijlafd-35345-fg?"
... Bazaar still has GUIDs...krlmlr– krlmlr07/20/2013 10:17:30Commented Jul 20, 2013 at 10:17 -
6@Izkata Mercurial does not have a similar problem. They use hashes, just like
git
. They also provide a local-only rev number for ease of typing.Hank Gay– Hank Gay07/25/2013 16:03:47Commented Jul 25, 2013 at 16:03 -
2with git, the first 5 characters of the hash are often unique enough to use a shorthand for the full revision ID.mendota– mendota08/05/2016 00:12:04Commented Aug 5, 2016 at 0:12
You need hashes in a distributed system. Let's say you and a colleague are both working on the same repository and you both commit a change locally and then push it. Who gets to be revision number 1200 and who is revision number 1201 given neither party has any knowledge about each other? The only realistic technical solution is to create a hash of the changes using a known method and link things up based on that.
Interestingly HG does support version numbers but they are explicitly a local-only feature -- your repository has one set, your co-worker's repo will have a different set depending on how they pushed and pulled. It does make command line usage a bit more friendly than Git though.
Data integrity.
I respectfully disagree with the current answers. Hashes are not necessary for a DVCS, see the Bazaar way. You could do as well with any other kind of globally unique identifier. The hashes are a measure to guarantee data integrity: They represent a digest of the information contained in the object (commit, trees, ...) referred to by the hash. Altering the contents without altering the hash (i.e., a preimage attack or collision attack) is believed to be difficult, although not impossible. (If you're really into it, take a look at the 2011 paper by Marc Stevens).
Hence, referring to objects by their SHA hash allows to check if the contents have been tampered with. And, given that they're (almost) guaranteed to be unique, they can be used as revision identifiers, too -- conveniently so.
See Chapter 9 of the Git book for more details.
-
8It's not a security measure, since the hash can easily be re-calculated for a modified commit. It's only used for integrity, to verify the contents against the calculated hash - see this comment from Linus Torvalds on the use of SHA-1 in Git.Lee– Lee07/19/2013 20:40:05Commented Jul 19, 2013 at 20:40
-
@Lee: If Chuck's repository is different from the one that Alice and Bob have in terms of revision hashes, it is guaranteed that Chuck also has different contents. On the other hand, it's very difficult for Chuck to fabricate a repository with different contents that looks identical w.r.t. their revision hashes.krlmlr– krlmlr07/19/2013 20:59:28Commented Jul 19, 2013 at 20:59
-
@Lee: Missed your link. Let's call it "data integrity" then...krlmlr– krlmlr07/19/2013 21:09:06Commented Jul 19, 2013 at 21:09
-
should be correct answerSuperUberDuper– SuperUberDuper06/22/2016 10:08:35Commented Jun 22, 2016 at 10:08
In layman's words:
- Hashes are intended to be nearly universally unique. It is NOT guaranteed but it is extremely unlikely that the same SHA's are generated for different content. In practical term for a given project you can treat it as unique.
- With revision numbers you would have to use a namespace in order to reffer specifically to revision 1200.
- Git can work both distributed and/or centralized. So how do you get revision numbers correct and unique ?
- Also using revision numbers would create the false spectation that newer revisions should have higher numbers, and that would not be true because of branching, merging, rebasing, etc.
- You always have the option to put tags to commits.
-
32Not guaranteed to be unique, just incredibly likely to be unique. :)dsw88– dsw8807/19/2013 14:30:04Commented Jul 19, 2013 at 14:30
-
@mustang2009cobra That's true.Tulains Córdova– Tulains Córdova07/19/2013 14:35:54Commented Jul 19, 2013 at 14:35
-
1It's possible that my change is not accepted because the hash is unchanged. It's much more likely that two meteors strike my computer and the computer with the repository at the same second, destroying the computers and killing everyone involved.gnasher729– gnasher72905/07/2015 08:55:24Commented May 7, 2015 at 8:55
In mathematical terms:
- A total order over Git's commits would be required for monotonally increasing version numbers.
- Git's commits form a directed, acyclic graph (DAG) that can only be ordered partially / topologically.
Hash is not the unique solution for distributed VCS. But when deal with a distributed system, only the partial ordering of events can be recorded. (For VCS, the event can be a commit.) That is why maintain a monotonically increasing revision number is impossible. Usually we adopt something like vector clock (or vector timestamp) to record such partial-ordered relation. This is the solution used in Bazaar.
But why Git not uses vector clock but hash? I think the root cause is cherry-pick. When we perform cherry-pick on a repository, the partial ordering of commits is changing. Some commits' vector clocks must be re-assigned to represent the new partial ordering. However, such reassignment in distributed system would induce inconsistent vector clocks. That is the real problem which hashes deal with.