Why does git use hashes instead of revision numbers?

Question 1

I always wondered why git prefers hashes over revision numbers. Revision numbers are much clearer and easier to refer to (in my opinion): There is a difference between telling someone to take a look at revision 1200 or commit 92ba93e! (Just to give one example).

So, is there any reason for this design?

Question 2

You can tag a commit with "v1.0" and then refer to the commit by that tag. See git-scm.com/book/en/v2/Git-Basics-Tagging

Question 3

A single, monotonically increasing revision number only really makes sense for a centralized version control system, where all revisions flow to a single place that can track and assign numbers. Once you get into the DVCS world, where numerous copies of the repository exist and changes are being pulled from and pushed to them in arbitrary workflows, the concept just doesn't apply. (For example, there's no one place to assign revision numbers - if I fork your repository and you decide a year later to pull my changes, how could a system ensure that our revision numbers don't conflict?)

Question 4

You might want to look at the Bazaar way -- a DVCS that still maintains revision numbers. The only guarantee there is that revision numbers are unique within a branch.

Question 5

@krlmlr

Person 1: "Hey, <P2>, what was revision 12345 for?" P2: "Revision 12345 was commited by <P3>." P3: "I don't have a revision 12345..."

- If I remember correctly, Mercurial has a similar problem. On the other hand, if they were using git, they'd all have identical references for each commit.

Question 6

@Izkata: P1: "Do you have revision with the GUID gdlmsnblngoijlafd-35345-fg?" ... Bazaar still has GUIDs...

Question 7

@Izkata Mercurial does not have a similar problem. They use hashes, just like git. They also provide a local-only rev number for ease of typing.

Question 8

with git, the first 5 characters of the hash are often unique enough to use a shorthand for the full revision ID.

Question 9

You need hashes in a distributed system. Let's say you and a colleague are both working on the same repository and you both commit a change locally and then push it. Who gets to be revision number 1200 and who is revision number 1201 given neither party has any knowledge about each other? The only realistic technical solution is to create a hash of the changes using a known method and link things up based on that.

Interestingly HG does support version numbers but they are explicitly a local-only feature -- your repository has one set, your co-worker's repo will have a different set depending on how they pushed and pulled. It does make command line usage a bit more friendly than Git though.

Question 10

Data integrity.

I respectfully disagree with the current answers. Hashes are not necessary for a DVCS, see the Bazaar way. You could do as well with any other kind of globally unique identifier. The hashes are a measure to guarantee data integrity: They represent a digest of the information contained in the object (commit, trees, ...) referred to by the hash. Altering the contents without altering the hash (i.e., a preimage attack or collision attack) is believed to be difficult, although not impossible. (If you're really into it, take a look at the 2011 paper by Marc Stevens).

Hence, referring to objects by their SHA hash allows to check if the contents have been tampered with. And, given that they're (almost) guaranteed to be unique, they can be used as revision identifiers, too -- conveniently so.

See Chapter 9 of the Git book for more details.

Question 11

It's not a security measure, since the hash can easily be re-calculated for a modified commit. It's only used for integrity, to verify the contents against the calculated hash - see this comment from Linus Torvalds on the use of SHA-1 in Git.

Question 12

@Lee: If Chuck's repository is different from the one that Alice and Bob have in terms of revision hashes, it is guaranteed that Chuck also has different contents. On the other hand, it's very difficult for Chuck to fabricate a repository with different contents that looks identical w.r.t. their revision hashes.

Question 13

@Lee: Missed your link. Let's call it "data integrity" then...

Question 14

should be correct answer

Question 15

In layman's words:

Hashes are intended to be nearly universally unique. It is NOT guaranteed but it is extremely unlikely that the same SHA's are generated for different content. In practical term for a given project you can treat it as unique.
With revision numbers you would have to use a namespace in order to reffer specifically to revision 1200.
Git can work both distributed and/or centralized. So how do you get revision numbers correct and unique ?
Also using revision numbers would create the false spectation that newer revisions should have higher numbers, and that would not be true because of branching, merging, rebasing, etc.
You always have the option to put tags to commits.

Question 16

Not guaranteed to be unique, just incredibly likely to be unique. :)

Question 17

@mustang2009cobra That's true.

Question 18

It's possible that my change is not accepted because the hash is unchanged. It's much more likely that two meteors strike my computer and the computer with the repository at the same second, destroying the computers and killing everyone involved.

Question 19

In mathematical terms:

A total order over Git's commits would be required for monotonally increasing version numbers.
Git's commits form a directed, acyclic graph (DAG) that can only be ordered partially / topologically.

Question 20

Hash is not the unique solution for distributed VCS. But when deal with a distributed system, only the partial ordering of events can be recorded. (For VCS, the event can be a commit.) That is why maintain a monotonically increasing revision number is impossible. Usually we adopt something like vector clock (or vector timestamp) to record such partial-ordered relation. This is the solution used in Bazaar.

But why Git not uses vector clock but hash? I think the root cause is cherry-pick. When we perform cherry-pick on a repository, the partial ordering of commits is changing. Some commits' vector clocks must be re-assigned to represent the new partial ordering. However, such reassignment in distributed system would induce inconsistent vector clocks. That is the real problem which hashes deal with.

Josh Kelley Josh Kelley 11.1k7 gold badges40 silver badges52 bronze badges · Answer 1 · 2013-07-19 14:14:30Z

120

A single, monotonically increasing revision number only really makes sense for a centralized version control system, where all revisions flow to a single place that can track and assign numbers. Once you get into the DVCS world, where numerous copies of the repository exist and changes are being pulled from and pushed to them in arbitrary workflows, the concept just doesn't apply. (For example, there's no one place to assign revision numbers - if I fork your repository and you decide a year later to pull my changes, how could a system ensure that our revision numbers don't conflict?)

Share

Improve this answer

answered Jul 19, 2013 at 14:14

Josh Kelley's user avatar

Josh Kelley Josh Kelley

11.1k7 gold badges40 silver badges52 bronze badges

6

12

You might want to look at the Bazaar way -- a DVCS that still maintains revision numbers. The only guarantee there is that revision numbers are unique within a branch.

krlmlr
– krlmlr

07/19/2013 19:58:01
Commented Jul 19, 2013 at 19:58
4

@krlmlr Person 1: "Hey, <P2>, what was revision 12345 for?" P2: "Revision 12345 was commited by <P3>." P3: "I don't have a revision 12345..." - If I remember correctly, Mercurial has a similar problem. On the other hand, if they were using git, they'd all have identical references for each commit.

Izkata
– Izkata

07/20/2013 02:27:31
Commented Jul 20, 2013 at 2:27
1

@Izkata: P1: "Do you have revision with the GUID gdlmsnblngoijlafd-35345-fg?" ... Bazaar still has GUIDs...

krlmlr
– krlmlr

07/20/2013 10:17:30
Commented Jul 20, 2013 at 10:17
6

@Izkata Mercurial does not have a similar problem. They use hashes, just like git. They also provide a local-only rev number for ease of typing.

Hank Gay
– Hank Gay

07/25/2013 16:03:47
Commented Jul 25, 2013 at 16:03
2

with git, the first 5 characters of the hash are often unique enough to use a shorthand for the full revision ID.

mendota
– mendota

08/05/2016 00:12:04
Commented Aug 5, 2016 at 0:12

| Show 1 more comment

Wyatt Barnett Wyatt Barnett 20.8k52 silver badges69 bronze badges · Answer 2 · 2013-07-19 14:15:57Z

You need hashes in a distributed system. Let's say you and a colleague are both working on the same repository and you both commit a change locally and then push it. Who gets to be revision number 1200 and who is revision number 1201 given neither party has any knowledge about each other? The only realistic technical solution is to create a hash of the changes using a known method and link things up based on that.

Interestingly HG does support version numbers but they are explicitly a local-only feature -- your repository has one set, your co-worker's repo will have a different set depending on how they pushed and pulled. It does make command line usage a bit more friendly than Git though.

krlmlr krlmlr 8036 silver badges13 bronze badges · Answer 3 · 2013-07-19 20:08:25Z

Data integrity.

I respectfully disagree with the current answers. Hashes are not necessary for a DVCS, see the Bazaar way. You could do as well with any other kind of globally unique identifier. The hashes are a measure to guarantee data integrity: They represent a digest of the information contained in the object (commit, trees, ...) referred to by the hash. Altering the contents without altering the hash (i.e., a preimage attack or collision attack) is believed to be difficult, although not impossible. (If you're really into it, take a look at the 2011 paper by Marc Stevens).

Hence, referring to objects by their SHA hash allows to check if the contents have been tampered with. And, given that they're (almost) guaranteed to be unique, they can be used as revision identifiers, too -- conveniently so.

See Chapter 9 of the Git book for more details.

It's not a security measure, since the hash can easily be re-calculated for a modified commit. It's only used for integrity, to verify the contents against the calculated hash - see this comment from Linus Torvalds on the use of SHA-1 in Git.
@Lee: If Chuck's repository is different from the one that Alice and Bob have in terms of revision hashes, it is guaranteed that Chuck also has different contents. On the other hand, it's very difficult for Chuck to fabricate a repository with different contents that looks identical w.r.t. their revision hashes.
@Lee: Missed your link. Let's call it "data integrity" then...

score 9 · Answer 4 · 2013-07-19 14:20:03Z

In layman's words:

Hashes are intended to be nearly universally unique. It is NOT guaranteed but it is extremely unlikely that the same SHA's are generated for different content. In practical term for a given project you can treat it as unique.
With revision numbers you would have to use a namespace in order to reffer specifically to revision 1200.
Git can work both distributed and/or centralized. So how do you get revision numbers correct and unique ?
Also using revision numbers would create the false spectation that newer revisions should have higher numbers, and that would not be true because of branching, merging, rebasing, etc.
You always have the option to put tags to commits.

Not guaranteed to be unique, just incredibly likely to be unique. :)
It's possible that my change is not accepted because the hash is unchanged. It's much more likely that two meteors strike my computer and the computer with the repository at the same second, destroying the computers and killing everyone involved.

Bengt Bengt 1794 bronze badges · Answer 5 · 2013-07-22 20:11:03Z

In mathematical terms:

A total order over Git's commits would be required for monotonally increasing version numbers.
Git's commits form a directed, acyclic graph (DAG) that can only be ordered partially / topologically.

Che-Sheng Lin Che-Sheng Lin 111 bronze badge · Answer 6 · 2015-05-07 08:31:24Z

Hash is not the unique solution for distributed VCS. But when deal with a distributed system, only the partial ordering of events can be recorded. (For VCS, the event can be a commit.) That is why maintain a monotonically increasing revision number is impossible. Usually we adopt something like vector clock (or vector timestamp) to record such partial-ordered relation. This is the solution used in Bazaar.

But why Git not uses vector clock but hash? I think the root cause is cherry-pick. When we perform cherry-pick on a repository, the partial ordering of commits is changing. Some commits' vector clocks must be re-assigned to represent the new partial ordering. However, such reassignment in distributed system would induce inconsistent vector clocks. That is the real problem which hashes deal with.

Stack Exchange Network

Why does git use hashes instead of revision numbers?

6 Answers 6

Data integrity.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Why does git use hashes instead of revision numbers?

6 Answers 6

Data integrity.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions