Browse Source
This document describes what a transition to a new hash function for Git would look like. Add it to Documentation/technical/ as the plan of record so that future changes can be recorded as patches. Also-by: Brandon Williams <bmwill@google.com> Also-by: Jonathan Tan <jonathantanmy@google.com> Also-by: Stefan Beller <sbeller@google.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>maint


2 changed files with 798 additions and 0 deletions
@ -0,0 +1,797 @@
@@ -0,0 +1,797 @@
|
||||
Git hash function transition |
||||
============================ |
||||
|
||||
Objective |
||||
--------- |
||||
Migrate Git from SHA-1 to a stronger hash function. |
||||
|
||||
Background |
||||
---------- |
||||
At its core, the Git version control system is a content addressable |
||||
filesystem. It uses the SHA-1 hash function to name content. For |
||||
example, files, directories, and revisions are referred to by hash |
||||
values unlike in other traditional version control systems where files |
||||
or versions are referred to via sequential numbers. The use of a hash |
||||
function to address its content delivers a few advantages: |
||||
|
||||
* Integrity checking is easy. Bit flips, for example, are easily |
||||
detected, as the hash of corrupted content does not match its name. |
||||
* Lookup of objects is fast. |
||||
|
||||
Using a cryptographically secure hash function brings additional |
||||
advantages: |
||||
|
||||
* Object names can be signed and third parties can trust the hash to |
||||
address the signed object and all objects it references. |
||||
* Communication using Git protocol and out of band communication |
||||
methods have a short reliable string that can be used to reliably |
||||
address stored content. |
||||
|
||||
Over time some flaws in SHA-1 have been discovered by security |
||||
researchers. https://shattered.io demonstrated a practical SHA-1 hash |
||||
collision. As a result, SHA-1 cannot be considered cryptographically |
||||
secure any more. This impacts the communication of hash values because |
||||
we cannot trust that a given hash value represents the known good |
||||
version of content that the speaker intended. |
||||
|
||||
SHA-1 still possesses the other properties such as fast object lookup |
||||
and safe error checking, but other hash functions are equally suitable |
||||
that are believed to be cryptographically secure. |
||||
|
||||
Goals |
||||
----- |
||||
Where NewHash is a strong 256-bit hash function to replace SHA-1 (see |
||||
"Selection of a New Hash", below): |
||||
|
||||
1. The transition to NewHash can be done one local repository at a time. |
||||
a. Requiring no action by any other party. |
||||
b. A NewHash repository can communicate with SHA-1 Git servers |
||||
(push/fetch). |
||||
c. Users can use SHA-1 and NewHash identifiers for objects |
||||
interchangeably (see "Object names on the command line", below). |
||||
d. New signed objects make use of a stronger hash function than |
||||
SHA-1 for their security guarantees. |
||||
2. Allow a complete transition away from SHA-1. |
||||
a. Local metadata for SHA-1 compatibility can be removed from a |
||||
repository if compatibility with SHA-1 is no longer needed. |
||||
3. Maintainability throughout the process. |
||||
a. The object format is kept simple and consistent. |
||||
b. Creation of a generalized repository conversion tool. |
||||
|
||||
Non-Goals |
||||
--------- |
||||
1. Add NewHash support to Git protocol. This is valuable and the |
||||
logical next step but it is out of scope for this initial design. |
||||
2. Transparently improving the security of existing SHA-1 signed |
||||
objects. |
||||
3. Intermixing objects using multiple hash functions in a single |
||||
repository. |
||||
4. Taking the opportunity to fix other bugs in Git's formats and |
||||
protocols. |
||||
5. Shallow clones and fetches into a NewHash repository. (This will |
||||
change when we add NewHash support to Git protocol.) |
||||
6. Skip fetching some submodules of a project into a NewHash |
||||
repository. (This also depends on NewHash support in Git |
||||
protocol.) |
||||
|
||||
Overview |
||||
-------- |
||||
We introduce a new repository format extension. Repositories with this |
||||
extension enabled use NewHash instead of SHA-1 to name their objects. |
||||
This affects both object names and object content --- both the names |
||||
of objects and all references to other objects within an object are |
||||
switched to the new hash function. |
||||
|
||||
NewHash repositories cannot be read by older versions of Git. |
||||
|
||||
Alongside the packfile, a NewHash repository stores a bidirectional |
||||
mapping between NewHash and SHA-1 object names. The mapping is generated |
||||
locally and can be verified using "git fsck". Object lookups use this |
||||
mapping to allow naming objects using either their SHA-1 and NewHash names |
||||
interchangeably. |
||||
|
||||
"git cat-file" and "git hash-object" gain options to display an object |
||||
in its sha1 form and write an object given its sha1 form. This |
||||
requires all objects referenced by that object to be present in the |
||||
object database so that they can be named using the appropriate name |
||||
(using the bidirectional hash mapping). |
||||
|
||||
Fetches from a SHA-1 based server convert the fetched objects into |
||||
NewHash form and record the mapping in the bidirectional mapping table |
||||
(see below for details). Pushes to a SHA-1 based server convert the |
||||
objects being pushed into sha1 form so the server does not have to be |
||||
aware of the hash function the client is using. |
||||
|
||||
Detailed Design |
||||
--------------- |
||||
Repository format extension |
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
||||
A NewHash repository uses repository format version `1` (see |
||||
Documentation/technical/repository-version.txt) with extensions |
||||
`objectFormat` and `compatObjectFormat`: |
||||
|
||||
[core] |
||||
repositoryFormatVersion = 1 |
||||
[extensions] |
||||
objectFormat = newhash |
||||
compatObjectFormat = sha1 |
||||
|
||||
Specifying a repository format extension ensures that versions of Git |
||||
not aware of NewHash do not try to operate on these repositories, |
||||
instead producing an error message: |
||||
|
||||
$ git status |
||||
fatal: unknown repository extensions found: |
||||
objectformat |
||||
compatobjectformat |
||||
|
||||
See the "Transition plan" section below for more details on these |
||||
repository extensions. |
||||
|
||||
Object names |
||||
~~~~~~~~~~~~ |
||||
Objects can be named by their 40 hexadecimal digit sha1-name or 64 |
||||
hexadecimal digit newhash-name, plus names derived from those (see |
||||
gitrevisions(7)). |
||||
|
||||
The sha1-name of an object is the SHA-1 of the concatenation of its |
||||
type, length, a nul byte, and the object's sha1-content. This is the |
||||
traditional <sha1> used in Git to name objects. |
||||
|
||||
The newhash-name of an object is the NewHash of the concatenation of its |
||||
type, length, a nul byte, and the object's newhash-content. |
||||
|
||||
Object format |
||||
~~~~~~~~~~~~~ |
||||
The content as a byte sequence of a tag, commit, or tree object named |
||||
by sha1 and newhash differ because an object named by newhash-name refers to |
||||
other objects by their newhash-names and an object named by sha1-name |
||||
refers to other objects by their sha1-names. |
||||
|
||||
The newhash-content of an object is the same as its sha1-content, except |
||||
that objects referenced by the object are named using their newhash-names |
||||
instead of sha1-names. Because a blob object does not refer to any |
||||
other object, its sha1-content and newhash-content are the same. |
||||
|
||||
The format allows round-trip conversion between newhash-content and |
||||
sha1-content. |
||||
|
||||
Object storage |
||||
~~~~~~~~~~~~~~ |
||||
Loose objects use zlib compression and packed objects use the packed |
||||
format described in Documentation/technical/pack-format.txt, just like |
||||
today. The content that is compressed and stored uses newhash-content |
||||
instead of sha1-content. |
||||
|
||||
Pack index |
||||
~~~~~~~~~~ |
||||
Pack index (.idx) files use a new v3 format that supports multiple |
||||
hash functions. They have the following format (all integers are in |
||||
network byte order): |
||||
|
||||
- A header appears at the beginning and consists of the following: |
||||
- The 4-byte pack index signature: '\377t0c' |
||||
- 4-byte version number: 3 |
||||
- 4-byte length of the header section, including the signature and |
||||
version number |
||||
- 4-byte number of objects contained in the pack |
||||
- 4-byte number of object formats in this pack index: 2 |
||||
- For each object format: |
||||
- 4-byte format identifier (e.g., 'sha1' for SHA-1) |
||||
- 4-byte length in bytes of shortened object names. This is the |
||||
shortest possible length needed to make names in the shortened |
||||
object name table unambiguous. |
||||
- 4-byte integer, recording where tables relating to this format |
||||
are stored in this index file, as an offset from the beginning. |
||||
- 4-byte offset to the trailer from the beginning of this file. |
||||
- Zero or more additional key/value pairs (4-byte key, 4-byte |
||||
value). Only one key is supported: 'PSRC'. See the "Loose objects |
||||
and unreachable objects" section for supported values and how this |
||||
is used. All other keys are reserved. Readers must ignore |
||||
unrecognized keys. |
||||
- Zero or more NUL bytes. This can optionally be used to improve the |
||||
alignment of the full object name table below. |
||||
- Tables for the first object format: |
||||
- A sorted table of shortened object names. These are prefixes of |
||||
the names of all objects in this pack file, packed together |
||||
without offset values to reduce the cache footprint of the binary |
||||
search for a specific object name. |
||||
|
||||
- A table of full object names in pack order. This allows resolving |
||||
a reference to "the nth object in the pack file" (from a |
||||
reachability bitmap or from the next table of another object |
||||
format) to its object name. |
||||
|
||||
- A table of 4-byte values mapping object name order to pack order. |
||||
For an object in the table of sorted shortened object names, the |
||||
value at the corresponding index in this table is the index in the |
||||
previous table for that same object. |
||||
|
||||
This can be used to look up the object in reachability bitmaps or |
||||
to look up its name in another object format. |
||||
|
||||
- A table of 4-byte CRC32 values of the packed object data, in the |
||||
order that the objects appear in the pack file. This is to allow |
||||
compressed data to be copied directly from pack to pack during |
||||
repacking without undetected data corruption. |
||||
|
||||
- A table of 4-byte offset values. For an object in the table of |
||||
sorted shortened object names, the value at the corresponding |
||||
index in this table indicates where that object can be found in |
||||
the pack file. These are usually 31-bit pack file offsets, but |
||||
large offsets are encoded as an index into the next table with the |
||||
most significant bit set. |
||||
|
||||
- A table of 8-byte offset entries (empty for pack files less than |
||||
2 GiB). Pack files are organized with heavily used objects toward |
||||
the front, so most object references should not need to refer to |
||||
this table. |
||||
- Zero or more NUL bytes. |
||||
- Tables for the second object format, with the same layout as above, |
||||
up to and not including the table of CRC32 values. |
||||
- Zero or more NUL bytes. |
||||
- The trailer consists of the following: |
||||
- A copy of the 20-byte NewHash checksum at the end of the |
||||
corresponding packfile. |
||||
|
||||
- 20-byte NewHash checksum of all of the above. |
||||
|
||||
Loose object index |
||||
~~~~~~~~~~~~~~~~~~ |
||||
A new file $GIT_OBJECT_DIR/loose-object-idx contains information about |
||||
all loose objects. Its format is |
||||
|
||||
# loose-object-idx |
||||
(newhash-name SP sha1-name LF)* |
||||
|
||||
where the object names are in hexadecimal format. The file is not |
||||
sorted. |
||||
|
||||
The loose object index is protected against concurrent writes by a |
||||
lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose |
||||
object: |
||||
|
||||
1. Write the loose object to a temporary file, like today. |
||||
2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock. |
||||
3. Rename the loose object into place. |
||||
4. Open loose-object-idx with O_APPEND and write the new object |
||||
5. Unlink loose-object-idx.lock to release the lock. |
||||
|
||||
To remove entries (e.g. in "git pack-refs" or "git-prune"): |
||||
|
||||
1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the |
||||
lock. |
||||
2. Write the new content to loose-object-idx.lock. |
||||
3. Unlink any loose objects being removed. |
||||
4. Rename to replace loose-object-idx, releasing the lock. |
||||
|
||||
Translation table |
||||
~~~~~~~~~~~~~~~~~ |
||||
The index files support a bidirectional mapping between sha1-names |
||||
and newhash-names. The lookup proceeds similarly to ordinary object |
||||
lookups. For example, to convert a sha1-name to a newhash-name: |
||||
|
||||
1. Look for the object in idx files. If a match is present in the |
||||
idx's sorted list of truncated sha1-names, then: |
||||
a. Read the corresponding entry in the sha1-name order to pack |
||||
name order mapping. |
||||
b. Read the corresponding entry in the full sha1-name table to |
||||
verify we found the right object. If it is, then |
||||
c. Read the corresponding entry in the full newhash-name table. |
||||
That is the object's newhash-name. |
||||
2. Check for a loose object. Read lines from loose-object-idx until |
||||
we find a match. |
||||
|
||||
Step (1) takes the same amount of time as an ordinary object lookup: |
||||
O(number of packs * log(objects per pack)). Step (2) takes O(number of |
||||
loose objects) time. To maintain good performance it will be necessary |
||||
to keep the number of loose objects low. See the "Loose objects and |
||||
unreachable objects" section below for more details. |
||||
|
||||
Since all operations that make new objects (e.g., "git commit") add |
||||
the new objects to the corresponding index, this mapping is possible |
||||
for all objects in the object store. |
||||
|
||||
Reading an object's sha1-content |
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
||||
The sha1-content of an object can be read by converting all newhash-names |
||||
its newhash-content references to sha1-names using the translation table. |
||||
|
||||
Fetch |
||||
~~~~~ |
||||
Fetching from a SHA-1 based server requires translating between SHA-1 |
||||
and NewHash based representations on the fly. |
||||
|
||||
SHA-1s named in the ref advertisement that are present on the client |
||||
can be translated to NewHash and looked up as local objects using the |
||||
translation table. |
||||
|
||||
Negotiation proceeds as today. Any "have"s generated locally are |
||||
converted to SHA-1 before being sent to the server, and SHA-1s |
||||
mentioned by the server are converted to NewHash when looking them up |
||||
locally. |
||||
|
||||
After negotiation, the server sends a packfile containing the |
||||
requested objects. We convert the packfile to NewHash format using |
||||
the following steps: |
||||
|
||||
1. index-pack: inflate each object in the packfile and compute its |
||||
SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against |
||||
objects the client has locally. These objects can be looked up |
||||
using the translation table and their sha1-content read as |
||||
described above to resolve the deltas. |
||||
2. topological sort: starting at the "want"s from the negotiation |
||||
phase, walk through objects in the pack and emit a list of them, |
||||
excluding blobs, in reverse topologically sorted order, with each |
||||
object coming later in the list than all objects it references. |
||||
(This list only contains objects reachable from the "wants". If the |
||||
pack from the server contained additional extraneous objects, then |
||||
they will be discarded.) |
||||
3. convert to newhash: open a new (newhash) packfile. Read the topologically |
||||
sorted list just generated. For each object, inflate its |
||||
sha1-content, convert to newhash-content, and write it to the newhash |
||||
pack. Record the new sha1<->newhash mapping entry for use in the idx. |
||||
4. sort: reorder entries in the new pack to match the order of objects |
||||
in the pack the server generated and include blobs. Write a newhash idx |
||||
file |
||||
5. clean up: remove the SHA-1 based pack file, index, and |
||||
topologically sorted list obtained from the server in steps 1 |
||||
and 2. |
||||
|
||||
Step 3 requires every object referenced by the new object to be in the |
||||
translation table. This is why the topological sort step is necessary. |
||||
|
||||
As an optimization, step 1 could write a file describing what non-blob |
||||
objects each object it has inflated from the packfile references. This |
||||
makes the topological sort in step 2 possible without inflating the |
||||
objects in the packfile for a second time. The objects need to be |
||||
inflated again in step 3, for a total of two inflations. |
||||
|
||||
Step 4 is probably necessary for good read-time performance. "git |
||||
pack-objects" on the server optimizes the pack file for good data |
||||
locality (see Documentation/technical/pack-heuristics.txt). |
||||
|
||||
Details of this process are likely to change. It will take some |
||||
experimenting to get this to perform well. |
||||
|
||||
Push |
||||
~~~~ |
||||
Push is simpler than fetch because the objects referenced by the |
||||
pushed objects are already in the translation table. The sha1-content |
||||
of each object being pushed can be read as described in the "Reading |
||||
an object's sha1-content" section to generate the pack written by git |
||||
send-pack. |
||||
|
||||
Signed Commits |
||||
~~~~~~~~~~~~~~ |
||||
We add a new field "gpgsig-newhash" to the commit object format to allow |
||||
signing commits without relying on SHA-1. It is similar to the |
||||
existing "gpgsig" field. Its signed payload is the newhash-content of the |
||||
commit object with any "gpgsig" and "gpgsig-newhash" fields removed. |
||||
|
||||
This means commits can be signed |
||||
1. using SHA-1 only, as in existing signed commit objects |
||||
2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig |
||||
fields. |
||||
3. using only NewHash, by only using the gpgsig-newhash field. |
||||
|
||||
Old versions of "git verify-commit" can verify the gpgsig signature in |
||||
cases (1) and (2) without modifications and view case (3) as an |
||||
ordinary unsigned commit. |
||||
|
||||
Signed Tags |
||||
~~~~~~~~~~~ |
||||
We add a new field "gpgsig-newhash" to the tag object format to allow |
||||
signing tags without relying on SHA-1. Its signed payload is the |
||||
newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP |
||||
SIGNATURE-----" delimited in-body signature removed. |
||||
|
||||
This means tags can be signed |
||||
1. using SHA-1 only, as in existing signed tag objects |
||||
2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body |
||||
signature. |
||||
3. using only NewHash, by only using the gpgsig-newhash field. |
||||
|
||||
Mergetag embedding |
||||
~~~~~~~~~~~~~~~~~~ |
||||
The mergetag field in the sha1-content of a commit contains the |
||||
sha1-content of a tag that was merged by that commit. |
||||
|
||||
The mergetag field in the newhash-content of the same commit contains the |
||||
newhash-content of the same tag. |
||||
|
||||
Submodules |
||||
~~~~~~~~~~ |
||||
To convert recorded submodule pointers, you need to have the converted |
||||
submodule repository in place. The translation table of the submodule |
||||
can be used to look up the new hash. |
||||
|
||||
Loose objects and unreachable objects |
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
||||
Fast lookups in the loose-object-idx require that the number of loose |
||||
objects not grow too high. |
||||
|
||||
"git gc --auto" currently waits for there to be 6700 loose objects |
||||
present before consolidating them into a packfile. We will need to |
||||
measure to find a more appropriate threshold for it to use. |
||||
|
||||
"git gc --auto" currently waits for there to be 50 packs present |
||||
before combining packfiles. Packing loose objects more aggressively |
||||
may cause the number of pack files to grow too quickly. This can be |
||||
mitigated by using a strategy similar to Martin Fick's exponential |
||||
rolling garbage collection script: |
||||
https://gerrit-review.googlesource.com/c/gerrit/+/35215 |
||||
|
||||
"git gc" currently expels any unreachable objects it encounters in |
||||
pack files to loose objects in an attempt to prevent a race when |
||||
pruning them (in case another process is simultaneously writing a new |
||||
object that refers to the about-to-be-deleted object). This leads to |
||||
an explosion in the number of loose objects present and disk space |
||||
usage due to the objects in delta form being replaced with independent |
||||
loose objects. Worse, the race is still present for loose objects. |
||||
|
||||
Instead, "git gc" will need to move unreachable objects to a new |
||||
packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see |
||||
below). To avoid the race when writing new objects referring to an |
||||
about-to-be-deleted object, code paths that write new objects will |
||||
need to copy any objects from UNREACHABLE_GARBAGE packs that they |
||||
refer to to new, non-UNREACHABLE_GARBAGE packs (or loose objects). |
||||
UNREACHABLE_GARBAGE are then safe to delete if their creation time (as |
||||
indicated by the file's mtime) is long enough ago. |
||||
|
||||
To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be |
||||
combined under certain circumstances. If "gc.garbageTtl" is set to |
||||
greater than one day, then packs created within a single calendar day, |
||||
UTC, can be coalesced together. The resulting packfile would have an |
||||
mtime before midnight on that day, so this makes the effective maximum |
||||
ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day, |
||||
then we divide the calendar day into intervals one-third of that ttl |
||||
in duration. Packs created within the same interval can be coalesced |
||||
together. The resulting packfile would have an mtime before the end of |
||||
the interval, so this makes the effective maximum ttl equal to the |
||||
garbageTtl * 4/3. |
||||
|
||||
This rule comes from Thirumala Reddy Mutchukota's JGit change |
||||
https://git.eclipse.org/r/90465. |
||||
|
||||
The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack |
||||
index. More generally, that field indicates where a pack came from: |
||||
|
||||
- 1 (PACK_SOURCE_RECEIVE) for a pack received over the network |
||||
- 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight |
||||
"gc --auto" operation |
||||
- 3 (PACK_SOURCE_GC) for a pack created by a full gc |
||||
- 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage |
||||
discovered by gc |
||||
- 5 (PACK_SOURCE_INSERT) for locally created objects that were |
||||
written directly to a pack file, e.g. from "git add ." |
||||
|
||||
This information can be useful for debugging and for "gc --auto" to |
||||
make appropriate choices about which packs to coalesce. |
||||
|
||||
Caveats |
||||
------- |
||||
Invalid objects |
||||
~~~~~~~~~~~~~~~ |
||||
The conversion from sha1-content to newhash-content retains any |
||||
brokenness in the original object (e.g., tree entry modes encoded with |
||||
leading 0, tree objects whose paths are not sorted correctly, and |
||||
commit objects without an author or committer). This is a deliberate |
||||
feature of the design to allow the conversion to round-trip. |
||||
|
||||
More profoundly broken objects (e.g., a commit with a truncated "tree" |
||||
header line) cannot be converted but were not usable by current Git |
||||
anyway. |
||||
|
||||
Shallow clone and submodules |
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
||||
Because it requires all referenced objects to be available in the |
||||
locally generated translation table, this design does not support |
||||
shallow clone or unfetched submodules. Protocol improvements might |
||||
allow lifting this restriction. |
||||
|
||||
Alternates |
||||
~~~~~~~~~~ |
||||
For the same reason, a newhash repository cannot borrow objects from a |
||||
sha1 repository using objects/info/alternates or |
||||
$GIT_ALTERNATE_OBJECT_REPOSITORIES. |
||||
|
||||
git notes |
||||
~~~~~~~~~ |
||||
The "git notes" tool annotates objects using their sha1-name as key. |
||||
This design does not describe a way to migrate notes trees to use |
||||
newhash-names. That migration is expected to happen separately (for |
||||
example using a file at the root of the notes tree to describe which |
||||
hash it uses). |
||||
|
||||
Server-side cost |
||||
~~~~~~~~~~~~~~~~ |
||||
Until Git protocol gains NewHash support, using NewHash based storage |
||||
on public-facing Git servers is strongly discouraged. Once Git |
||||
protocol gains NewHash support, NewHash based servers are likely not |
||||
to support SHA-1 compatibility, to avoid what may be a very expensive |
||||
hash reencode during clone and to encourage peers to modernize. |
||||
|
||||
The design described here allows fetches by SHA-1 clients of a |
||||
personal NewHash repository because it's not much more difficult than |
||||
allowing pushes from that repository. This support needs to be guarded |
||||
by a configuration option --- servers like git.kernel.org that serve a |
||||
large number of clients would not be expected to bear that cost. |
||||
|
||||
Meaning of signatures |
||||
~~~~~~~~~~~~~~~~~~~~~ |
||||
The signed payload for signed commits and tags does not explicitly |
||||
name the hash used to identify objects. If some day Git adopts a new |
||||
hash function with the same length as the current SHA-1 (40 |
||||
hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the |
||||
intent behind the PGP signed payload in an object signature is |
||||
unclear: |
||||
|
||||
object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 |
||||
type commit |
||||
tag v2.12.0 |
||||
tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800 |
||||
|
||||
Git 2.12 |
||||
|
||||
Does this mean Git v2.12.0 is the commit with sha1-name |
||||
e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with |
||||
new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? |
||||
|
||||
Fortunately NewHash and SHA-1 have different lengths. If Git starts |
||||
using another hash with the same length to name objects, then it will |
||||
need to change the format of signed payloads using that hash to |
||||
address this issue. |
||||
|
||||
Object names on the command line |
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
||||
To support the transition (see Transition plan below), this design |
||||
supports four different modes of operation: |
||||
|
||||
1. ("dark launch") Treat object names input by the user as SHA-1 and |
||||
convert any object names written to output to SHA-1, but store |
||||
objects using NewHash. This allows users to test the code with no |
||||
visible behavior change except for performance. This allows |
||||
allows running even tests that assume the SHA-1 hash function, to |
||||
sanity-check the behavior of the new mode. |
||||
|
||||
2. ("early transition") Allow both SHA-1 and NewHash object names in |
||||
input. Any object names written to output use SHA-1. This allows |
||||
users to continue to make use of SHA-1 to communicate with peers |
||||
(e.g. by email) that have not migrated yet and prepares for mode 3. |
||||
|
||||
3. ("late transition") Allow both SHA-1 and NewHash object names in |
||||
input. Any object names written to output use NewHash. In this |
||||
mode, users are using a more secure object naming method by |
||||
default. The disruption is minimal as long as most of their peers |
||||
are in mode 2 or mode 3. |
||||
|
||||
4. ("post-transition") Treat object names input by the user as |
||||
NewHash and write output using NewHash. This is safer than mode 3 |
||||
because there is less risk that input is incorrectly interpreted |
||||
using the wrong hash function. |
||||
|
||||
The mode is specified in configuration. |
||||
|
||||
The user can also explicitly specify which format to use for a |
||||
particular revision specifier and for output, overriding the mode. For |
||||
example: |
||||
|
||||
git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash} |
||||
|
||||
Selection of a New Hash |
||||
----------------------- |
||||
In early 2005, around the time that Git was written, Xiaoyun Wang, |
||||
Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 |
||||
collisions in 2^69 operations. In August they published details. |
||||
Luckily, no practical demonstrations of a collision in full SHA-1 were |
||||
published until 10 years later, in 2017. |
||||
|
||||
The hash function NewHash to replace SHA-1 should be stronger than |
||||
SHA-1 was: we would like it to be trustworthy and useful in practice |
||||
for at least 10 years. |
||||
|
||||
Some other relevant properties: |
||||
|
||||
1. A 256-bit hash (long enough to match common security practice; not |
||||
excessively long to hurt performance and disk usage). |
||||
|
||||
2. High quality implementations should be widely available (e.g. in |
||||
OpenSSL). |
||||
|
||||
3. The hash function's properties should match Git's needs (e.g. Git |
||||
requires collision and 2nd preimage resistance and does not require |
||||
length extension resistance). |
||||
|
||||
4. As a tiebreaker, the hash should be fast to compute (fortunately |
||||
many contenders are faster than SHA-1). |
||||
|
||||
Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16, |
||||
K12, and BLAKE2bp-256. |
||||
|
||||
Transition plan |
||||
--------------- |
||||
Some initial steps can be implemented independently of one another: |
||||
- adding a hash function API (vtable) |
||||
- teaching fsck to tolerate the gpgsig-newhash field |
||||
- excluding gpgsig-* from the fields copied by "git commit --amend" |
||||
- annotating tests that depend on SHA-1 values with a SHA1 test |
||||
prerequisite |
||||
- using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ |
||||
consistently instead of "unsigned char *" and the hardcoded |
||||
constants 20 and 40. |
||||
- introducing index v3 |
||||
- adding support for the PSRC field and safer object pruning |
||||
|
||||
|
||||
The first user-visible change is the introduction of the objectFormat |
||||
extension (without compatObjectFormat). This requires: |
||||
- implementing the loose-object-idx |
||||
- teaching fsck about this mode of operation |
||||
- using the hash function API (vtable) when computing object names |
||||
- signing objects and verifying signatures |
||||
- rejecting attempts to fetch from or push to an incompatible |
||||
repository |
||||
|
||||
Next comes introduction of compatObjectFormat: |
||||
- translating object names between object formats |
||||
- translating object content between object formats |
||||
- generating and verifying signatures in the compat format |
||||
- adding appropriate index entries when adding a new object to the |
||||
object store |
||||
- --output-format option |
||||
- ^{sha1} and ^{newhash} revision notation |
||||
- configuration to specify default input and output format (see |
||||
"Object names on the command line" above) |
||||
|
||||
The next step is supporting fetches and pushes to SHA-1 repositories: |
||||
- allow pushes to a repository using the compat format |
||||
- generate a topologically sorted list of the SHA-1 names of fetched |
||||
objects |
||||
- convert the fetched packfile to newhash format and generate an idx |
||||
file |
||||
- re-sort to match the order of objects in the fetched packfile |
||||
|
||||
The infrastructure supporting fetch also allows converting an existing |
||||
repository. In converted repositories and new clones, end users can |
||||
gain support for the new hash function without any visible change in |
||||
behavior (see "dark launch" in the "Object names on the command line" |
||||
section). In particular this allows users to verify NewHash signatures |
||||
on objects in the repository, and it should ensure the transition code |
||||
is stable in production in preparation for using it more widely. |
||||
|
||||
Over time projects would encourage their users to adopt the "early |
||||
transition" and then "late transition" modes to take advantage of the |
||||
new, more futureproof NewHash object names. |
||||
|
||||
When objectFormat and compatObjectFormat are both set, commands |
||||
generating signatures would generate both SHA-1 and NewHash signatures |
||||
by default to support both new and old users. |
||||
|
||||
In projects using NewHash heavily, users could be encouraged to adopt |
||||
the "post-transition" mode to avoid accidentally making implicit use |
||||
of SHA-1 object names. |
||||
|
||||
Once a critical mass of users have upgraded to a version of Git that |
||||
can verify NewHash signatures and have converted their existing |
||||
repositories to support verifying them, we can add support for a |
||||
setting to generate only NewHash signatures. This is expected to be at |
||||
least a year later. |
||||
|
||||
That is also a good moment to advertise the ability to convert |
||||
repositories to use NewHash only, stripping out all SHA-1 related |
||||
metadata. This improves performance by eliminating translation |
||||
overhead and security by avoiding the possibility of accidentally |
||||
relying on the safety of SHA-1. |
||||
|
||||
Updating Git's protocols to allow a server to specify which hash |
||||
functions it supports is also an important part of this transition. It |
||||
is not discussed in detail in this document but this transition plan |
||||
assumes it happens. :) |
||||
|
||||
Alternatives considered |
||||
----------------------- |
||||
Upgrading everyone working on a particular project on a flag day |
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
||||
Projects like the Linux kernel are large and complex enough that |
||||
flipping the switch for all projects based on the repository at once |
||||
is infeasible. |
||||
|
||||
Not only would all developers and server operators supporting |
||||
developers have to switch on the same flag day, but supporting tooling |
||||
(continuous integration, code review, bug trackers, etc) would have to |
||||
be adapted as well. This also makes it difficult to get early feedback |
||||
from some project participants testing before it is time for mass |
||||
adoption. |
||||
|
||||
Using hash functions in parallel |
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
||||
(e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ ) |
||||
Objects newly created would be addressed by the new hash, but inside |
||||
such an object (e.g. commit) it is still possible to address objects |
||||
using the old hash function. |
||||
* You cannot trust its history (needed for bisectability) in the |
||||
future without further work |
||||
* Maintenance burden as the number of supported hash functions grows |
||||
(they will never go away, so they accumulate). In this proposal, by |
||||
comparison, converted objects lose all references to SHA-1. |
||||
|
||||
Signed objects with multiple hashes |
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
||||
Instead of introducing the gpgsig-newhash field in commit and tag objects |
||||
for newhash-content based signatures, an earlier version of this design |
||||
added "hash newhash <newhash-name>" fields to strengthen the existing |
||||
sha1-content based signatures. |
||||
|
||||
In other words, a single signature was used to attest to the object |
||||
content using both hash functions. This had some advantages: |
||||
* Using one signature instead of two speeds up the signing process. |
||||
* Having one signed payload with both hashes allows the signer to |
||||
attest to the sha1-name and newhash-name referring to the same object. |
||||
* All users consume the same signature. Broken signatures are likely |
||||
to be detected quickly using current versions of git. |
||||
|
||||
However, it also came with disadvantages: |
||||
* Verifying a signed object requires access to the sha1-names of all |
||||
objects it references, even after the transition is complete and |
||||
translation table is no longer needed for anything else. To support |
||||
this, the design added fields such as "hash sha1 tree <sha1-name>" |
||||
and "hash sha1 parent <sha1-name>" to the newhash-content of a signed |
||||
commit, complicating the conversion process. |
||||
* Allowing signed objects without a sha1 (for after the transition is |
||||
complete) complicated the design further, requiring a "nohash sha1" |
||||
field to suppress including "hash sha1" fields in the newhash-content |
||||
and signed payload. |
||||
|
||||
Lazily populated translation table |
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
||||
Some of the work of building the translation table could be deferred to |
||||
push time, but that would significantly complicate and slow down pushes. |
||||
Calculating the sha1-name at object creation time at the same time it is |
||||
being streamed to disk and having its newhash-name calculated should be |
||||
an acceptable cost. |
||||
|
||||
Document History |
||||
---------------- |
||||
|
||||
2017-03-03 |
||||
bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com, |
||||
sbeller@google.com |
||||
|
||||
Initial version sent to |
||||
http://public-inbox.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com |
||||
|
||||
2017-03-03 jrnieder@gmail.com |
||||
Incorporated suggestions from jonathantanmy and sbeller: |
||||
* describe purpose of signed objects with each hash type |
||||
* redefine signed object verification using object content under the |
||||
first hash function |
||||
|
||||
2017-03-06 jrnieder@gmail.com |
||||
* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2] |
||||
* Make sha3-based signatures a separate field, avoiding the need for |
||||
"hash" and "nohash" fields (thanks to peff[3]). |
||||
* Add a sorting phase to fetch (thanks to Junio for noticing the need |
||||
for this). |
||||
* Omit blobs from the topological sort during fetch (thanks to peff). |
||||
* Discuss alternates, git notes, and git servers in the caveats |
||||
section (thanks to Junio Hamano, brian m. carlson[4], and Shawn |
||||
Pearce). |
||||
* Clarify language throughout (thanks to various commenters, |
||||
especially Junio). |
||||
|
||||
2017-09-27 jrnieder@gmail.com, sbeller@google.com |
||||
* use placeholder NewHash instead of SHA3-256 |
||||
* describe criteria for picking a hash function. |
||||
* include a transition plan (thanks especially to Brandon Williams |
||||
for fleshing these ideas out) |
||||
* define the translation table (thanks, Shawn Pearce[5], Jonathan |
||||
Tan, and Masaya Suzuki) |
||||
* avoid loose object overhead by packing more aggressively in |
||||
"git gc --auto" |
||||
|
||||
[1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/ |
||||
[2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/ |
||||
[3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/ |
||||
[4] http://public-inbox.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net |
||||
[5] https://public-inbox.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/ |
Loading…
Reference in new issue