technical doc: add a design doc for hash function transition
This document describes what a transition to a new hash function for Git would look like. Add it to Documentation/technical/ as the plan of record so that future changes can be recorded as patches. Also-by: Brandon Williams <bmwill@google.com> Also-by: Jonathan Tan <jonathantanmy@google.com> Also-by: Stefan Beller <sbeller@google.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>maint
							parent
							
								
									20fed7cad4
								
							
						
					
					
						commit
						752414ae43
					
				|  | @ -67,6 +67,7 @@ SP_ARTICLES += howto/maintain-git | |||
| API_DOCS = $(patsubst %.txt,%,$(filter-out technical/api-index-skel.txt technical/api-index.txt, $(wildcard technical/api-*.txt))) | ||||
| SP_ARTICLES += $(API_DOCS) | ||||
|  | ||||
| TECH_DOCS += technical/hash-function-transition | ||||
| TECH_DOCS += technical/http-protocol | ||||
| TECH_DOCS += technical/index-format | ||||
| TECH_DOCS += technical/pack-format | ||||
|  |  | |||
|  | @ -0,0 +1,797 @@ | |||
| Git hash function transition | ||||
| ============================ | ||||
|  | ||||
| Objective | ||||
| --------- | ||||
| Migrate Git from SHA-1 to a stronger hash function. | ||||
|  | ||||
| Background | ||||
| ---------- | ||||
| At its core, the Git version control system is a content addressable | ||||
| filesystem. It uses the SHA-1 hash function to name content. For | ||||
| example, files, directories, and revisions are referred to by hash | ||||
| values unlike in other traditional version control systems where files | ||||
| or versions are referred to via sequential numbers. The use of a hash | ||||
| function to address its content delivers a few advantages: | ||||
|  | ||||
| * Integrity checking is easy. Bit flips, for example, are easily | ||||
|   detected, as the hash of corrupted content does not match its name. | ||||
| * Lookup of objects is fast. | ||||
|  | ||||
| Using a cryptographically secure hash function brings additional | ||||
| advantages: | ||||
|  | ||||
| * Object names can be signed and third parties can trust the hash to | ||||
|   address the signed object and all objects it references. | ||||
| * Communication using Git protocol and out of band communication | ||||
|   methods have a short reliable string that can be used to reliably | ||||
|   address stored content. | ||||
|  | ||||
| Over time some flaws in SHA-1 have been discovered by security | ||||
| researchers. https://shattered.io demonstrated a practical SHA-1 hash | ||||
| collision. As a result, SHA-1 cannot be considered cryptographically | ||||
| secure any more. This impacts the communication of hash values because | ||||
| we cannot trust that a given hash value represents the known good | ||||
| version of content that the speaker intended. | ||||
|  | ||||
| SHA-1 still possesses the other properties such as fast object lookup | ||||
| and safe error checking, but other hash functions are equally suitable | ||||
| that are believed to be cryptographically secure. | ||||
|  | ||||
| Goals | ||||
| ----- | ||||
| Where NewHash is a strong 256-bit hash function to replace SHA-1 (see | ||||
| "Selection of a New Hash", below): | ||||
|  | ||||
| 1. The transition to NewHash can be done one local repository at a time. | ||||
|    a. Requiring no action by any other party. | ||||
|    b. A NewHash repository can communicate with SHA-1 Git servers | ||||
|       (push/fetch). | ||||
|    c. Users can use SHA-1 and NewHash identifiers for objects | ||||
|       interchangeably (see "Object names on the command line", below). | ||||
|    d. New signed objects make use of a stronger hash function than | ||||
|       SHA-1 for their security guarantees. | ||||
| 2. Allow a complete transition away from SHA-1. | ||||
|    a. Local metadata for SHA-1 compatibility can be removed from a | ||||
|       repository if compatibility with SHA-1 is no longer needed. | ||||
| 3. Maintainability throughout the process. | ||||
|    a. The object format is kept simple and consistent. | ||||
|    b. Creation of a generalized repository conversion tool. | ||||
|  | ||||
| Non-Goals | ||||
| --------- | ||||
| 1. Add NewHash support to Git protocol. This is valuable and the | ||||
|    logical next step but it is out of scope for this initial design. | ||||
| 2. Transparently improving the security of existing SHA-1 signed | ||||
|    objects. | ||||
| 3. Intermixing objects using multiple hash functions in a single | ||||
|    repository. | ||||
| 4. Taking the opportunity to fix other bugs in Git's formats and | ||||
|    protocols. | ||||
| 5. Shallow clones and fetches into a NewHash repository. (This will | ||||
|    change when we add NewHash support to Git protocol.) | ||||
| 6. Skip fetching some submodules of a project into a NewHash | ||||
|    repository. (This also depends on NewHash support in Git | ||||
|    protocol.) | ||||
|  | ||||
| Overview | ||||
| -------- | ||||
| We introduce a new repository format extension. Repositories with this | ||||
| extension enabled use NewHash instead of SHA-1 to name their objects. | ||||
| This affects both object names and object content --- both the names | ||||
| of objects and all references to other objects within an object are | ||||
| switched to the new hash function. | ||||
|  | ||||
| NewHash repositories cannot be read by older versions of Git. | ||||
|  | ||||
| Alongside the packfile, a NewHash repository stores a bidirectional | ||||
| mapping between NewHash and SHA-1 object names. The mapping is generated | ||||
| locally and can be verified using "git fsck". Object lookups use this | ||||
| mapping to allow naming objects using either their SHA-1 and NewHash names | ||||
| interchangeably. | ||||
|  | ||||
| "git cat-file" and "git hash-object" gain options to display an object | ||||
| in its sha1 form and write an object given its sha1 form. This | ||||
| requires all objects referenced by that object to be present in the | ||||
| object database so that they can be named using the appropriate name | ||||
| (using the bidirectional hash mapping). | ||||
|  | ||||
| Fetches from a SHA-1 based server convert the fetched objects into | ||||
| NewHash form and record the mapping in the bidirectional mapping table | ||||
| (see below for details). Pushes to a SHA-1 based server convert the | ||||
| objects being pushed into sha1 form so the server does not have to be | ||||
| aware of the hash function the client is using. | ||||
|  | ||||
| Detailed Design | ||||
| --------------- | ||||
| Repository format extension | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| A NewHash repository uses repository format version `1` (see | ||||
| Documentation/technical/repository-version.txt) with extensions | ||||
| `objectFormat` and `compatObjectFormat`: | ||||
|  | ||||
| 	[core] | ||||
| 		repositoryFormatVersion = 1 | ||||
| 	[extensions] | ||||
| 		objectFormat = newhash | ||||
| 		compatObjectFormat = sha1 | ||||
|  | ||||
| Specifying a repository format extension ensures that versions of Git | ||||
| not aware of NewHash do not try to operate on these repositories, | ||||
| instead producing an error message: | ||||
|  | ||||
| 	$ git status | ||||
| 	fatal: unknown repository extensions found: | ||||
| 		objectformat | ||||
| 		compatobjectformat | ||||
|  | ||||
| See the "Transition plan" section below for more details on these | ||||
| repository extensions. | ||||
|  | ||||
| Object names | ||||
| ~~~~~~~~~~~~ | ||||
| Objects can be named by their 40 hexadecimal digit sha1-name or 64 | ||||
| hexadecimal digit newhash-name, plus names derived from those (see | ||||
| gitrevisions(7)). | ||||
|  | ||||
| The sha1-name of an object is the SHA-1 of the concatenation of its | ||||
| type, length, a nul byte, and the object's sha1-content. This is the | ||||
| traditional <sha1> used in Git to name objects. | ||||
|  | ||||
| The newhash-name of an object is the NewHash of the concatenation of its | ||||
| type, length, a nul byte, and the object's newhash-content. | ||||
|  | ||||
| Object format | ||||
| ~~~~~~~~~~~~~ | ||||
| The content as a byte sequence of a tag, commit, or tree object named | ||||
| by sha1 and newhash differ because an object named by newhash-name refers to | ||||
| other objects by their newhash-names and an object named by sha1-name | ||||
| refers to other objects by their sha1-names. | ||||
|  | ||||
| The newhash-content of an object is the same as its sha1-content, except | ||||
| that objects referenced by the object are named using their newhash-names | ||||
| instead of sha1-names. Because a blob object does not refer to any | ||||
| other object, its sha1-content and newhash-content are the same. | ||||
|  | ||||
| The format allows round-trip conversion between newhash-content and | ||||
| sha1-content. | ||||
|  | ||||
| Object storage | ||||
| ~~~~~~~~~~~~~~ | ||||
| Loose objects use zlib compression and packed objects use the packed | ||||
| format described in Documentation/technical/pack-format.txt, just like | ||||
| today. The content that is compressed and stored uses newhash-content | ||||
| instead of sha1-content. | ||||
|  | ||||
| Pack index | ||||
| ~~~~~~~~~~ | ||||
| Pack index (.idx) files use a new v3 format that supports multiple | ||||
| hash functions. They have the following format (all integers are in | ||||
| network byte order): | ||||
|  | ||||
| - A header appears at the beginning and consists of the following: | ||||
|   - The 4-byte pack index signature: '\377t0c' | ||||
|   - 4-byte version number: 3 | ||||
|   - 4-byte length of the header section, including the signature and | ||||
|     version number | ||||
|   - 4-byte number of objects contained in the pack | ||||
|   - 4-byte number of object formats in this pack index: 2 | ||||
|   - For each object format: | ||||
|     - 4-byte format identifier (e.g., 'sha1' for SHA-1) | ||||
|     - 4-byte length in bytes of shortened object names. This is the | ||||
|       shortest possible length needed to make names in the shortened | ||||
|       object name table unambiguous. | ||||
|     - 4-byte integer, recording where tables relating to this format | ||||
|       are stored in this index file, as an offset from the beginning. | ||||
|   - 4-byte offset to the trailer from the beginning of this file. | ||||
|   - Zero or more additional key/value pairs (4-byte key, 4-byte | ||||
|     value). Only one key is supported: 'PSRC'. See the "Loose objects | ||||
|     and unreachable objects" section for supported values and how this | ||||
|     is used.  All other keys are reserved. Readers must ignore | ||||
|     unrecognized keys. | ||||
| - Zero or more NUL bytes. This can optionally be used to improve the | ||||
|   alignment of the full object name table below. | ||||
| - Tables for the first object format: | ||||
|   - A sorted table of shortened object names.  These are prefixes of | ||||
|     the names of all objects in this pack file, packed together | ||||
|     without offset values to reduce the cache footprint of the binary | ||||
|     search for a specific object name. | ||||
|  | ||||
|   - A table of full object names in pack order. This allows resolving | ||||
|     a reference to "the nth object in the pack file" (from a | ||||
|     reachability bitmap or from the next table of another object | ||||
|     format) to its object name. | ||||
|  | ||||
|   - A table of 4-byte values mapping object name order to pack order. | ||||
|     For an object in the table of sorted shortened object names, the | ||||
|     value at the corresponding index in this table is the index in the | ||||
|     previous table for that same object. | ||||
|  | ||||
|     This can be used to look up the object in reachability bitmaps or | ||||
|     to look up its name in another object format. | ||||
|  | ||||
|   - A table of 4-byte CRC32 values of the packed object data, in the | ||||
|     order that the objects appear in the pack file. This is to allow | ||||
|     compressed data to be copied directly from pack to pack during | ||||
|     repacking without undetected data corruption. | ||||
|  | ||||
|   - A table of 4-byte offset values. For an object in the table of | ||||
|     sorted shortened object names, the value at the corresponding | ||||
|     index in this table indicates where that object can be found in | ||||
|     the pack file. These are usually 31-bit pack file offsets, but | ||||
|     large offsets are encoded as an index into the next table with the | ||||
|     most significant bit set. | ||||
|  | ||||
|   - A table of 8-byte offset entries (empty for pack files less than | ||||
|     2 GiB). Pack files are organized with heavily used objects toward | ||||
|     the front, so most object references should not need to refer to | ||||
|     this table. | ||||
| - Zero or more NUL bytes. | ||||
| - Tables for the second object format, with the same layout as above, | ||||
|   up to and not including the table of CRC32 values. | ||||
| - Zero or more NUL bytes. | ||||
| - The trailer consists of the following: | ||||
|   - A copy of the 20-byte NewHash checksum at the end of the | ||||
|     corresponding packfile. | ||||
|  | ||||
|   - 20-byte NewHash checksum of all of the above. | ||||
|  | ||||
| Loose object index | ||||
| ~~~~~~~~~~~~~~~~~~ | ||||
| A new file $GIT_OBJECT_DIR/loose-object-idx contains information about | ||||
| all loose objects. Its format is | ||||
|  | ||||
|   # loose-object-idx | ||||
|   (newhash-name SP sha1-name LF)* | ||||
|  | ||||
| where the object names are in hexadecimal format. The file is not | ||||
| sorted. | ||||
|  | ||||
| The loose object index is protected against concurrent writes by a | ||||
| lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose | ||||
| object: | ||||
|  | ||||
| 1. Write the loose object to a temporary file, like today. | ||||
| 2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock. | ||||
| 3. Rename the loose object into place. | ||||
| 4. Open loose-object-idx with O_APPEND and write the new object | ||||
| 5. Unlink loose-object-idx.lock to release the lock. | ||||
|  | ||||
| To remove entries (e.g. in "git pack-refs" or "git-prune"): | ||||
|  | ||||
| 1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the | ||||
|    lock. | ||||
| 2. Write the new content to loose-object-idx.lock. | ||||
| 3. Unlink any loose objects being removed. | ||||
| 4. Rename to replace loose-object-idx, releasing the lock. | ||||
|  | ||||
| Translation table | ||||
| ~~~~~~~~~~~~~~~~~ | ||||
| The index files support a bidirectional mapping between sha1-names | ||||
| and newhash-names. The lookup proceeds similarly to ordinary object | ||||
| lookups. For example, to convert a sha1-name to a newhash-name: | ||||
|  | ||||
|  1. Look for the object in idx files. If a match is present in the | ||||
|     idx's sorted list of truncated sha1-names, then: | ||||
|     a. Read the corresponding entry in the sha1-name order to pack | ||||
|        name order mapping. | ||||
|     b. Read the corresponding entry in the full sha1-name table to | ||||
|        verify we found the right object. If it is, then | ||||
|     c. Read the corresponding entry in the full newhash-name table. | ||||
|        That is the object's newhash-name. | ||||
|  2. Check for a loose object. Read lines from loose-object-idx until | ||||
|     we find a match. | ||||
|  | ||||
| Step (1) takes the same amount of time as an ordinary object lookup: | ||||
| O(number of packs * log(objects per pack)). Step (2) takes O(number of | ||||
| loose objects) time. To maintain good performance it will be necessary | ||||
| to keep the number of loose objects low. See the "Loose objects and | ||||
| unreachable objects" section below for more details. | ||||
|  | ||||
| Since all operations that make new objects (e.g., "git commit") add | ||||
| the new objects to the corresponding index, this mapping is possible | ||||
| for all objects in the object store. | ||||
|  | ||||
| Reading an object's sha1-content | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| The sha1-content of an object can be read by converting all newhash-names | ||||
| its newhash-content references to sha1-names using the translation table. | ||||
|  | ||||
| Fetch | ||||
| ~~~~~ | ||||
| Fetching from a SHA-1 based server requires translating between SHA-1 | ||||
| and NewHash based representations on the fly. | ||||
|  | ||||
| SHA-1s named in the ref advertisement that are present on the client | ||||
| can be translated to NewHash and looked up as local objects using the | ||||
| translation table. | ||||
|  | ||||
| Negotiation proceeds as today. Any "have"s generated locally are | ||||
| converted to SHA-1 before being sent to the server, and SHA-1s | ||||
| mentioned by the server are converted to NewHash when looking them up | ||||
| locally. | ||||
|  | ||||
| After negotiation, the server sends a packfile containing the | ||||
| requested objects. We convert the packfile to NewHash format using | ||||
| the following steps: | ||||
|  | ||||
| 1. index-pack: inflate each object in the packfile and compute its | ||||
|    SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against | ||||
|    objects the client has locally. These objects can be looked up | ||||
|    using the translation table and their sha1-content read as | ||||
|    described above to resolve the deltas. | ||||
| 2. topological sort: starting at the "want"s from the negotiation | ||||
|    phase, walk through objects in the pack and emit a list of them, | ||||
|    excluding blobs, in reverse topologically sorted order, with each | ||||
|    object coming later in the list than all objects it references. | ||||
|    (This list only contains objects reachable from the "wants". If the | ||||
|    pack from the server contained additional extraneous objects, then | ||||
|    they will be discarded.) | ||||
| 3. convert to newhash: open a new (newhash) packfile. Read the topologically | ||||
|    sorted list just generated. For each object, inflate its | ||||
|    sha1-content, convert to newhash-content, and write it to the newhash | ||||
|    pack. Record the new sha1<->newhash mapping entry for use in the idx. | ||||
| 4. sort: reorder entries in the new pack to match the order of objects | ||||
|    in the pack the server generated and include blobs. Write a newhash idx | ||||
|    file | ||||
| 5. clean up: remove the SHA-1 based pack file, index, and | ||||
|    topologically sorted list obtained from the server in steps 1 | ||||
|    and 2. | ||||
|  | ||||
| Step 3 requires every object referenced by the new object to be in the | ||||
| translation table. This is why the topological sort step is necessary. | ||||
|  | ||||
| As an optimization, step 1 could write a file describing what non-blob | ||||
| objects each object it has inflated from the packfile references. This | ||||
| makes the topological sort in step 2 possible without inflating the | ||||
| objects in the packfile for a second time. The objects need to be | ||||
| inflated again in step 3, for a total of two inflations. | ||||
|  | ||||
| Step 4 is probably necessary for good read-time performance. "git | ||||
| pack-objects" on the server optimizes the pack file for good data | ||||
| locality (see Documentation/technical/pack-heuristics.txt). | ||||
|  | ||||
| Details of this process are likely to change. It will take some | ||||
| experimenting to get this to perform well. | ||||
|  | ||||
| Push | ||||
| ~~~~ | ||||
| Push is simpler than fetch because the objects referenced by the | ||||
| pushed objects are already in the translation table. The sha1-content | ||||
| of each object being pushed can be read as described in the "Reading | ||||
| an object's sha1-content" section to generate the pack written by git | ||||
| send-pack. | ||||
|  | ||||
| Signed Commits | ||||
| ~~~~~~~~~~~~~~ | ||||
| We add a new field "gpgsig-newhash" to the commit object format to allow | ||||
| signing commits without relying on SHA-1. It is similar to the | ||||
| existing "gpgsig" field. Its signed payload is the newhash-content of the | ||||
| commit object with any "gpgsig" and "gpgsig-newhash" fields removed. | ||||
|  | ||||
| This means commits can be signed | ||||
| 1. using SHA-1 only, as in existing signed commit objects | ||||
| 2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig | ||||
|    fields. | ||||
| 3. using only NewHash, by only using the gpgsig-newhash field. | ||||
|  | ||||
| Old versions of "git verify-commit" can verify the gpgsig signature in | ||||
| cases (1) and (2) without modifications and view case (3) as an | ||||
| ordinary unsigned commit. | ||||
|  | ||||
| Signed Tags | ||||
| ~~~~~~~~~~~ | ||||
| We add a new field "gpgsig-newhash" to the tag object format to allow | ||||
| signing tags without relying on SHA-1. Its signed payload is the | ||||
| newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP | ||||
| SIGNATURE-----" delimited in-body signature removed. | ||||
|  | ||||
| This means tags can be signed | ||||
| 1. using SHA-1 only, as in existing signed tag objects | ||||
| 2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body | ||||
|    signature. | ||||
| 3. using only NewHash, by only using the gpgsig-newhash field. | ||||
|  | ||||
| Mergetag embedding | ||||
| ~~~~~~~~~~~~~~~~~~ | ||||
| The mergetag field in the sha1-content of a commit contains the | ||||
| sha1-content of a tag that was merged by that commit. | ||||
|  | ||||
| The mergetag field in the newhash-content of the same commit contains the | ||||
| newhash-content of the same tag. | ||||
|  | ||||
| Submodules | ||||
| ~~~~~~~~~~ | ||||
| To convert recorded submodule pointers, you need to have the converted | ||||
| submodule repository in place. The translation table of the submodule | ||||
| can be used to look up the new hash. | ||||
|  | ||||
| Loose objects and unreachable objects | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| Fast lookups in the loose-object-idx require that the number of loose | ||||
| objects not grow too high. | ||||
|  | ||||
| "git gc --auto" currently waits for there to be 6700 loose objects | ||||
| present before consolidating them into a packfile. We will need to | ||||
| measure to find a more appropriate threshold for it to use. | ||||
|  | ||||
| "git gc --auto" currently waits for there to be 50 packs present | ||||
| before combining packfiles. Packing loose objects more aggressively | ||||
| may cause the number of pack files to grow too quickly. This can be | ||||
| mitigated by using a strategy similar to Martin Fick's exponential | ||||
| rolling garbage collection script: | ||||
| https://gerrit-review.googlesource.com/c/gerrit/+/35215 | ||||
|  | ||||
| "git gc" currently expels any unreachable objects it encounters in | ||||
| pack files to loose objects in an attempt to prevent a race when | ||||
| pruning them (in case another process is simultaneously writing a new | ||||
| object that refers to the about-to-be-deleted object). This leads to | ||||
| an explosion in the number of loose objects present and disk space | ||||
| usage due to the objects in delta form being replaced with independent | ||||
| loose objects.  Worse, the race is still present for loose objects. | ||||
|  | ||||
| Instead, "git gc" will need to move unreachable objects to a new | ||||
| packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see | ||||
| below). To avoid the race when writing new objects referring to an | ||||
| about-to-be-deleted object, code paths that write new objects will | ||||
| need to copy any objects from UNREACHABLE_GARBAGE packs that they | ||||
| refer to to new, non-UNREACHABLE_GARBAGE packs (or loose objects). | ||||
| UNREACHABLE_GARBAGE are then safe to delete if their creation time (as | ||||
| indicated by the file's mtime) is long enough ago. | ||||
|  | ||||
| To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be | ||||
| combined under certain circumstances. If "gc.garbageTtl" is set to | ||||
| greater than one day, then packs created within a single calendar day, | ||||
| UTC, can be coalesced together. The resulting packfile would have an | ||||
| mtime before midnight on that day, so this makes the effective maximum | ||||
| ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day, | ||||
| then we divide the calendar day into intervals one-third of that ttl | ||||
| in duration. Packs created within the same interval can be coalesced | ||||
| together. The resulting packfile would have an mtime before the end of | ||||
| the interval, so this makes the effective maximum ttl equal to the | ||||
| garbageTtl * 4/3. | ||||
|  | ||||
| This rule comes from Thirumala Reddy Mutchukota's JGit change | ||||
| https://git.eclipse.org/r/90465. | ||||
|  | ||||
| The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack | ||||
| index. More generally, that field indicates where a pack came from: | ||||
|  | ||||
|  - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network | ||||
|  - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight | ||||
|    "gc --auto" operation | ||||
|  - 3 (PACK_SOURCE_GC) for a pack created by a full gc | ||||
|  - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage | ||||
|    discovered by gc | ||||
|  - 5 (PACK_SOURCE_INSERT) for locally created objects that were | ||||
|    written directly to a pack file, e.g. from "git add ." | ||||
|  | ||||
| This information can be useful for debugging and for "gc --auto" to | ||||
| make appropriate choices about which packs to coalesce. | ||||
|  | ||||
| Caveats | ||||
| ------- | ||||
| Invalid objects | ||||
| ~~~~~~~~~~~~~~~ | ||||
| The conversion from sha1-content to newhash-content retains any | ||||
| brokenness in the original object (e.g., tree entry modes encoded with | ||||
| leading 0, tree objects whose paths are not sorted correctly, and | ||||
| commit objects without an author or committer). This is a deliberate | ||||
| feature of the design to allow the conversion to round-trip. | ||||
|  | ||||
| More profoundly broken objects (e.g., a commit with a truncated "tree" | ||||
| header line) cannot be converted but were not usable by current Git | ||||
| anyway. | ||||
|  | ||||
| Shallow clone and submodules | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| Because it requires all referenced objects to be available in the | ||||
| locally generated translation table, this design does not support | ||||
| shallow clone or unfetched submodules. Protocol improvements might | ||||
| allow lifting this restriction. | ||||
|  | ||||
| Alternates | ||||
| ~~~~~~~~~~ | ||||
| For the same reason, a newhash repository cannot borrow objects from a | ||||
| sha1 repository using objects/info/alternates or | ||||
| $GIT_ALTERNATE_OBJECT_REPOSITORIES. | ||||
|  | ||||
| git notes | ||||
| ~~~~~~~~~ | ||||
| The "git notes" tool annotates objects using their sha1-name as key. | ||||
| This design does not describe a way to migrate notes trees to use | ||||
| newhash-names. That migration is expected to happen separately (for | ||||
| example using a file at the root of the notes tree to describe which | ||||
| hash it uses). | ||||
|  | ||||
| Server-side cost | ||||
| ~~~~~~~~~~~~~~~~ | ||||
| Until Git protocol gains NewHash support, using NewHash based storage | ||||
| on public-facing Git servers is strongly discouraged. Once Git | ||||
| protocol gains NewHash support, NewHash based servers are likely not | ||||
| to support SHA-1 compatibility, to avoid what may be a very expensive | ||||
| hash reencode during clone and to encourage peers to modernize. | ||||
|  | ||||
| The design described here allows fetches by SHA-1 clients of a | ||||
| personal NewHash repository because it's not much more difficult than | ||||
| allowing pushes from that repository. This support needs to be guarded | ||||
| by a configuration option --- servers like git.kernel.org that serve a | ||||
| large number of clients would not be expected to bear that cost. | ||||
|  | ||||
| Meaning of signatures | ||||
| ~~~~~~~~~~~~~~~~~~~~~ | ||||
| The signed payload for signed commits and tags does not explicitly | ||||
| name the hash used to identify objects. If some day Git adopts a new | ||||
| hash function with the same length as the current SHA-1 (40 | ||||
| hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the | ||||
| intent behind the PGP signed payload in an object signature is | ||||
| unclear: | ||||
|  | ||||
| 	object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 | ||||
| 	type commit | ||||
| 	tag v2.12.0 | ||||
| 	tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800 | ||||
|  | ||||
| 	Git 2.12 | ||||
|  | ||||
| Does this mean Git v2.12.0 is the commit with sha1-name | ||||
| e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with | ||||
| new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? | ||||
|  | ||||
| Fortunately NewHash and SHA-1 have different lengths. If Git starts | ||||
| using another hash with the same length to name objects, then it will | ||||
| need to change the format of signed payloads using that hash to | ||||
| address this issue. | ||||
|  | ||||
| Object names on the command line | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| To support the transition (see Transition plan below), this design | ||||
| supports four different modes of operation: | ||||
|  | ||||
|  1. ("dark launch") Treat object names input by the user as SHA-1 and | ||||
|     convert any object names written to output to SHA-1, but store | ||||
|     objects using NewHash.  This allows users to test the code with no | ||||
|     visible behavior change except for performance.  This allows | ||||
|     allows running even tests that assume the SHA-1 hash function, to | ||||
|     sanity-check the behavior of the new mode. | ||||
|  | ||||
|  2. ("early transition") Allow both SHA-1 and NewHash object names in | ||||
|     input. Any object names written to output use SHA-1. This allows | ||||
|     users to continue to make use of SHA-1 to communicate with peers | ||||
|     (e.g. by email) that have not migrated yet and prepares for mode 3. | ||||
|  | ||||
|  3. ("late transition") Allow both SHA-1 and NewHash object names in | ||||
|     input. Any object names written to output use NewHash. In this | ||||
|     mode, users are using a more secure object naming method by | ||||
|     default.  The disruption is minimal as long as most of their peers | ||||
|     are in mode 2 or mode 3. | ||||
|  | ||||
|  4. ("post-transition") Treat object names input by the user as | ||||
|     NewHash and write output using NewHash. This is safer than mode 3 | ||||
|     because there is less risk that input is incorrectly interpreted | ||||
|     using the wrong hash function. | ||||
|  | ||||
| The mode is specified in configuration. | ||||
|  | ||||
| The user can also explicitly specify which format to use for a | ||||
| particular revision specifier and for output, overriding the mode. For | ||||
| example: | ||||
|  | ||||
| git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash} | ||||
|  | ||||
| Selection of a New Hash | ||||
| ----------------------- | ||||
| In early 2005, around the time that Git was written,  Xiaoyun Wang, | ||||
| Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 | ||||
| collisions in 2^69 operations. In August they published details. | ||||
| Luckily, no practical demonstrations of a collision in full SHA-1 were | ||||
| published until 10 years later, in 2017. | ||||
|  | ||||
| The hash function NewHash to replace SHA-1 should be stronger than | ||||
| SHA-1 was: we would like it to be trustworthy and useful in practice | ||||
| for at least 10 years. | ||||
|  | ||||
| Some other relevant properties: | ||||
|  | ||||
| 1. A 256-bit hash (long enough to match common security practice; not | ||||
|    excessively long to hurt performance and disk usage). | ||||
|  | ||||
| 2. High quality implementations should be widely available (e.g. in | ||||
|    OpenSSL). | ||||
|  | ||||
| 3. The hash function's properties should match Git's needs (e.g. Git | ||||
|    requires collision and 2nd preimage resistance and does not require | ||||
|    length extension resistance). | ||||
|  | ||||
| 4. As a tiebreaker, the hash should be fast to compute (fortunately | ||||
|    many contenders are faster than SHA-1). | ||||
|  | ||||
| Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16, | ||||
| K12, and BLAKE2bp-256. | ||||
|  | ||||
| Transition plan | ||||
| --------------- | ||||
| Some initial steps can be implemented independently of one another: | ||||
| - adding a hash function API (vtable) | ||||
| - teaching fsck to tolerate the gpgsig-newhash field | ||||
| - excluding gpgsig-* from the fields copied by "git commit --amend" | ||||
| - annotating tests that depend on SHA-1 values with a SHA1 test | ||||
|   prerequisite | ||||
| - using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ | ||||
|   consistently instead of "unsigned char *" and the hardcoded | ||||
|   constants 20 and 40. | ||||
| - introducing index v3 | ||||
| - adding support for the PSRC field and safer object pruning | ||||
|  | ||||
|  | ||||
| The first user-visible change is the introduction of the objectFormat | ||||
| extension (without compatObjectFormat). This requires: | ||||
| - implementing the loose-object-idx | ||||
| - teaching fsck about this mode of operation | ||||
| - using the hash function API (vtable) when computing object names | ||||
| - signing objects and verifying signatures | ||||
| - rejecting attempts to fetch from or push to an incompatible | ||||
|   repository | ||||
|  | ||||
| Next comes introduction of compatObjectFormat: | ||||
| - translating object names between object formats | ||||
| - translating object content between object formats | ||||
| - generating and verifying signatures in the compat format | ||||
| - adding appropriate index entries when adding a new object to the | ||||
|   object store | ||||
| - --output-format option | ||||
| - ^{sha1} and ^{newhash} revision notation | ||||
| - configuration to specify default input and output format (see | ||||
|   "Object names on the command line" above) | ||||
|  | ||||
| The next step is supporting fetches and pushes to SHA-1 repositories: | ||||
| - allow pushes to a repository using the compat format | ||||
| - generate a topologically sorted list of the SHA-1 names of fetched | ||||
|   objects | ||||
| - convert the fetched packfile to newhash format and generate an idx | ||||
|   file | ||||
| - re-sort to match the order of objects in the fetched packfile | ||||
|  | ||||
| The infrastructure supporting fetch also allows converting an existing | ||||
| repository. In converted repositories and new clones, end users can | ||||
| gain support for the new hash function without any visible change in | ||||
| behavior (see "dark launch" in the "Object names on the command line" | ||||
| section). In particular this allows users to verify NewHash signatures | ||||
| on objects in the repository, and it should ensure the transition code | ||||
| is stable in production in preparation for using it more widely. | ||||
|  | ||||
| Over time projects would encourage their users to adopt the "early | ||||
| transition" and then "late transition" modes to take advantage of the | ||||
| new, more futureproof NewHash object names. | ||||
|  | ||||
| When objectFormat and compatObjectFormat are both set, commands | ||||
| generating signatures would generate both SHA-1 and NewHash signatures | ||||
| by default to support both new and old users. | ||||
|  | ||||
| In projects using NewHash heavily, users could be encouraged to adopt | ||||
| the "post-transition" mode to avoid accidentally making implicit use | ||||
| of SHA-1 object names. | ||||
|  | ||||
| Once a critical mass of users have upgraded to a version of Git that | ||||
| can verify NewHash signatures and have converted their existing | ||||
| repositories to support verifying them, we can add support for a | ||||
| setting to generate only NewHash signatures. This is expected to be at | ||||
| least a year later. | ||||
|  | ||||
| That is also a good moment to advertise the ability to convert | ||||
| repositories to use NewHash only, stripping out all SHA-1 related | ||||
| metadata. This improves performance by eliminating translation | ||||
| overhead and security by avoiding the possibility of accidentally | ||||
| relying on the safety of SHA-1. | ||||
|  | ||||
| Updating Git's protocols to allow a server to specify which hash | ||||
| functions it supports is also an important part of this transition. It | ||||
| is not discussed in detail in this document but this transition plan | ||||
| assumes it happens. :) | ||||
|  | ||||
| Alternatives considered | ||||
| ----------------------- | ||||
| Upgrading everyone working on a particular project on a flag day | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| Projects like the Linux kernel are large and complex enough that | ||||
| flipping the switch for all projects based on the repository at once | ||||
| is infeasible. | ||||
|  | ||||
| Not only would all developers and server operators supporting | ||||
| developers have to switch on the same flag day, but supporting tooling | ||||
| (continuous integration, code review, bug trackers, etc) would have to | ||||
| be adapted as well. This also makes it difficult to get early feedback | ||||
| from some project participants testing before it is time for mass | ||||
| adoption. | ||||
|  | ||||
| Using hash functions in parallel | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| (e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ ) | ||||
| Objects newly created would be addressed by the new hash, but inside | ||||
| such an object (e.g. commit) it is still possible to address objects | ||||
| using the old hash function. | ||||
| * You cannot trust its history (needed for bisectability) in the | ||||
|   future without further work | ||||
| * Maintenance burden as the number of supported hash functions grows | ||||
|   (they will never go away, so they accumulate). In this proposal, by | ||||
|   comparison, converted objects lose all references to SHA-1. | ||||
|  | ||||
| Signed objects with multiple hashes | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| Instead of introducing the gpgsig-newhash field in commit and tag objects | ||||
| for newhash-content based signatures, an earlier version of this design | ||||
| added "hash newhash <newhash-name>" fields to strengthen the existing | ||||
| sha1-content based signatures. | ||||
|  | ||||
| In other words, a single signature was used to attest to the object | ||||
| content using both hash functions. This had some advantages: | ||||
| * Using one signature instead of two speeds up the signing process. | ||||
| * Having one signed payload with both hashes allows the signer to | ||||
|   attest to the sha1-name and newhash-name referring to the same object. | ||||
| * All users consume the same signature. Broken signatures are likely | ||||
|   to be detected quickly using current versions of git. | ||||
|  | ||||
| However, it also came with disadvantages: | ||||
| * Verifying a signed object requires access to the sha1-names of all | ||||
|   objects it references, even after the transition is complete and | ||||
|   translation table is no longer needed for anything else. To support | ||||
|   this, the design added fields such as "hash sha1 tree <sha1-name>" | ||||
|   and "hash sha1 parent <sha1-name>" to the newhash-content of a signed | ||||
|   commit, complicating the conversion process. | ||||
| * Allowing signed objects without a sha1 (for after the transition is | ||||
|   complete) complicated the design further, requiring a "nohash sha1" | ||||
|   field to suppress including "hash sha1" fields in the newhash-content | ||||
|   and signed payload. | ||||
|  | ||||
| Lazily populated translation table | ||||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
| Some of the work of building the translation table could be deferred to | ||||
| push time, but that would significantly complicate and slow down pushes. | ||||
| Calculating the sha1-name at object creation time at the same time it is | ||||
| being streamed to disk and having its newhash-name calculated should be | ||||
| an acceptable cost. | ||||
|  | ||||
| Document History | ||||
| ---------------- | ||||
|  | ||||
| 2017-03-03 | ||||
| bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com, | ||||
| sbeller@google.com | ||||
|  | ||||
| Initial version sent to | ||||
| http://public-inbox.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com | ||||
|  | ||||
| 2017-03-03 jrnieder@gmail.com | ||||
| Incorporated suggestions from jonathantanmy and sbeller: | ||||
| * describe purpose of signed objects with each hash type | ||||
| * redefine signed object verification using object content under the | ||||
|   first hash function | ||||
|  | ||||
| 2017-03-06 jrnieder@gmail.com | ||||
| * Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2] | ||||
| * Make sha3-based signatures a separate field, avoiding the need for | ||||
|   "hash" and "nohash" fields (thanks to peff[3]). | ||||
| * Add a sorting phase to fetch (thanks to Junio for noticing the need | ||||
|   for this). | ||||
| * Omit blobs from the topological sort during fetch (thanks to peff). | ||||
| * Discuss alternates, git notes, and git servers in the caveats | ||||
|   section (thanks to Junio Hamano, brian m. carlson[4], and Shawn | ||||
|   Pearce). | ||||
| * Clarify language throughout (thanks to various commenters, | ||||
|   especially Junio). | ||||
|  | ||||
| 2017-09-27 jrnieder@gmail.com, sbeller@google.com | ||||
| * use placeholder NewHash instead of SHA3-256 | ||||
| * describe criteria for picking a hash function. | ||||
| * include a transition plan (thanks especially to Brandon Williams | ||||
|   for fleshing these ideas out) | ||||
| * define the translation table (thanks, Shawn Pearce[5], Jonathan | ||||
|   Tan, and Masaya Suzuki) | ||||
| * avoid loose object overhead by packing more aggressively in | ||||
|   "git gc --auto" | ||||
|  | ||||
| [1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/ | ||||
| [2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/ | ||||
| [3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/ | ||||
| [4] http://public-inbox.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net | ||||
| [5] https://public-inbox.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/ | ||||
		Loading…
	
		Reference in New Issue
	
	 Jonathan Nieder
						Jonathan Nieder