docs: improve ambiguous areas of pack format documentation

It is fair to say that our pack and indexing code is quite complex.
Contributors who wish to work on this code or implementors of other
implementations would benefit from clear, unambiguous documentation
about how our data formats are structured and encoded and what data is
used in the computation of certain values.  Unfortunately, some of this
data is missing, which leads to confusion and frustration.

Let's document some of this data to help clarify things.  Specify over
what data CRC32 values are computed and also note which CRC32 algorithm
is used, since Wikipedia mentions at least four 32-bit CRC algorithms
and notes that it's possible to use different bit orderings.

In addition, note how we encode objects in the pack.  One might be led
to believe that packed objects are always stored with the "<type>
<size>\0" prefix of loose objects, but that is not the case, although
for obvious reasons this data is included in the computation of the
object ID.  Explain why this is for the curious reader.

Finally, indicate what the size field of the packed object represents.
Otherwise, a reader might think that the size of a delta is the size of
the full object or that it might contain the offset or object ID,
neither of which are the case.  Explain clearly, however, that the
values represent uncompressed sizes to avoid confusion.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
main
brian m. carlson 2025-10-09 21:56:21 +00:00 committed by Junio C Hamano
parent d477892b30
commit 24d46f8633
1 changed files with 19 additions and 0 deletions

View File

@ -32,6 +32,10 @@ In a repository using the traditional SHA-1, pack checksums, index checksums,
and object IDs (object names) mentioned below are all computed using SHA-1.
Similarly, in SHA-256 repositories, these values are computed using SHA-256.

CRC32 checksums are always computed over the entire packed object, including
the header (n-byte type and length); the base object name or offset, if any;
and the entire compressed object. The CRC32 algorithm used is that of zlib.

== pack-*.pack files have the following format:

- A header appears at the beginning and consists of the following:
@ -80,6 +84,16 @@ Valid object types are:

Type 5 is reserved for future expansion. Type 0 is invalid.

=== Object encoding

Unlike loose objects, packed objects do not have a prefix containing the type,
size, and a NUL byte. These are not necessary because they can be determined by
the n-byte type and length that prefixes the data and so they are omitted from
the compressed and deltified data.

The computation of the object ID still uses this prefix by reconstructing it
from the type and length as needed.

=== Size encoding

This document uses the following "size encoding" of non-negative
@ -92,6 +106,11 @@ values are more significant.
This size encoding should not be confused with the "offset encoding",
which is also used in this document.

When encoding the size of an undeltified object in a pack, the size is that of
the uncompressed raw object. For deltified objects, it is the size of the
uncompressed delta. The base object name or offset is not included in the size
computation.

=== Deltified representation

Conceptually there are only four object types: commit, tree, tag and