1099 lines
		
	
	
		
			37 KiB
		
	
	
	
		
			Plaintext
		
	
	
			
		
		
	
	
			1099 lines
		
	
	
		
			37 KiB
		
	
	
	
		
			Plaintext
		
	
	
| reftable
 | |
| --------
 | |
| 
 | |
| Overview
 | |
| ~~~~~~~~
 | |
| 
 | |
| Problem statement
 | |
| ^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Some repositories contain a lot of references (e.g. android at 866k,
 | |
| rails at 31k). The existing packed-refs format takes up a lot of space
 | |
| (e.g. 62M), and does not scale with additional references. Lookup of a
 | |
| single reference requires linearly scanning the file.
 | |
| 
 | |
| Atomic pushes modifying multiple references require copying the entire
 | |
| packed-refs file, which can be a considerable amount of data moved
 | |
| (e.g. 62M in, 62M out) for even small transactions (2 refs modified).
 | |
| 
 | |
| Repositories with many loose references occupy a large number of disk
 | |
| blocks from the local file system, as each reference is its own file
 | |
| storing 41 bytes (and another file for the corresponding reflog). This
 | |
| negatively affects the number of inodes available when a large number of
 | |
| repositories are stored on the same filesystem. Readers can be penalized
 | |
| due to the larger number of syscalls required to traverse and read the
 | |
| `$GIT_DIR/refs` directory.
 | |
| 
 | |
| 
 | |
| Objectives
 | |
| ^^^^^^^^^^
 | |
| 
 | |
| * Near constant time lookup for any single reference, even when the
 | |
| repository is cold and not in process or kernel cache.
 | |
| * Near constant time verification if an object name is referred to by at least
 | |
| one reference (for allow-tip-sha1-in-want).
 | |
| * Efficient enumeration of an entire namespace, such as `refs/tags/`.
 | |
| * Support atomic push with `O(size_of_update)` operations.
 | |
| * Combine reflog storage with ref storage for small transactions.
 | |
| * Separate reflog storage for base refs and historical logs.
 | |
| 
 | |
| Description
 | |
| ^^^^^^^^^^^
 | |
| 
 | |
| A reftable file is a portable binary file format customized for
 | |
| reference storage. References are sorted, enabling linear scans, binary
 | |
| search lookup, and range scans.
 | |
| 
 | |
| Storage in the file is organized into variable sized blocks. Prefix
 | |
| compression is used within a single block to reduce disk space. Block
 | |
| size and alignment are tunable by the writer.
 | |
| 
 | |
| Performance
 | |
| ^^^^^^^^^^^
 | |
| 
 | |
| Space used, packed-refs vs. reftable:
 | |
| 
 | |
| [cols=",>,>,>,>,>",options="header",]
 | |
| |===============================================================
 | |
| |repository |packed-refs |reftable |% original |avg ref |avg obj
 | |
| |android |62.2 M |36.1 M |58.0% |33 bytes |5 bytes
 | |
| |rails |1.8 M |1.1 M |57.7% |29 bytes |4 bytes
 | |
| |git |78.7 K |48.1 K |61.0% |50 bytes |4 bytes
 | |
| |git (heads) |332 b |269 b |81.0% |33 bytes |0 bytes
 | |
| |===============================================================
 | |
| 
 | |
| Scan (read 866k refs), by reference name lookup (single ref from 866k
 | |
| refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs):
 | |
| 
 | |
| [cols=",>,>,>,>",options="header",]
 | |
| |=========================================================
 | |
| |format |cache |scan |by name |by SHA-1
 | |
| |packed-refs |cold |402 ms |409,660.1 usec |412,535.8 usec
 | |
| |packed-refs |hot | |6,844.6 usec |20,110.1 usec
 | |
| |reftable |cold |112 ms |33.9 usec |323.2 usec
 | |
| |reftable |hot | |20.2 usec |320.8 usec
 | |
| |=========================================================
 | |
| 
 | |
| Space used for 149,932 log entries for 43,061 refs, reflog vs. reftable:
 | |
| 
 | |
| [cols=",>,>",options="header",]
 | |
| |================================
 | |
| |format |size |avg entry
 | |
| |$GIT_DIR/logs |173 M |1209 bytes
 | |
| |reftable |5 M |37 bytes
 | |
| |================================
 | |
| 
 | |
| Details
 | |
| ~~~~~~~
 | |
| 
 | |
| Peeling
 | |
| ^^^^^^^
 | |
| 
 | |
| References stored in a reftable are peeled, a record for an annotated
 | |
| (or signed) tag records both the tag object, and the object it refers
 | |
| to. This is analogous to storage in the packed-refs format.
 | |
| 
 | |
| Reference name encoding
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Reference names are an uninterpreted sequence of bytes that must pass
 | |
| linkgit:git-check-ref-format[1] as a valid reference name.
 | |
| 
 | |
| Key unicity
 | |
| ^^^^^^^^^^^
 | |
| 
 | |
| Each entry must have a unique key; repeated keys are disallowed.
 | |
| 
 | |
| Network byte order
 | |
| ^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| All multi-byte, fixed width fields are in network byte order.
 | |
| 
 | |
| Varint encoding
 | |
| ^^^^^^^^^^^^^^^
 | |
| 
 | |
| Varint encoding is identical to the ofs-delta encoding method used
 | |
| within pack files.
 | |
| 
 | |
| Decoder works as follows:
 | |
| 
 | |
| ....
 | |
| val = buf[ptr] & 0x7f
 | |
| while (buf[ptr] & 0x80) {
 | |
|   ptr++
 | |
|   val = ((val + 1) << 7) | (buf[ptr] & 0x7f)
 | |
| }
 | |
| ....
 | |
| 
 | |
| Ordering
 | |
| ^^^^^^^^
 | |
| 
 | |
| Blocks are lexicographically ordered by their first reference.
 | |
| 
 | |
| Directory/file conflicts
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| The reftable format accepts both `refs/heads/foo` and
 | |
| `refs/heads/foo/bar` as distinct references.
 | |
| 
 | |
| This property is useful for retaining log records in reftable, but may
 | |
| confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain
 | |
| references. Users of reftable may choose to continue to reject `foo` and
 | |
| `foo/bar` type conflicts to prevent problems for peers.
 | |
| 
 | |
| File format
 | |
| ~~~~~~~~~~~
 | |
| 
 | |
| Structure
 | |
| ^^^^^^^^^
 | |
| 
 | |
| A reftable file has the following high-level structure:
 | |
| 
 | |
| ....
 | |
| first_block {
 | |
|   header
 | |
|   first_ref_block
 | |
| }
 | |
| ref_block*
 | |
| ref_index*
 | |
| obj_block*
 | |
| obj_index*
 | |
| log_block*
 | |
| log_index*
 | |
| footer
 | |
| ....
 | |
| 
 | |
| A log-only file omits the `ref_block`, `ref_index`, `obj_block` and
 | |
| `obj_index` sections, containing only the file header and log block:
 | |
| 
 | |
| ....
 | |
| first_block {
 | |
|   header
 | |
| }
 | |
| log_block*
 | |
| log_index*
 | |
| footer
 | |
| ....
 | |
| 
 | |
| In a log-only file, the first log block immediately follows the file
 | |
| header, without padding to block alignment.
 | |
| 
 | |
| Block size
 | |
| ^^^^^^^^^^
 | |
| 
 | |
| The file's block size is arbitrarily determined by the writer, and does
 | |
| not have to be a power of 2. The block size must be larger than the
 | |
| longest reference name or log entry used in the repository, as
 | |
| references cannot span blocks.
 | |
| 
 | |
| Powers of two that are friendly to the virtual memory system or
 | |
| filesystem (such as 4k or 8k) are recommended. Larger sizes (64k) can
 | |
| yield better compression, with a possible increased cost incurred by
 | |
| readers during access.
 | |
| 
 | |
| The largest block size is `16777215` bytes (15.99 MiB).
 | |
| 
 | |
| Block alignment
 | |
| ^^^^^^^^^^^^^^^
 | |
| 
 | |
| Writers may choose to align blocks at multiples of the block size by
 | |
| including `padding` filled with NUL bytes at the end of a block to round
 | |
| out to the chosen alignment. When alignment is used, writers must
 | |
| specify the alignment with the file header's `block_size` field.
 | |
| 
 | |
| Block alignment is not required by the file format. Unaligned files must
 | |
| set `block_size = 0` in the file header, and omit `padding`. Unaligned
 | |
| files with more than one ref block must include the link:#Ref-index[ref
 | |
| index] to support fast lookup. Readers must be able to read both aligned
 | |
| and non-aligned files.
 | |
| 
 | |
| Very small files (e.g. a single ref block) may omit `padding` and the ref
 | |
| index to reduce total file size.
 | |
| 
 | |
| Header (version 1)
 | |
| ^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| A 24-byte header appears at the beginning of the file:
 | |
| 
 | |
| ....
 | |
| 'REFT'
 | |
| uint8( version_number = 1 )
 | |
| uint24( block_size )
 | |
| uint64( min_update_index )
 | |
| uint64( max_update_index )
 | |
| ....
 | |
| 
 | |
| Aligned files must specify `block_size` to configure readers with the
 | |
| expected block alignment. Unaligned files must set `block_size = 0`.
 | |
| 
 | |
| The `min_update_index` and `max_update_index` describe bounds for the
 | |
| `update_index` field of all log records in this file. When reftables are
 | |
| used in a stack for link:#Update-transactions[transactions], these
 | |
| fields can order the files such that the prior file's
 | |
| `max_update_index + 1` is the next file's `min_update_index`.
 | |
| 
 | |
| Header (version 2)
 | |
| ^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| A 28-byte header appears at the beginning of the file:
 | |
| 
 | |
| ....
 | |
| 'REFT'
 | |
| uint8( version_number = 2 )
 | |
| uint24( block_size )
 | |
| uint64( min_update_index )
 | |
| uint64( max_update_index )
 | |
| uint32( hash_id )
 | |
| ....
 | |
| 
 | |
| The header is identical to `version_number=1`, with the 4-byte hash ID
 | |
| ("sha1" for SHA1 and "s256" for SHA-256) appended to the header.
 | |
| 
 | |
| For maximum backward compatibility, it is recommended to use version 1 when
 | |
| writing SHA1 reftables.
 | |
| 
 | |
| First ref block
 | |
| ^^^^^^^^^^^^^^^
 | |
| 
 | |
| The first ref block shares the same block as the file header, and is 24
 | |
| bytes smaller than all other blocks in the file. The first block
 | |
| immediately begins after the file header, at position 24.
 | |
| 
 | |
| If the first block is a log block (a log-only file), its block header
 | |
| begins immediately at position 24.
 | |
| 
 | |
| Ref block format
 | |
| ^^^^^^^^^^^^^^^^
 | |
| 
 | |
| A ref block is written as:
 | |
| 
 | |
| ....
 | |
| 'r'
 | |
| uint24( block_len )
 | |
| ref_record+
 | |
| uint24( restart_offset )+
 | |
| uint16( restart_count )
 | |
| 
 | |
| padding?
 | |
| ....
 | |
| 
 | |
| Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which
 | |
| encodes the number of bytes in the block up to, but not including the
 | |
| optional `padding`. This is always less than or equal to the file's
 | |
| block size. In the first ref block, `block_len` includes 24 bytes for
 | |
| the file header.
 | |
| 
 | |
| The 2-byte `restart_count` stores the number of entries in the
 | |
| `restart_offset` list, which must not be empty. Readers can use
 | |
| `restart_count` to binary search between restarts before starting a
 | |
| linear scan.
 | |
| 
 | |
| Exactly `restart_count` 3-byte `restart_offset` values precede the
 | |
| `restart_count`. Offsets are relative to the start of the block and
 | |
| refer to the first byte of any `ref_record` whose name has not been
 | |
| prefix compressed. Entries in the `restart_offset` list must be sorted,
 | |
| ascending. Readers can start linear scans from any of these records.
 | |
| 
 | |
| A variable number of `ref_record` fill the middle of the block,
 | |
| describing reference names and values. The format is described below.
 | |
| 
 | |
| As the first ref block shares the first file block with the file header,
 | |
| all `restart_offset` in the first block are relative to the start of the
 | |
| file (position 0), and include the file header. This forces the first
 | |
| `restart_offset` to be `28`.
 | |
| 
 | |
| ref record
 | |
| ++++++++++
 | |
| 
 | |
| A `ref_record` describes a single reference, storing both the name and
 | |
| its value(s). Records are formatted as:
 | |
| 
 | |
| ....
 | |
| varint( prefix_length )
 | |
| varint( (suffix_length << 3) | value_type )
 | |
| suffix
 | |
| varint( update_index_delta )
 | |
| value?
 | |
| ....
 | |
| 
 | |
| The `prefix_length` field specifies how many leading bytes of the prior
 | |
| reference record's name should be copied to obtain this reference's
 | |
| name. This must be 0 for the first reference in any block, and also must
 | |
| be 0 for any `ref_record` whose offset is listed in the `restart_offset`
 | |
| table at the end of the block.
 | |
| 
 | |
| Recovering a reference name from any `ref_record` is a simple concat:
 | |
| 
 | |
| ....
 | |
| this_name = prior_name[0..prefix_length] + suffix
 | |
| ....
 | |
| 
 | |
| The `suffix_length` value provides the number of bytes available in
 | |
| `suffix` to copy from `suffix` to complete the reference name.
 | |
| 
 | |
| The `update_index` that last modified the reference can be obtained by
 | |
| adding `update_index_delta` to the `min_update_index` from the file
 | |
| header: `min_update_index + update_index_delta`.
 | |
| 
 | |
| The `value` follows. Its format is determined by `value_type`, one of
 | |
| the following:
 | |
| 
 | |
| * `0x0`: deletion; no value data (see transactions, below)
 | |
| * `0x1`: one object name; value of the ref
 | |
| * `0x2`: two object names; value of the ref, peeled target
 | |
| * `0x3`: symbolic reference: `varint( target_len ) target`
 | |
| 
 | |
| Symbolic references use `0x3`, followed by the complete name of the
 | |
| reference target. No compression is applied to the target name.
 | |
| 
 | |
| Types `0x4..0x7` are reserved for future use.
 | |
| 
 | |
| Ref index
 | |
| ^^^^^^^^^
 | |
| 
 | |
| The ref index stores the name of the last reference from every ref block
 | |
| in the file, enabling reduced disk seeks for lookups. Any reference can
 | |
| be found by searching the index, identifying the containing block, and
 | |
| searching within that block.
 | |
| 
 | |
| The index may be organized into a multi-level index, where the 1st level
 | |
| index block points to additional ref index blocks (2nd level), which may
 | |
| in turn point to either additional index blocks (e.g. 3rd level) or ref
 | |
| blocks (leaf level). Disk reads required to access a ref go up with
 | |
| higher index levels. Multi-level indexes may be required to ensure no
 | |
| single index block exceeds the file format's max block size of
 | |
| `16777215` bytes (15.99 MiB). To achieve constant O(1) disk seeks for
 | |
| lookups the index must be a single level, which is permitted to exceed
 | |
| the file's configured block size, but not the format's max block size of
 | |
| 15.99 MiB.
 | |
| 
 | |
| If present, the ref index block(s) appears after the last ref block.
 | |
| 
 | |
| If there are at least 4 ref blocks, a ref index block should be written
 | |
| to improve lookup times. Cold reads using the index require 2 disk reads
 | |
| (read index, read block), and binary searching < 4 blocks also requires
 | |
| <= 2 reads. Omitting the index block from smaller files saves space.
 | |
| 
 | |
| If the file is unaligned and contains more than one ref block, the ref
 | |
| index must be written.
 | |
| 
 | |
| Index block format:
 | |
| 
 | |
| ....
 | |
| 'i'
 | |
| uint24( block_len )
 | |
| index_record+
 | |
| uint24( restart_offset )+
 | |
| uint16( restart_count )
 | |
| 
 | |
| padding?
 | |
| ....
 | |
| 
 | |
| The index blocks begin with `block_type = 'i'` and a 3-byte `block_len`
 | |
| which encodes the number of bytes in the block, up to but not including
 | |
| the optional `padding`.
 | |
| 
 | |
| The `restart_offset` and `restart_count` fields are identical in format,
 | |
| meaning and usage as in ref blocks.
 | |
| 
 | |
| To reduce the number of reads required for random access in very large
 | |
| files the index block may be larger than other blocks. However, readers
 | |
| must hold the entire index in memory to benefit from this, so it's a
 | |
| time-space tradeoff in both file size and reader memory.
 | |
| 
 | |
| Increasing the file's block size decreases the index size. Alternatively
 | |
| a multi-level index may be used, keeping index blocks within the file's
 | |
| block size, but increasing the number of blocks that need to be
 | |
| accessed.
 | |
| 
 | |
| index record
 | |
| ++++++++++++
 | |
| 
 | |
| An index record describes the last entry in another block. Index records
 | |
| are written as:
 | |
| 
 | |
| ....
 | |
| varint( prefix_length )
 | |
| varint( (suffix_length << 3) | 0 )
 | |
| suffix
 | |
| varint( block_position )
 | |
| ....
 | |
| 
 | |
| Index records use prefix compression exactly like `ref_record`.
 | |
| 
 | |
| Index records store `block_position` after the suffix, specifying the
 | |
| absolute position in bytes (from the start of the file) of the block
 | |
| that ends with this reference. Readers can seek to `block_position` to
 | |
| begin reading the block header.
 | |
| 
 | |
| Readers must examine the block header at `block_position` to determine
 | |
| if the next block is another level index block, or the leaf-level ref
 | |
| block.
 | |
| 
 | |
| Reading the index
 | |
| +++++++++++++++++
 | |
| 
 | |
| Readers loading the ref index must first read the footer (below) to
 | |
| obtain `ref_index_position`. If not present, the position will be 0. The
 | |
| `ref_index_position` is for the 1st level root of the ref index.
 | |
| 
 | |
| Obj block format
 | |
| ^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Object blocks are optional. Writers may choose to omit object blocks,
 | |
| especially if readers will not use the object name to ref mapping.
 | |
| 
 | |
| Object blocks use unique, abbreviated 2-31 byte object name keys, mapping to
 | |
| ref blocks containing references pointing to that object directly, or as
 | |
| the peeled value of an annotated tag. Like ref blocks, object blocks use
 | |
| the file's standard block size. The abbreviation length is available in
 | |
| the footer as `obj_id_len`.
 | |
| 
 | |
| To save space in small files, object blocks may be omitted if the ref
 | |
| index is not present, as brute force search will only need to read a few
 | |
| ref blocks. When missing, readers should brute force a linear search of
 | |
| all references to lookup by object name.
 | |
| 
 | |
| An object block is written as:
 | |
| 
 | |
| ....
 | |
| 'o'
 | |
| uint24( block_len )
 | |
| obj_record+
 | |
| uint24( restart_offset )+
 | |
| uint16( restart_count )
 | |
| 
 | |
| padding?
 | |
| ....
 | |
| 
 | |
| Fields are identical to ref block. Binary search using the restart table
 | |
| works the same as in reference blocks.
 | |
| 
 | |
| Because object names are abbreviated by writers to the shortest unique
 | |
| abbreviation within the reftable, obj key lengths have a variable length. Their
 | |
| length must be at least 2 bytes. Readers must compare only for common prefix
 | |
| match within an obj block or obj index.
 | |
| 
 | |
| obj record
 | |
| ++++++++++
 | |
| 
 | |
| An `obj_record` describes a single object abbreviation, and the blocks
 | |
| containing references using that unique abbreviation:
 | |
| 
 | |
| ....
 | |
| varint( prefix_length )
 | |
| varint( (suffix_length << 3) | cnt_3 )
 | |
| suffix
 | |
| varint( cnt_large )?
 | |
| varint( position_delta )*
 | |
| ....
 | |
| 
 | |
| Like in reference blocks, abbreviations are prefix compressed within an
 | |
| obj block. On large reftables with many unique objects, higher block
 | |
| sizes (64k), and higher restart interval (128), a `prefix_length` of 2
 | |
| or 3 and `suffix_length` of 3 may be common in obj records (unique
 | |
| abbreviation of 5-6 raw bytes, 10-12 hex digits).
 | |
| 
 | |
| Each record contains `position_count` number of positions for matching
 | |
| ref blocks. For 1-7 positions the count is stored in `cnt_3`. When
 | |
| `cnt_3 = 0` the actual count follows in a varint, `cnt_large`.
 | |
| 
 | |
| The use of `cnt_3` bets most objects are pointed to by only a single
 | |
| reference, some may be pointed to by a couple of references, and very
 | |
| few (if any) are pointed to by more than 7 references.
 | |
| 
 | |
| A special case exists when `cnt_3 = 0` and `cnt_large = 0`: there are no
 | |
| `position_delta`, but at least one reference starts with this
 | |
| abbreviation. A reader that needs exact reference names must scan all
 | |
| references to find which specific references have the desired object.
 | |
| Writers should use this format when the `position_delta` list would have
 | |
| overflowed the file's block size due to a high number of references
 | |
| pointing to the same object.
 | |
| 
 | |
| The first `position_delta` is the position from the start of the file.
 | |
| Additional `position_delta` entries are sorted ascending and relative to
 | |
| the prior entry, e.g. a reader would perform:
 | |
| 
 | |
| ....
 | |
| pos = position_delta[0]
 | |
| prior = pos
 | |
| for (j = 1; j < position_count; j++) {
 | |
|   pos = prior + position_delta[j]
 | |
|   prior = pos
 | |
| }
 | |
| ....
 | |
| 
 | |
| With a position in hand, a reader must linearly scan the ref block,
 | |
| starting from the first `ref_record`, testing each reference's object names
 | |
| (for `value_type = 0x1` or `0x2`) for full equality. Faster searching by
 | |
| object name within a single ref block is not supported by the reftable format.
 | |
| Smaller block sizes reduce the number of candidates this step must
 | |
| consider.
 | |
| 
 | |
| Obj index
 | |
| ^^^^^^^^^
 | |
| 
 | |
| The obj index stores the abbreviation from the last entry for every obj
 | |
| block in the file, enabling reduced disk seeks for all lookups. It is
 | |
| formatted exactly the same as the ref index, but refers to obj blocks.
 | |
| 
 | |
| The obj index should be present if obj blocks are present, as obj blocks
 | |
| should only be written in larger files.
 | |
| 
 | |
| Readers loading the obj index must first read the footer (below) to
 | |
| obtain `obj_index_position`. If not present, the position will be 0.
 | |
| 
 | |
| Log block format
 | |
| ^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Unlike ref and obj blocks, log blocks are always unaligned.
 | |
| 
 | |
| Log blocks are variable in size, and do not match the `block_size`
 | |
| specified in the file header or footer. Writers should choose an
 | |
| appropriate buffer size to prepare a log block for deflation, such as
 | |
| `2 * block_size`.
 | |
| 
 | |
| A log block is written as:
 | |
| 
 | |
| ....
 | |
| 'g'
 | |
| uint24( block_len )
 | |
| zlib_deflate {
 | |
|   log_record+
 | |
|   uint24( restart_offset )+
 | |
|   uint16( restart_count )
 | |
| }
 | |
| ....
 | |
| 
 | |
| Log blocks look similar to ref blocks, except `block_type = 'g'`.
 | |
| 
 | |
| The 4-byte block header is followed by the deflated block contents using
 | |
| zlib deflate. The `block_len` in the header is the inflated size
 | |
| (including 4-byte block header), and should be used by readers to
 | |
| preallocate the inflation output buffer. A log block's `block_len` may
 | |
| exceed the file's block size.
 | |
| 
 | |
| Offsets within the log block (e.g. `restart_offset`) still include the
 | |
| 4-byte header. Readers may prefer prefixing the inflation output buffer
 | |
| with the 4-byte header.
 | |
| 
 | |
| Within the deflate container, a variable number of `log_record` describe
 | |
| reference changes. The log record format is described below. See ref
 | |
| block format (above) for a description of `restart_offset` and
 | |
| `restart_count`.
 | |
| 
 | |
| Because log blocks have no alignment or padding between blocks, readers
 | |
| must keep track of the bytes consumed by the inflater to know where the
 | |
| next log block begins.
 | |
| 
 | |
| log record
 | |
| ++++++++++
 | |
| 
 | |
| Log record keys are structured as:
 | |
| 
 | |
| ....
 | |
| ref_name '\0' reverse_int64( update_index )
 | |
| ....
 | |
| 
 | |
| where `update_index` is the unique transaction identifier. The
 | |
| `update_index` field must be unique within the scope of a `ref_name`.
 | |
| See the update transactions section below for further details.
 | |
| 
 | |
| The `reverse_int64` function inverses the value so lexicographical
 | |
| ordering the network byte order encoding sorts the more recent records
 | |
| with higher `update_index` values first:
 | |
| 
 | |
| ....
 | |
| reverse_int64(int64 t) {
 | |
|   return 0xffffffffffffffff - t;
 | |
| }
 | |
| ....
 | |
| 
 | |
| Log records have a similar starting structure to ref and index records,
 | |
| utilizing the same prefix compression scheme applied to the log record
 | |
| key described above.
 | |
| 
 | |
| ....
 | |
|     varint( prefix_length )
 | |
|     varint( (suffix_length << 3) | log_type )
 | |
|     suffix
 | |
|     log_data {
 | |
|       old_id
 | |
|       new_id
 | |
|       varint( name_length    )  name
 | |
|       varint( email_length   )  email
 | |
|       varint( time_seconds )
 | |
|       sint16( tz_offset )
 | |
|       varint( message_length )  message
 | |
|     }?
 | |
| ....
 | |
| 
 | |
| Log record entries use `log_type` to indicate what follows:
 | |
| 
 | |
| * `0x0`: deletion; no log data.
 | |
| * `0x1`: standard git reflog data using `log_data` above.
 | |
| 
 | |
| The `log_type = 0x0` is mostly useful for `git stash drop`, removing an
 | |
| entry from the reflog of `refs/stash` in a transaction file (below),
 | |
| without needing to rewrite larger files. Readers reading a stack of
 | |
| reflogs must treat this as a deletion.
 | |
| 
 | |
| For `log_type = 0x1`, the `log_data` section follows
 | |
| linkgit:git-update-ref[1] logging and includes:
 | |
| 
 | |
| * two object names (old id, new id)
 | |
| * varint string of committer's name
 | |
| * varint string of committer's email
 | |
| * varint time in seconds since epoch (Jan 1, 1970)
 | |
| * 2-byte timezone offset in minutes (signed)
 | |
| * varint string of message
 | |
| 
 | |
| `tz_offset` is the absolute number of minutes from GMT the committer was
 | |
| at the time of the update. For example `GMT-0800` is encoded in reftable
 | |
| as `sint16(-480)` and `GMT+0230` is `sint16(150)`.
 | |
| 
 | |
| The committer email does not contain `<` or `>`, it's the value normally
 | |
| found between the `<>` in a git commit object header.
 | |
| 
 | |
| The `message_length` may be 0, in which case there was no message
 | |
| supplied for the update.
 | |
| 
 | |
| Contrary to traditional reflog (which is a file), renames are encoded as
 | |
| a combination of ref deletion and ref creation.  A deletion is a log
 | |
| record with a zero new_id, and a creation is a log record with a zero old_id.
 | |
| 
 | |
| Reading the log
 | |
| +++++++++++++++
 | |
| 
 | |
| Readers accessing the log must first read the footer (below) to
 | |
| determine the `log_position`. The first block of the log begins at
 | |
| `log_position` bytes since the start of the file. The `log_position` is
 | |
| not block aligned.
 | |
| 
 | |
| Importing logs
 | |
| ++++++++++++++
 | |
| 
 | |
| When importing from `$GIT_DIR/logs` writers should globally order all
 | |
| log records roughly by timestamp while preserving file order, and assign
 | |
| unique, increasing `update_index` values for each log line. Newer log
 | |
| records get higher `update_index` values.
 | |
| 
 | |
| Although an import may write only a single reftable file, the reftable
 | |
| file must span many unique `update_index`, as each log line requires its
 | |
| own `update_index` to preserve semantics.
 | |
| 
 | |
| Log index
 | |
| ^^^^^^^^^
 | |
| 
 | |
| The log index stores the log key
 | |
| (`refname \0 reverse_int64(update_index)`) for the last log record of
 | |
| every log block in the file, supporting bounded-time lookup.
 | |
| 
 | |
| A log index block must be written if 2 or more log blocks are written to
 | |
| the file. If present, the log index appears after the last log block.
 | |
| There is no padding used to align the log index to block alignment.
 | |
| 
 | |
| Log index format is identical to ref index, except the keys are 9 bytes
 | |
| longer to include `'\0'` and the 8-byte `reverse_int64(update_index)`.
 | |
| Records use `block_position` to refer to the start of a log block.
 | |
| 
 | |
| Reading the index
 | |
| +++++++++++++++++
 | |
| 
 | |
| Readers loading the log index must first read the footer (below) to
 | |
| obtain `log_index_position`. If not present, the position will be 0.
 | |
| 
 | |
| Footer
 | |
| ^^^^^^
 | |
| 
 | |
| After the last block of the file, a file footer is written. It begins
 | |
| like the file header, but is extended with additional data.
 | |
| 
 | |
| ....
 | |
|     HEADER
 | |
| 
 | |
|     uint64( ref_index_position )
 | |
|     uint64( (obj_position << 5) | obj_id_len )
 | |
|     uint64( obj_index_position )
 | |
| 
 | |
|     uint64( log_position )
 | |
|     uint64( log_index_position )
 | |
| 
 | |
|     uint32( CRC-32 of above )
 | |
| ....
 | |
| 
 | |
| If a section is missing (e.g. ref index) the corresponding position
 | |
| field (e.g. `ref_index_position`) will be 0.
 | |
| 
 | |
| * `obj_position`: byte position for the first obj block.
 | |
| * `obj_id_len`: number of bytes used to abbreviate object names in
 | |
| obj blocks.
 | |
| * `log_position`: byte position for the first log block.
 | |
| * `ref_index_position`: byte position for the start of the ref index.
 | |
| * `obj_index_position`: byte position for the start of the obj index.
 | |
| * `log_index_position`: byte position for the start of the log index.
 | |
| 
 | |
| The size of the footer is 68 bytes for version 1, and 72 bytes for
 | |
| version 2.
 | |
| 
 | |
| Reading the footer
 | |
| ++++++++++++++++++
 | |
| 
 | |
| Readers must first read the file start to determine the version
 | |
| number. Then they seek to `file_length - FOOTER_LENGTH` to access the
 | |
| footer. A trusted external source (such as `stat(2)`) is necessary to
 | |
| obtain `file_length`. When reading the footer, readers must verify:
 | |
| 
 | |
| * 4-byte magic is correct
 | |
| * 1-byte version number is recognized
 | |
| * 4-byte CRC-32 matches the other 64 bytes (including magic, and
 | |
| version)
 | |
| 
 | |
| Once verified, the other fields of the footer can be accessed.
 | |
| 
 | |
| Empty tables
 | |
| ++++++++++++
 | |
| 
 | |
| A reftable may be empty. In this case, the file starts with a header
 | |
| and is immediately followed by a footer.
 | |
| 
 | |
| Binary search
 | |
| ^^^^^^^^^^^^^
 | |
| 
 | |
| Binary search within a block is supported by the `restart_offset` fields
 | |
| at the end of the block. Readers can binary search through the restart
 | |
| table to locate between which two restart points the sought reference or
 | |
| key should appear.
 | |
| 
 | |
| Each record identified by a `restart_offset` stores the complete key in
 | |
| the `suffix` field of the record, making the compare operation during
 | |
| binary search straightforward.
 | |
| 
 | |
| Once a restart point lexicographically before the sought reference has
 | |
| been identified, readers can linearly scan through the following record
 | |
| entries to locate the sought record, terminating if the current record
 | |
| sorts after (and therefore the sought key is not present).
 | |
| 
 | |
| Restart point selection
 | |
| +++++++++++++++++++++++
 | |
| 
 | |
| Writers determine the restart points at file creation. The process is
 | |
| arbitrary, but every 16 or 64 records is recommended. Every 16 may be
 | |
| more suitable for smaller block sizes (4k or 8k), every 64 for larger
 | |
| block sizes (64k).
 | |
| 
 | |
| More frequent restart points reduces prefix compression and increases
 | |
| space consumed by the restart table, both of which increase file size.
 | |
| 
 | |
| Less frequent restart points makes prefix compression more effective,
 | |
| decreasing overall file size, with increased penalties for readers
 | |
| walking through more records after the binary search step.
 | |
| 
 | |
| A maximum of `65535` restart points per block is supported.
 | |
| 
 | |
| Considerations
 | |
| ~~~~~~~~~~~~~~
 | |
| 
 | |
| Lightweight refs dominate
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| The reftable format assumes the vast majority of references are single
 | |
| object names valued with common prefixes, such as Gerrit Code Review's
 | |
| `refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many
 | |
| lightweight tags in the `refs/tags/` namespace.
 | |
| 
 | |
| Annotated tags storing the peeled object cost an additional object name per
 | |
| reference.
 | |
| 
 | |
| Low overhead
 | |
| ^^^^^^^^^^^^
 | |
| 
 | |
| A reftable with very few references (e.g. git.git with 5 heads) is 269
 | |
| bytes for reftable, vs. 332 bytes for packed-refs. This supports
 | |
| reftable scaling down for transaction logs (below).
 | |
| 
 | |
| Block size
 | |
| ^^^^^^^^^^
 | |
| 
 | |
| For a Gerrit Code Review type repository with many change refs, larger
 | |
| block sizes (64 KiB) and less frequent restart points (every 64) yield
 | |
| better compression due to more references within the block compressing
 | |
| against the prior reference.
 | |
| 
 | |
| Larger block sizes reduce the index size, as the reftable will require
 | |
| fewer blocks to store the same number of references.
 | |
| 
 | |
| Minimal disk seeks
 | |
| ^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Assuming the index block has been loaded into memory, binary searching
 | |
| for any single reference requires exactly 1 disk seek to load the
 | |
| containing block.
 | |
| 
 | |
| Scans and lookups dominate
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Scanning all references and lookup by name (or namespace such as
 | |
| `refs/heads/`) are the most common activities performed on repositories.
 | |
| Object names are stored directly with references to optimize this use case.
 | |
| 
 | |
| Logs are infrequently read
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Logs are infrequently accessed, but can be large. Deflating log blocks
 | |
| saves disk space, with some increased penalty at read time.
 | |
| 
 | |
| Logs are stored in an isolated section from refs, reducing the burden on
 | |
| reference readers that want to ignore logs. Further, historical logs can
 | |
| be isolated into log-only files.
 | |
| 
 | |
| Logs are read backwards
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Logs are frequently accessed backwards (most recent N records for master
 | |
| to answer `master@{4}`), so log records are grouped by reference, and
 | |
| sorted descending by update index.
 | |
| 
 | |
| Repository format
 | |
| ~~~~~~~~~~~~~~~~~
 | |
| 
 | |
| Version 1
 | |
| ^^^^^^^^^
 | |
| 
 | |
| A repository must set its `$GIT_DIR/config` to configure reftable:
 | |
| 
 | |
| ....
 | |
| [core]
 | |
|     repositoryformatversion = 1
 | |
| [extensions]
 | |
|     refStorage = reftable
 | |
| ....
 | |
| 
 | |
| Layout
 | |
| ^^^^^^
 | |
| 
 | |
| A collection of reftable files are stored in the `$GIT_DIR/reftable/` directory.
 | |
| Their names should have a random element, such that each filename is globally
 | |
| unique; this helps avoid spurious failures on Windows, where open files cannot
 | |
| be removed or overwritten. It suggested to use
 | |
| `${min_update_index}-${max_update_index}-${random}.ref` as a naming convention.
 | |
| 
 | |
| Log-only files use the `.log` extension, while ref-only and mixed ref
 | |
| and log files use `.ref`. extension.
 | |
| 
 | |
| The stack ordering file is `$GIT_DIR/reftable/tables.list` and lists the
 | |
| current files, one per line, in order, from oldest (base) to newest
 | |
| (most recent):
 | |
| 
 | |
| ....
 | |
| $ cat .git/reftable/tables.list
 | |
| 00000001-00000001-RANDOM1.log
 | |
| 00000002-00000002-RANDOM2.ref
 | |
| 00000003-00000003-RANDOM3.ref
 | |
| ....
 | |
| 
 | |
| Readers must read `$GIT_DIR/reftable/tables.list` to determine which
 | |
| files are relevant right now, and search through the stack in reverse
 | |
| order (last reftable is examined first).
 | |
| 
 | |
| Reftable files not listed in `tables.list` may be new (and about to be
 | |
| added to the stack by the active writer), or ancient and ready to be
 | |
| pruned.
 | |
| 
 | |
| Backward compatibility
 | |
| ^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Older clients should continue to recognize the directory as a git
 | |
| repository so they don't look for an enclosing repository in parent
 | |
| directories. To this end, a reftable-enabled repository must contain the
 | |
| following dummy files
 | |
| 
 | |
| * `.git/HEAD`, a regular file containing `ref: refs/heads/.invalid`.
 | |
| * `.git/refs/`, a directory
 | |
| * `.git/refs/heads`, a regular file
 | |
| 
 | |
| Readers
 | |
| ^^^^^^^
 | |
| 
 | |
| Readers can obtain a consistent snapshot of the reference space by
 | |
| following:
 | |
| 
 | |
| 1.  Open and read the `tables.list` file.
 | |
| 2.  Open each of the reftable files that it mentions.
 | |
| 3.  If any of the files is missing, goto 1.
 | |
| 4.  Read from the now-open files as long as necessary.
 | |
| 
 | |
| Update transactions
 | |
| ^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Although reftables are immutable, mutations are supported by writing a
 | |
| new reftable and atomically appending it to the stack:
 | |
| 
 | |
| 1.  Acquire `tables.list.lock`.
 | |
| 2.  Read `tables.list` to determine current reftables.
 | |
| 3.  Select `update_index` to be most recent file's
 | |
| `max_update_index + 1`.
 | |
| 4.  Prepare temp reftable `tmp_XXXXXX`, including log entries.
 | |
| 5.  Rename `tmp_XXXXXX` to `${update_index}-${update_index}-${random}.ref`.
 | |
| 6.  Copy `tables.list` to `tables.list.lock`, appending file from (5).
 | |
| 7.  Rename `tables.list.lock` to `tables.list`.
 | |
| 
 | |
| During step 4 the new file's `min_update_index` and `max_update_index`
 | |
| are both set to the `update_index` selected by step 3. All log records
 | |
| for the transaction use the same `update_index` in their keys. This
 | |
| enables later correlation of which references were updated by the same
 | |
| transaction.
 | |
| 
 | |
| Because a single `tables.list.lock` file is used to manage locking, the
 | |
| repository is single-threaded for writers. Writers may have to busy-spin
 | |
| (with backoff) around creating `tables.list.lock`, for up to an
 | |
| acceptable wait period, aborting if the repository is too busy to
 | |
| mutate. Application servers wrapped around repositories (e.g. Gerrit
 | |
| Code Review) can layer their own lock/wait queue to improve fairness to
 | |
| writers.
 | |
| 
 | |
| Reference deletions
 | |
| ^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Deletion of any reference can be explicitly stored by setting the `type`
 | |
| to `0x0` and omitting the `value` field of the `ref_record`. This serves
 | |
| as a tombstone, overriding any assertions about the existence of the
 | |
| reference from earlier files in the stack.
 | |
| 
 | |
| Compaction
 | |
| ^^^^^^^^^^
 | |
| 
 | |
| A partial stack of reftables can be compacted by merging references
 | |
| using a straightforward merge join across reftables, selecting the most
 | |
| recent value for output, and omitting deleted references that do not
 | |
| appear in remaining, lower reftables.
 | |
| 
 | |
| A compacted reftable should set its `min_update_index` to the smallest
 | |
| of the input files' `min_update_index`, and its `max_update_index`
 | |
| likewise to the largest input `max_update_index`.
 | |
| 
 | |
| For sake of illustration, assume the stack currently consists of
 | |
| reftable files (from oldest to newest): A, B, C, and D. The compactor is
 | |
| going to compact B and C, leaving A and D alone.
 | |
| 
 | |
| 1.  Obtain lock `tables.list.lock` and read the `tables.list` file.
 | |
| 2.  Obtain locks `B.lock` and `C.lock`. Ownership of these locks
 | |
| prevents other processes from trying to compact these files.
 | |
| 3.  Release `tables.list.lock`.
 | |
| 4.  Compact `B` and `C` into a temp file
 | |
| `${min_update_index}-${max_update_index}_XXXXXX`.
 | |
| 5.  Reacquire lock `tables.list.lock`.
 | |
| 6.  Verify that `B` and `C` are still in the stack, in that order. This
 | |
| should always be the case, assuming that other processes are adhering to
 | |
| the locking protocol.
 | |
| 7.  Rename `${min_update_index}-${max_update_index}_XXXXXX` to
 | |
| `${min_update_index}-${max_update_index}-${random}.ref`.
 | |
| 8.  Write the new stack to `tables.list.lock`, replacing `B` and `C`
 | |
| with the file from (4).
 | |
| 9.  Rename `tables.list.lock` to `tables.list`.
 | |
| 10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing
 | |
| readers to backtrack.
 | |
| 
 | |
| This strategy permits compactions to proceed independently of updates.
 | |
| 
 | |
| Each reftable (compacted or not) is uniquely identified by its name, so
 | |
| open reftables can be cached by their name.
 | |
| 
 | |
| Windows
 | |
| ^^^^^^^
 | |
| 
 | |
| On windows, and other systems that do not allow deleting or renaming to open
 | |
| files, compaction may succeed, but other readers may prevent obsolete tables
 | |
| from being deleted.
 | |
| 
 | |
| On these platforms, the following strategy can be followed: on closing a
 | |
| reftable stack, reload `tables.list`, and delete any tables no longer mentioned
 | |
| in `tables.list`.
 | |
| 
 | |
| Irregular program exit may still leave about unused files. In this case, a
 | |
| cleanup operation should proceed as follows:
 | |
| 
 | |
| * take a lock `tables.list.lock` to prevent concurrent modifications
 | |
| * refresh the reftable stack, by reading `tables.list`
 | |
| * for each `*.ref` file, remove it if
 | |
| ** it is not mentioned in `tables.list`, and
 | |
| ** its max update_index is not beyond the max update_index of the stack
 | |
| 
 | |
| 
 | |
| Alternatives considered
 | |
| ~~~~~~~~~~~~~~~~~~~~~~~
 | |
| 
 | |
| bzip packed-refs
 | |
| ^^^^^^^^^^^^^^^^
 | |
| 
 | |
| `bzip2` can significantly shrink a large packed-refs file (e.g. 62 MiB
 | |
| compresses to 23 MiB, 37%). However the bzip format does not support
 | |
| random access to a single reference. Readers must inflate and discard
 | |
| while performing a linear scan.
 | |
| 
 | |
| Breaking packed-refs into chunks (individually compressing each chunk)
 | |
| would reduce the amount of data a reader must inflate, but still leaves
 | |
| the problem of indexing chunks to support readers efficiently locating
 | |
| the correct chunk.
 | |
| 
 | |
| Given the compression achieved by reftable's encoding, it does not seem
 | |
| necessary to add the complexity of bzip/gzip/zlib.
 | |
| 
 | |
| Michael Haggerty's alternate format
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Michael Haggerty proposed
 | |
| link:https://lore.kernel.org/git/CAMy9T_HCnyc1g8XWOOWhe7nN0aEFyyBskV2aOMb_fe%2BwGvEJ7A%40mail.gmail.com/[an
 | |
| alternate] format to reftable on the Git mailing list. This format uses
 | |
| smaller chunks, without the restart table, and avoids block alignment
 | |
| with padding. Reflog entries immediately follow each ref, and are thus
 | |
| interleaved between refs.
 | |
| 
 | |
| Performance testing indicates reftable is faster for lookups (51%
 | |
| faster, 11.2 usec vs. 5.4 usec), although reftable produces a slightly
 | |
| larger file (+ ~3.2%, 28.3M vs 29.2M):
 | |
| 
 | |
| [cols=">,>,>,>",options="header",]
 | |
| |=====================================
 | |
| |format |size |seek cold |seek hot
 | |
| |mh-alt |28.3 M |23.4 usec |11.2 usec
 | |
| |reftable |29.2 M |19.9 usec |5.4 usec
 | |
| |=====================================
 | |
| 
 | |
| JGit Ketch RefTree
 | |
| ^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html[JGit Ketch]
 | |
| proposed
 | |
| link:https://lore.kernel.org/git/CAJo%3DhJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg%40mail.gmail.com/[RefTree],
 | |
| an encoding of references inside Git tree objects stored as part of the
 | |
| repository's object database.
 | |
| 
 | |
| The RefTree format adds additional load on the object database storage
 | |
| layer (more loose objects, more objects in packs), and relies heavily on
 | |
| the packer's delta compression to save space. Namespaces which are flat
 | |
| (e.g. thousands of tags in refs/tags) initially create very large loose
 | |
| objects, and so RefTree does not address the problem of copying many
 | |
| references to modify a handful.
 | |
| 
 | |
| Flat namespaces are not efficiently searchable in RefTree, as tree
 | |
| objects in canonical formatting cannot be binary searched. This fails
 | |
| the need to handle a large number of references in a single namespace,
 | |
| such as GitHub's `refs/pulls`, or a project with many tags.
 | |
| 
 | |
| LMDB
 | |
| ^^^^
 | |
| 
 | |
| David Turner proposed
 | |
| https://lore.kernel.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/[using
 | |
| LMDB], as LMDB is lightweight (64k of runtime code) and GPL-compatible
 | |
| license.
 | |
| 
 | |
| A downside of LMDB is its reliance on a single C implementation. This
 | |
| makes embedding inside JGit (a popular reimplementation of Git)
 | |
| difficult, and hoisting onto virtual storage (for JGit DFS) virtually
 | |
| impossible.
 | |
| 
 | |
| A common format that can be supported by all major Git implementations
 | |
| (git-core, JGit, libgit2) is strongly preferred.
 |