Commit Graph

61 Commits (v2.50.0-rc2)

Author SHA1 Message Date
Junio C Hamano 2b3303166b Merge branch 'ly/reftable-writer-leakfix'
Leakfix.

* ly/reftable-writer-leakfix:
  reftable/writer: fix memory leak when `writer_index_hash()` fails
  reftable/writer: fix memory leak when `padded_write()` fails
2025-05-19 16:02:47 -07:00
Lidong Yan 91db6c735d reftable/writer: fix memory leak when `writer_index_hash()` fails
In reftable/writer.c:writer_index_hash(), if `reftable_buf_add` failed,
key allocated by `reftable_malloc` will not be insert into `obj_index_tree`
thus leaks. Simple add reftable_free(key) will solve this problem.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-12 09:19:50 -07:00
Lidong Yan c8e752eaef reftable/writer: fix memory leak when `padded_write()` fails
In reftable/writer.c:padded_write(), if w->writer failed, zeroed
allocated in `reftable_calloc` will leak. w->writer could be
`reftable_write_data` in reftable/stack.c, and could fail due to
some write error. Simply add reftable_free(zeroed) will solve this
problem.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-12 09:19:49 -07:00
Junio C Hamano a819a3da85 Merge branch 'ps/reftable-api-revamp'
Overhaul of the reftable API.

* ps/reftable-api-revamp:
  reftable/table: move printing logic into test helper
  reftable/constants: make block types part of the public interface
  reftable/table: introduce iterator for table blocks
  reftable/table: add `reftable_table` to the public interface
  reftable/block: expose a generic iterator over reftable records
  reftable/block: make block iterators reseekable
  reftable/block: store block pointer in the block iterator
  reftable/block: create public interface for reading blocks
  git-zlib: use `struct z_stream_s` instead of typedef
  reftable/block: rename `block_reader` to `reftable_block`
  reftable/block: rename `block` to `block_data`
  reftable/table: move reading block into block reader
  reftable/block: simplify how we track restart points
  reftable/blocksource: consolidate code into a single file
  reftable/reader: rename data structure to "table"
  reftable: fix formatting of the license header
2025-04-29 14:21:30 -07:00
Junio C Hamano 6e2a3b8ae0 Merge branch 'ps/reftable-sans-compat-util'
Make the code in reftable library less reliant on the service
routines it used to borrow from Git proper, to make it easier to
use by external users of the library.

* ps/reftable-sans-compat-util:
  Makefile: skip reftable library for Coccinelle
  reftable: decouple from Git codebase by pulling in "compat/posix.h"
  git-compat-util.h: split out POSIX-emulating bits
  compat/mingw: split out POSIX-related bits
  reftable/basics: introduce `REFTABLE_UNUSED` annotation
  reftable/basics: stop using `SWAP()` macro
  reftable/stack: stop using `sleep_millisec()`
  reftable/system: introduce `reftable_rand()`
  reftable/reader: stop using `ARRAY_SIZE()` macro
  reftable/basics: provide wrappers for big endian conversion
  reftable/basics: stop using `st_mult()` in array allocators
  reftable: stop using `BUG()` in trivial cases
  reftable/record: don't `BUG()` in `reftable_record_cmp()`
  reftable/record: stop using `BUG()` in `reftable_record_init()`
  reftable/record: stop using `COPY_ARRAY()`
  reftable/blocksource: stop using `xmmap()`
  reftable/stack: stop using `write_in_full()`
  reftable/stack: stop using `read_in_full()`
2025-04-08 11:43:14 -07:00
Patrick Steinhardt 0f8ee94b63 reftable/constants: make block types part of the public interface
Now that reftable blocks can be read individually via the public
interface it becomes necessary for callers to be able to distinguish the
different types of blocks. Expose the relevant constants.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-07 14:53:12 -07:00
Patrick Steinhardt 6dcc05ffc3 reftable: fix formatting of the license header
The license headers used across the reftable library doesn't follow our
typical coding style for multi-line comments. Fix it.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-07 14:53:09 -07:00
Meet Soni 0e1b9c5eed reftable: adapt write_object_record() to propagate block_writer_add() errors
Previously, write_object_record() would flush the current block and retry
appending the record whenever block_writer_add() returned any nonzero
error. This forced an assumption that every failure meant the block was
full, even when errors such as memory allocation or I/O failures occurred.

Update the write_object_record() to inspect the error code returned by
block_writer_add() and flush and reinitialize the writer iff the
error is REFTABLE_ENTRY_TOO_BIG_ERROR. For any other error, immediately
propagate it.

If the flush and reinitialization still fail with
REFTABLE_ENTRY_TOO_BIG_ERROR, reset the record's offset length to zero
before a final attempt.

All call sites now handle various error codes returned by
block_writer_add().

Signed-off-by: Meet Soni <meetsoni3017@gmail.com>
Acked-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 01:51:08 -07:00
Meet Soni 9ce297239b reftable: adapt writer_add_record() to propagate block_writer_add() errors
Previously, writer_add_record() would flush the current block and retry
appending the record whenever block_writer_add() returned any nonzero
error. This forced an assumption that every failure meant the block was
full, even when errors such as memory allocation or I/O failures occurred.

Update the writer_add_record() to inspect the error code returned by
block_writer_add() and only flush and reinitialize the writer when the
error is REFTABLE_ENTRY_TOO_BIG_ERROR. For any other error, immediately
propagate it.

Signed-off-by: Meet Soni <meetsoni3017@gmail.com>
Acked-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 01:51:07 -07:00
Patrick Steinhardt f93b2a0424 reftable/basics: introduce `REFTABLE_UNUSED` annotation
Introduce the `REFTABLE_UNUSED` annotation and replace all existing
users of `UNUSED` in the reftable library to use the new macro instead.

Note that we unconditionally define `MAYBE_UNUSED` in the exact same
way, so doing so unconditionally for `REFTABLE_UNUSED` should be fine,
too.

Suggested-by: Toon Claes <toon@iotcl.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-18 10:55:38 -08:00
Patrick Steinhardt e676694298 reftable/basics: provide wrappers for big endian conversion
We're using a mixture of big endian conversion functions provided by
both the reftable library, but also by the Git codebase. Refactor the
code so that we exclusively use reftable-provided wrappers in order to
untangle us from the Git codebase.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-18 10:55:37 -08:00
Patrick Steinhardt 445f9f4f35 reftable: stop using `BUG()` in trivial cases
Stop using `BUG()` in the remaining trivial cases that we still have in
the reftable library. Instead of aborting the program, we'll now bubble
up a `REFTABLE_API_ERROR` to indicate misuse of the calling conventions.

Note that in both `reftable_reader_{inc,dec}ref()` we simply stop
calling `BUG()` altogether. The only situation where the counter should
be zero is when the structure has already been free'd anyway, so we
would run into undefined behaviour regardless of whether we try to abort
the program or not.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-18 10:55:36 -08:00
Junio C Hamano 82522a9e2c Merge branch 'kn/reflog-migration-fix-followup'
Code clean-up.

* kn/reflog-migration-fix-followup:
  reftable: prevent 'update_index' changes after adding records
  refs: use 'uint64_t' for 'ref_update.index'
  refs: mark `ref_transaction_update_reflog()` as static
2025-02-14 17:53:48 -08:00
Karthik Nayak 017bd89239 reftable: prevent 'update_index' changes after adding records
The function `reftable_writer_set_limits()` allows updating the
'min_update_index' and 'max_update_index' of a reftable writer. These
values are written to both the writer's header and footer.

Since the header is written during the first block write, any subsequent
changes to the update index would create a mismatch between the header
and footer values. The footer would contain the newer values while the
header retained the original ones.

To protect against this bug, prevent callers from updating these values
after any record is written. To do this, modify the function to return
an error whenever the limits are modified after any record adds. Check
for record adds within `reftable_writer_set_limits()` by checking the
`last_key` and `next` variable. The former is updated after each record
added, but is reset at certain points. The latter is set after writing
the first block.

Modify all callers of the function to anticipate a return type and
handle it accordingly. Add a unit test to also ensure the function
returns the error as expected.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-22 09:51:36 -08:00
Patrick Steinhardt 5ac65f0d6b reftable/basics: adjust `common_prefix_size()` to return `size_t`
The `common_prefix_size()` function computes the length of the common
prefix between two buffers. As such its return value will always be an
unsigned integer, as the length cannot be negative. Regardless of that,
the function returns a signed integer, which is nonsensical and causes a
couple of -Wsign-compare warnings all over the place.

Adjust the function to return a `size_t` instead.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-21 14:20:29 -08:00
René Scharfe 8db127d43f reftable: avoid leaks on realloc error
When realloc(3) fails, it returns NULL and keeps the original allocation
intact.  REFTABLE_ALLOC_GROW overwrites both the original pointer and
the allocation count variable in that case, simultaneously leaking the
original allocation and misrepresenting the number of storable items.

parse_names() and reftable_buf_add() avoid leaking by restoring the
original pointer value on failure, but all other callers seem to be OK
with losing the old allocation.  Add a new variant of the macro,
REFTABLE_ALLOC_GROW_OR_NULL, which plugs the leak and zeros the
allocation counter.  Use it for those callers.

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-28 08:00:44 -08:00
Junio C Hamano f7c607fac3 Merge branch 'kn/reftable-writer-log-write-verify'
Reftable backend adds check for upper limit of log's update_index.

* kn/reftable-writer-log-write-verify:
  reftable/writer: ensure valid range for log's update_index
2024-12-23 09:32:08 -08:00
Junio C Hamano de9278127e Merge branch 'ps/reftable-detach'
Isolates the reftable subsystem from the rest of Git's codebase by
using fewer pieces of Git's infrastructure.

* ps/reftable-detach:
  reftable/system: provide thin wrapper for lockfile subsystem
  reftable/stack: drop only use of `get_locked_file_path()`
  reftable/system: provide thin wrapper for tempfile subsystem
  reftable/stack: stop using `fsync_component()` directly
  reftable/system: stop depending on "hash.h"
  reftable: explicitly handle hash format IDs
  reftable/system: move "dir.h" to its only user
2024-12-10 10:04:56 +09:00
Karthik Nayak 49c6b912e2 reftable/writer: ensure valid range for log's update_index
Each reftable addition has an associated update_index. While writing
refs, the update_index is verified to be within the range of the
reftable writer, i.e. `writer.min_update_index <= ref.update_index` and
`writer.max_update_index => ref.update_index`.

The corresponding check for reflogs in `reftable_writer_add_log` is
however missing. Add a similar check, but only check for the upper
limit. This is because reflogs are treated a bit differently than refs.
Each reflog entry in reftable has an associated update_index and we also
allow expiring entries in the middle, which is done by simply writing a
new reflog entry with the same update_index. This means, writing reflog
entries with update_index lesser than the writer's update_index is an
expected scenario.

Add a new unit test to check for the limits and fix some of the existing
tests, which were setting arbitrary values for the update_index by
ensuring they stay within the now checked limits.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 08:04:46 +09:00
Patrick Steinhardt ef46ad0815 reftable: rename scratch buffer
Both `struct block_writer` and `struct reftable_writer` have a `buf`
member that is being reused to optimize the number of allocations.
Rename the variable to `scratch` to clarify its intend and provide a
comment explaining why it exists.

Suggested-by: Christian Couder <christian.couder@gmail.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-26 08:39:38 +09:00
Patrick Steinhardt 66ed011bf7 reftable/writer: optimize allocations by using a scratch buffer
Both `writer_add_record()` and `reftable_writer_add_ref()` get executed
for every single ref record we're adding to the reftable writer. And as
both functions use a local buffer to write data, the allocations we have
to do here add up during larger transactions.

Refactor the code to use a scratch buffer part of the `reftable_writer`
itself such that we can reuse it. This signifcantly reduces the number
of allocations during large transactions, e.g. when migrating refs from
the "files" backend to the "reftable" backend. Before this change:

    HEAP SUMMARY:
        in use at exit: 80,048 bytes in 49 blocks
      total heap usage: 5,032,171 allocs, 5,032,122 frees, 418,792,092 bytes allocated

After this change:

    HEAP SUMMARY:
        in use at exit: 80,048 bytes in 49 blocks
      total heap usage: 3,025,864 allocs, 3,025,815 frees, 372,746,291 bytes allocated

This also translate into a small speedup:

    Benchmark 1: migrate files:reftable (refcount = 1000000, revision = HEAD~)
      Time (mean ± σ):     827.2 ms ±  16.5 ms    [User: 689.4 ms, System: 124.9 ms]
      Range (min … max):   809.0 ms … 924.7 ms    50 runs

    Benchmark 2: migrate files:reftable (refcount = 1000000, revision = HEAD)
      Time (mean ± σ):     813.6 ms ±  11.6 ms    [User: 679.0 ms, System: 123.4 ms]
      Range (min … max):   786.7 ms … 833.5 ms    50 runs

    Summary
      migrate files:reftable (refcount = 1000000, revision = HEAD) ran
        1.02 ± 0.02 times faster than migrate files:reftable (refcount = 1000000, revision = HEAD~)

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-21 07:59:16 +09:00
Patrick Steinhardt c2f08236ed reftable/system: stop depending on "hash.h"
We include "hash.h" in "reftable/system.h" such that we can use hash
format IDs as well as the raw size of SHA1 and SHA256. As we are in the
process of converting the reftable library to become standalone we of
course cannot rely on those constants anymore.

Introduce a new `enum reftable_hash` to replace internal uses of the
hash format IDs and new constants that replace internal uses of the hash
size. Adapt the reftable backend to set up the correct hash function.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-19 12:23:10 +09:00
Patrick Steinhardt 88e297275b reftable: explicitly handle hash format IDs
The hash format IDs are used for two different things across the
reftable codebase:

  - They are used as a 32 bit unsigned integer when reading and writing
    the header in order to identify the hash function.

  - They are used internally to identify which hash function is in use.

When one only considers the second usecase one might think that one can
easily change the representation of those hash IDs. But because those
IDs end up in the reftable header and footer on disk it is important
that those never change.

Create separate constants `REFTABLE_FORMAT_ID_*` and use them in
contexts where we read or write reftable headers. This serves multiple
purposes:

  - It allows us to more easily discern cases where we actually use
    those constants for the on-disk format.

  - It detangles us from the same constants that are defined in
    libgit.a, which is another required step to convert the reftable
    library to become standalone.

  - It makes the next step easier where we stop using `GIT_*_FORMAT_ID`
    constants in favor of a custom enum.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-19 12:23:09 +09:00
Patrick Steinhardt 20590cd287 reftable: handle trivial `reftable_buf` errors
Convert the reftable library such that we handle failures with the
new `reftable_buf` interfaces.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-17 16:59:56 -04:00
Patrick Steinhardt 4abc8022ff reftable/record: adapt `reftable_record_key()` to handle allocation failures
The `reftable_record_key()` function cannot pass any errors to the
caller as it has a `void` return type. Adapt it and its callers such
that we can handle errors and start handling allocation failures.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-17 16:59:56 -04:00
Patrick Steinhardt be4c070a3c reftable: convert from `strbuf` to `reftable_buf`
Convert the reftable library to use the `reftable_buf` interface instead
of the `strbuf` interface. This is mostly a mechanical change via sed(1)
with some manual fixes where functions for `strbuf` and `reftable_buf`
differ. The converted code does not yet handle allocation failures. This
will be handled in subsequent commits.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-17 16:59:56 -04:00
Patrick Steinhardt 409f04995e reftable: stop using `strbuf_addbuf()`
We're about to introduce our own `reftable_buf` type to replace
`strbuf`. Get rid of the seldomly-used `strbuf_addbuf()` function such
that we have to reimplement one less function.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-17 16:59:55 -04:00
Patrick Steinhardt 24e0ade65b reftable: introduce `REFTABLE_FREE_AND_NULL()`
We have several calls to `FREE_AND_NULL()` in the reftable library,
which of course uses free(3P). As the reftable allocators are pluggable
we should rather call the reftable specific function, which is
`reftable_free()`.

Introduce a new macro `REFTABLE_FREE_AND_NULL()` and adapt the callsites
accordingly.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-02 07:53:56 -07:00
Patrick Steinhardt 12b9078066 reftable: handle trivial allocation failures
Handle trivial allocation failures in the reftable library and its unit
tests.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-02 07:53:55 -07:00
Patrick Steinhardt 51afc709dc reftable/tree: handle allocation failures
The tree interfaces of the reftable library handle both insertion and
searching of tree nodes with a single function, where the behaviour is
altered between the two via an `insert` bit. This makes it quit awkward
to handle allocation failures because on inserting we'd have to check
for `NULL` pointers and return an error, whereas on searching entries we
don't have to handle it as an allocation error.

Split up concerns of this function into two separate functions, one for
inserting entries and one for searching entries. This makes it easy for
us to check for allocation errors as `tree_insert()` should never return
a `NULL` pointer now. Adapt callers accordingly.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-02 07:53:55 -07:00
Patrick Steinhardt 2d5dbb37b2 reftable/block: handle allocation failures
Handle allocation failures in `block_writer_init()` and
`block_reader_init()`. This requires us to bubble up error codes into
`writer_reinit_block_writer()`. Adapt call sites accordingly.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-02 07:53:55 -07:00
Patrick Steinhardt 74d1c18757 reftable/writer: handle allocation failures in `reftable_new_writer()`
Handle allocation failures in `reftable_new_writer()`. Adapt the
function to return an error code to return such failures. While at it,
rename it to match our code style as we have to touch up every callsite
anyway.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-02 07:53:52 -07:00
Patrick Steinhardt b680af2dba reftable/writer: handle allocation failures in `writer_index_hash()`
Handle allocation errors in `writer_index_hash()`. Adjust its only
caller in `reftable_writer_add_ref()` accordingly.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-02 07:53:52 -07:00
Jeff King 4695c3f3a9 reftable: mark unused parameters in virtual functions
The reftable code uses a lot of virtual function pointers, but many of
the concrete implementations do not need all of the parameters.

For the most part these are obviously fine to just mark as UNUSED (e.g.,
the empty_iterator functions unsurprisingly do not do anything). Here
are a few cases where I dug a little deeper (but still ended up just
marking them UNUSED):

  - the iterator exclude_patterns is best-effort and optional (though it
    would be nice to support in the long run as an optimization)

  - ignoring the ref_store in many transaction functions is unexpected,
    but works because the ref_transaction itself carries enough
    information to do what we need.

  - ignoring "err" for in some cases (e.g., transaction abort) is OK
    because we do not return any errors. It is a little odd for
    reftable_be_create_reflog(), though, since we do return errors
    there. We should perhaps be creating string error messages at this
    layer, but I've punted on that for now.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-17 09:46:12 -07:00
Patrick Steinhardt c22d75b027 reftable/writer: improve error when passed an invalid block size
The reftable format only supports block sizes up to 16MB. When the
writer is being passed a value bigger than that it simply calls
abort(3P), which isn't all that helpful due to the lack of a proper
error message.

Improve this by calling `BUG()` instead.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:02:38 -07:00
Patrick Steinhardt e0cf3d8f8b reftable/writer: drop static variable used to initialize strbuf
We have a static variable in the reftable writer code that is merely
used to initialize the `last_key` of the writer. Convert the code to
instead use `strbuf_init()` and drop the variable.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:02:38 -07:00
Patrick Steinhardt 799237852b reftable: pass opts as constant pointer
We sometimes pass the refatble write options as value and sometimes as a
pointer. This is quite confusing and makes the reader wonder whether the
options get modified sometimes.

In fact, `reftable_new_writer()` does cause the caller-provided options
to get updated when some values aren't set up. This is quite unexpected,
but didn't cause any harm until now.

Adapt the code so that we do not modify the caller-provided values
anymore. While at it, refactor the code to code to consistently pass the
options as a constant pointer to clarify that the caller-provided opts
will not ever get modified.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:02:38 -07:00
Patrick Steinhardt 8aaeffe3b5 reftable/writer: reset `last_key` instead of releasing it
The reftable writer tracks the last key that it has written so that it
can properly compute the compressed prefix for the next record it is
about to write. This last key must be reset whenever we move on to write
the next block, which is done in `writer_reinit_block_writer()`. We do
this by calling `strbuf_release()` though, which needlessly deallocates
the underlying buffer.

Convert the code to use `strbuf_reset()` instead, which saves one
allocation per block we're about to write. This requires us to also
amend `reftable_writer_free()` to release the buffer's memory now as we
previously seemingly relied on `writer_reinit_block_writer()` to release
the memory for us. Releasing memory here is the right thing to do
anyway.

While at it, convert a callsite where we truncate the buffer by setting
its length to zero to instead use `strbuf_reset()`, too.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-08 17:01:41 -07:00
Patrick Steinhardt 60dd319519 reftable/writer: unify releasing memory
There are two code paths which release memory of the reftable writer:

  - `reftable_writer_close()` releases internal state after it has
    written data.

  - `reftable_writer_free()` releases the block that was written to and
    the writer itself.

Both code paths free different parts of the writer, and consequently the
caller must make sure to call both. And while callers mostly do this
already, this falls apart when a write failure causes the caller to skip
calling `reftable_write_close()`.

Introduce a new function `reftable_writer_release()` that releases all
internal state and call it from both paths. Like this it is fine for the
caller to not call `reftable_writer_close()`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-08 17:01:41 -07:00
Patrick Steinhardt 7e892fec47 reftable/writer: refactorings for `writer_flush_nonempty_block()`
Large parts of the reftable library do not conform to Git's typical code
style. Refactor `writer_flush_nonempty_block()` such that it conforms
better to it and add some documentation that explains some of its more
intricate behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-08 17:01:41 -07:00
Patrick Steinhardt d0dd119f72 reftable/writer: refactorings for `writer_add_record()`
Large parts of the reftable library do not conform to Git's typical code
style. Refactor `writer_add_record()` such that it conforms better to it
and add some documentation that explains some of its more intricate
behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-08 17:01:41 -07:00
Junio C Hamano f424d7c33d Merge branch 'ps/reftable-styles'
Code clean-up in various reftable code paths.

* ps/reftable-styles:
  reftable/record: improve semantics when initializing records
  reftable/merged: refactor initialization of iterators
  reftable/merged: refactor seeking of records
  reftable/stack: use `size_t` to track stack length
  reftable/stack: use `size_t` to track stack slices during compaction
  reftable/stack: index segments with `size_t`
  reftable/stack: fix parameter validation when compacting range
  reftable: introduce macros to allocate arrays
  reftable: introduce macros to grow arrays
2024-02-12 13:16:10 -08:00
Patrick Steinhardt b4ff12c8ee reftable: introduce macros to allocate arrays
Similar to the preceding commit, let's carry over macros to allocate
arrays with `REFTABLE_ALLOC_ARRAY()` and `REFTABLE_CALLOC_ARRAY()`. This
requires us to change the signature of `reftable_calloc()`, which only
takes a single argument right now and thus puts the burden on the caller
to calculate the final array's size. This is a net improvement though as
it means that we can now provide proper overflow checks when multiplying
the array size with the member size.

Convert callsites of `reftable_calloc()` to the new signature and start
using the new macros where possible.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-06 12:10:08 -08:00
Patrick Steinhardt f6b58c1be4 reftable: introduce macros to grow arrays
Throughout the reftable library we have many cases where we need to grow
arrays. In order to avoid too many reallocations, we roughly double the
capacity of the array on each iteration. The resulting code pattern is
duplicated across many sites.

We have similar patterns in our main codebase, which is why we have
eventually introduced an `ALLOC_GROW()` macro to abstract it away and
avoid some code duplication. We cannot easily reuse this macro here
though because `ALLOC_GROW()` uses `REALLOC_ARRAY()`, which in turn will
call realloc(3P) to grow the array. The reftable code is structured as a
library though (even if the boundaries are fuzzy), and one property this
brings with it is that it is possible to plug in your own allocators. So
instead of using realloc(3P), we need to use `reftable_realloc()` that
knows to use the user-provided implementation.

So let's introduce two new macros `REFTABLE_REALLOC_ARRAY()` and
`REFTABLE_ALLOC_GROW()` that mirror what we do in our main codebase,
with two modifications:

  - They use `reftable_realloc()`, as explained above.

  - They use a different growth factor of `2 * cap + 1` instead of `(cap
    + 16) * 3 / 2`.

The second change is because we know a bit more about the allocation
patterns in the reftable library. In most cases, we end up only having a
handful of items in the array and don't end up growing them. The initial
capacity that our normal growth factor uses (which is 24) would thus end
up over-allocating in a lot of code paths. This effect is measurable:

  - Before change:

      HEAP SUMMARY:
          in use at exit: 671,983 bytes in 152 blocks
        total heap usage: 3,843,446 allocs, 3,843,294 frees, 223,761,402 bytes allocated

  - After change with a growth factor of `(2 * alloc + 1)`:

      HEAP SUMMARY:
          in use at exit: 671,983 bytes in 152 blocks
        total heap usage: 3,843,446 allocs, 3,843,294 frees, 223,761,410 bytes allocated

  - After change with a growth factor of `(alloc + 16)* 2 / 3`:

      HEAP SUMMARY:
          in use at exit: 671,983 bytes in 152 blocks
        total heap usage: 3,833,673 allocs, 3,833,521 frees, 4,728,251,742 bytes allocated

While the total heap usage is roughly the same, we do end up allocating
significantly more bytes with our usual growth factor (in fact, roughly
21 times as many).

Convert the reftable library to use these new macros.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-06 12:10:08 -08:00
Patrick Steinhardt 4950acae7d reftable: document reading and writing indices
The way the index gets written and read is not trivial at all and
requires the reader to piece together a bunch of parts to figure out how
it works. Add some documentation to hopefully make this easier to
understand for the next reader.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-01 11:11:33 -08:00
Patrick Steinhardt e7485601ca reftable/writer: fix writing multi-level indices
When finishing a section we will potentially write an index that makes
it more efficient to look up relevant blocks. The index records written
will encode, for each block of the indexed section, what the offset of
that block is as well as the last key of that block. Thus, the reader
would iterate through the index records to find the first key larger or
equal to the wanted key and then use the encoded offset to look up the
desired block.

When there are a lot of blocks to index though we may end up writing
multiple index blocks, too. To not require a linear search across all
index blocks we instead end up writing a multi-level index. Instead of
referring to the block we are after, an index record may point to
another index block. The reader will then access the highest-level index
and follow down the chain of index blocks until it hits the sought-after
block.

It has been observed though that it is impossible to seek ref records of
the last ref block when using a multi-level index. While the multi-level
index exists and looks fine for most of the part, the highest-level
index was missing an index record pointing to the last block of the next
index. Thus, every additional level made more refs become unseekable at
the end of the ref section.

The root cause is that we are not flushing the last block of the current
level once done writing the level. Consequently, it wasn't recorded in
the blocks that need to be indexed by the next-higher level and thus we
forgot about it.

Fix this bug by flushing blocks after we have written all index records.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-01 11:11:32 -08:00
Patrick Steinhardt b66e006ff5 reftable/writer: simplify writing index records
When finishing the current section some index records might be written
for the section to the table. The logic that adds these records to the
writer duplicates what we already have in `writer_add_record()`, making
this more complicated than it really has to be.

Simplify the code by using `writer_add_record()` instead. While at it,
drop the unneeded braces around a loop to make the code conform to our
code style better.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-01 11:11:32 -08:00
Patrick Steinhardt 9ebb2d7b08 reftable/writer: use correct type to iterate through index entries
The reftable writer is tracking the number of blocks it has to index via
the `index_len` variable. But while this variable is of type `size_t`,
some sites use an `int` to loop through the index entries.

Convert the code to consistently use `size_t`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-01 11:11:32 -08:00
John Cai 1df18a1c9a reftable: honor core.fsync
While the reffiles backend honors configured fsync settings, the
reftable backend does not. Address this by fsyncing reftable files using
the write-or-die api's fsync_component() in two places: when we
add additional entries into the table, and when we close the reftable
writer.

This commits adds a flush function pointer as a new member of
reftable_writer because we are not sure that the first argument to the
*write function pointer always contains a file descriptor. In the case of
strbuf_add_void, the first argument is a buffer. This way, we can pass
in a corresponding flush function that knows how to flush depending on
which writer is being used.

This patch does not contain tests as they will need to wait for another
patch to start to exercise the reftable backend. At that point, the
tests will be added to observe that fsyncs are happening when the
reftable is in use.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-01-23 13:45:27 -08:00
Patrick Steinhardt ddac965965 reftable/writer: fix index corruption when writing multiple indices
Each reftable may contain multiple types of blocks for refs, objects and
reflog records, where each of these may have an index that makes it more
efficient to find the records. It was observed that the index for log
records can become corrupted under certain circumstances, where the
first entry of the index points into the object index instead of to the
log records.

As it turns out, this corruption can occur whenever we write a log index
as well as at least one additional index. Writing records and their index
is basically a two-step process:

  1. We write all blocks for the corresponding record. Each block that
     gets written is added to a list of blocks to index.

  2. Once all blocks were written we finish the section. If at least two
     blocks have been added to the list of blocks to index then we will
     now write the index for those blocks and flush it, as well.

When we have a very large number of blocks then we may decide to write a
multi-level index, which is why we also keep track of the list of the
index blocks in the same way as we previously kept track of the blocks
to index.

Now when we have finished writing all index blocks we clear the index
and flush the last block to disk. This is done in the wrong order though
because flushing the block to disk will re-add it to the list of blocks
to be indexed. The result is that the next section we are about to write
will have an entry in the list of blocks to index that points to the
last block of the preceding section's index, which will corrupt the log
index.

Fix this corruption by clearing the index after having written the last
block.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-01-03 09:54:20 -08:00