Commit Graph

249 Commits (0c9d6b7ced47d24cf3126a8b8eff3e4c2a0f8aad)

Author SHA1 Message Date
Junio C Hamano 36d8035d27 Merge branch 'ps/object-file-cleanup'
Code clean-up.

* ps/object-file-cleanup:
  object-store: merge "object-store-ll.h" and "object-store.h"
  object-store: remove global array of cached objects
  object: split out functions relating to object store subsystem
  object-file: drop `index_blob_stream()`
  object-file: split up concerns of `HASH_*` flags
  object-file: split out functions relating to object store subsystem
  object-file: move `xmmap()` into "wrapper.c"
  object-file: move `git_open_cloexec()` to "compat/open.c"
  object-file: move `safe_create_leading_directories()` into "path.c"
  object-file: move `mkdir_in_gitdir()` into "path.c"
2025-04-24 17:25:33 -07:00
Junio C Hamano a271b05066 Merge branch 'ps/cat-file-filter-batch'
"git cat-file --batch" and friends learned to allow "--filter=" to
omit certain objects, just like the transport layer does.

* ps/cat-file-filter-batch:
  builtin/cat-file: use bitmaps to efficiently filter by object type
  builtin/cat-file: deduplicate logic to iterate over all objects
  pack-bitmap: introduce function to check whether a pack is bitmapped
  pack-bitmap: add function to iterate over filtered bitmapped objects
  pack-bitmap: allow passing payloads to `show_reachable_fn()`
  builtin/cat-file: support "object:type=" objects filter
  builtin/cat-file: support "blob:limit=" objects filter
  builtin/cat-file: support "blob:none" objects filter
  builtin/cat-file: wire up an option to filter objects
  builtin/cat-file: introduce function to report object status
  builtin/cat-file: rename variable that tracks usage
2025-04-16 13:54:21 -07:00
Junio C Hamano ee847e0034 Merge branch 'ps/object-wo-the-repository'
The object layer has been updated to take an explicit repository
instance as a parameter in more code paths.

* ps/object-wo-the-repository:
  hash: stop depending on `the_repository` in `null_oid()`
  hash: fix "-Wsign-compare" warnings
  object-file: split out logic regarding hash algorithms
  delta-islands: stop depending on `the_repository`
  object-file-convert: stop depending on `the_repository`
  pack-bitmap-write: stop depending on `the_repository`
  pack-revindex: stop depending on `the_repository`
  pack-check: stop depending on `the_repository`
  environment: move access to "core.bigFileThreshold" into repo settings
  pack-write: stop depending on `the_repository` and `the_hash_algo`
  object: stop depending on `the_repository`
  csum-file: stop depending on `the_repository`
2025-04-15 13:50:15 -07:00
Patrick Steinhardt 68cd492a3e object-store: merge "object-store-ll.h" and "object-store.h"
The "object-store-ll.h" header has been introduced to keep transitive
header dependendcies and compile times at bay. Now that we have created
a new "object-store.c" file though we can easily move the last remaining
additional bit of "object-store.h", the `odb_path_map`, out of the
header.

Do so. As the "object-store.h" header is now equivalent to its low-level
alternative we drop the latter and inline it into the former.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15 08:24:37 -07:00
Patrick Steinhardt 97dc141fd6 object-file: move `git_open_cloexec()` to "compat/open.c"
The `git_open_cloexec()` wrapper function provides the ability to open a
file with `O_CLOEXEC` in a platform-agnostic way. This function is
provided by "object-file.c" even though it is not specific to the object
subsystem at all.

Move the file into "compat/open.c". This file already exists before this
commit, but has only been compiled conditionally depending on whether or
not open(3p) may return EINTR. With this change we now unconditionally
compile the object, but wrap `git_open_with_retry()` in an ifdef.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15 08:24:35 -07:00
Patrick Steinhardt c9b94a7785 pack-bitmap: introduce function to check whether a pack is bitmapped
Introduce a function that allows us to verify whether a pack is
bitmapped or not. This functionality will be used in a subsequent
commit.

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-07 14:43:51 -07:00
Patrick Steinhardt 5420901bde pack-bitmap: add function to iterate over filtered bitmapped objects
Introduce a function that allows the caller to iterate over all
bitmapped objects that match a given filter. This mechanism will be used
in a subsequent commit to optimize object filters in git-cat-file(1).

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-07 14:43:51 -07:00
Patrick Steinhardt 3d45483846 pack-bitmap: allow passing payloads to `show_reachable_fn()`
The `show_reachable_fn` callback is used by a couple of functions to
present reachable objects to the caller. The function does not provide a
way for the caller to pass a payload though, which is functionality that
we'll require in a subsequent commit.

Change the callback type to accept a payload and adapt all callsites
accordingly.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-07 14:43:51 -07:00
Taylor Blau 5999b44fcb pack-bitmap.c: use `ewah_or_iterator` for type bitmap iterators
Now that we have initialized arrays for each bitmap layer's type bitmaps
in the previous commit, adjust existing callers to use them in
preparation for multi-layered bitmaps.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 04:34:13 -07:00
Taylor Blau e07af41139 pack-bitmap.c: keep track of each layer's type bitmaps
Prepare for reading the type-level bitmaps from previous bitmap layers
by maintaining an array for each type, where each element in that type's
array corresponds to one layer's bitmap for that type.

These fields will be used in a later commit to instantiate the 'struct
ewah_or_iterator' for each type.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 04:34:09 -07:00
Taylor Blau e2837e29e6 pack-bitmap.c: apply pseudo-merge commits with incremental MIDXs
Prepare for using pseudo-merges with incremental MIDX bitmaps by
attempting to apply pseudo-merges from each layer when encountering a
given commit during a walk.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 04:34:01 -07:00
Taylor Blau db17e777c8 pack-bitmap.c: compute disk-usage with incremental MIDXs
In a similar fashion as previous commits, use nth_midxed_pack() instead
of accessing the MIDX's ->packs array directly to support incremental
MIDXs.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 04:33:57 -07:00
Taylor Blau faab03f9a1 pack-bitmap.c: teach `rev-list --test-bitmap` about incremental MIDXs
Implement support for the special `--test-bitmap` mode of `git rev-list`
when using incremental MIDXs.

The bitmap_test_data structure is extended to contain a "base" pointer
that mirrors the structure of the bitmap chain that it is being used to
test.

When we find a commit to test, we first chase down the ->base pointer to
find the appropriate bitmap_test_data for the bitmap layer that the
given commit is contained within, and then perform the test on that
bitmap.

In order to implement this, light modifications are made to
bitmap_for_commit() to reimplement it in terms of a new function,
find_bitmap_for_commit(), which fills out a pointer which indicates the
bitmap layer which contains the given commit.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 04:33:53 -07:00
Taylor Blau 3886c72621 pack-bitmap.c: support bitmap pack-reuse with incremental MIDXs
In a similar fashion as previous commits in the first phase of
incremental MIDXs, enumerate not just the packs in the current
incremental MIDX layer, but previous ones as well.

Likewise, in reuse_partial_packfile_from_bitmap(), when reusing only a
single pack from a MIDX, use the oldest layer's preferred pack as it is
likely to contain the largest number of reusable sections.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 04:33:49 -07:00
Taylor Blau f77dbf0285 pack-bitmap.c: teach `show_objects_for_type()` about incremental MIDXs
Since we may ask for a pack_id that is in an earlier MIDX layer relative
to the one corresponding to our bitmap, use nth_midxed_pack() instead of
accessing the ->packs array directly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 04:33:45 -07:00
Taylor Blau ae61324f0a pack-bitmap.c: teach `bitmap_for_commit()` about incremental MIDXs
The pack-bitmap machinery uses `bitmap_for_commit()` to locate the
EWAH-compressed bitmap corresponding to some given commit object.

Teach this function about incremental MIDX bitmaps by teaching it to
recur on earlier bitmap layers when it fails to find a given commit in
the current layer.

The changes to do so are as follows:

  - Avoid initializing hash_pos at its declaration, since
    bitmap_for_commit() is now a recursive function and may receive a
    NULL bitmap_index pointer as its first argument.

  - In cases where we would previously return NULL (to indicate that a
    lookup failed and the given bitmap_index does not contain an entry
    corresponding to the given commit), recursively call the function on
    the previous bitmap layer.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 04:33:41 -07:00
Taylor Blau f31a17cea5 pack-bitmap.c: open and store incremental bitmap layers
Prepare the pack-bitmap machinery to work with incremental MIDXs by
adding a new "base" field to keep track of the bitmap index associated
with the previous MIDX layer.

The changes in this commit are mostly boilerplate to open the correct
bitmap(s), add them to the chain of bitmap layers along the "base"
pointer, ensure that the correct packs and their reverse indexes are
loaded across MIDX layers, etc.

While we're at it, keep track of a base_nr field to indicate how many
bitmap layers (including the current bitmap) exist. This will be used in
a future commit to allocate an array of 'struct ewah_bitmap' pointers to
collect all of the respective type bitmaps among all layers to
initialize a multi-EWAH iterator.

Subsequent commits will teach the functions within the pack-bitmap
machinery how to interact with these new fields.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 04:33:36 -07:00
Taylor Blau 8331c17b79 pack-revindex: prepare for incremental MIDX bitmaps
Prepare the reverse index machinery to handle object lookups in an
incremental MIDX bitmap. These changes are broken out across a few
functions:

  - load_midx_revindex() learns to use the appropriate MIDX filename
    depending on whether the given 'struct multi_pack_index *' is
    incremental or not.

  - pack_pos_to_midx() and midx_to_pack_pos() now both take in a global
    object position in the MIDX pseudo-pack order, and find the
    earliest containing MIDX (similar to midx.c::midx_for_object().

  - midx_pack_order_cmp() adjusts its call to pack_pos_to_midx() by the
    number of objects in the base (since 'vb - midx->revindx_data' is
    relative to the containing MIDX, and pack_pos_to_midx() expects a
    global position).

    Likewise, this function adjusts its output by adding
    m->num_objects_in_base to return a global position out through the
    `*pos` pointer.

Together, these changes are sufficient to use the multi-pack index's
reverse index format for incremental multi-pack reachability bitmaps.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 04:33:33 -07:00
Patrick Steinhardt 74d414c9f1 object: stop depending on `the_repository`
There are a couple of functions exposed by "object.c" that implicitly
depend on `the_repository`. Remove this dependency by injecting the
repository via a parameter. Adapt callers accordingly by simply using
`the_repository`, except in cases where the subsystem is already free of
the repository. In that case, we instead pass the repository provided by
the caller's context.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-10 13:16:18 -07:00
Patrick Steinhardt 228457c9d9 csum-file: stop depending on `the_repository`
There are multiple sites in "csum-file.c" where we use the global
`the_repository` variable, either explicitly or implicitly by using
`the_hash_algo`.

Refactor the code to stop using `the_repository` by adapting functions
to receive required data as parameters. Adapt callsites accordingly by
either using `the_repository->hash_algo`, or by using a context-provided
hash algorithm in case the subsystem already got rid of its dependency
on `the_repository`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-10 13:16:18 -07:00
Junio C Hamano 7b39a128c8 Merge branch 'ps/the-repository'
More code paths have a repository passed through the callchain,
instead of assuming the primary the_repository object.

* ps/the-repository:
  match-trees: stop using `the_repository`
  graph: stop using `the_repository`
  add-interactive: stop using `the_repository`
  tmp-objdir: stop using `the_repository`
  resolve-undo: stop using `the_repository`
  credential: stop using `the_repository`
  mailinfo: stop using `the_repository`
  diagnose: stop using `the_repository`
  server-info: stop using `the_repository`
  send-pack: stop using `the_repository`
  serve: stop using `the_repository`
  trace: stop using `the_repository`
  pager: stop using `the_repository`
  progress: stop using `the_repository`
2025-01-21 08:44:54 -08:00
Junio C Hamano a08ebf8b3e Merge branch 'tb/bitmap-fix-pack-reuse'
Code to reuse objects based on bitmap contents have been tightened
to avoid race condition even when multiple packs are involved.

* tb/bitmap-fix-pack-reuse:
  pack-bitmap.c: ensure pack validity for all reuse packs
2024-12-23 09:32:22 -08:00
Junio C Hamano 4156b6a741 Merge branch 'ps/build-sign-compare'
Start working to make the codebase buildable with -Wsign-compare.

* ps/build-sign-compare:
  t/helper: don't depend on implicit wraparound
  scalar: address -Wsign-compare warnings
  builtin/patch-id: fix type of `get_one_patchid()`
  builtin/blame: fix type of `length` variable when emitting object ID
  gpg-interface: address -Wsign-comparison warnings
  daemon: fix type of `max_connections`
  daemon: fix loops that have mismatching integer types
  global: trivial conversions to fix `-Wsign-compare` warnings
  pkt-line: fix -Wsign-compare warning on 32 bit platform
  csum-file: fix -Wsign-compare warning on 32-bit platform
  diff.h: fix index used to loop through unsigned integer
  config.mak.dev: drop `-Wno-sign-compare`
  global: mark code units that generate warnings with `-Wsign-compare`
  compat/win32: fix -Wsign-compare warning in "wWinMain()"
  compat/regex: explicitly ignore "-Wsign-compare" warnings
  git-compat-util: introduce macros to disable "-Wsign-compare" warnings
2024-12-23 09:32:11 -08:00
Patrick Steinhardt 1f7e6478dc progress: stop using `the_repository`
Stop using `the_repository` in the "progress" subsystem by passing in a
repository when initializing `struct progress`. Furthermore, store a
pointer to the repository in that struct so that we can pass it to the
trace2 API when logging information.

Adjust callers accordingly by using `the_repository`. While there may be
some callers that have a repository available in their context, this
trivial conversion allows for easier verification and bubbles up the use
of `the_repository` by one level.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-18 10:44:30 -08:00
Junio C Hamano 913a1e157c Merge branch 'ps/build-sign-compare' into ps/the-repository
* ps/build-sign-compare:
  t/helper: don't depend on implicit wraparound
  scalar: address -Wsign-compare warnings
  builtin/patch-id: fix type of `get_one_patchid()`
  builtin/blame: fix type of `length` variable when emitting object ID
  gpg-interface: address -Wsign-comparison warnings
  daemon: fix type of `max_connections`
  daemon: fix loops that have mismatching integer types
  global: trivial conversions to fix `-Wsign-compare` warnings
  pkt-line: fix -Wsign-compare warning on 32 bit platform
  csum-file: fix -Wsign-compare warning on 32-bit platform
  diff.h: fix index used to loop through unsigned integer
  config.mak.dev: drop `-Wno-sign-compare`
  global: mark code units that generate warnings with `-Wsign-compare`
  compat/win32: fix -Wsign-compare warning in "wWinMain()"
  compat/regex: explicitly ignore "-Wsign-compare" warnings
  git-compat-util: introduce macros to disable "-Wsign-compare" warnings
2024-12-18 10:43:16 -08:00
Taylor Blau 62b3ec8a3f pack-bitmap.c: ensure pack validity for all reuse packs
Commit 44f9fd6496 (pack-bitmap.c: check preferred pack validity when
opening MIDX bitmap, 2022-05-24) prevents a race condition whereby the
preferred pack disappears between opening the MIDX bitmap and attempting
verbatim reuse out of its packs.

That commit forces open_midx_bitmap_1() to ensure the validity of the
MIDX's preferred pack, meaning that we have an open file handle on the
*.pack, ensuring that we can reuse bytes out of verbatim later on in the
process[^1].

But 44f9fd6496 was not extended to cover multi-pack reuse, meaning that
this same race condition exists for non-preferred packs during verbatim
reuse. Work around that race in the same way by only marking valid packs
as reuse-able. For packs that aren't reusable, skip over them but
include the number of objects they have to ensure we allocate a large
enough 'reuse' bitmap (e.g. if a pack in the middle of the MIDX
disappeared but we still want to reuse later packs).

Since we're ensuring the validity of these packs within the verbatim
reuse code, we no longer have to special-case the preferred pack and
open it within the open_midx_bitmap_1() function.

An alternative approach to the one taken here would be to open all
MIDX'd packs from within open_midx_bitmap_1(). But that would be both
slower and make the bitmaps less useful, since we can still perform some
pack reuse among the packs that still exist when the *.bitmap is opened.

After applying this patch, we can simulate the new behavior after
instrumenting Git like so:

    diff --git a/packfile.c b/packfile.c
    index 9560f0a33c..aedce72524 100644
    --- a/packfile.c
    +++ b/packfile.c
    @@ -557,6 +557,11 @@ static int open_packed_git_1(struct packed_git *p)
     		; /* nothing */

     	p->pack_fd = git_open(p->pack_name);
    +	{
    +		const char *delete = getenv("GIT_RACILY_DELETE");
    +		if (delete && !strcmp(delete, pack_basename(p)))
    +			return -1;
    +	}
     	if (p->pack_fd < 0 || fstat(p->pack_fd, &st))
     		return -1;
     	pack_open_fds++;

and adding the following test:

    test_expect_success 'disappearing packs' '
            git init disappearing-packs &&
            (
                    cd disappearing-packs &&

                    git config pack.allowPackReuse multi &&

                    test_commit A &&
                    test_commit B &&
                    test_commit C &&

                    A="$(echo "A" | git pack-objects --revs $packdir/pack-A)" &&
                    B="$(echo "A..B" | git pack-objects --revs $packdir/pack-B)" &&
                    C="$(echo "B..C" | git pack-objects --revs $packdir/pack-C)" &&

                    git multi-pack-index write --bitmap --preferred-pack=pack-A-$A.idx &&

                    test_pack_objects_reused_all 9 3 &&

                    test_env GIT_RACILY_DELETE=pack-A-$A.pack \
                            test_pack_objects_reused_all 6 2 &&
                    test_env GIT_RACILY_DELETE=pack-B-$B.pack \
                            test_pack_objects_reused_all 6 2 &&
                    test_env GIT_RACILY_DELETE=pack-C-$C.pack \
                            test_pack_objects_reused_all 6 2

            )
    '

Note that we could relax the single-pack version of this which was most
recently addressed in dc1daacdcc (pack-bitmap: check pack validity when
opening bitmap, 2021-07-23), but only partially. Because we still need
to know the object count in the pack, we'd still have to open the pack's
*.idx, so the savings there are marginal.

Note likewise that we add a new "if (!packs_nr)" early return in the
pack reuse code to avoid a potentially expensive allocation on the
'reuse' bitmap in the case that no packs are available for reuse.

[^1]: Unless we run out of open file handles. If that happens and we are
  forced to close the only open file handle of a file that has been
  removed from underneath us, there is nothing we can do.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-18 09:51:09 -08:00
Junio C Hamano ca43bd2562 Merge branch 'kn/midx-wo-the-repository'
Yet another "pass the repository through the callchain" topic.

* kn/midx-wo-the-repository:
  midx: inline the `MIDX_MIN_SIZE` definition
  midx: pass down `hash_algo` to functions using global variables
  midx: pass `repository` to `load_multi_pack_index`
  midx: cleanup internal usage of `the_repository` and `the_hash_algo`
  midx-write: pass down repository to `write_midx_file[_only]`
  write-midx: add repository field to `write_midx_context`
  midx-write: use `revs->repo` inside `read_refs_snapshot`
  midx-write: pass down repository to static functions
  packfile.c: remove unnecessary prepare_packed_git() call
  midx: add repository to `multi_pack_index` struct
  config: make `packed_git_(limit|window_size)` non-global variables
  config: make `delta_base_cache_limit` a non-global variable
  packfile: pass down repository to `for_each_packed_object`
  packfile: pass down repository to `has_object[_kept]_pack`
  packfile: pass down repository to `odb_pack_name`
  packfile: pass `repository` to static function in the file
  packfile: use `repository` from `packed_git` directly
  packfile: add repository to struct `packed_git`
2024-12-13 07:33:44 -08:00
Patrick Steinhardt 41f43b8243 global: mark code units that generate warnings with `-Wsign-compare`
Mark code units that generate warnings with `-Wsign-compare`. This
allows for a structured approach to get rid of all such warnings over
time in a way that can be easily measured.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-06 20:20:02 +09:00
Karthik Nayak f59de71cf7 midx: pass down `hash_algo` to functions using global variables
The functions `get_split_midx_filename_ext()`, `get_midx_filename()` and
`get_midx_filename_ext()` use `hash_to_hex()` which internally uses the
`the_hash_algo` global variable.

Remove this dependency on global variables by passing down the
`hash_algo` through to the functions mentioned and instead calling
`hash_to_hex_algop()` along with the obtained `hash_algo`.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-04 10:32:21 +09:00
Junio C Hamano 0a0712e05f Merge branch 'tb/boundary-traversal-fix'
A trivial "correctness" fix that does not yet matter in practice.

* tb/boundary-traversal-fix:
  pack-bitmap.c: typofix in `find_boundary_objects()`
2024-12-04 10:14:44 +09:00
Karthik Nayak e106040722 midx: add repository to `multi_pack_index` struct
The `multi_pack_index` struct represents the MIDX for a repository.
Here, we add a pointer to the repository in this struct, allowing direct
use of the repository variable without relying on the global
`the_repository` struct.

With this addition, we can determine the repository associated with a
`bitmap_index` struct. A `bitmap_index` points to either a `packed_git`
or a `multi_pack_index`, both of which have direct repository
references. To support this, we introduce a static helper function,
`bitmap_repo`, in `pack-bitmap.c`, which retrieves a repository given a
`bitmap_index`.

With this, we clear up all usages of `the_repository` within
`pack-bitmap.c` and also remove the `USE_THE_REPOSITORY_VARIABLE`
definition. Bringing us another step closer to remove all global
variable usage.

Although this change also opens up the potential to clean up `midx.c`,
doing so would require additional refactoring to pass the repository
struct to functions where the MIDX struct is created: a task better
suited for future patches.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-04 08:21:55 +09:00
Karthik Nayak cc656f4eb2 packfile: pass down repository to `has_object[_kept]_pack`
The functions `has_object[_kept]_pack` currently rely on the global
variable `the_repository`. To eliminate global variable usage in
`packfile.c`, we should progressively shift the dependency on
the_repository to higher layers. Let's remove its usage from these
functions and any related ones.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-04 08:21:54 +09:00
Taylor Blau 91f88f76e6 pack-bitmap.c: typofix in `find_boundary_objects()`
In the boundary-based bitmap traversal, we use the given 'rev_info'
structure to first do a commit-only walk in order to determine the
boundary between interesting and uninteresting objects.

That walk only looks at commit objects, regardless of the state of
revs->blob_objects, revs->tree_objects, and so on. In order to do this,
we store the state of these variables in temporary fields before
setting them back to zero, performing the traversal, and then setting
them back.

But there is a typo here that dates back to b0afdce5da (pack-bitmap.c:
use commit boundary during bitmap traversal, 2023-05-08), where we
incorrectly store the value of the "tags" field as "revs->blob_objects".

This could lead to problems later on if, say, the caller wants tag
objects but *not* blob objects. In the pre-image behavior, we'd set
revs->tag_objects back to the old value of revs->blob_objects, thus
emitting fewer objects than expected back to the caller.

Fix that by correctly assigning the value of 'revs->tag_objects' to the
'tmp_tags' field.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-22 08:57:18 +09:00
Jeff King 479ab76c9f packfile: use object_id in find_pack_entry_one()
The main function we use to search a pack index for an object is
find_pack_entry_one(). That function still takes a bare pointer to the
hash, despite the fact that its underlying bsearch_pack() function needs
an object_id struct. And so we end up making an extra copy of the hash
into the struct just to do a lookup.

As it turns out, all callers but one already have such an object_id. So
we can just take a pointer to that struct and use it directly. This
avoids the extra copy and provides a more type-safe interface.

The one exception is get_delta_base() in packfile.c, when we are chasing
a REF_DELTA from inside the pack (and thus we have a pointer directly to
the mmap'd pack memory, not a struct). We can just bump the hashcpy()
from inside find_pack_entry_one() to this one caller that needs it.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-25 17:35:46 -04:00
Patrick Steinhardt 55e563a90c pseudo-merge: fix various memory leaks
Fix various memory leaks hit by the pseudo-merge machinery. These leaks
are exposed by t5333, but plugging them does not yet make the whole test
suite pass.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-30 11:23:06 -07:00
Taylor Blau db40e3c92b pack-bitmap.c: avoid repeated `pack_pos_to_offset()` during reuse
When calling `try_partial_reuse()`, the (sole) caller from the function
`reuse_partial_packfile_from_bitmap_1()` has to translate its bit
position to a pack position.

In the MIDX bitmap case, the caller translates from the bit position, to
a position in the MIDX's pseudo-pack order (with `pack_pos_to_midx()`),
then get a pack offset (with `nth_midxed_offset()`) before finally
working backwards to get the pack position in the source pack by calling
`offset_to_pack_pos()`.

In the non-MIDX bitmap case, we can use the bit position as the pack
position directly (see the comment at the beginning of the
`reuse_partial_packfile_from_bitmap_1()` function for why).

In either case, the first thing that `try_partial_reuse()` does after
being called is determine the offset of the object at the given pack
position by calling `pack_pos_to_offset()`. But we already have that
information in the MIDX case!

Avoid re-computing that information by instead passing it in. In the
MIDX case, we already have that information stored. In the non-MIDX
case, the call to `pack_pos_to_offset()` moves from the function
`try_partial_reuse()` to its caller. In total, we'll save one call to
`pack_pos_to_offset()` when processing MIDX bitmaps.

(On my machine, there is a slight speed-up on the order of ~2ms, but it
is within the margin of error over 10 runs, so I think you'd have to
have a truly gigantic repository to confidently measure any significant
improvement here).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-27 14:50:27 -07:00
Taylor Blau 41cd4b478f pack-bitmap: tag bitmapped packs with their corresponding MIDX
The next commit will need to use the bitmap's MIDX (if one exists) to
translate bit positions into pack-relative positions in the source pack.

Ordinarily, we'd use the "midx" field of the bitmap_index struct. But
since that struct is defined within pack-bitmap.c, and our caller is in
a separate compilation unit, we do not have access to the MIDX field.

Instead, add a "from_midx" field to the bitmapped_pack structure so that
we can use that piece of data from outside of pack-bitmap.c. The caller
that uses this new piece of information will be added in the following
commit.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-27 14:50:26 -07:00
Junio C Hamano 7b472da915 Merge branch 'ps/use-the-repository'
A CPP macro USE_THE_REPOSITORY_VARIABLE is introduced to help
transition the codebase to rely less on the availability of the
singleton the_repository instance.

* ps/use-the-repository:
  hex: guard declarations with `USE_THE_REPOSITORY_VARIABLE`
  t/helper: remove dependency on `the_repository` in "proc-receive"
  t/helper: fix segfault in "oid-array" command without repository
  t/helper: use correct object hash in partial-clone helper
  compat/fsmonitor: fix socket path in networked SHA256 repos
  replace-object: use hash algorithm from passed-in repository
  protocol-caps: use hash algorithm from passed-in repository
  oidset: pass hash algorithm when parsing file
  http-fetch: don't crash when parsing packfile without a repo
  hash-ll: merge with "hash.h"
  refs: avoid include cycle with "repository.h"
  global: introduce `USE_THE_REPOSITORY_VARIABLE` macro
  hash: require hash algorithm in `empty_tree_oid_hex()`
  hash: require hash algorithm in `is_empty_{blob,tree}_oid()`
  hash: make `is_null_oid()` independent of `the_repository`
  hash: convert `oidcmp()` and `oideq()` to compare whole hash
  global: ensure that object IDs are always padded
  hash: require hash algorithm in `oidread()` and `oidclr()`
  hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()`
  hash: drop (mostly) unused `is_empty_{blob,tree}_sha1()` functions
2024-07-02 09:59:00 -07:00
Junio C Hamano ffa47b75cf Merge branch 'tb/pseudo-merge-reachability-bitmap'
The pseudo-merge reachability bitmap to help more efficient storage
of the reachability bitmap in a repository with too many refs has
been added.

* tb/pseudo-merge-reachability-bitmap: (26 commits)
  pack-bitmap.c: ensure pseudo-merge offset reads are bounded
  Documentation/technical/bitmap-format.txt: add missing position table
  t/perf: implement performance tests for pseudo-merge bitmaps
  pseudo-merge: implement support for finding existing merges
  ewah: `bitmap_equals_ewah()`
  pack-bitmap: extra trace2 information
  pack-bitmap.c: use pseudo-merges during traversal
  t/test-lib-functions.sh: support `--notick` in `test_commit_bulk()`
  pack-bitmap: implement test helpers for pseudo-merge
  ewah: implement `ewah_bitmap_popcount()`
  pseudo-merge: implement support for reading pseudo-merge commits
  pack-bitmap.c: read pseudo-merge extension
  pseudo-merge: scaffolding for reads
  pack-bitmap: extract `read_bitmap()` function
  pack-bitmap-write.c: write pseudo-merge table
  pseudo-merge: implement support for selecting pseudo-merge commits
  config: introduce `git_config_double()`
  pack-bitmap: make `bitmap_writer_push_bitmapped_commit()` public
  pack-bitmap: implement `bitmap_writer_has_bitmapped_object_id()`
  pack-bitmap-write: support storing pseudo-merge commits
  ...
2024-06-24 16:39:13 -07:00
Taylor Blau a83e21de6b pack-bitmap.c: ensure pseudo-merge offset reads are bounded
After reading the pseudo-merge extension's metadata table, we allocate
an array to store information about each pseudo-merge, including its
byte offset within the .bitmap file itself.

This is done like so:

    pseudo_merge_ofs = index_end - 24 -
            (index->pseudo_merges.nr * sizeof(uint64_t));
    for (i = 0; i < index->pseudo_merges.nr; i++) {
            index->pseudo_merges.v[i].at = get_be64(pseudo_merge_ofs);
            pseudo_merge_ofs += sizeof(uint64_t);
    }

But if the pseudo-merge table is corrupt, we'll keep calling get_be64()
past the end of the pseudo-merge extension, potentially reading off the
end of the mmap'd region.

Prevent this by ensuring that we have at least `table_size - 24` many
bytes available to read (adding 24 to the left-hand side of our
inequality to account for the length of the metadata component).

This is sufficient to prevent us from reading off the end of the
pseudo-merge extension, and ensures that all of the get_be64() calls
below are in bounds.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14 14:19:27 -07:00
Patrick Steinhardt e7da938570 global: introduce `USE_THE_REPOSITORY_VARIABLE` macro
Use of the `the_repository` variable is deprecated nowadays, and we
slowly but steadily convert the codebase to not use it anymore. Instead,
callers should be passing down the repository to work on via parameters.

It is hard though to prove that a given code unit does not use this
variable anymore. The most trivial case, merely demonstrating that there
is no direct use of `the_repository`, is already a bit of a pain during
code reviews as the reviewer needs to manually verify claims made by the
patch author. The bigger problem though is that we have many interfaces
that implicitly rely on `the_repository`.

Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code
units to opt into usage of `the_repository`. The intent of this macro is
to demonstrate that a certain code unit does not use this variable
anymore, and to keep it from new dependencies on it in future changes,
be it explicit or implicit

For now, the macro only guards `the_repository` itself as well as
`the_hash_algo`. There are many more known interfaces where we have an
implicit dependency on `the_repository`, but those are not guarded at
the current point in time. Over time though, we should start to add
guards as required (or even better, just remove them).

Define the macro as required in our code units. As expected, most of our
code still relies on the global variable. Nearly all of our builtins
rely on the variable as there is no way yet to pass `the_repository` to
their entry point. For now, declare the macro in "biultin.h" to keep the
required changes at least a little bit more contained.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14 10:26:33 -07:00
Patrick Steinhardt f4836570a7 hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()`
Many of our hash functions have two variants, one receiving a `struct
git_hash_algo` and one that derives it via `the_repository`. Adapt all
of those functions to always require the hash algorithm as input and
drop the variants that do not accept one.

As those functions are now independent of `the_repository`, we can move
them from "hash.h" to "hash-ll.h".

Note that both in this and subsequent commits in this series we always
just pass `the_repository->hash_algo` as input even if it is obvious
that there is a repository in the context that we should be using the
hash from instead. This is done to be on the safe side and not introduce
any regressions. All callsites should eventually be amended to use a
repo passed via parameters, but this is outside the scope of this patch
series.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14 10:26:32 -07:00
Taylor Blau ed4a1d6ae1 pack-bitmap.c: avoid uninitialized `pack_int_id` during reuse
When performing multi-pack reuse, reuse_partial_packfile_from_bitmap()
is responsible for generating an array of bitmapped_pack structs from
which to perform reuse.

In the multi-pack case, we loop over the MIDXs packs and copy the result
of calling `nth_bitmapped_pack()` to construct the list of reusable
paths.

But we may also want to do pack-reuse over a single pack, either because
we only had one pack to perform reuse over (in the case of single-pack
bitmaps), or because we explicitly asked to do single pack reuse even
with a MIDX[^1].

When this is the case, the array we generate of reusable packs contains
only a single element, which is either (a) the pack attached to the
single-pack bitmap, or (b) the MIDX's preferred pack.

In 795006fff4 (pack-bitmap: gracefully handle missing BTMP chunks,
2024-04-15), we refactored the reuse_partial_packfile_from_bitmap()
function and stopped assigning the pack_int_id field when reusing only
the MIDX's preferred pack. This results in an uninitialized read down in
try_partial_reuse() like so:

    ==7474==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x55c5cd191dde in try_partial_reuse pack-bitmap.c:1887:8
    #1 0x55c5cd191dde in reuse_partial_packfile_from_bitmap_1 pack-bitmap.c:2001:8
    #2 0x55c5cd191dde in reuse_partial_packfile_from_bitmap pack-bitmap.c:2105:3
    #3 0x55c5cce0bd0e in get_object_list_from_bitmap builtin/pack-objects.c:4043:3
    #4 0x55c5cce0bd0e in get_object_list builtin/pack-objects.c:4156:27
    #5 0x55c5cce0bd0e in cmd_pack_objects builtin/pack-objects.c:4596:3
    #6 0x55c5ccc8fac8 in run_builtin git.c:474:11

which happens when try_partial_reuse() tries to call
midx_pair_to_pack_pos() when it tries to reject cross-pack deltas.

Avoid the uninitialized read by ensuring that the pack_int_id field is
set in the single-pack reuse case by setting it to either the MIDX
preferred pack's pack_int_id, or '-1', in the case of single-pack
bitmaps.  In the latter case, we never read the pack_int_id field, so
the choice of '-1' is intentional as a "garbage in, garbage out"
measure.

Guard against further regressions in this area by adding a test which
ensures that we do not throw out deltas from the preferred pack as
"cross-pack" due to an uninitialized pack_int_id.

[^1]: This can happen for a couple of reasons, either because the
  repository is configured with 'pack.allowPackReuse=(true|single)', or
  because the MIDX was generated prior to the introduction of the BTMP
  chunk, which contains information necessary to perform multi-pack
  reuse.

Reported-by: Kyle Lippincott <spectral@google.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-11 16:08:28 -07:00
Taylor Blau 4cac79a50e pack-bitmap.c: reimplement `midx_bitmap_filename()` with helper
Now that we have the `get_midx_filename_ext()` helper, we can
reimplement the `midx_bitmap_filename()` function in terms of it.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-30 13:43:52 -07:00
Taylor Blau 7252d9a036 pseudo-merge: implement support for finding existing merges
This patch implements support for reusing existing pseudo-merge commits
when writing bitmaps when there is an existing pseudo-merge bitmap which
has exactly the same set of parents as one that we are about to write.

Note that unstable pseudo-merges are likely to change between
consecutive repacks, and so are generally poor candidates for reuse.
However, stable pseudo-merges (see the configuration option
'bitmapPseudoMerge.<name>.stableThreshold') are by definition unlikely
to change between runs (as they represent long-running branches).

Because there is no index from a *set* of pseudo-merge parents to a
matching pseudo-merge bitmap, we have to construct the bitmap
corresponding to the set of parents for each pending pseudo-merge commit
and see if a matching bitmap exists.

This is technically quadratic in the number of pseudo-merges, but is OK
in practice for a couple of reasons:

  - non-matching pseudo-merge bitmaps are rejected quickly as soon as
    they differ in a single bit

  - already-matched pseudo-merge bitmaps are discarded from subsequent
    rounds of search

  - the number of pseudo-merges is generally small, even for large
    repositories

In order to do this, implement (a) a function that finds a matching
pseudo-merge given some uncompressed bitset describing its parents, (b)
a function that computes the bitset of parents for a given pseudo-merge
commit, and (c) call that function before computing the set of reachable
objects for some pending pseudo-merge.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-24 11:40:44 -07:00
Taylor Blau 25163f50a2 pack-bitmap: extra trace2 information
Add some extra trace2 lines to capture the number of bitmap lookups that
are hits versus misses, as well as the number of reachability roots that
have bitmap coverage (versus those that do not).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-24 11:40:44 -07:00
Taylor Blau 11d45a6e6a pack-bitmap.c: use pseudo-merges during traversal
Now that all of the groundwork has been laid to support reading and
using pseudo-merges, make use of that work in this commit by teaching
the pack-bitmap machinery to use pseudo-merge(s) when available during
traversal.

The basic operation is as follows:

  - When enumerating objects on either side of a reachability query,
    first see if any subset of the roots satisfies some pseudo-merge
    bitmap. If it does, apply that pseudo-merge bitmap.

  - If any pseudo-merge bitmap(s) were applied in the previous step, OR
    them into the result[^1]. Then repeat the process over all
    pseudo-merge bitmaps (we'll refer to this as "cascading"
    pseudo-merges). Once this is done, OR in the resulting bitmap.

  - If there is no fill-in traversal to be done, return the bitmap for
    that side of the reachability query. If there is fill-in traversal,
    then for each commit we encounter via show_commit(), check to see if
    any unsatisfied pseudo-merges containing that commit as one of its
    parents has been made satisfied by the presence of that commit.

    If so, OR in the object set from that pseudo-merge bitmap, and then
    cascade. If not, continue traversal.

A similar implementation is present in the boundary-based bitmap
traversal routines.

[^1]: Importantly, we cannot OR in the entire set of roots along with
  the objects reachable from whatever pseudo-merge bitmaps were
  satisfied.  This may leave some dangling bits corresponding to any
  unsatisfied root(s) getting OR'd into the resulting bitmap, tricking
  other parts of the traversal into thinking we already have a
  reachability closure over those commit(s) when we do not.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-24 11:40:43 -07:00
Taylor Blau 71eca9ab79 pack-bitmap: implement test helpers for pseudo-merge
Implement three new sub-commands for the "bitmap" test-helper:

  - t/helper test-tool bitmap dump-pseudo-merges
  - t/helper test-tool bitmap dump-pseudo-merge-commits <n>
  - t/helper test-tool bitmap dump-pseudo-merge-objects <n>

These three helpers dump the list of pseudo merges, the "parents" of the
nth pseudo-merges, and the set of objects reachable from those parents,
respectively.

These helpers will be useful in subsequent patches when we add test
coverage for pseudo-merge bitmaps.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-24 11:40:43 -07:00
Taylor Blau 7c0fae8844 pack-bitmap.c: read pseudo-merge extension
Now that the scaffolding for reading the pseudo-merge extension has been
laid, teach the pack-bitmap machinery to read the pseudo-merge extension
when present.

Note that pseudo-merges themselves are not yet used during traversal,
this step will be taken by a future commit.

In the meantime, read the table and initialize the pseudo_merge_map
structure introduced by a previous commit. When the pseudo-merge
extension is present, `load_bitmap_header()` performs basic sanity
checks to make sure that the table is well-formed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-24 11:40:43 -07:00
Taylor Blau 79621f3e41 pack-bitmap: extract `read_bitmap()` function
The pack-bitmap machinery uses the `read_bitmap_1()` function to read a
bitmap from within the mmap'd region corresponding to the .bitmap file.
As as side-effect of calling this function, `read_bitmap_1()` increments
the `index->map_pos` variable to reflect the number of bytes read.

Extract the core of this routine to a separate function (that operates
over a `const unsigned char *`, a `size_t` and a `size_t *` pointer)
instead of a `struct bitmap_index *` pointer.

This function (called `read_bitmap()`) is part of the pack-bitmap.h API
so that it can be used within the upcoming portion of the implementation
in pseduo-merge.ch.

Rewrite the existing function, `read_bitmap_1()`, in terms of its more
generic counterpart.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-24 11:40:42 -07:00