Reduce implicit assumption and dependence on the_repository in the
object-file subsystem.
* ps/object-file-wo-the-repository:
object-file: get rid of `the_repository` in index-related functions
object-file: get rid of `the_repository` in `force_object_loose()`
object-file: get rid of `the_repository` in `read_loose_object()`
object-file: get rid of `the_repository` in loose object iterators
object-file: remove declaration for `for_each_file_in_obj_subdir()`
object-file: inline `for_each_loose_file_in_objdir_buf()`
object-file: get rid of `the_repository` when writing objects
odb: introduce `odb_write_object()`
loose: write loose objects map via their source
object-file: get rid of `the_repository` in `finalize_object_file()`
object-file: get rid of `the_repository` in `loose_object_info()`
object-file: get rid of `the_repository` when freshening objects
object-file: inline `check_and_freshen()` functions
object-file: get rid of `the_repository` in `has_loose_object()`
object-file: stop using `the_hash_algo`
object-file: fix -Wsign-compare warnings
The config API had a set of convenience wrapper functions that
implicitly use the_repository instance; they have been removed and
inlined at the calling sites.
* ps/config-wo-the-repository: (21 commits)
config: fix sign comparison warnings
config: move Git config parsing into "environment.c"
config: remove unused `the_repository` wrappers
config: drop `git_config_set_multivar()` wrapper
config: drop `git_config_get_multivar_gently()` wrapper
config: drop `git_config_set_multivar_in_file_gently()` wrapper
config: drop `git_config_set_in_file_gently()` wrapper
config: drop `git_config_set()` wrapper
config: drop `git_config_set_gently()` wrapper
config: drop `git_config_set_in_file()` wrapper
config: drop `git_config_get_bool()` wrapper
config: drop `git_config_get_ulong()` wrapper
config: drop `git_config_get_int()` wrapper
config: drop `git_config_get_string()` wrapper
config: drop `git_config_get_string()` wrapper
config: drop `git_config_get_string_multi()` wrapper
config: drop `git_config_get_value()` wrapper
config: drop `git_config_get_value()` wrapper
config: drop `git_config_get()` wrapper
config: drop `git_config_clear()` wrapper
...
In 036876a106 (config: hide functions using `the_repository` by
default, 2024-08-13) we have moved around a bunch of functions in the
config subsystem that depend on `the_repository`. Those function have
been converted into mere wrappers around their equivalent function that
takes in a repository as parameter, and the intent was that we'll
eventually remove those wrappers to make the dependency on the global
repository variable explicit at the callsite.
Follow through with that intent and remove `git_config()`. All callsites
are adjusted so that they use `repo_config(the_repository, ...)`
instead. While some callsites might already have a repository available,
this mechanical conversion is the exact same as the current situation
and thus cannot cause any regression. Those sites should eventually be
cleaned up in a later patch series.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `force_object_loose()` forces an object to become a loose
object in case it only exists in its packed form. To do so it implicitly
relies on `the_repository`.
Refactor the function by passing a `struct odb_source` as parameter.
While the check whether any such loose object exists already acts on the
whole object database, writing the loose object happens in one specific
source.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The iterators for loose objects still rely on `the_repository`. Refactor
them:
- `for_each_loose_file_in_objdir()` is refactored so that the caller
is now expected to pass an `odb_source` as parameter instead of the
path to that source. Furthermore, it is renamed accordingly to
`for_each_loose_file_in_source()`.
- `for_each_loose_object()` is refactored to take in an object
database now and calls the above function in a loop.
This allows us to get rid of the global dependency.
Adjust callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We implicitly depend on `the_repository` when moving an object file into
place in `finalize_object_file()`. Get rid of this global dependency by
passing in a repository.
Note that one might be pressed to inject an object database instead of a
repository. But the function doesn't really care about the ODB at all.
All it does is to move a file into place while checking whether there is
any collision. As such, the functionality it provides is independent of
the object database and only needs the repository as parameter so that
it can adjust permissions of the file we are about to finalize.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We implicitly depend on `the_repository` in `has_loose_object()`.
Refactor the function to accept an `odb_source` as input that should be
checked for such a loose object.
This refactoring changes semantics of the function to not check the
whole object database for such a loose object anymore, but instead we
now only check that single source. Existing callers thus need to loop
through all sources manually now.
While this change may seem illogical at first, whether or not an object
exists in a specific format should be answered by the source using that
format. As such, we can eventually convert this into a generic function
`odb_source_has_object()` that simply checks whether a given object
exists in an object source. And as we will know about the format that
any given source uses it allows us to derive whether the object exists
in a given format.
This change also makes `has_loose_object_nonlocal()` obsolete. The only
caller of this function is adapted so that it skips the primary object
source.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code clean-up around object access API.
* ps/object-store:
odb: rename `read_object_with_reference()`
odb: rename `pretend_object_file()`
odb: rename `has_object()`
odb: rename `repo_read_object_file()`
odb: rename `oid_object_info()`
odb: trivial refactorings to get rid of `the_repository`
odb: get rid of `the_repository` when handling submodule sources
odb: get rid of `the_repository` when handling the primary source
odb: get rid of `the_repository` in `for_each()` functions
odb: get rid of `the_repository` when handling alternates
odb: get rid of `the_repository` in `odb_mkstemp()`
odb: get rid of `the_repository` in `assert_oid_type()`
odb: get rid of `the_repository` in `find_odb()`
odb: introduce parent pointers
object-store: rename files to "odb.{c,h}"
object-store: rename `object_directory` to `odb_source`
object-store: rename `raw_object_store` to `object_database`
The function `get_multi_pack_index()` loads multi-pack indices via
`prepare_packed_git()` and then returns the linked list of multi-pack
indices that is stored in `struct object_database`. That list is in the
process of being removed though in favor of storing the MIDX as part of
the object database source it belongs to.
Refactor `get_multi_pack_index()` so that it returns the multi-pack
index for a single object source. Callers are now expected to call this
function for each source they are interested in. This requires them to
iterate through alternates, so we have to prepare alternate object
sources before doing so.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* tb/midx-avoid-cruft-packs:
repack: exclude cruft pack(s) from the MIDX where possible
pack-objects: introduce '--stdin-packs=follow'
pack-objects: swap 'show_{object,commit}_pack_hint'
pack-objects: fix typo in 'show_object_pack_hint()'
pack-objects: perform name-hash traversal for unpacked objects
pack-objects: declare 'rev_info' for '--stdin-packs' earlier
pack-objects: factor out handling '--stdin-packs'
pack-objects: limit scope in 'add_object_entry_from_pack()'
pack-objects: use standard option incompatibility functions
* ps/object-store:
odb: rename `read_object_with_reference()`
odb: rename `pretend_object_file()`
odb: rename `has_object()`
odb: rename `repo_read_object_file()`
odb: rename `oid_object_info()`
odb: trivial refactorings to get rid of `the_repository`
odb: get rid of `the_repository` when handling submodule sources
odb: get rid of `the_repository` when handling the primary source
odb: get rid of `the_repository` in `for_each()` functions
odb: get rid of `the_repository` when handling alternates
odb: get rid of `the_repository` in `odb_mkstemp()`
odb: get rid of `the_repository` in `assert_oid_type()`
odb: get rid of `the_repository` in `find_odb()`
odb: introduce parent pointers
object-store: rename files to "odb.{c,h}"
object-store: rename `object_directory` to `odb_source`
object-store: rename `raw_object_store` to `object_database`
* ps/object-store:
odb: rename `read_object_with_reference()`
odb: rename `pretend_object_file()`
odb: rename `has_object()`
odb: rename `repo_read_object_file()`
odb: rename `oid_object_info()`
odb: trivial refactorings to get rid of `the_repository`
odb: get rid of `the_repository` when handling submodule sources
odb: get rid of `the_repository` when handling the primary source
odb: get rid of `the_repository` in `for_each()` functions
odb: get rid of `the_repository` when handling alternates
odb: get rid of `the_repository` in `odb_mkstemp()`
odb: get rid of `the_repository` in `assert_oid_type()`
odb: get rid of `the_repository` in `find_odb()`
odb: introduce parent pointers
object-store: rename files to "odb.{c,h}"
object-store: rename `object_directory` to `odb_source`
object-store: rename `raw_object_store` to `object_database`
Rename `read_object_with_reference()` to `odb_read_object_peeled()` to
match other functions related to the object database and our modern
coding guidelines. Furthermore though, the old name didn't really
describe very well what this function actually does, which is to walk
down any commit and tag objects until an object of the required type has
been found. This is generally referred to as "peeling", so the new name
should be way more descriptive.
No compatibility wrapper is introduced as the function is not used a lot
throughout our codebase.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Rename `has_object()` to `odb_has_object()` to match other functions
related to the object database and our modern coding guidelines.
Introduce a compatibility wrapper so that any in-flight topics will
continue to compile.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Rename `repo_read_object_file()` to `odb_read_object()` to match other
functions related to the object database and our modern coding
guidelines.
Introduce a compatibility wrapper so that any in-flight topics will
continue to compile.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Rename `oid_object_info()` to `odb_read_object_info()` as well as their
`_extended()` variant to match other functions related to the object
database and our modern coding guidelines.
Introduce compatibility wrappers so that any in-flight topics will
continue to compile.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the preceding commits we have renamed the structures contained in
"object-store.h" to `struct object_database` and `struct odb_backend`.
As such, the code files "object-store.{c,h}" are confusingly named now.
Rename them to "odb.{c,h}" accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When invoked with '--stdin-packs', pack-objects will generate a pack
which contains the objects found in the "included" packs, less any
objects from "excluded" packs.
Packs that exist in the repository but weren't specified as either
included or excluded are in practice treated like the latter, at least
in the sense that pack-objects won't include objects from those packs.
This behavior forces us to include any cruft pack(s) in a repository's
multi-pack index for the reasons described in ddee3703b3
(builtin/repack.c: add cruft packs to MIDX during geometric repack,
2022-05-20).
The full details are in ddee3703b3, but the gist is if you
have a once-unreachable object in a cruft pack which later becomes
reachable via one or more commits in a pack generated with
'--stdin-packs', you *have* to include that object in the MIDX via the
copy in the cruft pack, otherwise we cannot generate reachability
bitmaps for any commits which reach that object.
Note that the traversal here is best-effort, similar to the existing
traversal which provides name-hash hints. This means that the object
traversal may hand us back a blob that does not actually exist. We
*won't* see missing trees/commits with 'ignore_missing_links' because:
- missing commit parents are discarded at the commit traversal stage by
revision.c::process_parents()
- missing tag objects are discarded by revision.c::handle_commit()
- missing tree objects are discarded by the list-objects code in
list-objects.c::process_tree()
But we have to handle potentially-missing blobs specially by making a
separate check to ensure they exist in the repository. Failing to do so
would mean that we'd add an object to the packing list which doesn't
actually exist, rendering us unable to write out the pack.
This prepares us for new repacking behavior which will "resurrect"
objects found in cruft or otherwise unspecified packs when generating
new packs. In the context of geometric repacking, this may be used to
maintain a sequence of geometrically-repacked packs, the union of which
is closed under reachability, even in the case described earlier.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
show_commit_pack_hint() has heretofore been a noop, so its position
within its compilation unit only needs to appear before its first use.
But the following commit will sometimes have `show_commit_pack_hint()`
call `show_object_pack_hint()`, so reorder the former to appear after
the latter to minimize the code movement in that patch.
Suggested-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
With '--unpacked', pack-objects adds loose objects (which don't appear
in any of the excluded packs from '--stdin-packs') to the output pack
without considering them as reachability tips for the name-hash
traversal.
This was an oversight in the original implementation of '--stdin-packs',
since the code which enumerates and adds loose objects to the output
pack (`add_unreachable_loose_objects()`) did not have access to the
'rev_info' struct found in `read_packs_list_from_stdin()`.
Excluding unpacked objects from that traversal doesn't affect the
correctness of the resulting pack, but it does make it harder to
discover good deltas for loose objects.
Now that the 'rev_info' struct is declared outside of
`read_packs_list_from_stdin()`, we can pass it to
`add_objects_in_unpacked_packs()` and add any loose objects as tips to
the above-mentioned traversal, in theory producing slightly tighter
packs as a result.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Once 'read_packs_list_from_stdin()' has called for_each_object_in_pack()
on each of the input packs, we do a reachability traversal to discover
names for any objects we picked up so we can generate name hash values
and hopefully get higher quality deltas as a result.
A future commit will change the purpose of this reachability traversal
to find and pack objects which are reachable from commits in the input
packs, but are packed in an unknown (not included nor excluded) pack.
Extract the code which initializes and performs the reachability
traversal to take place in the caller, not the callee, which prepares us
to share this code for the '--unpacked' case (see the function
add_unreachable_loose_objects() for more details).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
At the bottom of cmd_pack_objects() we check which mode the command is
running in (e.g., generating a cruft pack, handling '--stdin-packs',
using the internal rev-list, etc.) and handle the mode appropriately.
The '--stdin-packs' case is handled inline (dating back to its
introduction in 339bce27f4 (builtin/pack-objects.c: add '--stdin-packs'
option, 2021-02-22)) since it is relatively short. Extract the body of
"if (stdin_packs)" into its own function to prepare for the
implementation to become lengthier in a following commit.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In add_object_entry_from_pack() we declare 'revs' (given to us through
the miscellaneous context argument) earlier in the "if (p)" conditional
than is necessary. Move it down as far as it can go to reduce its
scope.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
pack-objects has a handful of explicit checks for pairs of command-line
options which are mutually incompatible. Many of these pre-date
a699367bb8 (i18n: factorize more 'incompatible options' messages,
2022-01-31).
Convert the explicit checks into die_for_incompatible_opt2() calls,
which simplifies the implementation and standardizes pack-objects'
output when given incompatible options (e.g., --stdin-packs with
--filter gives different output than --keep-unreachable with
--unpack-unreachable).
There is one minor piece of test fallout in t5331 that expects the old
format, which has been corrected.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.
* ds/path-walk-2:
pack-objects: allow --shallow and --path-walk
path-walk: add new 'edge_aggressive' option
pack-objects: thread the path-based compression
pack-objects: refactor path-walk delta phase
scalar: enable path-walk during push via config
pack-objects: enable --path-walk via config
repack: add --path-walk option
t5538: add tests to confirm deltas in shallow pushes
pack-objects: introduce GIT_TEST_PACK_PATH_WALK
p5313: add performance tests for --path-walk
pack-objects: update usage to match docs
pack-objects: add --path-walk option
pack-objects: extract should_attempt_deltas()
There does not appear to be anything particularly incompatible about the
--shallow and --path-walk options of 'git pack-objects'. If shallow
commits are to be handled differently, then it is by the revision walk
that defines the commit set and which are interesting or uninteresting.
However, before the previous change, a trivial removal of the warning
would cause a failure in t5500-fetch-pack.sh when
GIT_TEST_PACK_PATH_WALK is enabled. The shallow fetch would provide more
objects than we desired, due to some incorrect behavior of the path-walk
API, especially around walking uninteresting objects.
The recently-added tests in t5538-push-shallow.sh help to confirm this
behavior is working with the --path-walk option if
GIT_TEST_PACK_PATH_WALK is enabled. These tests passed previously due to
the --path-walk feature being disabled in the presence of a shallow
clone.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.
This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.
Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)
Test HEAD~1 HEAD
---------------------------------------------------
5313.20: big pack 2.38 1.99 -16.4%
5313.21: big pack size 16.1M 16.0M -0.2%
5313.24: repack 107.32 45.41 -57.7%
5313.25: repack size 213.3M 213.2M -0.0%
(Test output is formatted to better fit in message.)
This ~60% reduction in 'git repack --path-walk' time is typical across
all repos I used for testing. What is interesting is to compare when the
overall time improves enough to outperform the --name-hash-version=1
case. These time improvements correlate with repositories with data
shapes that significantly improve their data size as well. The
--path-walk feature frequently takes longer than --name-hash-version=2,
trading some extra computation for some additional compression. The
natural place where this additional computation comes from is the two
compression passes that --path-walk takes, though the first pass is
naturally faster due to the path boundaries avoiding a number of delta
compression attempts.
For example, the microsoft/fluentui repo has significant size reduction
from --name-hash-version=1 to --name-hash-version=2 followed by further
improvements with --path-walk. The threaded computation makes
--path-walk more competitive in time compared to --name-hash-version=2,
though still ~31% more expensive in that metric.
Repack Method Pack Size Time
------------------------------------------
Hash v1 439.4M 87.24s
Hash v2 161.7M 21.51s
Path Walk (Before) 142.5M 81.29s
Path Walk (After) 142.5M 28.16s
Similar results hold for the Git repository:
Repack Method Pack Size Time
------------------------------------------
Hash v1 248.8M 30.44s
Hash v2 249.0M 30.15s
Path Walk (Before) 213.2M 142.50s
Path Walk (After) 213.3M 45.41s
...as well as the nodejs/node repository:
Repack Method Pack Size Time
------------------------------------------
Hash v1 739.9M 71.18s
Hash v2 764.6M 67.82s
Path Walk (Before) 698.1M 208.10s
Path Walk (After) 698.0M 75.10s
Finally, the Linux kernel repository is a good test for this repacking
time change, even though the space savings is more subtle:
Repack Method Pack Size Time
------------------------------------------
Hash v1 2.5G 554.41s
Hash v2 2.5G 549.62s
Path Walk (before) 2.2G 1562.36s
Path Walk (before) 2.2G 559.00s
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.
Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.
This presents a new progress indicator that can be used in tests to
verify that this stage is happening.
The current implementation is not integrated with threads, but we are
setting it up to arrive in the next change.
Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.
This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.
Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".
In order to test that this config works, add a new trace2 region around
the path walk code that can be checked by a 'git push' command.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.
This was useful in testing the implementation of the --path-walk
implementation, helping to find tests that are overly specific to the
default object walk. These include:
- t0411-clone-from-partial.sh : One test fetches from a repo that does
not have the boundary objects. This causes the path-based walk to
fail. Disable the variable for this test.
- t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
without a boundary object.
- t5310-pack-bitmaps.sh : One test compares the case when packing with
bitmaps to the case when packing without them. Since we disable the
test variable when writing bitmaps, this causes a difference in the
object list (the --path-walk option adds an extra object). Specify
--no-path-walk in both processes for the comparison. Another test
checks for a specific delta base, but when computing dynamically
without using bitmaps, the base object it too small to be considered
in the delta calculations so no base is used.
- t5316-pack-delta-depth.sh : This script cares about certain delta
choices and their chain lengths. The --path-walk option changes how
these chains are selected, and thus changes the results of this test.
- t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
the --sparse option and how it combines with --path-walk.
- t5332-multi-pack-reuse.sh : This test verifies that the preferred
pack is used for delta reuse when possible. The --path-walk option is
not currently aware of the preferred pack at all, so finds a
different delta base.
- t7406-submodule-update.sh : When using the variable, the --depth
option collides with the --path-walk feature, resulting in a warning
message. Disable the variable so this warning does not appear.
I want to call out one specific test change that is only temporary:
- t5530-upload-pack-error.sh : One test cares specifically about an
"unable to read" error message. Since the current implementation
performs delta calculations within the path-walk API callback, a
different "unable to get size" error message appears. When this
is changed in a future refactoring, this test change can be reverted.
Similar to GIT_TEST_NAME_HASH_VERSION, we do not add this option to the
linux-TEST-vars CI build as that's already an overloaded build.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The t0450 test script verifies that builtin usage matches the synopsis
in the documentation. Adjust the builtin to match and then remove 'git
pack-objects' from the exception list.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In order to more easily compute delta bases among objects that appear at
the exact same path, add a --path-walk option to 'git pack-objects'.
This option will use the path-walk API instead of the object walk given
by the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.
The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.
After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.
The current implementation performs delta calculations while walking
objects, which is not ideal for a few reasons. First, this will cause
the "Enumerating objects" phase to be much longer than usual. Second, it
does not take advantage of threading during the path-scoped delta
calculations. Even with this lack of threading, the path-walk option is
sometimes faster than the usual approach. Future changes will refactor
this code to allow for threading, but that complexity is deferred until
later to keep this patch as simple as possible.
This new walk is incompatible with some features and is ignored by
others:
* Object filters are not currently integrated with the path-walk API,
such as sparse-checkout or tree depth. A blobless packfile could be
integrated easily, but that is deferred for later.
* Server-focused features such as delta islands, shallow packs, and
using a bitmap index are incompatible with the path-walk API.
* The path walk API is only compatible with the --revs option, not
taking object lists or pack lists over stdin. These alternative ways
to specify the objects currently ignores the --path-walk option
without even a warning.
Future changes will create performance tests that demonstrate the power
of this approach.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This will be helpful in a future change, which will reuse this logic.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Update parse-options API to catch mistakes to pass address of an
integral variable of a wrong type/size.
* ps/parse-options-integers:
parse-options: detect mismatches in integer signedness
parse-options: introduce precision handling for `OPTION_UNSIGNED`
parse-options: introduce precision handling for `OPTION_INTEGER`
parse-options: rename `OPT_MAGNITUDE()` to `OPT_UNSIGNED()`
parse-options: support unit factors in `OPT_INTEGER()`
global: use designated initializers for options
parse: fix off-by-one for minimum signed values
Code clean-up.
* ps/object-file-cleanup:
object-store: merge "object-store-ll.h" and "object-store.h"
object-store: remove global array of cached objects
object: split out functions relating to object store subsystem
object-file: drop `index_blob_stream()`
object-file: split up concerns of `HASH_*` flags
object-file: split out functions relating to object store subsystem
object-file: move `xmmap()` into "wrapper.c"
object-file: move `git_open_cloexec()` to "compat/open.c"
object-file: move `safe_create_leading_directories()` into "path.c"
object-file: move `mkdir_in_gitdir()` into "path.c"
With the preceding commit, `OPT_INTEGER()` has learned to support unit
factors. Consequently, the major differencen between `OPT_INTEGER()` and
`OPT_MAGNITUDE()` isn't the support of unit factors anymore, as both of
them do support them now. Instead, the difference is that one handles
signed and the other handles unsigned integers.
Adapt the name of `OPT_MAGNITUDE()` accordingly by renaming it to
`OPT_UNSIGNED()`.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git cat-file --batch" and friends learned to allow "--filter=" to
omit certain objects, just like the transport layer does.
* ps/cat-file-filter-batch:
builtin/cat-file: use bitmaps to efficiently filter by object type
builtin/cat-file: deduplicate logic to iterate over all objects
pack-bitmap: introduce function to check whether a pack is bitmapped
pack-bitmap: add function to iterate over filtered bitmapped objects
pack-bitmap: allow passing payloads to `show_reachable_fn()`
builtin/cat-file: support "object:type=" objects filter
builtin/cat-file: support "blob:limit=" objects filter
builtin/cat-file: support "blob:none" objects filter
builtin/cat-file: wire up an option to filter objects
builtin/cat-file: introduce function to report object status
builtin/cat-file: rename variable that tracks usage
The object layer has been updated to take an explicit repository
instance as a parameter in more code paths.
* ps/object-wo-the-repository:
hash: stop depending on `the_repository` in `null_oid()`
hash: fix "-Wsign-compare" warnings
object-file: split out logic regarding hash algorithms
delta-islands: stop depending on `the_repository`
object-file-convert: stop depending on `the_repository`
pack-bitmap-write: stop depending on `the_repository`
pack-revindex: stop depending on `the_repository`
pack-check: stop depending on `the_repository`
environment: move access to "core.bigFileThreshold" into repo settings
pack-write: stop depending on `the_repository` and `the_hash_algo`
object: stop depending on `the_repository`
csum-file: stop depending on `the_repository`
The "object-store-ll.h" header has been introduced to keep transitive
header dependendcies and compile times at bay. Now that we have created
a new "object-store.c" file though we can easily move the last remaining
additional bit of "object-store.h", the `odb_path_map`, out of the
header.
Do so. As the "object-store.h" header is now equivalent to its low-level
alternative we drop the latter and inline it into the former.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* ps/object-wo-the-repository:
hash: stop depending on `the_repository` in `null_oid()`
hash: fix "-Wsign-compare" warnings
object-file: split out logic regarding hash algorithms
delta-islands: stop depending on `the_repository`
object-file-convert: stop depending on `the_repository`
pack-bitmap-write: stop depending on `the_repository`
pack-revindex: stop depending on `the_repository`
pack-check: stop depending on `the_repository`
environment: move access to "core.bigFileThreshold" into repo settings
pack-write: stop depending on `the_repository` and `the_hash_algo`
object: stop depending on `the_repository`
csum-file: stop depending on `the_repository`
Incrementally updating multi-pack index files.
* tb/incremental-midx-part-2:
midx: implement writing incremental MIDX bitmaps
pack-bitmap.c: use `ewah_or_iterator` for type bitmap iterators
pack-bitmap.c: keep track of each layer's type bitmaps
ewah: implement `struct ewah_or_iterator`
pack-bitmap.c: apply pseudo-merge commits with incremental MIDXs
pack-bitmap.c: compute disk-usage with incremental MIDXs
pack-bitmap.c: teach `rev-list --test-bitmap` about incremental MIDXs
pack-bitmap.c: support bitmap pack-reuse with incremental MIDXs
pack-bitmap.c: teach `show_objects_for_type()` about incremental MIDXs
pack-bitmap.c: teach `bitmap_for_commit()` about incremental MIDXs
pack-bitmap.c: open and store incremental bitmap layers
pack-revindex: prepare for incremental MIDX bitmaps
Documentation: describe incremental MIDX bitmaps
Documentation: remove a "future work" item from the MIDX docs
The `show_reachable_fn` callback is used by a couple of functions to
present reachable objects to the caller. The function does not provide a
way for the caller to pass a payload though, which is functionality that
we'll require in a subsequent commit.
Change the callback type to accept a payload and adapt all callsites
accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Now that the pack-bitmap machinery has learned how to read and interact
with an incremental MIDX bitmap, teach the pack-bitmap-write.c machinery
(and relevant callers from within the MIDX machinery) to write such
bitmaps.
The details for doing so are mostly straightforward. The main changes
are as follows:
- find_object_pos() now makes use of an extra MIDX parameter which is
used to locate the bit positions of objects which are from previous
layers (and thus do not exist in the current layer's pack_order
field).
(Note also that the pack_order field is moved into struct
write_midx_context to further simplify the callers for
write_midx_bitmap()).
- bitmap_writer_build_type_index() first determines how many objects
precede the current bitmap layer and offsets the bits it sets in
each respective type-level bitmap by that amount so they can be OR'd
together.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Once an object is written into a cruft pack, we can only freshen it by
writing a new loose or packed copy of that object with a more recent
mtime.
Prior to 61568efa95 (builtin/pack-objects.c: support `--max-pack-size`
with `--cruft`, 2023-08-28), we typically had at most one cruft pack in
a repository at any given time. So freshening unreachable objects was
straightforward when already rewriting the cruft pack (and its *.mtimes
file).
But 61568efa95 changes things: 'pack-objects' now supports writing
multiple cruft packs when invoked with `--cruft` and the
`--max-pack-size` flag. Cruft packs are rewritten until they reach some
size threshold, at which point they are considered "frozen", and will
only be modified in a pruning GC, or if the threshold itself is
adjusted.
Prior to this patch, however, this process breaks down when we attempt
to freshen an object packed in an earlier cruft pack, and that cruft
pack is larger than the threshold and thus will survive the repack.
When this is the case, it is impossible to freshen objects in cruft
pack(s) when those cruft packs are larger than the threshold. This is
because we would avoid writing them in the new cruft pack entirely, for
a couple of reasons.
1. When enumerating packed objects via 'add_objects_in_unpacked_packs()'
we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping
over the packs we're going to retain (which are marked as kept
in-core by 'read_cruft_objects()').
This means that we will avoid enumerating additional packed copies
of objects found in any cruft packs which are larger than the given
size threshold. Thus there is no opportunity to call
'create_object_entry()' whatsoever.
2. We likewise will discard the loose copy (if one exists) of any
unreachable object packed in a cruft pack that is larger than the
threshold. Here our call path is 'add_unreachable_loose_objects()',
which uses the 'add_loose_object()' callback.
That function will eventually land us in 'want_object_in_pack()'
(via 'add_cruft_object_entry()'), and we'll discard the object as it
appears in one of the packs which we marked as kept in-core.
This means in effect that it is impossible to freshen an unreachable
object once it appears in a cruft pack larger than the given threshold.
Instead, we should pack an additional copy of an unreachable object we
want to freshen even if it appears in a cruft pack, provided that the
cruft copy has an mtime which is before the mtime of the copy we are
trying to pack/freshen. This is sub-optimal in the sense that it
requires keeping an additional copy of unreachable objects upon
freshening, but we don't have a better alternative without the ability
to make in-place modifications to existing *.mtimes files.
In order to implement this, we have to adjust the behavior of
'want_found_object()'. When 'pack-objects' is told that we're *not*
going to retain any cruft packs (i.e. the set of packs marked as kept
in-core does not contain a cruft pack), the behavior is unchanged.
But when there *is* at least one cruft pack that we're holding onto, it
is no longer sufficient to reject a copy of an object found in that
cruft pack for that reason alone. In this case, we only want to reject a
candidate object when copies of that object either:
- exists in a non-cruft pack that we are retaining, regardless of that
pack's mtime, or
- exists in a cruft pack with an mtime at least as recent as the copy
we are debating whether or not to pack, in which case freshening
would be redundant.
To do this, keep track of whether or not we have any cruft packs in our
in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
flag. When we end up in this new special case, we replace a call to
'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only reject
objects when we have a copy in an existing cruft pack with at least as
recent an mtime as our candidate (in which case "freshening" would be
redundant).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are multiple sites in "delta-islands.c" where we use the
global `the_repository` variable, either explicitly or implicitly by
using `the_hash_algo`.
Refactor the code to stop using `the_repository`. In most cases this is
trivial because we already had a repository available in the calling
context, with the only exception being `propagate_island_marks()`. Adapt
it so that the repository gets passed in via a parameter.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "core.bigFileThreshold" setting is stored in a global variable and
populated via `git_default_core_config()`. This may cause issues in
the case where one is handling multiple different repositories in a
single process with different values for that config key, as we may or
may not see the correct value in that case. Furthermore, global state
blocks our path towards libification.
Refactor the code so that we instead store the value in `struct
repo_settings`, where the value is computed as-needed and cached.
Note that this change requires us to adapt one test in t1050 that
verifies that we die when parsing an invalid "core.bigFileThreshold"
value. The exercised Git command doesn't use the value at all, and thus
it won't hit the new code path that parses the value. This is addressed
by using git-hash-object(1) instead, which does read the value.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are a couple of functions in "pack-write.c" that implicitly depend
on `the_repository` or `the_hash_algo`. Remove this dependency by
injecting the repository via a parameter and adapt callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>