The variables `packed_git_window_size` and `packed_git_limit` are global
config variables used in the `packfile.c` file. Since it is only used in
this file, let's change it from being a global config variable to a
local variable for the subsystem.
With this, we rid `packfile.c` from all global variable usage and this
means we can also remove the `USE_THE_REPOSITORY_VARIABLE` guard from
the file.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `for_each_packed_object` currently relies on the global
variable `the_repository`. To eliminate global variable usage in
`packfile.c`, we should progressively shift the dependency on
the_repository to higher layers. Let's remove its usage from this
function and closely related function `is_promisor_object`.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The functions `has_object[_kept]_pack` currently rely on the global
variable `the_repository`. To eliminate global variable usage in
`packfile.c`, we should progressively shift the dependency on
the_repository to higher layers. Let's remove its usage from these
functions and any related ones.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `odb_pack_name` currently relies on the global variable
`the_repository`. To eliminate global variable usage in `packfile.c`, we
should progressively shift the dependency on the_repository to higher
layers.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The struct `packed_git` holds information regarding a packed object
file. Let's add the repository variable to this object, to represent the
repository that this packfile belongs to. This helps remove dependency
on the global `the_repository` object in `packfile.c` by simply using
repository information now readily available in the struct.
We do need to consider that a packfile could be part of the alternates
of a repository, but considering that we only have one repository struct
and also that we currently anyways use 'the_repository', we should be
OK with this change.
We also modify `alloc_packed_git` to ensure that the repository is added
to newly created `packed_git` structs. This requires modifying the
function and all its callee to pass the repository object down the
levels.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The main function we use to search a pack index for an object is
find_pack_entry_one(). That function still takes a bare pointer to the
hash, despite the fact that its underlying bsearch_pack() function needs
an object_id struct. And so we end up making an extra copy of the hash
into the struct just to do a lookup.
As it turns out, all callers but one already have such an object_id. So
we can just take a pointer to that struct and use it directly. This
avoids the extra copy and provides a more type-safe interface.
The one exception is get_delta_base() in packfile.c, when we are chasing
a REF_DELTA from inside the pack (and thus we have a pointer directly to
the mmap'd pack memory, not a struct). We can just bump the hashcpy()
from inside find_pack_entry_one() to this one caller that needs it.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
The find_sha1_pack() function has a few problems:
- it's badly named, since it works with any object hash
- it takes the hash as a bare pointer rather than an object_id struct
We can fix both of these easily, as all callers actually have a real
object_id anyway.
I also found the existence of this function somewhat confusing, as it is
about looking in an arbitrary set of linked packed_git structs. It's
good for things like dumb-http which are looking in downloaded remote
packs, and not our local packs. But despite the name, it is not a good
way to find the pack which contains a local object (it skips the use of
the midx, the pack mru list, and so on).
So let's also add an explanatory comment above the declaration that may
point people in the right direction.
I suspect the calls in fast-import.c, which use the packed_git list from
the repository struct, could actually just be using find_pack_entry().
But since we'd need to keep it anyway for dumb-http, I didn't dig
further there. If we eventually drop dumb-http support, then it might be
worth examining them to see if we can get rid of the function entirely.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
With a name like parse_packed_git(), you might think it's the right way
to access a local pack index and its associated objects. But not so!
It's a one-off used by the dumb-http code to access pack idx files we've
downloaded from the remote, but whose packs we might not have.
There's only one caller left for this function, and ideally we'd drop it
completely and just inline it there. But that would require exposing
other internals from packfile.[ch], like alloc_packed_git() and
check_packed_git_idx().
So let's leave it be for now, and just warn people that it's probably
not what they're looking for. Perhaps in the long run if we eventually
drop dumb-http support, we can remove the function entirely then.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Like sha1_pack_name() that we dropped in the previous commit, this
function uses an error-prone static strbuf and the somewhat misleading
name "sha1".
The only caller left is in pack-redundant.c. While this command is
marked for potential removal in our BreakingChanges document, we still
have it for now. But it's simple enough to convert it to use its own
strbuf with the underlying odb_pack_name() function, letting us drop the
otherwise obsolete function.
Note that odb_pack_name() does its own strbuf_reset(), so it's safe to
use directly within a loop like this.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
The sha1_pack_name() function has a few ugly bits:
- it writes into a static strbuf (and not even a ring buffer of them),
which can lead to subtle invalidation problems
- it uses the term "sha1", but it's really using the_hash_algo, which
could be sha256
There's only one caller of it left. And in fact that caller is better
off using the underlying odb_pack_name() function itself, since it's
just copying the result into its own strbuf anyway.
Converting that caller lets us get rid of this now-obselete function.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
The has_pack_index() function has several oddities that may make it
surprising if you are trying to find out if we have a pack with some
$hash:
- it is not looking for a valid pack that we found while searching
object directories. It just looks for any pack-$hash.idx file in the
pack directory.
- it only looks in the local directory, not any alternates
- it takes a bare "unsigned char" hash, which we try to avoid these
days
The only caller it has is in the dumb http code; it wants to know if we
already have the pack idx in question. This can happen if we downloaded
the pack (and generated its index) during a previous fetch.
Before the previous patch ("dumb-http: store downloaded pack idx as
tempfile"), it could also happen if we downloaded the .idx from the
remote but didn't get the matching .pack. But since that patch, we don't
hold on to those .idx files. So there's no need to look for the .idx
file in the filesystem; we can just scan through the packed_git list to
see if we have it.
That lets us simplify the dumb http code a bit, as we know that if we
have the .idx we have the matching .pack already. And it lets us get rid
of this odd function that is unlikely to be needed again.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Now that the rest of the MIDX subsystem and relevant callers have been
updated to learn about how to read and process incremental MIDX chains,
let's finally update the implementation in `write_midx_internal()` to be
able to write incremental MIDX chains.
This new feature is available behind the `--incremental` option for the
`multi-pack-index` builtin, like so:
$ git multi-pack-index write --incremental
The implementation for doing so is relatively straightforward, and boils
down to a handful of different kinds of changes implemented in this
patch:
- The `compute_sorted_entries()` function is taught to reject objects
which appear in any existing MIDX layer.
- Functions like `write_midx_revindex()` are adjusted to write
pack_order values which are offset by the number of objects in the
base MIDX layer.
- The end of `write_midx_internal()` is adjusted to move
non-incremental MIDX files when necessary (i.e. when creating an
incremental chain with an existing non-incremental MIDX in the
repository).
There are a handful of other changes that are introduced, like new
functions to clear incremental MIDX files that are unrelated to the
current chain (using the same "keep_hash" mechanism as in the
non-incremental case).
The tests explicitly exercising the new incremental MIDX feature are
relatively limited for two reasons:
1. Most of the "interesting" behavior is already thoroughly covered in
t5319-multi-pack-index.sh, which handles the core logic of reading
objects through a MIDX.
The new tests in t5334-incremental-multi-pack-index.sh are mostly
focused on creating and destroying incremental MIDXs, as well as
stitching their results together across layers.
2. A new GIT_TEST environment variable is added called
"GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL", which modifies the
entire test suite to write incremental MIDXs after repacking when
combined with the "GIT_TEST_MULTI_PACK_INDEX" variable.
This exercises the long tail of other interesting behavior that is
defined implicitly throughout the rest of the CI suite. It is
likewise added to the linux-TEST-vars job.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "hash-ll.h" header was introduced via d1cbe1e6d8 (hash-ll.h: split
out of hash.h to remove dependency on repository.h, 2023-04-22) to make
explicit the split between hash-related functions that rely on the
global `the_repository`, and those that don't. This split is no longer
necessary now that we we have removed the reliance on `the_repository`.
Merge "hash-ll.h" back into "hash.h". This causes some code units to not
include "repository.h" anymore, which requires us to add some forward
declarations.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
One parameter is called `file_pach`. On the face of it, this looks as if
it was supposed to talk about a `path` instead of a `pach`.
However, looking at the way this callback is called, it gets fed the
`d_name` from a directory entry, which provides just the file name, not
the full path. Therefore, let's fix this by calling the parameter
`file_name` instead.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* ab/remove-implicit-use-of-the-repository:
libs: use "struct repository *" argument, not "the_repository"
post-cocci: adjust comments for recent repo_* migration
cocci: apply the "revision.h" part of "the_repository.pending"
cocci: apply the "rerere.h" part of "the_repository.pending"
cocci: apply the "refs.h" part of "the_repository.pending"
cocci: apply the "promisor-remote.h" part of "the_repository.pending"
cocci: apply the "packfile.h" part of "the_repository.pending"
cocci: apply the "pretty.h" part of "the_repository.pending"
cocci: apply the "object-store.h" part of "the_repository.pending"
cocci: apply the "diff.h" part of "the_repository.pending"
cocci: apply the "commit.h" part of "the_repository.pending"
cocci: apply the "commit-reach.h" part of "the_repository.pending"
cocci: apply the "cache.h" part of "the_repository.pending"
cocci: add missing "the_repository" macros to "pending"
cocci: sort "the_repository" rules by header
cocci: fix incorrect & verbose "the_repository" rules
cocci: remove dead rule from "the_repository.pending.cocci"
Apply the part of "the_repository.pending.cocci" pertaining to
"packfile.h".
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ever since a64215b6cd ("object.h: stop depending on cache.h; make
cache.h depend on object.h", 2023-02-24), we have a few headers that
could have replaced their include of cache.h with an include of
object.h. Make that change now.
Some C files had to start including cache.h after this change (or some
smaller header it had brought in), because the C files were depending
on things from cache.h but were only formerly implicitly getting
cache.h through one of these headers being modified in this patch.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since its definition in 2f4ba2a867 (packfile: prepare for the existence
of '*.rev' files, 2021-01-25), the only caller of
`close_pack_revindex()` was within packfile.c.
Thus there is no need for this to be exposed via packfile.h, and instead
can remain static within packfile.c's compilation unit. While we're
here, move the function's opening brace onto its own line.
Noticed-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The single caller has a full object ID, so pass it on instead of just
its hash member.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
All callers have full object IDs, so pass them on instead of just their
hash member.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Future callers will want a function to fill a 'struct pack_entry' for a
given object id but _only_ from its position in any kept pack(s).
In particular, an new 'git repack' mode which ensures the resulting
packs form a geometric progress by object count will mark packs that it
does not want to repack as "kept in-core", and it will want to halt a
reachability traversal as soon as it visits an object in any of the kept
packs. But, it does not want to halt the traversal at non-kept, or
.keep packs.
The obvious alternative is 'find_pack_entry()', but this doesn't quite
suffice since it only returns the first pack it finds, which may or may
not be kept (and the mru cache makes it unpredictable which one you'll
get if there are options).
Short of that, you could walk over all packs looking for the object in
each one, but it scales with the number of packs, which may be
prohibitive.
Introduce 'find_kept_pack_entry()', a function which is like
'find_pack_entry()', but only fills in objects in the kept packs.
Handle packs which have .keep files, as well as in-core kept packs
separately, since certain callers will want to distinguish one from the
other. (Though on-disk and in-core kept packs share the adjective
"kept", it is best to think of the two sets as independent.)
There is a gotcha when looking up objects that are duplicated in kept
and non-kept packs, particularly when the MIDX stores the non-kept
version and the caller asked for kept objects only. This could be
resolved by teaching the MIDX to resolve duplicates by always favoring
the kept pack (if one exists), but this breaks an assumption in existing
MIDXs, and so it would require a format change.
The benefit to changing the MIDX in this way is marginal, so we instead
have a more thorough check here which is explained with a comment.
Callers will be added in subsequent patches.
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Specify the format of the on-disk reverse index 'pack-*.rev' file, as
well as prepare the code for the existence of such files.
The reverse index maps from pack relative positions (i.e., an index into
the array of object which is sorted by their offsets within the
packfile) to their position within the 'pack-*.idx' file. Today, this is
done by building up a list of (off_t, uint32_t) tuples for each object
(the off_t corresponding to that object's offset, and the uint32_t
corresponding to its position in the index). To convert between pack and
index position quickly, this array of tuples is radix sorted based on
its offset.
This has two major drawbacks:
First, the in-memory cost scales linearly with the number of objects in
a pack. Each 'struct revindex_entry' is sizeof(off_t) +
sizeof(uint32_t) + padding bytes for a total of 16.
To observe this, force Git to load the reverse index by, for e.g.,
running 'git cat-file --batch-check="%(objectsize:disk)"'. When asking
for a single object in a fresh clone of the kernel, Git needs to
allocate 120+ MB of memory in order to hold the reverse index in memory.
Second, the cost to sort also scales with the size of the pack.
Luckily, this is a linear function since 'load_pack_revindex()' uses a
radix sort, but this cost still must be paid once per pack per process.
As an example, it takes ~60x longer to print the _size_ of an object as
it does to print that entire object's _contents_:
Benchmark #1: git.compile cat-file --batch <obj
Time (mean ± σ): 3.4 ms ± 0.1 ms [User: 3.3 ms, System: 2.1 ms]
Range (min … max): 3.2 ms … 3.7 ms 726 runs
Benchmark #2: git.compile cat-file --batch-check="%(objectsize:disk)" <obj
Time (mean ± σ): 210.3 ms ± 8.9 ms [User: 188.2 ms, System: 23.2 ms]
Range (min … max): 193.7 ms … 224.4 ms 13 runs
Instead, avoid computing and sorting the revindex once per process by
writing it to a file when the pack itself is generated.
The format is relatively straightforward. It contains an array of
uint32_t's, the length of which is equal to the number of objects in the
pack. The ith entry in this table contains the index position of the
ith object in the pack, where "ith object in the pack" is determined by
pack offset.
One thing that the on-disk format does _not_ contain is the full (up to)
eight-byte offset corresponding to each object. This is something that
the in-memory revindex contains (it stores an off_t in 'struct
revindex_entry' along with the same uint32_t that the on-disk format
has). Omit it in the on-disk format, since knowing the index position
for some object is sufficient to get a constant-time lookup in the
pack-*.idx file to ask for an object's offset within the pack.
This trades off between the on-disk size of the 'pack-*.rev' file for
runtime to chase down the offset for some object. Even though the lookup
is constant time, the constant is heavier, since it can potentially
involve two pointer walks in v2 indexes (one to access the 4-byte offset
table, and potentially a second to access the double wide offset table).
Consider trying to map an object's pack offset to a relative position
within that pack. In a cold-cache scenario, more page faults occur while
switching between binary searching through the reverse index and
searching through the *.idx file for an object's offset. Sure enough,
with a cold cache (writing '3' into '/proc/sys/vm/drop_caches' after
'sync'ing), printing out the entire object's contents is still
marginally faster than printing its size:
Benchmark #1: git.compile cat-file --batch-check="%(objectsize:disk)" <obj >/dev/null
Time (mean ± σ): 22.6 ms ± 0.5 ms [User: 2.4 ms, System: 7.9 ms]
Range (min … max): 21.4 ms … 23.5 ms 41 runs
Benchmark #2: git.compile cat-file --batch <obj >/dev/null
Time (mean ± σ): 17.2 ms ± 0.7 ms [User: 2.8 ms, System: 5.5 ms]
Range (min … max): 15.6 ms … 18.2 ms 45 runs
(Numbers taken in the kernel after cheating and using the next patch to
generate a reverse index). There are a couple of approaches to improve
cold cache performance not pursued here:
- We could include the object offsets in the reverse index format.
Predictably, this does result in fewer page faults, but it triples
the size of the file, while simultaneously duplicating a ton of data
already available in the .idx file. (This was the original way I
implemented the format, and it did show
`--batch-check='%(objectsize:disk)'` winning out against `--batch`.)
On the other hand, this increase in size also results in a large
block-cache footprint, which could potentially hurt other workloads.
- We could store the mapping from pack to index position in more
cache-friendly way, like constructing a binary search tree from the
table and writing the values in breadth-first order. This would
result in much better locality, but the price you pay is trading
O(1) lookup in 'pack_pos_to_index()' for an O(log n) one (since you
can no longer directly index the table).
So, neither of these approaches are taken here. (Thankfully, the format
is versioned, so we are free to pursue these in the future.) But, cold
cache performance likely isn't interesting outside of one-off cases like
asking for the size of an object directly. In real-world usage, Git is
often performing many operations in the revindex (i.e., asking about
many objects rather than a single one).
The trade-off is worth it, since we will avoid the vast majority of the
cost of generating the revindex that the extra pointer chase will look
like noise in the following patch's benchmarks.
This patch describes the format and prepares callers (like in
pack-revindex.c) to be able to read *.rev files once they exist. An
implementation of the writer will appear in the next patch, and callers
will gradually begin to start using the writer in the patches that
follow after that.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When a repository has an alternate object directory configured, callers
can traverse through each alternate's MIDX by walking the '->next'
pointer.
But, when 'prepare_multi_pack_index_one()' loads multiple MIDXs, it
places the new ones at the front of this pointer chain, not at the end.
This can be confusing for callers such as 'git repack -ad', causing test
failures like in t7700.6 with 'GIT_TEST_MULTI_PACK_INDEX=1'.
The occurs when dropping a pack known to the local MIDX with alternates
configured that have their own MIDX. Since the alternate's MIDX is
returned via 'get_multi_pack_index()', 'midx_contains_pack()' returns
true (which is correct, since it traverses through the '->next' pointer
to find the MIDX in the chain that does contain the requested object).
But, we call 'clear_midx_file()' on 'the_repository', which drops the
MIDX at the path of the first MIDX in the chain, which (in the case of
t7700.6 is the one in the alternate).
This patch addresses that by:
- placing the local MIDX first in the chain when calling
'prepare_multi_pack_index_one()', and
- introducing a new 'get_local_multi_pack_index()', which explicitly
returns the repository-local MIDX, if any.
Don't impose an additional order on the MIDX's '->next' pointer beyond
that the first item in the chain must be local if one exists so that we
avoid a quadratic insertion.
Likewise, use 'get_local_multi_pack_index()' in
'remove_redundant_pack()' to fix the formerly broken t7700.6 when run
with 'GIT_TEST_MULTI_PACK_INDEX=1'.
Finally, note that the MIDX ordering invariant is only preserved by the
insertion order in 'prepare_packed_git()', which traverses through the
ODB's '->next' pointer, meaning we visit the local object store first.
This fragility makes this an undesirable long-term solution if more
callers are added, but it is acceptable for now since this is the only
caller.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Once upon a time, nth_packed_object_sha1() was the primary way to get
the oid of a packfile's index position. But these days we have the more
type-safe nth_packed_object_id() wrapper, and all callers have been
converted.
Let's drop the "sha1" version (turning the safer wrapper into a single
function) so that nobody is tempted to introduce new callers.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our nth_packed_object_sha1() function returns NULL for error. So when we
wrapped it with nth_packed_object_oid(), we kept the same semantics. But
it's a bit funny, because the caller actually passes in an out
parameter, and the pointer we return is just that same struct they
passed to us (or NULL).
It's not too terrible, but it does make the interface a little
non-idiomatic. Let's switch to our usual "0 for success, negative for
error" return value. Most callers either don't check it, or are
trivially converted. The one that requires the biggest change is
actually improved, as we can ditch an extra aliased pointer variable.
Since we are changing the interface in a subtle way that the compiler
wouldn't catch, let's also change the name to catch any topics in
flight. We can drop the 'o' and make it nth_packed_object_id(). That's
slightly shorter, but also less redundant since the 'o' stands for
"object" already.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In a following commit get_delta_base() will be used outside
packfile.c, so let's make it non static and declare it in
packfile.h.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 336226c259 (packfile.h: drop extern from function declarations,
2019-04-05), `extern` was removed from function declarations because
it's redundant. However, in 8434e85d5f (repack: refactor pack deletion
for future use, 2019-06-10), an `extern` was mistakenly included.
Remove this spurious `extern`.
Signed-off-by: Denton Liu <liu.denton@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git multi-pack-index" learned expire and repack subcommands.
* ds/midx-expire-repack:
t5319: use 'test-tool path-utils' instead of 'ls -l'
t5319-multi-pack-index.sh: test batch size zero
midx: add test that 'expire' respects .keep files
multi-pack-index: test expire while adding packs
midx: implement midx_repack()
multi-pack-index: prepare 'repack' subcommand
multi-pack-index: implement 'expire' subcommand
midx: refactor permutation logic and pack sorting
midx: simplify computation of pack name lengths
multi-pack-index: prepare for 'expire' subcommand
Docs: rearrange subcommands for multi-pack-index
repack: refactor pack deletion for future use
The commit-graph file is now part of the "files that the runtime
may keep open file descriptors on, all of which would need to be
closed when done with the object store", and the file descriptor to
an existing commit-graph file now is closed before "gc" finalizes a
new instance to replace it.
* ds/close-object-store:
packfile: rename close_all_packs to close_object_store
packfile: close commit-graph in close_all_packs
commit-graph: use raw_object_store when closing
The close_all_packs() method is now responsible for more than just pack-files.
It also closes the commit-graph and the multi-pack-index. Rename the function
to be more descriptive of its larger role. The name also fits because the
input parameter is a raw_object_store.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The repack builtin deletes redundant pack-files and their
associated .idx, .promisor, .bitmap, and .keep files. We will want
to re-use this logic in the future for other types of repack, so
pull the logic into 'unlink_pack_path()' in packfile.c.
The 'ignore_keep' parameter is enabled for the use in repack, but
will be important for a future caller.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Further code clean-up to allow the lowest level of name-to-object
mapping layer to work with a passed-in repository other than the
default one.
* nd/sha1-name-c-wo-the-repository: (34 commits)
sha1-name.c: remove the_repo from get_oid_mb()
sha1-name.c: remove the_repo from other get_oid_*
sha1-name.c: remove the_repo from maybe_die_on_misspelt_object_name
submodule-config.c: use repo_get_oid for reading .gitmodules
sha1-name.c: add repo_get_oid()
sha1-name.c: remove the_repo from get_oid_with_context_1()
sha1-name.c: remove the_repo from resolve_relative_path()
sha1-name.c: remove the_repo from diagnose_invalid_index_path()
sha1-name.c: remove the_repo from handle_one_ref()
sha1-name.c: remove the_repo from get_oid_1()
sha1-name.c: remove the_repo from get_oid_basic()
sha1-name.c: remove the_repo from get_describe_name()
sha1-name.c: remove the_repo from get_oid_oneline()
sha1-name.c: add repo_interpret_branch_name()
sha1-name.c: remove the_repo from interpret_branch_mark()
sha1-name.c: remove the_repo from interpret_nth_prior_checkout()
sha1-name.c: remove the_repo from get_short_oid()
sha1-name.c: add repo_for_each_abbrev()
sha1-name.c: store and use repo in struct disambiguate_state
sha1-name.c: add repo_find_unique_abbrev_r()
...
Code clean-up around a much-less-important-than-it-used-to-be
update_server_info() funtion.
* jk/server-info-rabbit-hole:
update_info_refs(): drop unused force parameter
server-info: drop objdirlen pointer arithmetic
server-info: drop nr_alloc struct member
server-info: use strbuf to read old info/packs file
server-info: simplify cleanup in parse_pack_def()
server-info: fix blind pointer arithmetic
http: simplify parsing of remote objects/info/packs
packfile: fix pack basename computation
midx: check both pack and index names for containment
t5319: drop useless --buffer from cat-file
t5319: fix bogus cat-file argument
pack-revindex: open index if necessary
packfile.h: drop extern from function declarations
When we have a multi-pack-index that covers many packfiles, we try to
avoid opening the .idx for those packfiles. To do that we feed the pack
name to midx_contains_pack(). But that function wants to see only the
basename, which we compute using strrchr() to find the final slash. But
that leaves an extra "/" at the start of our string.
We can fix this by incrementing the pointer. That also raises the
question of what to do when the name does not have a '/' at all. This
should generally not happen (we always find files in "pack/"), but it
doesn't hurt to be defensive here.
Let's wrap all of that up in a helper function and make it publicly
available, since a later patch will need to use it, too.
The tests don't notice because there's nothing about opening those .idx
files that would cause us to give incorrect output. It's just a little
slower. The new test checks this case by corrupting the covered .idx,
and then making sure we don't complain about it.
We also have to tweak t5570, which intentionally corrupts a .idx file
and expects us to notice it. When run with GIT_TEST_MULTI_PACK_INDEX,
this will fail since we now will (correctly) not bother opening the .idx
at all. We can fix that by unconditionally dropping any midx that's
there, which ensures we'll have to read the .idx.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As CodingGuidelines recommends, we do not need an "extern" when
declaring a public function. Let's drop these. Note that we leave the
extern on report_garbage(), as that is actually a function pointer, not
a function itself.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Teach `multi-pack-index verify` to sort the set of object by
packfile so that only one packfile needs to be open at a time.
This is a performance improvement. Previously, objects were
verified in OID order. This essentially requires all packfiles
to be open at the same time. If the number of packfiles exceeds
the open file limit, packfiles would be LRU-closed and re-opened
many times.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The in-core repository instances are passed through more codepaths.
* sb/more-repo-in-api: (23 commits)
t/helper/test-repository: celebrate independence from the_repository
path.h: make REPO_GIT_PATH_FUNC repository agnostic
commit: prepare free_commit_buffer and release_commit_memory for any repo
commit-graph: convert remaining functions to handle any repo
submodule: don't add submodule as odb for push
submodule: use submodule repos for object lookup
pretty: prepare format_commit_message to handle arbitrary repositories
commit: prepare logmsg_reencode to handle arbitrary repositories
commit: prepare repo_unuse_commit_buffer to handle any repo
commit: prepare get_commit_buffer to handle any repo
commit-reach: prepare in_merge_bases[_many] to handle any repo
commit-reach: prepare get_merge_bases to handle any repo
commit-reach.c: allow get_merge_bases_many_0 to handle any repo
commit-reach.c: allow remove_redundant to handle any repo
commit-reach.c: allow merge_bases_many to handle any repo
commit-reach.c: allow paint_down_to_common to handle any repo
commit: allow parse_commit* to handle any repo
object: parse_object to honor its repository argument
object-store: prepare has_{sha1, object}_file to handle any repo
object-store: prepare read_object_file to deal with any repo
...
has_packed_and_bad is not widely used, so just migrate it all at once.
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Breaks the majority of check_packed_git_idx() into a separate function,
load_idx(). The latter function operates on arbitrary buffers, which
makes it suitable as a fuzzing test target.
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If a repo contains a multi-pack-index, then the packed_git list
does not contain the packfiles that are covered by the multi-pack-index.
This is important for doing object lookups, abbreviations, and
approximating object count. However, there are many operations that
really want to iterate over all packfiles.
Create a new 'all_packs' linked list that contains this list, starting
with the packfiles in the multi-pack-index and then continuing along
the packed_git linked list.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* ds/multi-pack-index: (23 commits)
midx: clear midx on repack
packfile: skip loading index if in multi-pack-index
midx: prevent duplicate packfile loads
midx: use midx in approximate_object_count
midx: use existing midx when writing new one
midx: use midx in abbreviation calculations
midx: read objects from multi-pack-index
config: create core.multiPackIndex setting
midx: write object offsets
midx: write object id fanout chunk
midx: write object ids in a chunk
midx: sort and deduplicate objects from packfiles
midx: read pack names into array
multi-pack-index: write pack names in chunk
multi-pack-index: read packfile list
packfile: generalize pack directory list
t5319: expand test data
multi-pack-index: load into memory
midx: write header information to lockfile
multi-pack-index: add 'write' verb
...
The API to iterate over all objects learned to optionally list
objects in the order they appear in packfiles, which helps locality
of access if the caller accesses these objects while as objects are
enumerated.
* jk/for-each-object-iteration:
for_each_*_object: move declarations to object-store.h
cat-file: use a single strbuf for all output
cat-file: split batch "buf" into two variables
cat-file: use oidset check-and-insert
cat-file: support "unordered" output for --batch-all-objects
cat-file: rename batch_{loose,packed}_object callbacks
t1006: test cat-file --batch-all-objects with duplicates
for_each_packed_object: support iterating in pack-order
for_each_*_object: give more comprehensive docstrings
for_each_*_object: take flag arguments as enum
for_each_*_object: store flag definitions in a single location
The for_each_loose_object() and for_each_packed_object()
functions are meant to be part of a unified interface: they
use the same set of for_each_object_flags, and it's not
inconceivable that we might one day add a single
for_each_object() wrapper around them.
Let's put them together in a single file, so we can avoid
awkwardness like saying "the flags for this function are
over in cache.h". Moving the loose functions to packfile.h
is silly. Moving the packed functions to cache.h works, but
makes the "cache.h is a kitchen sink" problem worse. The
best place is the recently-created object-store.h, since
these are quite obviously related to object storage.
The for_each_*_in_objdir() functions do not use the same
flags, but they are logically part of the same interface as
for_each_loose_object(), and share callback signatures. So
we'll move those, as well, as they also make sense in
object-store.h.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We currently iterate over objects within a pack in .idx
order, which uses the object hashes. That means that it
is effectively random with respect to the location of the
object within the pack. If you're going to access the actual
object data, there are two reasons to move linearly through
the pack itself:
1. It improves the locality of access in the packfile. In
the cold-cache case, this may mean fewer disk seeks, or
better usage of disk cache.
2. We store related deltas together in the packfile. Which
means that the delta base cache can operate much more
efficiently if we visit all of those related deltas in
sequence, as the earlier items are likely to still be
in the cache. Whereas if we visit the objects in
random order, our cache entries are much more likely to
have been evicted by unrelated deltas in the meantime.
So in general, if you're going to access the object contents
pack order is generally going to end up more efficient.
But if you're simply generating a list of object names, or
if you're going to end up sorting the result anyway, you're
better off just using the .idx order, as finding the pack
order means generating the in-memory pack-revindex.
According to the numbers in 8b8dfd5132 (pack-revindex:
radix-sort the revindex, 2013-07-11), that takes about 200ms
for linux.git, and 20ms for git.git (those numbers are a few
years old but are still a good ballpark).
That makes it a good optimization for some cases (we can
save tens of seconds in git.git by having good locality of
delta access, for a 20ms cost), but a bad one for others
(e.g., right now "cat-file --batch-all-objects
--batch-check="%(objectname)" is 170ms in git.git, so adding
20ms to that is noticeable).
Hence this patch makes it an optional flag. You can't
actually do any interesting timings yet, as it's not plumbed
through to any user-facing tools like cat-file. That will
come in a later patch.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We already mention the local/alternate behavior of these
functions, but we can help clarify a few other behaviors:
- there's no need to mention LOCAL_ONLY specifically, since
we already reference the flags by type (and as we add
more flags, we don't want to have to mention each)
- clarify that reachability doesn't matter here; this is
all accessible objects
- what ordering/uniqueness guarantees we give
- how pack-specific flags are handled for the loose case
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>