Once upon a time, nth_packed_object_sha1() was the primary way to get
the oid of a packfile's index position. But these days we have the more
type-safe nth_packed_object_id() wrapper, and all callers have been
converted.
Let's drop the "sha1" version (turning the safer wrapper into a single
function) so that nobody is tempted to introduce new callers.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The previous commit changed the public interface of packed_object_info()
to return a struct object_id rather than a bare hash. That enables us to
convert our internal helper, as well. We can use nth_packed_object_id()
directly for OFS_DELTA, but we'll still have to use oidread() to pull
the hash for a REF_DELTA out of the packfile.
There should be no additional cost, since we're copying directly into
the object_id the caller provided us (just as we did before; it's just
happening now via nth_packed_object_id()).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If a caller sets the object_info.delta_base_sha1 to a non-NULL pointer,
we'll write the oid of the object's delta base to it. But we can
increase our type safety by switching this to a real object_id struct.
All of our callers are just pointing into the hash member of an
object_id anyway, so there's no inconvenience.
Note that we do still keep it as a pointer-to-struct, because the NULL
sentinel value tells us whether the caller is even interested in the
information.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our nth_packed_object_sha1() function returns NULL for error. So when we
wrapped it with nth_packed_object_oid(), we kept the same semantics. But
it's a bit funny, because the caller actually passes in an out
parameter, and the pointer we return is just that same struct they
passed to us (or NULL).
It's not too terrible, but it does make the interface a little
non-idiomatic. Let's switch to our usual "0 for success, negative for
error" return value. Most callers either don't check it, or are
trivially converted. The one that requires the biggest change is
actually improved, as we can ditch an extra aliased pointer variable.
Since we are changing the interface in a subtle way that the compiler
wouldn't catch, let's also change the name to catch any topics in
flight. We can drop the 'o' and make it nth_packed_object_id(). That's
slightly shorter, but also less redundant since the 'o' stands for
"object" already.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Some fields in struct raw_object_store are lazy initialized by the
thread-unsafe packfile.c:prepare_packed_git(). Although this function is
present in the call stack of git-grep threads, all paths to it are
currently protected by obj_read_lock() (and the main thread usually
indirectly calls it before firing the worker threads, anyway). However,
it's possible that future modifications add new unprotected paths to it,
introducing a race condition. Because errors derived from it wouldn't
happen often, it could be hard to detect. So to prevent future
headaches, let's force eager initialization of packed_git when setting
git-grep up. There'll be a small overhead in the cases where we didn't
really need to prepare packed_git during execution but this shouldn't be
very noticeable.
Also, packed_git may be re-initialized by
packfile.c:reprepare_packed_git(). Again, all paths to it in git-grep
are already protected by obj_read_lock() but it may suffer from the same
problem in the future. So let's also internally protect it with
obj_read_lock() (which is a recursive mutex).
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We already have pread emulation for portability, so there's
there's no reason to make two syscalls where one suffices.
Furthermore, readers of the packfile will be using mmap
(or pread to emulate mmap), anyways, so the file description
offset does not matter in this case.
Signed-off-by: Eric Wong <e@80x24.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
git_open sets close-on-exec since cd66ada065
("sha1_file: open window into packfiles with O_CLOEXEC").
There's no reason to keep using fcntl to set the close-on-exec
flag, anymore.
Signed-off-by: Eric Wong <e@80x24.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When loading packfiles on start-up, we traverse the internal packfile
list once per file to avoid reloading packfiles that have already
been loaded. This check runs in quadratic time, so for poorly
maintained repos with a large number of packfiles, it can be pretty
slow.
Add a hashmap containing the packfile names as we load them so that
the average runtime cost of checking for already-loaded packs becomes
constant.
Add a perf test to p5303 to show speed-up.
The existing p5303 test runtimes are dominated by other factors and do
not show an appreciable speed-up. The new test in p5303 clearly exposes
a speed-up in bad cases. In this test we create 10,000 packfiles and
measure the start-up time of git rev-parse, which does little else
besides load in the packs.
Here are the numbers for the new p5303 test:
Test HEAD^ HEAD
---------------------------------------------------------------------
5303.12: load 10,000 packs 1.03(0.92+0.10) 0.12(0.02+0.09) -88.3%
Signed-off-by: Colin Stolley <cstolley@runbox.com>
Helped-by: Jeff King <peff@peff.net>
[jc: squashed the change to call hashmap in install_packed_git() by peff]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Another step in eliminating the requirement of hashmap_entry
being the first member of a struct.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Update callers to use hashmap_get_entry, hashmap_get_entry_from_hash
or container_of as appropriate.
This is another step towards eliminating the requirement of
hashmap_entry being the first field in a struct.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is less error-prone than "const void *" as the compiler
now detects invalid types being passed.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is less error-prone than "void *" as the compiler now
detects invalid types being passed.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
C compilers do type checking to make life easier for us. So
rely on that and update all hashmap_entry_init callers to take
"struct hashmap_entry *" to avoid future bugs while improving
safety and readability.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This hashmap_entry_init function is intended to take a
hashmap_entry struct pointer, not a hashmap struct pointer.
This was not noticed because hashmap_entry_init takes a "void *"
arg instead of "struct hashmap_entry *", and the hashmap struct
is larger and can be cast into a hashmap_entry struct without
data corruption.
This has the beneficial side effect of reducing the size of
a delta_base_cache_entry from 104 bytes to 72 bytes on 64-bit
systems.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Found with "git grep '^#include ' '*.c' | sort | uniq -d".
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In a following commit get_delta_base() will be used outside
packfile.c, so let's make it non static and declare it in
packfile.h.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Avoid derefencing ->tagged without checking for NULL by using the
convenience wrapper for getting the ID of the tagged object. It die()s
when encountering a broken tag instead of segfaulting.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Replace a use of sha1_to_hex with hash_to_hex so that this code works
with a hash algorithm other than SHA-1.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Long ago, in 97bfeb34df (Release pack windows before reporting out of
memory., 2006-12-24), we taught xmalloc() and friends to try unmapping
pack windows when malloc() failed. It's unlikely that his helps a lot in
practice, and it has some downsides. First, the downsides:
1. It makes xmalloc() not thread-safe. We've worked around this in
pack-objects.c, which installs its own locking version of the
try_to_free_routine(). But other threaded code doesn't.
2. It makes the system as a whole harder to reason about. Functions
which allocate heap memory under the hood may have farther-reaching
effects than expected.
That might be worth the tradeoff if there's a benefit. But in practice,
it seems unlikely. We're generally dealing with mmap'd files, so the OS
is going to do a much better job at responding to memory pressure by
dropping individual pages (the exception is systems with NO_MMAP, but
even there the OS can probably respond just as well with swapping).
So the only thing we're really freeing is address space. On 64-bit
systems, we have plenty of that to go around. On 32-bit systems, it
could possibly help. But around the same time we made two other changes:
77ccc5bbd1 (Introduce new config option for mmap limit., 2006-12-23) and
60bb8b1453 (Fully activate the sliding window pack access., 2006-12-23).
Together that means that a 32-bit system should have no more than 256MB
total of packed-git mmaps at one time, split between a few 32MB windows.
It's unlikely we have any address space problems since then, but we
don't have any data since the features were all added at the same time.
Likewise, xmmap() will try to free memory. At first glance, it seems
like we'd need this (when we try to mmap a new window, we might need to
close an old one to save address space on a 32-bit system). But we're
saved again by core.packedGitLimit: if we're going to exceed our 256MB
limit, we'll close an existing window before we even call mmap().
So it seems unlikely that this feature is actually doing anything
useful. And while we don't have reports of it harming anything (probably
because it rarely if ever kicks in), it would be nice to simplify the
system overall. This patch drops the whole try_to_free system from
xmalloc(), as well as the manual pack memory release in xmmap().
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Instead of using the repository_format_partial_clone global
and fetch_objects() directly, let's use has_promisor_remote()
and promisor_remote_get_direct().
This way all the configured promisor remotes will be taken
into account, not only the one specified by
extensions.partialClone.
Also when cloning or fetching using a partial clone filter,
remote.origin.promisor will be set to "true" instead of
setting extensions.partialClone to "origin". This makes it
possible to use many promisor remote just by fetching from
them.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Convert calls of memcpy(3) to use COPY_ARRAY, which shortens and
simplifies the code a bit.
Patch generated by Coccinelle and contrib/coccinelle/array.cocci.
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The close_all_packs() method is now responsible for more than just pack-files.
It also closes the commit-graph and the multi-pack-index. Rename the function
to be more descriptive of its larger role. The name also fits because the
input parameter is a raw_object_store.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The close_all_packs() method is used to close all read handles to
pack-files and the multi-pack-index before running 'git gc --auto'.
This is particularly important on the Windows platform, where read
handles block any writes to those files. Replacing one of these
files with a rename() will fail in this situation.
The commit-graph also performs a rename, so is susceptable to this
problem. We are careful to close the commit-graph before writing,
but that doesn't work when a 'git fetch' (or similar) process runs
'git gc --auto' which may write a commit-graph.
Here, close the commit-graph as part of close_all_packs().
Reported-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The repack builtin deletes redundant pack-files and their
associated .idx, .promisor, .bitmap, and .keep files. We will want
to re-use this logic in the future for other types of repack, so
pull the logic into 'unlink_pack_path()' in packfile.c.
The 'ignore_keep' parameter is enabled for the use in repack, but
will be important for a future caller.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
use_pack has its own error message on mmap error, but it can't be
reached when using xmmap, which dies with its own error.
Signed-off-by: Mike Hommey <mh@glandium.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The multi-pack-index allows searching for objects across multiple
packs using one object list. The original design gains many of
these performance benefits by keeping the packs in the
multi-pack-index out of the packed_git list.
Unfortunately, this has one major drawback. If the multi-pack-index
covers thousands of packs, and a command loads many of those packs,
then we can hit the limit for open file descriptors. The
close_one_pack() method is used to limit this resource, but it
only looks at the packed_git list, and uses an LRU cache to prevent
thrashing.
Instead of complicating this close_one_pack() logic to include
direct references to the multi-pack-index, simply add the packs
opened by the multi-pack-index to the packed_git list. This
immediately solves the file-descriptor limit problem, but requires
some extra steps to avoid performance issues or other problems:
1. Create a multi_pack_index bit in the packed_git struct that is
one if and only if the pack was loaded from a multi-pack-index.
2. Skip packs with the multi_pack_index bit when doing object
lookups and abbreviations. These algorithms already check the
multi-pack-index before the packed_git struct. This has a very
small performance hit, as we need to walk more packed_git
structs. This is acceptable, since these operations run binary
search on the other packs, so this walk-and-ignore logic is
very fast by comparison.
3. When closing a multi-pack-index file, do not close its packs,
as those packs will be closed using close_all_packs(). In some
cases, such as 'git repack', we run 'close_midx()' without also
closing the packs, so we need to un-set the multi_pack_index bit
in those packs. This is necessary, and caught by running
t6501-freshen-objects.sh with GIT_TEST_MULTI_PACK_INDEX=1.
To manually test this change, I inserted trace2 logging into
close_pack_fd() and set pack_max_fds to 10, then ran 'git rev-list
--all --objects' on a copy of the Git repo with 300+ pack-files and
a multi-pack-index. The logs verified the packs are closed as
we read them beyond the file descriptor limit.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Much of the multi-pack-index code focuses on the multi_pack_index
struct, and so we only pass a pointer to the current one. However,
we will insert a dependency on the packed_git linked list in a
future change, so we will need a repository reference. Inserting
these parameters is a significant enough change to split out.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we have a multi-pack-index that covers many packfiles, we try to
avoid opening the .idx for those packfiles. To do that we feed the pack
name to midx_contains_pack(). But that function wants to see only the
basename, which we compute using strrchr() to find the final slash. But
that leaves an extra "/" at the start of our string.
We can fix this by incrementing the pointer. That also raises the
question of what to do when the name does not have a '/' at all. This
should generally not happen (we always find files in "pack/"), but it
doesn't hurt to be defensive here.
Let's wrap all of that up in a helper function and make it publicly
available, since a later patch will need to use it, too.
The tests don't notice because there's nothing about opening those .idx
files that would cause us to give incorrect output. It's just a little
slower. The new test checks this case by corrupting the covered .idx,
and then making sure we don't complain about it.
We also have to tweak t5570, which intentionally corrupts a .idx file
and expects us to notice it. When run with GIT_TEST_MULTI_PACK_INDEX,
this will fail since we now will (correctly) not bother opening the .idx
at all. We can fix that by unconditionally dropping any midx that's
there, which ensures we'll have to read the .idx.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We can't create a pack revindex if we haven't actually looked at the
index. Normally we would never get as far as creating a revindex without
having already been looking in the pack, so this code never bothered to
double-check that pack->index_data had been loaded.
But with the new multi-pack-index feature, many code paths might not
load the individual pack .idx at all (they'd find objects via the midx
and then open the .pack, but not its index).
This can't yet be triggered in practice, because a bug in the midx code
means we accidentally open up the individual .idx files anyway. But in
preparation for fixing that, let's have the revindex code check that
everything it needs has been loaded.
In most cases this will just be a quick noop. But note that this does
introduce a possibility of error (if we have to open the index and it's
corrupt), so load_pack_revindex() now returns a result code, and callers
need to handle the error.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This member is used to represent the pack checksum of the pack in
question. Expand this member to be GIT_MAX_RAWSZ bytes in length so it
works with longer hashes and rename it to be "hash" instead of "sha1".
This transformation was made with a change to the definition and the
following semantic patch:
@@
struct packed_git *E1;
@@
- E1->sha1
+ E1->hash
@@
struct packed_git E1;
@@
- E1.sha1
+ E1.hash
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Teach `multi-pack-index verify` to sort the set of object by
packfile so that only one packfile needs to be open at a time.
This is a performance improvement. Previously, objects were
verified in OID order. This essentially requires all packfiles
to be open at the same time. If the number of packfiles exceeds
the open file limit, packfiles would be LRU-closed and re-opened
many times.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When parsing a tree, we read the object ID directly out of the tree
buffer. This is normally fine, but such an object ID cannot be used with
oidcpy, which copies GIT_MAX_RAWSZ bytes, because if we are using SHA-1,
there may not be that many bytes to copy.
Instead, store the object ID in a separate struct member. Since we can
no longer efficiently compute the path length, store that information as
well in struct name_entry. Ensure we only copy the object ID into the
new buffer if the path length is nonzero, as some callers will pass us
an empty path with no object ID following it, and we will not want to
read past the end of the buffer.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add and use a function for emptying the loose object cache, so callers
don't have to know any of its implementation details.
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our object_directory struct has a loose objects cache that all users of
the struct can see. But the only one that knows how to load the cache is
find_short_object_filename(). Let's extract that logic in to a reusable
function.
While we're at it, let's also reset the cache when we re-read the object
directories. This shouldn't have an impact on performance, as re-reads
are meant to be rare (and are already expensive, so we avoid them with
things like OBJECT_INFO_QUICK).
Since the cache is already meant to be an approximation, it's tempting
to skip even this bit of safety. But it's necessary to allow more code
to use it. For instance, fetch-pack explicitly re-reads the object
directory after performing its fetch, and would be confused if we didn't
clear the cache.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In preparation for unifying the handling of alt odb's and the normal
repo object directory, let's use a more neutral name. This patch is
purely mechanical, swapping the type name, and converting any variables
named "alt" to "odb". There should be no functional change, but it will
reduce the noise in subsequent diffs.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Whenever we delete pack-files from the object directory, we need
to also delete the multi-pack-index that may refer to those
objects. Sometimes, this connection is obvious, like during a
repack. Other times, this is less obvious, like when gc calls
a repack command and then does other actions on the objects, like
write a commit-graph file.
The pattern we use to avoid out-of-date in-memory packed_git
structs is to call close_all_packs(). This should also call
close_midx(). Since we already pass an object store to
close_all_packs(), this is a nicely scoped operation.
This fixes a test failure when running t6500-gc.sh with
GIT_TEST_MULTI_PACK_INDEX=1.
Reported-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
has_packed_and_bad is not widely used, so just migrate it all at once.
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Breaks the majority of check_packed_git_idx() into a separate function,
load_idx(). The latter function operates on arbitrary buffers, which
makes it suitable as a fuzzing test target.
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Replace uses of GIT_SHA1_RAWSZ with references to the_hash_algo to avoid
dependence on a particular hash length.
It's likely that in the future, we'll update the pack format to indicate
what hash algorithm it uses, and then this code will change. However,
at least on an interim basis, make it easier to develop on a pure
SHA-256 Git by using the_hash_algo here.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This rounds out the previous three patches, covering the
inequality logic for the "hash" variant of the functions.
As with the previous three, the accompanying code changes
are the mechanical result of applying the coccinelle patch;
see those patches for more discussion.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is the partner patch to the previous one, but covering
the "hash" variants instead of "oid". Note that our
coccinelle rule is slightly more complex to avoid triggering
the call in hasheq().
I didn't bother to add a new rule to convert:
- hasheq(E1->hash, E2->hash)
+ oideq(E1, E2)
Since these are new functions, there won't be any such
existing callers. And since most of the code is already
using oideq, we're not likely to introduce new ones.
We might still see "!hashcmp(E1->hash, E2->hash)" from topics
in flight. But because our new rule comes after the existing
ones, that should first get converted to "!oidcmp(E1, E2)"
and then to "oideq(E1, E2)".
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are many places in the codebase that want to iterate over
all packfiles known to Git. The purposes are wide-ranging, and
those that can take advantage of the multi-pack-index already
do. So, use get_all_packs() instead of get_packed_git() to be
sure we are iterating over all packfiles.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If a repo contains a multi-pack-index, then the packed_git list
does not contain the packfiles that are covered by the multi-pack-index.
This is important for doing object lookups, abbreviations, and
approximating object count. However, there are many operations that
really want to iterate over all packfiles.
Create a new 'all_packs' linked list that contains this list, starting
with the packfiles in the multi-pack-index and then continuing along
the packed_git linked list.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When prepare_packed_git is called with the report_garbage method
initialized, we report unexpected files in the objects directory
as garbage. Stop reporting the multi-pack-index and the pack-files
it covers as garbage.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A pack-file is 'local' if it is stored within the usual object
directory. If it is stored in an alternate, it is non-local.
Pack-files are stored using a 'pack_local' member in the packed_git
struct. Add a similar 'local' member to the multi_pack_index struct
and 'local' parameters to the methods that load and prepare multi-
pack-indexes.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We currently iterate over objects within a pack in .idx
order, which uses the object hashes. That means that it
is effectively random with respect to the location of the
object within the pack. If you're going to access the actual
object data, there are two reasons to move linearly through
the pack itself:
1. It improves the locality of access in the packfile. In
the cold-cache case, this may mean fewer disk seeks, or
better usage of disk cache.
2. We store related deltas together in the packfile. Which
means that the delta base cache can operate much more
efficiently if we visit all of those related deltas in
sequence, as the earlier items are likely to still be
in the cache. Whereas if we visit the objects in
random order, our cache entries are much more likely to
have been evicted by unrelated deltas in the meantime.
So in general, if you're going to access the object contents
pack order is generally going to end up more efficient.
But if you're simply generating a list of object names, or
if you're going to end up sorting the result anyway, you're
better off just using the .idx order, as finding the pack
order means generating the in-memory pack-revindex.
According to the numbers in 8b8dfd5132 (pack-revindex:
radix-sort the revindex, 2013-07-11), that takes about 200ms
for linux.git, and 20ms for git.git (those numbers are a few
years old but are still a good ballpark).
That makes it a good optimization for some cases (we can
save tens of seconds in git.git by having good locality of
delta access, for a 20ms cost), but a bad one for others
(e.g., right now "cat-file --batch-all-objects
--batch-check="%(objectname)" is 170ms in git.git, so adding
20ms to that is noticeable).
Hence this patch makes it an optional flag. You can't
actually do any interesting timings yet, as it's not plumbed
through to any user-facing tools like cat-file. That will
come in a later patch.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It's not wrong to pass our flags in an "unsigned", as we
know it will be at least as large as the enum. However,
using the enum in the declaration makes it more obvious
where to find the list of flags.
While we're here, let's also drop the "extern" noise-words
from the declarations, per our modern coding style.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>