Commit Graph

495 Commits (cefcada1d3084ae6df4ac8b00d1c19fd6ea00ed1)

Author SHA1 Message Date
Jeff King 9622610e55 commit-graph: detect out-of-bounds extra-edges pointers
If an entry in a commit-graph file has more than 2 parents, the
fixed-size parent fields instead point to an offset within an "extra
edges" chunk. We blindly follow these, assuming that the chunk is
present and sufficiently large; this can lead to an out-of-bounds read
for a corrupt or malicious file.

We can fix this by recording the size of the chunk and adding a
bounds-check in fill_commit_in_graph(). There are a few tricky bits:

  1. We'll switch from working with a pointer to an offset. This makes
     some corner cases just fall out naturally:

      a. If we did not find an EDGE chunk at all, our size will
         correctly be zero (so everything is "out of bounds").

      b. Comparing "size / 4" lets us make sure we have at least 4 bytes
         to read, and we never compute a pointer more than one element
         past the end of the array (computing a larger pointer is
         probably OK in practice, but is technically undefined
         behavior).

      c. The current code casts to "uint32_t *". Replacing it with an
         offset avoids any comparison between different types of pointer
         (since the chunk is stored as "unsigned char *").

  2. This is the first case in which fill_commit_in_graph() may return
     anything but success. We need to make sure to roll back the
     "parsed" flag (and any parents we might have added before running
     out of buffer) so that the caller can cleanly fall back to
     loading the commit object itself.

     It's a little non-trivial to do this, and we might benefit from
     factoring it out. But we can wait on that until we actually see a
     second case where we return an error.

As a bonus, this lets us drop the st_mult() call. Since we've already
done a bounds check, we know there won't be any integer overflow (it
would imply our buffer is larger than a size_t can hold).

The included test does not actually segfault before this patch (though
you could construct a case where it does). Instead, it reads garbage
from the next chunk which results in it complaining about a bogus parent
id. This is sufficient for our needs, though (we care that the fallback
succeeds, and that stderr mentions the out-of-bounds read).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King b72df612af commit-graph: check size of commit data chunk
We expect a commit-graph file to have a fixed-size data record for each
commit in the file (and we know the number of commits to expct from the
size of the lookup table). If we encounter a file where this is too
small, we'll look past the end of the chunk (and possibly even off the
mapped memory).

We can fix this by checking the size up front when we record the
pointer.

The included test doesn't segfault, since it ends up reading bytes
from another chunk. But it produces nonsense results, since the values
it reads are garbage. Our test notices this by comparing the output to a
non-corrupted run of the same command (and of course we also check that
the expected error is printed to stderr).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King c9b9fefc13 midx: enforce chunk alignment on reading
The midx reader assumes chunks are aligned to a 4-byte boundary: we
treat the fanout chunk as an array of uint32_t, indexing it to feed the
results to ntohl(). Without aligning the chunks, we may violate the
CPU's alignment constraints. Though many platforms allow this, some do
not. And certanily UBSan will complain, since it is undefined behavior.

Even though most chunks are naturally 4-byte-aligned (because they are
storing uint32_t or larger types), PNAM is not. It stores NUL-terminated
pack names, so you can have a valid chunk with any length. The writing
side handles this by 4-byte-aligning the chunk, introducing a few extra
NULs as necessary. But since we don't check this on the reading side, we
may end up with a misaligned fanout and trigger the undefined behavior.

We have two options here:

  1. Swap out ntohl(fanout[i]) for get_be32(fanout+i) everywhere. The
     latter handles alignment itself. It's possible that it's slightly
     slower (though in practice I'm not sure how true that is,
     especially for these code paths which then go on to do a binary
     search).

  2. Enforce the alignment when reading the chunks. This is easy to do,
     since the table-of-contents reader can check it in one spot.

I went with the second option here, just because it places less burden
on maintenance going forward (it is OK to continue using ntohl), and we
know it can't have any performance impact on the actual reads.

The commit-graph code uses the same chunk API. It's usually also 4-byte
aligned, but some chunks are not (like Bloom filter BDAT chunks). So
we'll pass "1" here to allow any alignment. It doesn't suffer from the
same problem as midx with its fanout because the fanout chunk is always
the first (and the rest of the format dictates that the first chunk will
start aligned).

The new test shows the effect on a midx with a misaligned PNAM chunk.
Note that the midx-reading code treats chunk-toc errors as soft, falling
back to the non-midx path rather than calling die(), as we do for other
parsing errors. Arguably we should make all of these behave the same,
but that's out of scope for this patch. For now the test just expects
the fallback behavior.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King 4169d89645 commit-graph: check consistency of fanout table
We use bsearch_hash() to look up items in the oid index of a
commit-graph. It also has a fanout table to reduce the initial range in
which we'll search. But since the fanout comes from the on-disk file, a
corrupted or malicious file can cause us to look outside of the
allocated index memory.

One solution here would be to pass the total table size to
bsearch_hash(), which could then bounds check the values it reads from
the fanout. But there's an inexpensive up-front check we can do, and
it's the same one used by the midx and pack idx code (both of which
likewise have fanout tables and use bsearch_hash(), but are not affected
by this bug):

  1. We can check the value of the final fanout entry against the size
     of the table we got from the index chunk. These must always match,
     since the fanout is just slicing up the index.

       As a side note, the midx and pack idx code compute it the other
       way around: they use the final fanout value as the object count, and
       check the index size against it. Either is valid; if they
       disagree we cannot know which is wrong (a corrupted fanout value,
       or a too-small table of oids).

  2. We can quickly scan the fanout table to make sure it is
     monotonically increasing. If it is, then we know that every value
     is less than or equal to the final value, and therefore less than
     or equal to the table size.

     It would also be sufficient to just check that each fanout value is
     smaller than the final one, but the midx and pack idx code both do
     a full monotonicity check. It's the same cost, and it catches some
     other corruptions (though not all; the checks done by "commit-graph
     verify" are more complete but more expensive, and our goal here is
     to be fast and memory-safe).

There are two new tests. One just checks the final fanout value (this is
the mirror image of the "too small oid lookup" case added for the midx
in the previous commit; it's flipped here because commit-graph considers
the oid lookup chunk to be the source of truth).

The other actually creates a fanout with many out-of-bounds entries, and
prior to this patch, it does cause the segfault you'd expect. But note
that the error is not "your fanout entry is out-of-bounds", but rather
"fanout value out of order". That's because we leave the final fanout
value in place (to get past the table size check), making the index
non-monotonic (the second-to-last entry is big, but the last one must
remain small to match the actual table).

We need adjustments to a few existing tests, as well:

  - an earlier test in t5318 corrupts the fanout and runs "commit-graph
    verify". Its message is now changed, since we catch the problem
    earlier (during the load step, rather than the careful validation
    step).

  - in t5324, we test that "commit-graph verify --shallow" does not do
    expensive verification on the base file of the chain. But the
    corruption it uses (munging a byte at offset 1000) happens to be in
    the middle of the fanout table. And now we detect that problem in
    the cheaper checks that are performed for every part of the graph.
    We'll push this back to offset 1500, which is only caught by the
    more expensive checksum validation.

    Likewise, there's a later test in t5324 which munges an offset 100
    bytes into a file (also in the fanout table) that is referenced by
    an alternates file. So we now find that corruption during the load
    step, rather than the verification step. At the very least we need
    to change the error message (like the case above in t5318). But it
    is probably good to make sure we handle all parts of the
    verification even for alternate graph files. So let's likewise
    corrupt byte 1500 and make sure we found the invalid checksum.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:00 -07:00
Jeff King 52e2e8d43d commit-graph: check size of oid fanout chunk
We load the oid fanout chunk with pair_chunk(), which means we never see
the size of the chunk. We just assume the on-disk file uses the
appropriate size, and if it's too small we'll access random memory.

It's easy to check this up-front; the fanout always consists of 256
uint32's, since it is a fanout of the first byte of the hash pointing
into the oid index. These parameters can't be changed without
introducing a new chunk type.

This matches the similar check in the midx OIDF chunk (but note that
rather than checking for the error immediately, the graph code just
leaves parts of the struct NULL and checks for required fields later).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:00 -07:00
Jeff King 570b8b8836 chunk-format: note that pair_chunk() is unsafe
The pair_chunk() function is provided as an easy helper for parsing
chunks that just want a pointer to a set of bytes. But every caller has
a hidden bug: because we return only the pointer without the matching
chunk size, the callers have no clue how many bytes they are allowed to
look at. And as a result, they may read off the end of the mmap'd data
when the on-disk file does not match their expectations.

Since chunk files are typically used for local-repository data like
commit-graph files and midx's, the security implications here are pretty
mild. The worst that can happen is that you hand somebody a corrupted
repository tarball, and running Git on it does an out-of-bounds read and
crashes. So it's worth being more defensive, but we don't need to drop
everything and fix every caller immediately.

I noticed the problem because the pair_chunk_fn() callback does not look
at its chunk_size argument, and wanted to annotate it to silence
-Wunused-parameter. We could do that now, but we'd lose the hint that
this code should be audited and fixed.

So instead, let's set ourselves up for going down that path:

  1. Provide a pair_chunk() function that does return the size, which
     prepares us for fixing these cases.

  2. Rename the existing function to pair_chunk_unsafe(). That gives us
     an easy way to grep for cases which still need to be fixed, and the
     name should cause anybody adding new calls to think twice before
     using it.

There are no callers of the "safe" version yet, but we'll add some in
subsequent patches.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:00 -07:00
Jeff King d9c84c6d67 commit-graph: free write-context base_graph_name during cleanup
Commit 6c622f9f0b (commit-graph: write commit-graph chains, 2019-06-18)
added a base_graph_name string to the write_commit_graph_context struct.
But the end-of-function cleanup forgot to free it, causing a leak.

This (presumably in combination with the preceding leak-fixes) lets us
mark t5328 as leak-free.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 14:28:24 -07:00
Jeff King 274bfa7f28 commit-graph: free write-context entries before overwriting
When writing a split graph file, we replace the final element of the
commit_graph_hash_after and commit_graph_filenames_after arrays. But
since these are allocated strings, we need to free them before
overwriting to avoid leaking the old string.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 14:28:24 -07:00
Jeff King 1d94abfe1e commit-graph: free graph struct that was not added to chain
When reading the graph chain file, we open (and allocate) each
individual slice it mentions and then add them to a linked-list chain.
But if adding to the chain fails (e.g., because the base-graph chunk it
contains didn't match what we expected), we leave the function without
freeing the graph struct that caused the failure, leaking it.

We can fix it by calling free_graph_commit().

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 14:28:24 -07:00
Jeff King 991d549f74 commit-graph: delay base_graph assignment in add_graph_to_chain()
When adding a graph to a chain, we do some consistency checks and then
if everything looks good, set g->base_graph to add a link to the chain.
But when we added a new consistency check in 209250ef38 (commit-graph.c:
prevent overflow in add_graph_to_chain(), 2023-07-12), it comes _after_
we've already set g->base_graph. So we might return failure, even though
we actually added to the chain.

This hasn't caused a bug yet, because after failing to add to the chain,
we discard the failed graph struct completely, leaking it. But in order
to fix that, it's important that the struct be in a consistent and
predictable state after the failure.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 14:28:24 -07:00
Jeff King 09a75abba4 commit-graph: free all elements of graph chain
When running "commit-graph verify", we call free_commit_graph(). That's
sufficient for the case of a single graph file, but if we loaded a chain
of split graph files, they form a linked list via the base_graph
pointers. We need to free all of them, or we leak all but the first
struct.

We can make this work by teaching free_commit_graph() to walk the
base_graph pointers and free each element. This in turn lets us simplify
close_commit_graph(), which does the same thing by recursion (we cannot
just use close_commit_graph() in "commit-graph verify", as the function
takes a pointer to an object store, and the verify command creates a
single one-off graph struct).

While indenting the code in free_commit_graph() for the loop, I noticed
that setting g->data to NULL is rather pointless, as we free the struct
a few lines later. So I cleaned that up while we're here.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 14:28:24 -07:00
Jeff King ac6d45d11f commit-graph: move slab-clearing to close_commit_graph()
When closing and freeing a commit-graph, the main entry point is
close_commit_graph(), which then uses close_commit_graph_one() to
recurse through the base_graph links and free each one.

Commit 957ba814bf (commit-graph: when closing the graph, also release
the slab, 2021-09-08) put the call to clear the slab into the recursive
function, but this is pointless: there's only a single global slab
variable. It works OK in practice because clearing the slab is
idempotent, but it makes the code harder to reason about and refactor.

Move it into the parent function so it's only called once (and there are
no other direct callers of the recursive close_commit_graph_one(), so we
are not hurting them).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 14:28:24 -07:00
Jeff King 5f259197ee commit-graph: report incomplete chains during verification
The load_commit_graph_chain_fd_st() function will stop loading chains
when it sees an error. But if it has loaded any graph slice at all, it
will return it. This is a good thing for normal use (we use what data we
can, and this is just an optimization). But it's a bad thing for
"commit-graph verify", which should be careful about finding any
irregularities. We do complain to stderr with a warning(), but the
verify command still exits with a successful return code.

The new tests here cover corruption of both the base and tip slices of
the chain. The corruption of the base file already works (it is the
first file we look at, so when we see the error we return NULL). The
"tip" case is what is fixed by this patch (it complains to stderr but
still returns the base slice).

Likewise the existing tests for corruption of the commit-graph-chain
file itself need to be updated. We already exited non-zero correctly for
the "base" case, but the "tip" case can now do so, too.

Note that this also causes us to adjust a test later in the file that
similarly corrupts a tip (though confusingly the test script calls this
"base"). It checks stderr but erroneously expects the whole "verify"
command to exit with a successful code.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 07:00:43 -07:00
Jeff King 7754a565e2 commit-graph: tighten chain size check
When we open a commit-graph-chain file, if it's smaller than a single
entry, we just quietly treat that as ENOENT. That make some sense if the
file is truly zero bytes, but it means that "commit-graph verify" will
quietly ignore a file that contains garbage if that garbage happens to
be short.

Instead, let's only simulate ENOENT when the file is truly empty, and
otherwise return EINVAL. The normal graph-loading routines don't care,
but "commit-graph verify" will notice and complain about the difference.

It's not entirely clear to me that the 0-is-ENOENT case actually happens
in real life, so we could perhaps just eliminate this special-case
altogether. But this is how we've always behaved, so I'm preserving it
in the name of backwards compatibility (though again, it really only
matters for "verify", as the regular routines are happy to load what
they can).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 07:00:43 -07:00
Jeff King 8298b54317 commit-graph: check mixed generation validation when loading chain file
In read_commit_graph_one(), we call validate_mixed_generation_chain()
after loading the graph. Even though we don't check the return value,
this has the side effect of clearing the read_generation_data flag,
which is important when working with mixed generation numbers.

But doing this in load_commit_graph_chain_fd_st() makes more sense:

  1. We are calling it even when we did not load a chain at all, which
     is pointless (you cannot have mixed generations in a single file).

  2. For now, all callers load the graph via read_commit_graph_one().
     But the point of factoring out the open/load in the previous commit
     was to let "commit-graph verify" call them separately. So it needs
     to trigger this function as part of the load.

     Without this patch, the mixed-generation tests in t5324 would start
     failing on "git commit-graph verify" calls, once we switch to using
     a separate open/load call there.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 07:00:43 -07:00
Jeff King 7ed76b4eb2 commit-graph: factor out chain opening function
The load_commit_graph_chain() function opens the chain file and all of
the slices of graph that it points to. If there is no chain file (which
is a totally normal condition), we return NULL. But if we run into
errors with the chain file or loading the actual graph data, we also
return NULL, and the caller cannot tell the difference.

The caller can check for ENOENT for the unremarkable "no such file"
case. But I'm hesitant to assume that the rest of the function would
never accidentally set errno to ENOENT itself, since it is opening the
slice files (and that would mean the caller fails to notice a real
error).

So let's break this into two functions: one to open the file, and one to
actually load it. This matches the interface we provide for the
non-chain graph file, which will also come in handy in a moment when we
fix some bugs in the "git commit-graph verify" code.

Some notes:

  - I've kept the "1 is good, 0 is bad" return convention (and the weird
    "fd" out-parameter) used by the matching open_commit_graph()
    function and other parts of the commit-graph code. This is unlike
    most of the rest of Git (which would just return the fd, with -1 for
    error), but it makes sense to stay consistent with the adjacent bits
    of the API here.

  - The existing chain loading function will quietly return if the file
    is too small to hold a single entry. I've retained that behavior
    (and explicitly set ENOENT in the opener function) for now, under
    the notion that it's probably valid (though I'd imagine unusual) to
    have an empty chain file.

There are two small behavior changes here, but I think both are strictly
positive:

  1. The original blindly did a stat() before checking if fopen()
     succeeded, meaning we were making a pointless extra stat call.

  2. We now use fstat() to check the file size. The previous code using
     a regular stat() on the pathname meant we could technically race
     with somebody updating the chain file, and end up with a size that
     does not match what we just opened with fopen(). I doubt anybody
     ever hit this in practice, but it may have caused an out-of-bounds
     read.

We'll retain the load_commit_graph_chain() function which does both the
open and reading steps (most existing callers do not care about seeing
errors anyway, since loading commit-graphs is optimistic).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 07:00:43 -07:00
Junio C Hamano 25ff15d108 Merge branch 'jk/unused-post-2.42'
Unused parameters to functions are marked as such, and/or removed,
in order to bring us closer to -Wunused-parameter clean.

* jk/unused-post-2.42: (22 commits)
  update-ref: mark unused parameter in parser callbacks
  gc: mark unused descriptors in scheduler callbacks
  bundle-uri: mark unused parameters in callbacks
  fetch: mark unused parameter in ref_transaction callback
  credential: mark unused parameter in urlmatch callback
  grep: mark unused parmaeters in pcre fallbacks
  imap-send: mark unused parameters with NO_OPENSSL
  worktree: mark unused parameters in noop repair callback
  negotiator/noop: mark unused callback parameters
  add-interactive: mark unused callback parameters
  grep: mark unused parameter in output function
  test-trace2: mark unused argv/argc parameters
  trace2: mark unused config callback parameter
  trace2: mark unused us_elapsed_absolute parameters
  stash: mark unused parameter in diff callback
  ls-tree: mark unused parameter in callback
  commit-graph: mark unused data parameters in generation callbacks
  worktree: mark unused parameters in each_ref_fn callback
  pack-bitmap: mark unused parameters in show_object callback
  ref-filter: mark unused parameters in parser callbacks
  ...
2023-09-07 15:06:07 -07:00
Jeff King e1cba404db commit-graph: mark unused data parameters in generation callbacks
The compute_generation_info code uses function pointers to abstract the
get/set generation operations. Some callers don't need the extra void
data pointer, which should be annotated to appease -Wunused-parameter.

Note that we can drop the assignment of the "data" parameter in
compute_generation_numbers(), as we've just shown that neither of the
callbacks it uses will access it. This matches the caller in
ensure_generations_valid(), which already does not bother to set "data".

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-29 17:56:24 -07:00
Taylor Blau db6044d762 commit-graph: avoid repeated mixed generation number warnings
When validating that a commit-graph has either all zero, or all non-zero
generation numbers, we emit a warning on both the rising and falling
edge of transitioning between the two.

So if we are unfortunate enough to see a commit-graph which has a
repeating sequence of zero, then non-zero generation numbers, we'll
generate many warnings that contain more or less the same information.

Avoid this by keeping track of a single example for a commit with zero-
and non-zero generation, and emit a single warning at the end of
verification if both are non-NULL.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-21 16:16:23 -07:00
Jeff King cc9c9a00a5 commit-graph: verify swapped zero/non-zero generation cases
In verify_one_commit_graph(), we have code that complains when a commit
is found with a generation number of zero, and then later with a
non-zero number. It works like this:

  1. When we see an entry with generation zero, we set the
     generation_zero flag to GENERATION_ZERO_EXISTS.

  2. When we later see an entry with a non-zero generation, we complain
     if the flag is GENERATION_ZERO_EXISTS.

There's a matching GENERATION_NUMBER_EXISTS value, which in theory would
be used to find the case that we see the entries in the opposite order:

  1. When we see an entry with a non-zero generation, we set the
     generation_zero flag to GENERATION_NUMBER_EXISTS.

  2. When we later see an entry with a zero generation, we complain if
     the flag is GENERATION_NUMBER_EXISTS.

But that doesn't work; step 2 is implemented, but there is no step 1. We
never use NUMBER_EXISTS at all, and Coverity rightly complains that step
2 is dead code.

We can fix that by implementing that step 1.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-21 16:07:33 -07:00
Taylor Blau 868c991155 commit-graph: introduce `commit_graph_generation_from_graph()`
In 2ee11f7261 (commit-graph: return generation from memory, 2023-03-20),
the `commit_graph_generation()` function stopped returning zeros when
asked to locate the generation number of a given commit.

This was done at the time to prepare for a later change which set
generation values in memory, meaning that we could no longer rely on
`graph_pos` alone to tell us whether or not to trust the generation
number returned by this function.

In 2ee11f7261, it was noted that this change only impacted very old
commit-graphs, which were written with all commits having generation
number 0. Indeed, zero is not a valid generation number, so we should
never expect to see that value outside of the aforementioned case.

The test fallout in 2ee11f7261 indicated that we were no longer able to
fsck a specific old case of commit-graph corruption, where we see a
non-zero generation number after having seen a generation number of 0
earlier.

Introduce a variant of `commit_graph_generation()` which behaves like
that function did prior to 2ee11f7261, known as
`commit_graph_generation_from_graph()`. Then use this function in the
context of `verify_one_commit_graph()`, where we only want to trust the
values from the graph.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-21 16:07:33 -07:00
Junio C Hamano 4488bb3bed Merge branch 'tb/object-access-overflow-protection'
Various offset computation in the code that accesses the packfiles
and other data in the object layer has been hardened against
arithmetic overflow, especially on 32-bit systems.

* tb/object-access-overflow-protection:
  commit-graph.c: prevent overflow in `verify_commit_graph()`
  commit-graph.c: prevent overflow in `write_commit_graph()`
  commit-graph.c: prevent overflow in `merge_commit_graph()`
  commit-graph.c: prevent overflow in `split_graph_merge_strategy()`
  commit-graph.c: prevent overflow in `load_tree_for_commit()`
  commit-graph.c: prevent overflow in `fill_commit_in_graph()`
  commit-graph.c: prevent overflow in `fill_commit_graph_info()`
  commit-graph.c: prevent overflow in `load_oid_from_graph()`
  commit-graph.c: prevent overflow in add_graph_to_chain()
  commit-graph.c: prevent overflow in `write_commit_graph_file()`
  pack-bitmap.c: ensure that eindex lookups don't overflow
  midx.c: prevent overflow in `fill_included_packs_batch()`
  midx.c: prevent overflow in `write_midx_internal()`
  midx.c: store `nr`, `alloc` variables as `size_t`'s
  midx.c: prevent overflow in `nth_midxed_offset()`
  midx.c: prevent overflow in `nth_midxed_object_oid()`
  midx.c: use `size_t`'s for fanout nr and alloc
  packfile.c: use checked arithmetic in `nth_packed_object_offset()`
  packfile.c: prevent overflow in `load_idx()`
  packfile.c: prevent overflow in `nth_packed_object_id()`
2023-07-25 12:05:23 -07:00
Junio C Hamano 6016ee0a71 Merge branch 'tb/fsck-no-progress'
"git fsck --no-progress" still spewed noise from the commit-graph
subsystem, which has been corrected.

* tb/fsck-no-progress:
  commit-graph.c: avoid duplicated progress output during `verify`
  commit-graph.c: pass progress to `verify_one_commit_graph()`
  commit-graph.c: iteratively verify commit-graph chains
  commit-graph.c: extract `verify_one_commit_graph()`
  fsck: suppress MIDX output with `--no-progress`
  fsck: suppress commit-graph output with `--no-progress`
2023-07-18 07:28:53 -07:00
Taylor Blau 9a25cad7e0 commit-graph.c: prevent overflow in `verify_commit_graph()`
In a similar spirit as previous commits, ensure that we don't overflow
when trying to read an OID out of an existing commit-graph during
verification.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14 09:32:03 -07:00
Taylor Blau 588af1bfd3 commit-graph.c: prevent overflow in `write_commit_graph()`
In a similar spirit as previous commits, ensure that we don't overflow
when trying to read an existing OID while writing a new commit-graph.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14 09:32:03 -07:00
Taylor Blau d76e0a744d commit-graph.c: prevent overflow in `merge_commit_graph()`
When merging two commit graphs, ensure that we don't attempt to merge
two graphs which, when combined, have more total commits than the 32-bit
unsigned maximum.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14 09:32:03 -07:00
Taylor Blau 19565d093d commit-graph.c: prevent overflow in `split_graph_merge_strategy()`
In a similar spirit as previous commits, ensure that we don't overflow
when choosing how to split and merge different layers of the
commit-graph.

In particular, avoid a potential overflow between `size_mult` and
`num_commits`, as well as a potential overflow between the number of
commits currently in the merged graph, and the number of commits in the
graph about to be merged.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14 09:32:03 -07:00
Taylor Blau 51c31a6408 commit-graph.c: prevent overflow in `load_tree_for_commit()`
In a similar spirit as previous commits, ensure that we don't overflow
when computing an offset into the commit_data chunk when the (relative)
graph position exceeds 2^32-1/GRAPH_DATA_WIDTH.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14 09:32:03 -07:00
Taylor Blau 50a71c2942 commit-graph.c: prevent overflow in `fill_commit_in_graph()`
In a similar spirit as previous commits, ensure that we don't overflow
when the lex_index of the commit we are trying to fill out exceeds
2^32-1/(g->hash_len+16).

The other hunk touched in this patch is not susceptible to overflow,
since an explicit cast is made to a 64-bit unsigned value. For clarity
and consistency with the rest of the commits in this series, avoid a
tricky to reason about cast, and use `st_mult()` directly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14 09:32:03 -07:00
Taylor Blau 2740ed1c76 commit-graph.c: prevent overflow in `fill_commit_graph_info()`
In a similar spirit as previous commits, ensure that we don't overflow
in a few spots within `fill_commit_graph_info()`:

  - First, when computing an offset into the commit data chunk, which
    can occur when the `lex_index` of the item we're looking up exceeds
    2^32-1/GRAPH_DATA_WIDTH.

  - A similar issue when computing the generation date offset for
    commits with `lex_index` greater than 2^32-1/4. Note that in
    practice this will never overflow, since the left-hand operand is
    from calling `sizeof(...)` and is thus already a `size_t`. But wrap
    that in an `st_mult()` to make it clear that we intend to perform
    this computation using 64-bit operands.

  - Finally, a nearly identical issue as above when computing an offset
    into the `generation_data_overflow` chunk.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14 09:32:03 -07:00
Taylor Blau 0bd8f30a0e commit-graph.c: prevent overflow in `load_oid_from_graph()`
In a similar spirit as previous commits, ensure that we don't overflow
when trying to compute an offset into the `chunk_oid_lookup` table when
the `lex_index` of the item we're trying to look up exceeds
`2^32-1/g->hash_len`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14 09:32:03 -07:00
Taylor Blau 209250ef38 commit-graph.c: prevent overflow in add_graph_to_chain()
The commit-graph uses a fanout table with 4-byte entries to store the
number of commits at each shard of the commit-graph. So it is OK to have
a commit graph with as many as 2^32-1 stored commits. But we risk
overflowing any computation which may exceed the 32-bit (unsigned)
maximum when those computations are (incorrectly) performed using 32-bit
operands.

There are a couple of spots in `add_graph_to_chain()` where we could
potentially overflow the result:

  - First, when comparing the list of existing entries in the
    commit-graph chain. It is unlikely that this should ever overflow,
    since it would require having roughly 2^32-1/g->hash_len
    commit-graphs in the chain. But let's guard that computation with a
    `st_mult()` just to be safe.

  - Second, when computing the number of commits in the graph added to
    the front of the chain. This value is also a 32-bit unsigned, but we
    should make sure that it does not grow beyond the maximum value.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14 09:32:03 -07:00
Taylor Blau 48f3f8cf37 commit-graph.c: prevent overflow in `write_commit_graph_file()`
When writing a commit-graph, we use the chunk-format API to write out
each individual chunk of the commit-graph. Each chunk of the
commit-graph is tracked via a call to `add_chunk()`, along with the
expected size of that chunk.

Similar to an earlier commit which handled the identical issue in the
MIDX machinery, guard against overflow when dealing with a commit-graph
with a large number of entries to avoid corrupting the contents of the
commit-graph itself.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14 09:32:03 -07:00
Taylor Blau 9281cd07f0 commit-graph.c: avoid duplicated progress output during `verify`
When `git commit-graph verify` was taught how to verify commit-graph
chains in 3da4b609bb (commit-graph: verify chains with --shallow mode,
2019-06-18), it produced one line of progress per layer of the
commit-graph chain.

    $ git.compile commit-graph verify
    Verifying commits in commit graph: 100% (4356/4356), done.
    Verifying commits in commit graph: 100% (131912/131912), done.

This could be somewhat confusing to users, who may wonder why there are
multiple occurrences of "Verifying commits in commit graph".

There are likely good arguments on whether or not there should be
one line of progress output per commit-graph layer. On the one hand, the
existing output shows us verifying each individual layer of the chain.
But on the other hand, the fact that a commit-graph may be stored among
multiple layers is an implementation detail that the caller need not be
aware of.

Clarify this by showing a single progress meter regardless of the number
of layers in the commit-graph chain. After this patch, the output
reflects the logical contents of a commit-graph chain, instead of
showing one line of output per commit-graph layer:

    $ git.compile commit-graph verify
    Verifying commits in commit graph: 100% (136268/136268), done.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-10 10:02:45 -07:00
Taylor Blau 7248857b6e commit-graph.c: pass progress to `verify_one_commit_graph()`
This is the final step to prepare for consolidating the output of `git
commit-graph verify`. Instead of having each call to
`verify_one_commit_graph()` initialize its own progress struct, have the
caller pass one in instead.

This patch does not alter the output of `git commit-graph verify`, but
the next commit will consolidate the output.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-10 10:02:43 -07:00
Taylor Blau f5facaa465 commit-graph.c: iteratively verify commit-graph chains
Now that we have a function which can verify a single layer of a
commit-graph chain, implement `verify_commit_graph()` in terms of
iterating over commit-graphs along their `->base_graph` pointers.

This further prepares us to consolidate the progress output of `git
commit-graph verify`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-10 10:02:42 -07:00
Taylor Blau eb319d6771 commit-graph.c: extract `verify_one_commit_graph()`
When the `verify_commit_graph()` function was extended to support
commit-graph chains via 3da4b609bb (commit-graph: verify chains with
--shallow mode, 2019-06-18), it did so by recursively calling itself on
each layer of the commit-graph chain.

In practice this poses no issues, since commit-graph chains do not loop,
and there are few enough of them that adding additional frames to the
stack is not a problem.

A future commit will consolidate the progress output from `git
commit-graph verify` when verifying chained commit-graphs to print a
single line instead of one progress meter per commit-graph layer.
Prepare for this by extracting a routine to verify a single layer of a
commit-graph.

Note that `verify_commit_graph()` is still recursive after this patch,
but this will change in the subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-10 10:02:41 -07:00
Calvin Wan da9502ff4d treewide: remove unnecessary includes for wrapper.h
Signed-off-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05 11:41:59 -07:00
Junio C Hamano a1264a08a1 Merge branch 'en/header-split-cache-h-part-3'
Header files cleanup.

* en/header-split-cache-h-part-3: (28 commits)
  fsmonitor-ll.h: split this header out of fsmonitor.h
  hash-ll, hashmap: move oidhash() to hash-ll
  object-store-ll.h: split this header out of object-store.h
  khash: name the structs that khash declares
  merge-ll: rename from ll-merge
  git-compat-util.h: remove unneccessary include of wildmatch.h
  builtin.h: remove unneccessary includes
  list-objects-filter-options.h: remove unneccessary include
  diff.h: remove unnecessary include of oidset.h
  repository: remove unnecessary include of path.h
  log-tree: replace include of revision.h with simple forward declaration
  cache.h: remove this no-longer-used header
  read-cache*.h: move declarations for read-cache.c functions from cache.h
  repository.h: move declaration of the_index from cache.h
  merge.h: move declarations for merge.c from cache.h
  diff.h: move declaration for global in diff.c from cache.h
  preload-index.h: move declarations for preload-index.c from elsewhere
  sparse-index.h: move declarations for sparse-index.c from cache.h
  name-hash.h: move declarations for name-hash.c from cache.h
  run-command.h: move declarations for run-command.c from cache.h
  ...
2023-06-29 16:43:21 -07:00
Junio C Hamano d9f9f6b358 Merge branch 'ds/disable-replace-refs'
Introduce a mechanism to disable replace refs globally and per
repository.

* ds/disable-replace-refs:
  repository: create read_replace_refs setting
  replace-objects: create wrapper around setting
  repository: create disable_replace_refs()
2023-06-22 16:29:06 -07:00
Elijah Newren a034e9106f object-store-ll.h: split this header out of object-store.h
The vast majority of files including object-store.h did not need dir.h
nor khash.h.  Split the header into two files, and let most just depend
upon object-store-ll.h, while letting the two callers that need it
depend on the full object-store.h.

After this patch:
    $ git grep -h include..object-store | sort | uniq -c
          2 #include "object-store.h"
        129 #include "object-store-ll.h"

Diff best viewed with `--color-moved`.

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21 13:39:54 -07:00
Elijah Newren c339932bd8 repository: remove unnecessary include of path.h
This also made it clear that several .c files that depended upon path.h
were missing a #include for it; add the missing includes while at it.

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21 13:39:53 -07:00
Derrick Stolee f1178380ac replace-objects: create wrapper around setting
The 'read_replace_objects' constant is initialized by git_default_config
(if core.useReplaceRefs is disabled) and within setup_git_env (if
GIT_NO_REPLACE_OBJECTS) is set. To ensure that this variable cannot be
set accidentally in other places, wrap it in a replace_refs_enabled()
method.

Since we still assign this global in config.c, we are not able to remove
the global scope of this variable and make it a static within
replace-object.c. This will happen in a later change which will also
prevent the variable from being read before it is initialized.

Centralizing read access to the variable is an important first step.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-12 13:34:55 -07:00
Junio C Hamano ccd12a3d6c Merge branch 'en/header-split-cache-h-part-2'
More header clean-up.

* en/header-split-cache-h-part-2: (22 commits)
  reftable: ensure git-compat-util.h is the first (indirect) include
  diff.h: reduce unnecessary includes
  object-store.h: reduce unnecessary includes
  commit.h: reduce unnecessary includes
  fsmonitor: reduce includes of cache.h
  cache.h: remove unnecessary headers
  treewide: remove cache.h inclusion due to previous changes
  cache,tree: move basic name compare functions from read-cache to tree
  cache,tree: move cmp_cache_name_compare from tree.[ch] to read-cache.c
  hash-ll.h: split out of hash.h to remove dependency on repository.h
  tree-diff.c: move S_DIFFTREE_IFXMIN_NEQ define from cache.h
  dir.h: move DTYPE defines from cache.h
  versioncmp.h: move declarations for versioncmp.c functions from cache.h
  ws.h: move declarations for ws.c functions from cache.h
  match-trees.h: move declarations for match-trees.c functions from cache.h
  pkt-line.h: move declarations for pkt-line.c functions from cache.h
  base85.h: move declarations for base85.c functions from cache.h
  copy.h: move declarations for copy.c functions from cache.h
  server-info.h: move declarations for server-info.c functions from cache.h
  packfile.h: move pack_window and pack_entry from cache.h
  ...
2023-05-09 16:45:46 -07:00
Junio C Hamano 0807e57807 Merge branch 'en/header-split-cache-h'
Header clean-up.

* en/header-split-cache-h: (24 commits)
  protocol.h: move definition of DEFAULT_GIT_PORT from cache.h
  mailmap, quote: move declarations of global vars to correct unit
  treewide: reduce includes of cache.h in other headers
  treewide: remove double forward declaration of read_in_full
  cache.h: remove unnecessary includes
  treewide: remove cache.h inclusion due to pager.h changes
  pager.h: move declarations for pager.c functions from cache.h
  treewide: remove cache.h inclusion due to editor.h changes
  editor: move editor-related functions and declarations into common file
  treewide: remove cache.h inclusion due to object.h changes
  object.h: move some inline functions and defines from cache.h
  treewide: remove cache.h inclusion due to object-file.h changes
  object-file.h: move declarations for object-file.c functions from cache.h
  treewide: remove cache.h inclusion due to git-zlib changes
  git-zlib: move declarations for git-zlib functions from cache.h
  treewide: remove cache.h inclusion due to object-name.h changes
  object-name.h: move declarations for object-name.c functions from cache.h
  treewide: remove unnecessary cache.h inclusion
  treewide: be explicit about dependence on mem-pool.h
  treewide: be explicit about dependence on oid-array.h
  ...
2023-04-25 13:56:20 -07:00
Elijah Newren d4a4f9291d commit.h: reduce unnecessary includes
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-24 12:47:33 -07:00
Elijah Newren b6fdc44c84 treewide: remove cache.h inclusion due to object-file.h changes
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11 08:52:10 -07:00
Elijah Newren 87bed17907 object-file.h: move declarations for object-file.c functions from cache.h
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11 08:52:10 -07:00
Elijah Newren 6f2d743043 treewide: be explicit about dependence on oid-array.h
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11 08:52:09 -07:00
Junio C Hamano 6047b28eb7 Merge branch 'en/header-split-cleanup'
Split key function and data structure definitions out of cache.h to
new header files and adjust the users.

* en/header-split-cleanup:
  csum-file.h: remove unnecessary inclusion of cache.h
  write-or-die.h: move declarations for write-or-die.c functions from cache.h
  treewide: remove cache.h inclusion due to setup.h changes
  setup.h: move declarations for setup.c functions from cache.h
  treewide: remove cache.h inclusion due to environment.h changes
  environment.h: move declarations for environment.c functions from cache.h
  treewide: remove unnecessary includes of cache.h
  wrapper.h: move declarations for wrapper.c functions from cache.h
  path.h: move function declarations for path.c functions from cache.h
  cache.h: remove expand_user_path()
  abspath.h: move absolute path functions from cache.h
  environment: move comment_line_char from cache.h
  treewide: remove unnecessary cache.h inclusion from several sources
  treewide: remove unnecessary inclusion of gettext.h
  treewide: be explicit about dependence on gettext.h
  treewide: remove unnecessary cache.h inclusion from a few headers
2023-04-06 13:38:31 -07:00
Junio C Hamano 72871b198f Merge branch 'ab/remove-implicit-use-of-the-repository'
Code clean-up around the use of the_repository.

* ab/remove-implicit-use-of-the-repository:
  libs: use "struct repository *" argument, not "the_repository"
  post-cocci: adjust comments for recent repo_* migration
  cocci: apply the "revision.h" part of "the_repository.pending"
  cocci: apply the "rerere.h" part of "the_repository.pending"
  cocci: apply the "refs.h" part of "the_repository.pending"
  cocci: apply the "promisor-remote.h" part of "the_repository.pending"
  cocci: apply the "packfile.h" part of "the_repository.pending"
  cocci: apply the "pretty.h" part of "the_repository.pending"
  cocci: apply the "object-store.h" part of "the_repository.pending"
  cocci: apply the "diff.h" part of "the_repository.pending"
  cocci: apply the "commit.h" part of "the_repository.pending"
  cocci: apply the "commit-reach.h" part of "the_repository.pending"
  cocci: apply the "cache.h" part of "the_repository.pending"
  cocci: add missing "the_repository" macros to "pending"
  cocci: sort "the_repository" rules by header
  cocci: fix incorrect & verbose "the_repository" rules
  cocci: remove dead rule from "the_repository.pending.cocci"
2023-04-06 13:38:30 -07:00
Junio C Hamano 119e82a515 Merge branch 'ps/ahead-behind-truncation-fix'
Fix unnecessary truncation of generation numbers used in-core.

* ps/ahead-behind-truncation-fix:
  commit-graph: fix truncated generation numbers
2023-04-06 13:38:27 -07:00
Junio C Hamano 7727da99df Merge branch 'ds/ahead-behind'
"git for-each-ref" learns '%(ahead-behind:<base>)' that computes the
distances from a single reference point in the history with bunch
of commits in bulk.

* ds/ahead-behind:
  commit-reach: add tips_reachable_from_bases()
  for-each-ref: add ahead-behind format atom
  commit-reach: implement ahead_behind() logic
  commit-graph: introduce `ensure_generations_valid()`
  commit-graph: return generation from memory
  commit-graph: simplify compute_generation_numbers()
  commit-graph: refactor compute_topological_levels()
  for-each-ref: explicitly test no matches
  for-each-ref: add --stdin option
2023-04-06 13:38:21 -07:00
Junio C Hamano e7dca80692 Merge branch 'ab/remove-implicit-use-of-the-repository' into en/header-split-cache-h
* ab/remove-implicit-use-of-the-repository:
  libs: use "struct repository *" argument, not "the_repository"
  post-cocci: adjust comments for recent repo_* migration
  cocci: apply the "revision.h" part of "the_repository.pending"
  cocci: apply the "rerere.h" part of "the_repository.pending"
  cocci: apply the "refs.h" part of "the_repository.pending"
  cocci: apply the "promisor-remote.h" part of "the_repository.pending"
  cocci: apply the "packfile.h" part of "the_repository.pending"
  cocci: apply the "pretty.h" part of "the_repository.pending"
  cocci: apply the "object-store.h" part of "the_repository.pending"
  cocci: apply the "diff.h" part of "the_repository.pending"
  cocci: apply the "commit.h" part of "the_repository.pending"
  cocci: apply the "commit-reach.h" part of "the_repository.pending"
  cocci: apply the "cache.h" part of "the_repository.pending"
  cocci: add missing "the_repository" macros to "pending"
  cocci: sort "the_repository" rules by header
  cocci: fix incorrect & verbose "the_repository" rules
  cocci: remove dead rule from "the_repository.pending.cocci"
2023-04-04 08:25:52 -07:00
Ævar Arnfjörð Bjarmason 4a93b899c1 libs: use "struct repository *" argument, not "the_repository"
As can easily be seen from grepping in our sources, we had these uses
of "the_repository" in various library code in cases where the
function in question was already getting a "struct repository *"
argument. Let's use that argument instead.

Out of these changes only the changes to "cache-tree.c",
"commit-reach.c", "shallow.c" and "upload-pack.c" would have cleanly
applied before the migration away from the "repo_*()" wrapper macros
in the preceding commits.

The rest aren't new, as we'd previously implicitly refer to
"the_repository", but it's now more obvious that we were doing the
wrong thing all along, and should have used the parameter instead.

The change to change "get_index_format_default(the_repository)" in
"read-cache.c" to use the "r" variable instead should arguably have
been part of [1], or in the subsequent cleanup in [2]. Let's do it
here, as can be seen from the initial code in [3] it's not important
that we use "the_repository" there, but would prefer to always use the
current repository.

This change excludes the "the_repository" use in "upload-pack.c"'s
upload_pack_advertise(), as the in-flight [4] makes that change.

1. ee1f0c242e (read-cache: add index.skipHash config option,
   2023-01-06)
2. 6269f8eaad (treewide: always have a valid "index_state.repo"
   member, 2023-01-17)
3. 7211b9e753 (repo-settings: consolidate some config settings,
   2019-08-13)
4. <Y/hbUsGPVNAxTdmS@coredump.intra.peff.net>

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-28 07:36:46 -07:00
Ævar Arnfjörð Bjarmason afe27c8894 cocci: apply the "packfile.h" part of "the_repository.pending"
Apply the part of "the_repository.pending.cocci" pertaining to
"packfile.h".

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-28 07:36:45 -07:00
Ævar Arnfjörð Bjarmason ecb5091fd4 cocci: apply the "commit.h" part of "the_repository.pending"
Apply the part of "the_repository.pending.cocci" pertaining to
"commit.h".

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-28 07:36:45 -07:00
Patrick Steinhardt d3af1c193d commit-graph: fix truncated generation numbers
In 80c928d947 (commit-graph: simplify compute_generation_numbers(),
2023-03-20), the code to compute generation numbers was simplified to
use the same infrastructure as is used to compute topological levels.
This refactoring introduced a bug where the generation numbers are
truncated when they exceed UINT32_MAX because we explicitly cast the
computed generation number to `uint32_t`. This is not required though:
both the computed value and the field of `struct commit_graph_data` are
of the same type `timestamp_t` already, so casting to `uint32_t` will
cause truncation.

This cast can cause us to miscompute generation data overflows:

    1. Given a commit with no parents and committer date
       `UINT32_MAX + 1`.

    2. We compute its generation number as `UINT32_MAX + 1`, but
       truncate it to `1`.

    3. We calculate the generation offset via `$generation - $date`,
       which is thus `1 - (UINT32_MAX + 1)`. The computation underflows
       and we thus end up with an offset that is bigger than the maximum
       allowed offset.

As a result, we'd be writing generation data overflow information into
the commit-graph that is bogus and ultimately not even required.

Fix this bug by removing the needless cast.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-27 10:52:06 -07:00
Elijah Newren ec2f026961 csum-file.h: remove unnecessary inclusion of cache.h
With the change in the last commit to move several functions to
write-or-die.h, csum-file.h no longer needs to include cache.h.
However, removing that include forces several other C files, which
directly or indirectly dependend upon csum-file.h's inclusion of
cache.h, to now be more explicit about their dependencies.

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21 10:56:55 -07:00
Elijah Newren d5ebb50dcb wrapper.h: move declarations for wrapper.c functions from cache.h
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21 10:56:53 -07:00
Elijah Newren f394e093df treewide: be explicit about dependence on gettext.h
Dozens of files made use of gettext functions, without explicitly
including gettext.h.  This made it more difficult to find which files
could remove a dependence on cache.h.  Make C files explicitly include
gettext.h if they are using it.

However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an
include of gettext.h, it was left out to avoid conflicting with an
in-flight topic.

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21 10:56:51 -07:00
Taylor Blau c08645b353 commit-graph: introduce `ensure_generations_valid()`
Use the just-introduced compute_reachable_generation_numbers_1() to
implement a function which dynamically computes topological levels (or
corrected commit dates) for out-of-graph commits.

This will be useful for the ahead-behind algorithm we are about to
introduce, which needs accurate topological levels on _all_ commits
reachable from the tips in order to avoid over-counting.

Co-authored-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-20 12:17:33 -07:00
Derrick Stolee 2ee11f7261 commit-graph: return generation from memory
The commit_graph_generation() method used to report a value of
GENERATION_NUMBER_INFINITY if the commit_graph_data_slab had an instance
for the given commit but the graph_pos indicated the commit was not in
the commit-graph file.

However, an upcoming change will introduce the ability to set generation
values in-memory without writing the commit-graph file. Thus, we can no
longer trust 'graph_pos' to indicate whether or not the generation
member can be trusted.

Instead, trust the 'generation' member if the commit has a value in the
slab _and_ the 'generation' member is non-zero. Otherwise, treat it as
GENERATION_NUMBER_INFINITY.

This only makes a difference for a very old case for the commit-graph:
the very first Git release to write commit-graph files wrote zeroes in
the topological level positions. If we are parsing a commit-graph with
all zeroes, those commits will now appear to have
GENERATION_NUMBER_INFINITY (as if they were not parsed from the
commit-graph).

I attempted several variations to work around the need for providing an
uninitialized 'generation' member, but this was the best one I found. It
does require a change to a verification test in t5318 because it reports
a different error than the one about non-zero generation numbers.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-20 12:17:33 -07:00
Derrick Stolee 80c928d947 commit-graph: simplify compute_generation_numbers()
The previous change introduced the generic algorithm
compute_reachable_generation_numbers() and used it as the core
functionality of compute_topological_levels(). Now, use it as the core
functionality of compute_generation_numbers().

The main difference here is that we use generation version 2, which is
used in to toggle the logic in compute_generation_from_max() for
computing the corrected commit date based on the corrected commit dates
of the parent commits (and the commit date of the current commit). It
also uses different methods for (get|set)_generation in the vtable in
order to store and access the value in the correct places.

Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-20 12:17:33 -07:00
Derrick Stolee 368d19b0b7 commit-graph: refactor compute_topological_levels()
This patch extracts the common code used to compute topological levels
and corrected committer dates into a common routine,
compute_reachable_generation_numbers(). For ease of reading, it only
modifies compute_topological_levels() to use this new routine, leaving
compute_generation_numbers() to be modified in the next change.

This new routine dispatches to call the necessary functions to get and
set the generation number for a given commit through a vtable (the
compute_generation_info struct).

Computing the generation number itself is done in
compute_generation_from_max(), which dispatches its implementation based
on the generation version requested, or issuing a BUG() for unrecognized
generation versions. This does not use a vtable because the logic
depends only on the generation number version, not where the data is
being loaded from or being stored to. This is a subtle point that will
make more sense in a future change that modifies the in-memory
generation values instead of just preparing values for writing to a
commit-graph file.

This change looks like it adds a lot of new code. However, two upcoming
changes will be quite small due to the work being done in this change.

Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-20 12:17:33 -07:00
Elijah Newren 41771fa435 cache.h: remove dependence on hex.h; make other files include it explicitly
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23 17:25:29 -08:00
René Scharfe 6e57841096 use DUP_ARRAY
Add a semantic patch for replace ALLOC_ARRAY+COPY_ARRAY with DUP_ARRAY
to reduce code duplication and apply its results.

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-09 13:28:36 +09:00
Junio C Hamano 3239100b5a Merge branch 'ml/commit-graph-expire-dir-leak-fix'
A result from opendir() was leaking in the commit-graph expiration
codepath, which has been plugged.

* ml/commit-graph-expire-dir-leak-fix:
  commit-graph: Fix missing closedir in expire_commit_graphs
2022-09-21 14:23:14 -07:00
Miaoqian Lin 12f1ae5324 commit-graph: Fix missing closedir in expire_commit_graphs
The function calls opendir() but missing the corresponding
closedir() before exit the function.
Add missing closedir() to fix it.

Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Reviewed-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-19 10:42:38 -07:00
Junio C Hamano dd407f1c7c Merge branch 'ab/unused-annotation'
Undoes 'jk/unused-annotation' topic and redoes it to work around
Coccinelle rules misfiring false positives in unrelated codepaths.

* ab/unused-annotation:
  git-compat-util.h: use "deprecated" for UNUSED variables
  git-compat-util.h: use "UNUSED", not "UNUSED(var)"
2022-09-14 12:56:39 -07:00
Junio C Hamano a6b42ec0c6 Merge branch 'jk/unused-annotation'
Annotate function parameters that are not used (but cannot be
removed for structural reasons), to prepare us to later compile
with -Wunused warning turned on.

* jk/unused-annotation:
  is_path_owned_by_current_uid(): mark "report" parameter as unused
  run-command: mark unused async callback parameters
  mark unused read_tree_recursive() callback parameters
  hashmap: mark unused callback parameters
  config: mark unused callback parameters
  streaming: mark unused virtual method parameters
  transport: mark bundle transport_options as unused
  refs: mark unused virtual method parameters
  refs: mark unused reflog callback parameters
  refs: mark unused each_ref_fn parameters
  git-compat-util: add UNUSED macro
2022-09-14 12:56:39 -07:00
Junio C Hamano 655e494047 Merge branch 'jk/rev-list-verify-objects-fix'
"git rev-list --verify-objects" ought to inspect the contents of
objects and notice corrupted ones, but it didn't when the commit
graph is in use, which has been corrected.

* jk/rev-list-verify-objects-fix:
  rev-list: disable commit graph with --verify-objects
  lookup_commit_in_graph(): use prepare_commit_graph() to check for graph
2022-09-13 11:38:24 -07:00
Jeff King d6045294a9 lookup_commit_in_graph(): use prepare_commit_graph() to check for graph
We exit early from lookup_commit_in_graph() if the commit_graph pointer
is NULL, under the assumption that we don't have a graph to look at. But
the graph pointer is lazy-loaded; if no other code happens to have
called prepare_commit_graph(), we'll incorrectly assume that one isn't
available at all.

This has a pretty small performance impact in practice, because the
fallback will generally be to call parse_object() instead. That ends up
in parse_commit_buffer(), which loads the graph data itself. So the
first commit we see won't use the graph, but subsequent ones will. Since
using the graph is just an optimization there's generally no
user-visible difference, but if you instrument rev-list like so:

  diff --git a/revision.c b/revision.c
  index ee702e498a..63c488ffb6 100644
  --- a/revision.c
  +++ b/revision.c
  @@ -381,6 +381,9 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
           * parsing commit data from disk.
           */
          commit = lookup_commit_in_graph(revs->repo, oid);
  +       warning("%s %s in commit graph",
  +               commit ? "found" : "did not find",
  +               name);
          if (commit)
                  object = &commit->object;
          else

and run (in git.git):

  git commit-graph write --reachable
  git rev-list origin/master origin/next >/dev/null

you'll see that we fail to find the first one:

  warning: did not find origin/master in commit graph
  warning: found origin/next in commit graph

After this patch, you'll see that we find both:

  warning: found origin/master in commit graph
  warning: found origin/next in commit graph

Even though the performance implication is small here, there are two
important reasons to do this:

  - it's downright confusing if you are hunting a bug triggered by the
    use of the commit graph. It may or may not trigger depending on the
    number and ordering of tips you ask for.

  - prepare_commit_graph() has other policy logic, too. In particular,
    if we've loaded a commit graph and then disabled the graph via
    disable_commit_graph(), that should take precedence.

    I'm not sure if this can trigger bad behavior in practice. The only
    caller there is upload-pack's deepen_by_rev_list(), which should be
    avoiding the commit graph for its traversal tips, but probably
    wasn't before this patch. Whether you could come up with a case
    where that mattered is unclear. Still, this is obviously the right
    thing to be doing.

Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-07 09:44:28 -07:00
Ævar Arnfjörð Bjarmason 5cf88fd8b0 git-compat-util.h: use "UNUSED", not "UNUSED(var)"
As reported in [1] the "UNUSED(var)" macro introduced in
2174b8c75de (Merge branch 'jk/unused-annotation' into next,
2022-08-24) breaks coccinelle's parsing of our sources in files where
it occurs.

Let's instead partially go with the approach suggested in [2] of
making this not take an argument. As noted in [1] "coccinelle" will
ignore such tokens in argument lists that it doesn't know about, and
it's less of a surprise to syntax highlighters.

This undoes the "help us notice when a parameter marked as unused is
actually use" part of 9b24034754 (git-compat-util: add UNUSED macro,
2022-08-19), a subsequent commit will further tweak the macro to
implement a replacement for that functionality.

1. https://lore.kernel.org/git/220825.86ilmg4mil.gmgdl@evledraar.gmail.com/
2. https://lore.kernel.org/git/220819.868rnk54ju.gmgdl@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01 10:49:48 -07:00
Jeff King 63e14ee2d6 refs: mark unused each_ref_fn parameters
Functions used with for_each_ref(), etc, need to conform to the
each_ref_fn interface. But most of them don't need every parameter;
let's annotate the unused ones to quiet -Wunused-parameter.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-19 12:18:54 -07:00
Junio C Hamano 042159a509 Merge branch 'tb/commit-graph-genv2-upgrade-fix' into maint
There was a bug in the codepath to upgrade generation information
in commit-graph from v1 to v2 format, which has been corrected.
source: <cover.1657667404.git.me@ttaylorr.com>

* tb/commit-graph-genv2-upgrade-fix:
  commit-graph: fix corrupt upgrade from generation v1 to v2
  commit-graph: introduce `repo_find_commit_pos_in_graph()`
  t5318: demonstrate commit-graph generation v2 corruption
2022-08-10 21:52:35 -07:00
Junio C Hamano 312d5b7429 Merge branch 'hx/lookup-commit-in-graph-fix' into maint
A corner case bug where lazily fetching objects from a promisor
remote resulted in infinite recursion has been corrected.
source: <cover.1656593279.git.hanxin.hx@bytedance.com>

* hx/lookup-commit-in-graph-fix:
  t5330: remove run_with_limited_processses()
  commit-graph.c: no lazy fetch in lookup_commit_in_graph()
2022-08-10 21:52:32 -07:00
Junio C Hamano 37e4bdd5ee Merge branch 'tb/commit-graph-genv2-upgrade-fix'
There was a bug in the codepath to upgrade generation information
in commit-graph from v1 to v2 format, which has been corrected.

* tb/commit-graph-genv2-upgrade-fix:
  commit-graph: fix corrupt upgrade from generation v1 to v2
  commit-graph: introduce `repo_find_commit_pos_in_graph()`
  t5318: demonstrate commit-graph generation v2 corruption
2022-08-03 13:36:08 -07:00
Junio C Hamano 36d7bd19cf Merge branch 'js/commit-graph-parsing-without-repo-settings'
API tweak to make it easier to run fuzz testing on commit-graph parser.

* js/commit-graph-parsing-without-repo-settings:
  commit-graph: pass repo_settings instead of repository
2022-07-27 09:16:52 -07:00
Junio C Hamano 99c0d94eaa Merge branch 'hx/lookup-commit-in-graph-fix'
A corner case bug where lazily fetching objects from a promisor
remote resulted in infinite recursion has been corrected.

* hx/lookup-commit-in-graph-fix:
  t5330: remove run_with_limited_processses()
  commit-graph.c: no lazy fetch in lookup_commit_in_graph()
2022-07-19 16:40:16 -07:00
Taylor Blau 7805360b7a commit-graph: introduce `repo_find_commit_pos_in_graph()`
Low-level callers in systems that are adjacent to the commit-graph (like
the changed-path Bloom filter code) could benefit from being able to
call a function like `parse_commit_in_graph()` without modifying the
corresponding commit slab data.

This is useful in contexts where that slab data is being used to prepare
for an upcoming commit-graph write, where Git must be careful to avoid
clobbering any of that data during a read operation.

Introduce a low-level variant of `parse_commit_in_graph()` which returns
the graph position of a given commit only, without modifying any of the
slab data.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-15 16:51:39 -07:00
Taylor Blau a92d8523ce commit-graph: pass repo_settings instead of repository
The parse_commit_graph() function takes a 'struct repository *' pointer,
but it only ever accesses config settings (either directly or through
the .settings field of the repo struct). Move all relevant config
settings into the repo_settings struct, and update parse_commit_graph()
and its existing callers so that it takes 'struct repo_settings *'
instead.

Callers of parse_commit_graph() will now need to call
prepare_repo_settings() themselves, or initialize a 'struct
repo_settings' directly.

Prior to ab14d0676c (commit-graph: pass a 'struct repository *' in more
places, 2020-09-09), parsing a commit-graph was a pure function
depending only on the contents of the commit-graph itself. Commit
ab14d0676c introduced a dependency on a `struct repository` pointer, and
later commits such as b66d84756f (commit-graph: respect
'commitGraph.readChangedPaths', 2020-09-09) added dependencies on config
settings, which were accessed through the `settings` field of the
repository pointer. This field was initialized via a call to
`prepare_repo_settings()`.

Additionally, this fixes an issue in fuzz-commit-graph: In 44c7e62
(2021-12-06, repo-settings:prepare_repo_settings only in git repos),
prepare_repo_settings was changed to issue a BUG() if it is called by a
process whose CWD is not a Git repository.

The combination of commits mentioned above broke fuzz-commit-graph,
which attempts to parse arbitrary fuzzing-engine-provided bytes as a
commit graph file. Prior to this change, parse_commit_graph() called
prepare_repo_settings(), but since we run the fuzz tests without a valid
repository, we are hitting the BUG() from 44c7e62 for every test case.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-14 15:42:17 -07:00
Han Xin 3a1ea94a49 commit-graph.c: no lazy fetch in lookup_commit_in_graph()
The commit-graph is used to opportunistically optimize accesses to
certain pieces of information on commit objects, and
lookup_commit_in_graph() tries to say "no" when the requested commit
does not locally exist by returning NULL, in which case the caller
can ask for (which may result in on-demand fetching from a promisor
remote) and parse the commit object itself.

However, it uses a wrong helper, repo_has_object_file(), to do so.
This helper not only checks if an object is mmediately available in
the local object store, but also tries to fetch from a promisor remote.
But the fetch machinery calls lookup_commit_in_graph(), thus causing an
infinite loop.

We should make lookup_commit_in_graph() expect that a commit given to it
can be legitimately missing from the local object store, by using the
has_object_file() helper instead.

Signed-off-by: Han Xin <hanxin.hx@bytedance.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-30 23:00:32 -07:00
Junio C Hamano a50036da1a Merge branch 'tb/cruft-packs'
A mechanism to pack unreachable objects into a "cruft pack",
instead of ejecting them into loose form to be reclaimed later, has
been introduced.

* tb/cruft-packs:
  sha1-file.c: don't freshen cruft packs
  builtin/gc.c: conditionally avoid pruning objects via loose
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: support generating a cruft pack
  builtin/pack-objects.c: --cruft with expiration
  reachable: report precise timestamps from objects in cruft packs
  reachable: add options to add_unseen_recent_objects_to_traversal
  builtin/pack-objects.c: --cruft without expiration
  builtin/pack-objects.c: return from create_object_entry()
  t/helper: add 'pack-mtimes' test-tool
  pack-mtimes: support writing pack .mtimes files
  chunk-format.h: extract oid_version()
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  pack-mtimes: support reading .mtimes files
  Documentation/technical: add cruft-packs.txt
2022-06-03 14:30:37 -07:00
Taylor Blau d9fef9d90d chunk-format.h: extract oid_version()
There are three definitions of an identical function which converts
`the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
copy of this function for writing both the commit-graph and
multi-pack-index file, and another inline definition used to write the
.rev header.

Consolidate these into a single definition in chunk-format.h. It's not
clear that this is the best header to define this function in, but it
should do for now.

(Worth noting, the .rev caller expects a 4-byte unsigned, but the other
two callers work with a single unsigned byte. The consolidated version
uses the latter type, and lets the compiler widen it when required).

Another caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-26 15:48:26 -07:00
Junio C Hamano 1b8138fb08 Merge branch 'ab/valgrind-fixes'
A bit of test framework fixes with a few fixes to issues found by
valgrind.

* ab/valgrind-fixes:
  commit-graph.c: don't assume that stat() succeeds
  object-file: fix a unpack_loose_header() regression in 3b6a8db3b0
  log test: skip a failing mkstemp() test under valgrind
  tests: using custom GIT_EXEC_PATH breaks --valgrind tests
2022-05-23 14:39:54 -07:00
Junio C Hamano 538dc459a0 Merge branch 'ep/maint-equals-null-cocci'
Introduce and apply coccinelle rule to discourage an explicit
comparison between a pointer and NULL, and applies the clean-up to
the maintenance track.

* ep/maint-equals-null-cocci:
  tree-wide: apply equals-null.cocci
  tree-wide: apply equals-null.cocci
  contrib/coccinnelle: add equals-null.cocci
2022-05-20 15:26:59 -07:00
Ævar Arnfjörð Bjarmason 7c898554d7 commit-graph.c: don't assume that stat() succeeds
Fix code added in 8d84097f96 (commit-graph: expire commit-graph
files, 2019-06-18) to check the return value of the stat() system
call. Not doing so caused us to use uninitialized memory in the "Bloom
generation is limited by --max-new-filters" test in
t4216-log-bloom.sh:

	+ rm -f trace.event
	+ pwd
	+ GIT_TRACE2_EVENT=[...]/t/trash directory.t4216-log-bloom/limits/trace.event git commit-graph write --reachable --split=replace --changed-paths --max-new-filters=2
	==24835== Syscall param utimensat(times[0].tv_sec) points to uninitialised byte(s)
	==24835==    at 0x499E65A: __utimensat64_helper (utimensat.c:34)
	==24835==    by 0x4999142: utime (utime.c:36)
	==24835==    by 0x552BE0: mark_commit_graphs (commit-graph.c:2213)
	==24835==    by 0x550822: write_commit_graph (commit-graph.c:2424)
	==24835==    by 0x54E3A0: write_commit_graph_reachable (commit-graph.c:1681)
	==24835==    by 0x4374BB: graph_write (commit-graph.c:269)
	==24835==    by 0x436F7D: cmd_commit_graph (commit-graph.c:326)
	==24835==    by 0x407B9A: run_builtin (git.c:465)
	==24835==    by 0x406651: handle_builtin (git.c:719)
	==24835==    by 0x407575: run_argv (git.c:786)
	==24835==    by 0x406410: cmd_main (git.c:917)
	==24835==    by 0x511F09: main (common-main.c:56)
	==24835==  Address 0x1ffeffde70 is on thread 1's stack
	==24835==  in frame #1, created by utime (utime.c:25)
	==24835==  Uninitialised value was created by a stack allocation
	==24835==    at 0x552B50: mark_commit_graphs (commit-graph.c:2201)
	==24835==
	[...]
	error: last command exited with $?=126
	not ok 137 - Bloom generation is limited by --max-new-filters

This would happen as we stat'd the non-existing
".git/objects/info/commit-graph" file. Let's fix mark_commit_graphs()
to check the stat()'s return value, and while we're at it fix another
case added in the same commit to do the same.

The caller in expire_commit_graphs() would have been less likely to
run into this, as it's operating on files it just got from readdir(),
but it could still happen due to a race with e.g. a concurrent "rm
-rf" of the commit-graph files.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-12 15:42:26 -07:00
Junio C Hamano 2b0a58d164 Merge branch 'ep/maint-equals-null-cocci' for maint-2.35
* ep/maint-equals-null-cocci:
  tree-wide: apply equals-null.cocci
  contrib/coccinnelle: add equals-null.cocci
2022-05-02 10:06:04 -07:00
Junio C Hamano afe8a9070b tree-wide: apply equals-null.cocci
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-02 09:50:37 -07:00
Kleber Tarcísio c0befa0c03 commit-graph: close file before returning NULL
There are two reasons that we could return NULL early within
load_commit_graph_chain():

 1. The file does not exist, so the file pointer is NULL.
 2. The file exists, but is too small to contain a single hash.

These were grouped together when the function was first written in
5c84b3396 (commit-graph: load commit-graph chains, 2019-06-18) in order
to simplify how the 'chain_name' string is freed. However, the current
code leaves a narrow window where the file pointer is not closed when
the file exists, but is rejected for being too small.

Split out these cases separately to ensure we close the file in this
case.

Signed-off-by: Kleber Tarcísio <klebertarcisio@yahoo.com.br>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-20 13:56:24 -07:00
Junio C Hamano eb804cd405 Merge branch 'ns/core-fsyncmethod'
Replace core.fsyncObjectFiles with two new configuration variables,
core.fsync and core.fsyncMethod.

* ns/core-fsyncmethod:
  core.fsync: documentation and user-friendly aggregate options
  core.fsync: new option to harden the index
  core.fsync: add configuration parsing
  core.fsync: introduce granular fsync control infrastructure
  core.fsyncmethod: add writeout-only mode
  wrapper: make inclusion of Windows csprng header tightly scoped
2022-03-25 16:38:24 -07:00
Junio C Hamano 38bbb9e990 Merge branch 'ab/string-list-count-in-size-t'
Count string_list items in size_t, not "unsigned int".

* ab/string-list-count-in-size-t:
  string-list API: change "nr" and "alloc" to "size_t"
  gettext API users: don't explicitly cast ngettext()'s "n"
2022-03-16 17:53:09 -07:00
Junio C Hamano a54cc523ad Merge branch 'ds/commit-graph-gen-v2-fixes'
Fixes to the way generation number v2 in the commit-graph files are
(not) handled.

* ds/commit-graph-gen-v2-fixes:
  commit-graph: declare bankruptcy on GDAT chunks
  commit-graph: fix generation number v2 overflow values
  commit-graph: start parsing generation v2 (again)
  commit-graph: fix ordering bug in generation numbers
  t5318: extract helpers to lib-commit-graph.sh
  test-read-graph: include extra post-parse info
2022-03-16 17:53:09 -07:00
Neeraj Singh 020406eaa5 core.fsync: introduce granular fsync control infrastructure
This commit introduces the infrastructure for the core.fsync
configuration knob. The repository components we want to sync
are identified by flags so that we can turn on or off syncing
for specific components.

If core.fsyncObjectFiles is set and the core.fsync configuration
also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any
loose objects. This picks the strictest data integrity behavior
if core.fsync and core.fsyncObjectFiles are set to conflicting values.

This change introduces the currently unused fsync_component
helper, which will be used by a later patch that adds fsyncing to
the refs backend.

Actual configuration and documentation of the fsync components
list are in other patches in the series to separate review of
the underlying mechanism from the policy of how it's configured.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-10 15:10:22 -08:00
Ævar Arnfjörð Bjarmason 99d60545f8 string-list API: change "nr" and "alloc" to "size_t"
Change the "nr" and "alloc" members of "struct string_list" to use
"size_t" instead of "nr". On some platforms the size of an "unsigned
int" will be smaller than a "size_t", e.g. a 32 bit unsigned v.s. 64
bit unsigned. As "struct string_list" is a generic API we use in a lot
of places this might cause overflows.

As one example: code in "refs.c" keeps track of the number of refs
with a "size_t", and auxiliary code in builtin/remote.c in
get_ref_states() appends those to a "struct string_list".

While we're at it split the "nr" and "alloc" in string-list.h across
two lines, which is the case for most such struct member
declarations (e.g. in "strbuf.h" and "strvec.h").

Changing e.g. "int i" to "size_t i" in run_and_feed_hook() isn't
strictly necessary, and there are a lot more cases where we'll use a
local "int", "unsigned int" etc. variable derived from the "nr" in the
"struct string_list". But in that case as well as
add_wrapped_shortlog_msg() in builtin/shortlog.c we need to adjust the
printf format referring to "nr" anyway, so let's also change the other
variables referring to it.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-07 12:02:04 -08:00
Derrick Stolee 6dbf4b8172 commit-graph: declare bankruptcy on GDAT chunks
The Generation Data (GDAT) and Generation Data Overflow (GDOV) chunks
store corrected commit date offsets, used for generation number v2.
Recent changes have demonstrated that previous versions of Git were
incorrectly parsing data from these chunks, but might have also been
writing them incorrectly.

I asserted [1] that the previous fixes were sufficient because the known
reasons for incorrectly writing generation number v2 data relied on
parsing the information incorrectly out of a commit-graph file, but the
previous versions of Git were not reading the generation number v2 data.

However, Patrick demonstrated [2] a case where in split commit-graphs
across an alternate boundary (and possibly some other special
conditions) it was possible to have a commit-graph that was generated by
a previous version of Git have incorrect generation number v2 data which
results in errors like the following:

  commit-graph generation for commit <oid> is 1623273624 < 1623273710

[1] https://lore.kernel.org/git/f50e74f0-9ffa-f4f2-4663-269801495ed3@github.com/
[2] https://lore.kernel.org/git/Yh93vOkt2DkrGPh2@ncase/

Clearly, there is something else going on. The situation is not
completely understood, but the errors do not reproduce if the
commit-graphs are all generated by a Git version including these recent
fixes.

If we cannot trust the existing data in the GDAT and GDOV chunks, then
we can alter the format to change the chunk IDs for these chunks. This
causes the new version of Git to silently ignore the older chunks (and
disabling generation number v2 in the process) while writing new
commit-graph files with correct data in the GDA2 and GDO2 chunks.

Update commit-graph-format.txt including a historical note about these
deprecated chunks.

Reported-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-07 09:17:03 -08:00
Ævar Arnfjörð Bjarmason ef3fe21448 lockfile API users: simplify and don't leak "path"
Fix a memory leak in code added in 6c622f9f0b (commit-graph: write
commit-graph chains, 2019-06-18). We needed to free the "lock_name" if
we encounter errors, and the "graph_name" after we'd run unlink() on
it.

For the case of write_commit_graph_file() refactoring the code to free
the "lock_name" after we were done using the "struct lock_file lk"
would have made the control flow more complex. Luckily we can free the
"lock_file" right after the hold_lock_file_for_update() call, if it
makes use of "path" at all it'll have copied its contents to a "struct
strbuf" of its own.

While I'm at it let's fix code added in fb10ca5b54 (sparse-checkout:
write using lockfile, 2019-11-21) in write_patterns_and_update() to
avoid the same complexity that I thought I needed when I wrote the
initial fix for write_commit_graph_file(). We can free the
"sparse_filename" right after calling hold_lock_file_for_update(), we
don't need to wait until we're exiting the function to do so.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-04 13:24:19 -08:00
Ævar Arnfjörð Bjarmason 51a94d8ffe commit-graph: stop fill_oids_from_packs() progress on error and free()
Fix a bug in fill_oids_from_packs(), we should always stop_progress(),
but did not do so if we returned an error here. This also plugs a
memory leak in those cases by releasing the two "struct strbuf"
variables the function uses.

While I'm at it stop hardcoding "-1" here and just use the return
value of error() instead, which happens to be "-1".

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-04 13:24:19 -08:00
Ævar Arnfjörð Bjarmason 4a0479086a commit-graph: fix memory leak in misused string_list API
When this code was migrated to the string_list API in
d88b14b3fd (commit-graph: use string-list API for input, 2018-06-27)
it was made to use used both STRING_LIST_INIT_NODUP and a
strbuf_detach() pattern.

Those should not be used together if string_list_clear() is expected
to free the memory, instead we need to either use STRING_LIST_INIT_DUP
with a string_list_append_nodup(), or a STRING_LIST_INIT_NODUP and
manually fiddle with the "strdup_strings" member before calling
string_list_clear(). Let's do the former.

Since "strdup_strings = 1" is set now other code might be broken by
relying on "pack_indexes" not to duplicate it strings, but that
doesn't happen. When we pass this down to write_commit_graph() that
code uses the "struct string_list" without modifying it. Let's add a
"const" to the variable to have the compiler enforce that assumption.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-04 13:24:18 -08:00
Derrick Stolee c8d67b9a68 commit-graph: fix generation number v2 overflow values
The Generation Data Chunk was implemented and tested in e8b63005c
(commit-graph: implement generation data chunk, 2021-01-16), but the
test was carefully constructed to work on systems with 32-bit dates.
Since the corrected commit date offsets still required more than 31
bits, this triggered writing the generation_data_overflow chunk.

However, upon closer look, the
write_graph_chunk_generation_data_overflow() method writes the offsets
to the chunk (as dictated by the format) but fill_commit_graph_info()
treats the value in the chunk as if it is the full corrected commit date
(not an offset). For some reason, this does not cause an issue when
using the FUTURE_DATE specified in t5318-commit-graph.sh, but it does
show up as a failure in 'git commit-graph verify' if we increase that
FUTURE_DATE to be above four billion.

Fix this error and create a 64-bit timestamp version of the test so we
can test these larger values.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-01 12:15:06 -08:00
Derrick Stolee 3b0199d4c3 commit-graph: start parsing generation v2 (again)
The 'read_generation_data' member of 'struct commit_graph' was
introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire
chain does, 2021-01-16). The intention was to avoid using corrected
commit dates if not all layers of a commit-graph had that data stored.
The logic in validate_mixed_generation_chain() at that point incorrectly
initialized read_generation_data to 1 if and only if the tip
commit-graph contained the Corrected Commit Date chunk.

This was "fixed" in 448a39e65 (commit-graph: validate layers for
generation data, 2021-02-02) to validate that read_generation_data was
either non-zero for all layers, or it would set read_generation_data to
zero for all layers.

The problem here is that read_generation_data is not initialized to be
non-zero anywhere!

This change initializes read_generation_data immediately after the chunk
is parsed, so each layer will have its value present as soon as
possible.

The read_generation_data member is used in fill_commit_graph_info() to
determine if we should use the corrected commit date or the topological
levels stored in the Commit Data chunk. Due to this bug, all previous
versions of Git were defaulting to topological levels in all cases!

This can be measured with some performance tests. Using the Linux kernel
as a testbed, I generated a complete commit-graph containing corrected
commit dates and tested the 'new' version against the previous, 'old'
version.

First, rev-list with --topo-order demonstrates a 26% improvement using
corrected commit dates:

hyperfine \
	-n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \
	-n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \
	--warmup=10

Benchmark 1: old
  Time (mean ± σ):      57.1 ms ±   3.1 ms
  Range (min … max):    52.9 ms …  62.0 ms    55 runs

Benchmark 2: new
  Time (mean ± σ):      45.5 ms ±   3.3 ms
  Range (min … max):    39.9 ms …  51.7 ms    59 runs

Summary
  'new' ran
    1.26 ± 0.11 times faster than 'old'

These performance improvements are due to the algorithmic improvements
given by walking fewer commits due to the higher cutoffs from corrected
commit dates.

However, this comes at a cost. The additional I/O cost of parsing the
corrected commit dates is visible in case of merge-base commands that do
not reduce the overall number of walked commits.

hyperfine \
        -n "old" "$OLD_GIT merge-base v4.8 v4.9" \
        -n "new" "$NEW_GIT merge-base v4.8 v4.9" \
        --warmup=10

Benchmark 1: old
  Time (mean ± σ):     110.4 ms ±   6.4 ms
  Range (min … max):    96.0 ms … 118.3 ms    25 runs

Benchmark 2: new
  Time (mean ± σ):     150.7 ms ±   1.1 ms
  Range (min … max):   149.3 ms … 153.4 ms    19 runs

Summary
  'old' ran
    1.36 ± 0.08 times faster than 'new'

Performance issues like this are what motivated 702110aac (commit-graph:
use config to specify generation type, 2021-02-25).

In the future, we could fix this performance problem by inserting the
corrected commit date offsets into the Commit Date chunk instead of
having that data in an extra chunk.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-01 12:15:06 -08:00
Derrick Stolee 75979d9460 commit-graph: fix ordering bug in generation numbers
When computing the generation numbers for a commit-graph, we compute
the corrected commit dates and then check if their offsets from the
actual dates is too large to fit in the 32-bit Generation Data chunk.
However, there is a problem with this approach: if we have parsed the
generation data from the previous commit-graph, then we continue the
loop because the corrected commit date is already computed. This causes
an under-count in the number of overflow values.

It is incorrect to add an increment to num_generation_data_overflows
next to this 'continue' statement, because we might start
double-counting commits that are computed because of the depth-first
search walk from a commit with an earlier OID.

Instead, iterate over the full commit list at the end, checking the
offsets to see how many grow beyond the maximum value.

Create a new t5328-commit-graph-64-bit-time.sh test script to handle
special cases of testing 64-bit timestamps. This helps demonstrate this
bug in more cases. It still won't hit all potential cases until the next
change, which reenables reading generation numbers. Use the skip_all
trick from 0a2bfccb9c (t0051: use "skip_all" under !MINGW in
single-test file, 2022-02-04) to make the output clean when run on a
32-bit system.

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-01 12:14:57 -08:00
Lessley Dennington 0803f9c7cd commit-graph: return if there is no git directory
Return early if git directory does not exist. This will protect against
test failures in the upcoming change to BUG in prepare_repo_settings if no
git directory exists.

Signed-off-by: Lessley Dennington <lessleydennington@gmail.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-06 09:55:05 -08:00
Junio C Hamano c042ad5ad5 Merge branch 'js/run-command-close-packs'
The run-command API has been updated so that the callers can easily
ask the file descriptors open for packfiles to be closed immediately
before spawning commands that may trigger auto-gc.

* js/run-command-close-packs:
  Close object store closer to spawning child processes
  run_auto_maintenance(): implicitly close the object store
  run-command: offer to close the object store before running
  run-command: prettify the `RUN_COMMAND_*` flags
  pull: release packs before fetching
  commit-graph: when closing the graph, also release the slab
2021-09-20 15:20:45 -07:00
Junio C Hamano df0c308c1a Merge branch 'ab/progress-users-adjust-counters'
The code to show progress indicator in a few code paths did not
cover between 0-100%, which has been corrected.

* ab/progress-users-adjust-counters:
  entry: show finer-grained counter in "Filtering content" progress line
  commit-graph: fix bogus counter in "Scanning merged commits" progress line
2021-09-20 15:20:41 -07:00
SZEDER Gábor 4011224944 commit-graph: fix bogus counter in "Scanning merged commits" progress line
The final value of the counter of the "Scanning merged commits"
progress line is always one less than its expected total, e.g.:

  Scanning merged commits:  83% (5/6), done.

This happens because while iterating over an array the loop variable
is passed to display_progress() as-is, but while C arrays (and thus
the loop variable) start at 0 and end at N-1, the progress counter
must end at N. Fix this by passing 'i + 1' to display_progress(), like
most other callsites do.

There's an RFC series to add a GIT_TEST_CHECK_PROGRESS=1 mode[1] which
catches this issue in the 'fetch.writeCommitGraph' and
'fetch.writeCommitGraph with submodules' tests in
't5510-fetch.sh'. The GIT_TEST_CHECK_PROGRESS=1 mode is not part of
this series, but future changes to progress.c may add it or similar
assertions to catch this and similar bugs elsewhere.

1. https://lore.kernel.org/git/20210620200303.2328957-1-szeder.dev@gmail.com/

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-09 09:58:19 -07:00
Johannes Schindelin 957ba814bf commit-graph: when closing the graph, also release the slab
The slab has information about the commit graph. That means that it is
meaningless (and even misleading) when the commit graph was closed.

This seems not to matter currently, but we're about to fix a
Windows-specific bug where `git pull` does not close the object store
before fetching (risking that an implicit auto-gc fails to remove the
now-obsolete pack file(s)), and once we have that bug fix in place, it
does matter: after that bug fix, we will open the object store, do some
stuff with it, then close it, fetch, and then open it again, and do more
stuff. If we close the commit graph without releasing the corresponding
slab, we're hit by a symptom like this in t5520.19:

	BUG: commit-reach.c:85: bad generation skip 9223372036854775807
	> 3 at 5cd378271655d43a3b4477520014f02213ad1546

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-08 12:17:14 -07:00
Patrick Steinhardt f559d6d45e revision: avoid hitting packfiles when commits are in commit-graph
When queueing references in git-rev-list(1), we try to optimize parsing
of commits via the commit-graph. To do so, we first look up the object's
type, and if it is a commit we call `repo_parse_commit()` instead of
`parse_object()`. This is quite inefficient though given that we're
always uncompressing the object header in order to determine the type.
Instead, we can opportunistically search the commit-graph for the object
ID: in case it's found, we know it's a commit and can directly fill in
the commit object without having to uncompress the object header.

Expose a new function `lookup_commit_in_graph()`, which tries to find a
commit in the commit-graph by ID, and convert `get_reference()` to use
this function. This provides a big performance win in cases where we
load references in a repository with lots of references pointing to
commits. The following has been executed in a real-world repository with
about 2.2 million refs:

    Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      4.458 s ±  0.044 s    [User: 4.115 s, System: 0.342 s]
      Range (min … max):    4.409 s …  4.534 s    10 runs

    Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      3.089 s ±  0.015 s    [User: 2.768 s, System: 0.321 s]
      Range (min … max):    3.061 s …  3.105 s    10 runs

    Summary
      'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
        1.44 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-09 09:51:12 -07:00
Patrick Steinhardt 809ea28f80 commit-graph: split out function to search commit position
The function `find_commit_in_graph()` assumes that the caller has passed
an object which was already determined to be a commit given that it will
access the commit's graph position, which is stored in a commit slab. In
a subsequent patch, we want to search for an object ID though without
knowing whether it is a commit or not, which is not currently possible.

Split out the logic to search the commit graph for a given object ID to
prepare for this change. This commit also renames the function to
`find_commit_pos_in_graph()`, which more accurately reflects what this
function does. Furthermore, in order to allow for the searched object ID
to be const, we need to adjust `bsearch_graph()`'s signature to accept a
constant object ID as input, too.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-09 09:51:12 -07:00
Junio C Hamano dd6d3c90ee Merge branch 'ab/attribute-format'
Many "printf"-like helper functions we have have been annotated
with __attribute__() to catch placeholder/parameter mismatches.

* ab/attribute-format:
  advice.h: add missing __attribute__((format)) & fix usage
  *.h: add a few missing __attribute__((format))
  *.c static functions: add missing __attribute__((format))
  sequencer.c: move static function to avoid forward decl
  *.c static functions: don't forward-declare __attribute__
2021-07-28 13:17:59 -07:00
Ævar Arnfjörð Bjarmason 48ca53cac4 *.c static functions: add missing __attribute__((format))
Add missing __attribute__((format)) function attributes to various
"static" functions that take printf arguments.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-13 15:20:20 -07:00
Taylor Blau 15316a4732 commit-graph: rewrite to use checksum_valid()
Rewrite an existing caller in `git commit-graph verify` to take
advantage of checksum_valid().

Note that the replacement isn't a verbatim cut-and-paste, since the new
function avoids using hashfile at all and instead talks to the_hash_algo
directly, but it is functionally equivalent.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-28 20:36:17 -07:00
brian m. carlson 72871b132c commit-graph: don't store file hashes as struct object_id
The idea behind struct object_id is that it is supposed to represent the
identifier of a standard Git object or a special pseudo-object like the
all-zeros object ID.  In this case, we have file hashes, which, while
similar, are distinct from the identifiers of objects.

Switch these code paths to use an unsigned char array.  This is both
more logically consistent and it means that we need not set the
algorithm identifier for the struct object_id.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27 16:31:39 +09:00
brian m. carlson 92e2cab96b Always use oidread to read into struct object_id
In the future, we'll want oidread to automatically set the hash
algorithm member for an object ID we read into it, so ensure we use
oidread instead of hashcpy everywhere we're copying a hash value into a
struct object_id.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27 16:31:38 +09:00
Junio C Hamano d20fa3cf9d Merge branch 'ds/commit-graph-generation-config'
A new configuration variable has been introduced to allow choosing
which version of the generation number gets used in the
commit-graph file.

* ds/commit-graph-generation-config:
  commit-graph: use config to specify generation type
  commit-graph: create local repository pointer
2021-03-22 14:00:23 -07:00
René Scharfe ca56dadb4b use CALLOC_ARRAY
Add and apply a semantic patch for converting code that open-codes
CALLOC_ARRAY to use it instead.  It shortens the code and infers the
element size automatically.

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 16:00:09 -08:00
Junio C Hamano 660dd97a62 Merge branch 'ds/chunked-file-api'
The common code to deal with "chunked file format" that is shared
by the multi-pack-index and commit-graph files have been factored
out, to help codepaths for both filetypes to become more robust.

* ds/chunked-file-api:
  commit-graph.c: display correct number of chunks when writing
  chunk-format: add technical docs
  chunk-format: restore duplicate chunk checks
  midx: use 64-bit multiplication for chunk sizes
  midx: use chunk-format read API
  commit-graph: use chunk-format read API
  chunk-format: create read chunk API
  midx: use chunk-format API in write_midx_internal()
  midx: drop chunk progress during write
  midx: return success/failure in chunk write methods
  midx: add num_large_offsets to write_midx_context
  midx: add pack_perm to write_midx_context
  midx: add entries to write_midx_context
  midx: use context in write_midx_pack_names()
  midx: rename pack_info to write_midx_context
  commit-graph: use chunk-format write API
  chunk-format: create chunk format write API
  commit-graph: anonymize data in chunk_write_fn
2021-03-01 14:02:57 -08:00
Junio C Hamano c0b27e3964 Merge branch 'js/commit-graph-warning'
* js/commit-graph-warning:
  Revert "commit-graph: when incompatible with graphs, indicate why"
2021-03-01 09:21:24 -08:00
Junio C Hamano cdc986a7c2 Revert "commit-graph: when incompatible with graphs, indicate why"
This reverts commit c85eec7fc3, as
it is a bit overzealous, we are in prerelease freeze, and we want
to have enough time to get this right and cook in 'next'.

cf. <8735xgkvuo.fsf@evledraar.gmail.com>
2021-03-01 09:19:37 -08:00
Derrick Stolee 702110aac6 commit-graph: use config to specify generation type
We have two established generation number versions:

 1: topological levels
 2: corrected commit dates

The corrected commit dates are enabled by default, but they also write
extra data in the GDAT and GDOV chunks. Services that host Git data
might want to have more control over when this feature rolls out than
just updating the Git binaries.

Add a new "commitGraph.generationVersion" config option that specifies
the intended generation number version. If this value is less than 2,
then the GDAT chunk is never written _or read_ from an existing file.

This can replace our use of the GIT_TEST_COMMIT_GRAPH_NO_GDAT
environment variable in the test suite. Remove it.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-25 15:10:41 -08:00
Derrick Stolee c7ef8fe608 commit-graph: create local repository pointer
The write_commit_graph() method uses 'the_repository' in a few places. A
new need for a repository pointer is coming in the following change, so
group these instances into a local variable 'r' that could eventually
become part of the method signature, if so desired.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-25 15:10:40 -08:00
Taylor Blau c4ff24bbb3 commit-graph.c: display correct number of chunks when writing
When writing a commit-graph, a progress meter is shown which indicates
the number of pieces of data to write (one per commit in each chunk).

In 47410aa837 (commit-graph: use chunk-format write API, 2021-02-18),
the number of chunks became tracked by the new chunk-format API. But a
stray local variable was left behind from when write_commit_graph_file()
used to keep track of the same.

Since this was no longer updated after 47410aa837, the progress meter
appeared broken:

    $ git commit-graph write --reachable
    Expanding reachable commits in commit graph: 837569, done.
    Writing out commit graph in 3 passes: 166% (4187845/2512707), done.

Drop the local variable and rely instead on the chunk-format API to tell
us the correct number of chunks.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-24 11:44:34 -08:00
Andrzej Hunt bf4bb9f9f5 commit-graph: avoid leaking topo_levels slab in write_commit_graph()
write_commit_graph initialises topo_levels using init_topo_level_slab(),
next it calls compute_topological_levels() which can cause the slab to
grow, we therefore need to clear the slab again using
clear_topo_level_slab() when we're done.

First introduced in 72a2bfca (commit-graph: add a slab to store
topological levels, 2021-01-16).

LeakSanitizer output:

==1026==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x498ae9 in realloc /src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3
    #1 0xafbed8 in xrealloc /src/git/wrapper.c:126:8
    #2 0x7966d1 in topo_level_slab_at_peek /src/git/commit-graph.c:71:1
    #3 0x7965e0 in topo_level_slab_at /src/git/commit-graph.c:71:1
    #4 0x78fbf5 in compute_topological_levels /src/git/commit-graph.c:1472:12
    #5 0x78c5c3 in write_commit_graph /src/git/commit-graph.c:2456:2
    #6 0x535c5f in graph_write /src/git/builtin/commit-graph.c:299:6
    #7 0x5350ca in cmd_commit_graph /src/git/builtin/commit-graph.c:337:11
    #8 0x4cddb1 in run_builtin /src/git/git.c:453:11
    #9 0x4cabe2 in handle_builtin /src/git/git.c:704:3
    #10 0x4cd084 in run_argv /src/git/git.c:771:4
    #11 0x4ca424 in cmd_main /src/git/git.c:902:19
    #12 0x707fb6 in main /src/git/common-main.c:52:11
    #13 0x7fee4249383f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)

Indirect leak of 524256 byte(s) in 1 object(s) allocated from:
    #0 0x498942 in calloc /src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3
    #1 0xafc088 in xcalloc /src/git/wrapper.c:140:8
    #2 0x796870 in topo_level_slab_at_peek /src/git/commit-graph.c:71:1
    #3 0x7965e0 in topo_level_slab_at /src/git/commit-graph.c:71:1
    #4 0x78fbf5 in compute_topological_levels /src/git/commit-graph.c:1472:12
    #5 0x78c5c3 in write_commit_graph /src/git/commit-graph.c:2456:2
    #6 0x535c5f in graph_write /src/git/builtin/commit-graph.c:299:6
    #7 0x5350ca in cmd_commit_graph /src/git/builtin/commit-graph.c:337:11
    #8 0x4cddb1 in run_builtin /src/git/git.c:453:11
    #9 0x4cabe2 in handle_builtin /src/git/git.c:704:3
    #10 0x4cd084 in run_argv /src/git/git.c:771:4
    #11 0x4ca424 in cmd_main /src/git/git.c:902:19
    #12 0x707fb6 in main /src/git/common-main.c:52:11
    #13 0x7fee4249383f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)

SUMMARY: AddressSanitizer: 524264 byte(s) leaked in 2 allocation(s).

Signed-off-by: Andrzej Hunt <ajrhunt@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-22 13:45:01 -08:00
Derrick Stolee 2692c2f6fd commit-graph: use chunk-format read API
Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). While the current
implementation loses the duplicate-chunk detection, that will be added
in a future change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Derrick Stolee 47410aa837 commit-graph: use chunk-format write API
The commit-graph write logic is ready to make use of the chunk-format
write API. Each chunk write method is already in the correct prototype.
We only need to use the 'struct chunkfile' pointer and the correct API
calls.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Junio C Hamano 726b11d68a Merge branch 'js/commit-graph-warning'
When certain features (e.g. grafts) used in the repository are
incompatible with the use of the commit-graph, we used to silently
turned commit-graph off; we now tell the user what we are doing.

* js/commit-graph-warning:
  commit-graph: when incompatible with graphs, indicate why
2021-02-17 17:21:42 -08:00
Junio C Hamano 5bd0b21bf7 Merge branch 'ds/commit-graph-genno-fix'
Fix incremental update of commit-graph file around corrected commit
date data.

* ds/commit-graph-genno-fix:
  commit-graph: prepare commit graph
  commit-graph: be extra careful about mixed generations
  commit-graph: compute generations separately
  commit-graph: validate layers for generation data
  commit-graph: always parse before commit_graph_data_at()
  commit-graph: use repo_parse_commit
2021-02-17 17:21:40 -08:00
Junio C Hamano 8b4701ae4f Merge branch 'ak/corrected-commit-date'
The commit-graph learned to use corrected commit dates instead of
the generation number to help topological revision traversal.

* ak/corrected-commit-date:
  doc: add corrected commit date info
  commit-reach: use corrected commit dates in paint_down_to_common()
  commit-graph: use generation v2 only if entire chain does
  commit-graph: implement generation data chunk
  commit-graph: implement corrected commit date
  commit-graph: return 64-bit generation number
  commit-graph: add a slab to store topological levels
  t6600-test-reach: generalize *_three_modes
  commit-graph: consolidate fill_commit_graph_info
  revision: parse parent in indegree_walk_step()
  commit-graph: fix regression when computing Bloom filters
2021-02-17 17:21:40 -08:00
Johannes Schindelin c85eec7fc3 commit-graph: when incompatible with graphs, indicate why
When `gc.writeCommitGraph = true`, it is possible that the commit-graph
is _still_ not written: replace objects, grafts and shallow repositories
are incompatible with the commit-graph feature.

Under such circumstances, we need to indicate to the user why the
commit-graph was not written instead of staying silent about it.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Acked-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-11 09:33:01 -08:00
Junio C Hamano 9d5b1c06ac Merge branch 'jk/use-oid-pos'
Code clean-up to ensure our use of hashtables using object names as
keys use the "struct object_id" objects, not the raw hash values.

* jk/use-oid-pos:
  oid_pos(): access table through const pointers
  hash_pos(): convert to oid_pos()
  rerere: use strmap to store rerere directories
  rerere: tighten rr-cache dirname check
  rerere: check dirname format while iterating rr_cache directory
  commit_graft_pos(): take an oid instead of a bare hash
2021-02-10 14:48:31 -08:00
Derrick Stolee eb9071912f commit-graph: anonymize data in chunk_write_fn
In preparation for creating an API around file formats using chunks and
tables of contents, prepare the commit-graph write code to use
prototypes that will match this new API.

Specifically, convert chunk_write_fn to take a "void *data" parameter
instead of the commit-graph-specific "struct write_commit_graph_context"
pointer.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-05 15:40:41 -08:00
Junio C Hamano 973e20b83f Merge branch 'jk/peel-iterated-oid'
The peel_ref() API has been replaced with peel_iterated_oid().

* jk/peel-iterated-oid:
  refs: switch peel_ref() to peel_iterated_oid()
2021-02-03 15:04:49 -08:00
Derrick Stolee bc50d6c91f commit-graph: prepare commit graph
Before checking if the repository has a commit-graph loaded, be sure
to run prepare_commit_graph(). This is necessary because otherwise
the topo_levels slab is not initialized. As we compute topo_levels for
the new commits, we iterate further into the lower layers since the
first visit to each commit looks as though the topo_level is not
populated.

By properly initializing the topo_slab, we fix the previously broken
case of a split commit graph where a base layer has the
generation_data_overflow chunk.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01 21:03:36 -08:00
Derrick Stolee fde55b0906 commit-graph: be extra careful about mixed generations
When upgrading to a commit-graph with corrected commit dates from
one without, there are a few things that need to be considered.

When computing generation numbers for the new commit-graph file that
expects to add the generation_data chunk with corrected commit
dates, we need to ensure that the 'generation' member of the
commit_graph_data struct is set to zero for these commits.

Unfortunately, the fallback to use topological level for generation
number when corrected commit dates are not available are causing us
harm here: parsing commits notices that read_generation_data is
false and populates 'generation' with the topological level.

The solution is to iterate through the commits, parse the commits
to populate initial values, then reset the generation values to
zero to trigger recalculation. This loop only occurs when the
existing commit-graph data has no corrected commit dates.

While this improves our situation somewhat, we have not completely
solved the issue for correctly computing generation numbers for mixed
layers. That follows in the next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01 21:03:36 -08:00
Derrick Stolee 9c2c0a8256 commit-graph: compute generations separately
The compute_generation_numbers() method was introduced by 3258c663
(commit-graph: compute generation numbers, 2018-05-01) to compute what
is now known as "topological levels". These are still stored in the
commit-graph file for compatibility sake while c1a09119 (commit-graph:
implement corrected commit date, 2021-01-16) updated the method to also
compute the new version of generation numbers: corrected commit date.

It makes sense why these are grouped. They perform very similar walks of
the necessary commits and compute similar maximums over each parent.
However, having these two together conflates them in subtle ways that is
hard to separate.

In particular, the topo_level slab is used to store the topological
levels in all cases, but the commit_graph_data_at(c)->generation member
stores different values depending on the state of the existing
commit-graph file.

* If the existing commit-graph file has a "GDAT" chunk, then these
  values represent corrected commit dates.

* If the existing commit-graph file doesn't have a "GDAT" chunk, then
  these values are actually the topological levels.

This issue only occurs only when upgrading an existing commit-graph file
into one that has the "GDAT" chunk. The current change does not resolve
this upgrade problem, but splitting the implementation into two pieces
here helps with that process, which will follow in the next change.

The important thing this helps with is the case where the
num_generation_data_overflows was being incremented incorrectly,
triggering a write of the overflow chunk.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01 21:03:36 -08:00
Derrick Stolee 448a39e65d commit-graph: validate layers for generation data
We need to be extra careful that we don't use corrected
commit dates from any layer of a commit-graph chain if there is a
single commit-graph file that is missing the generation_data chunk.
Update validate_mixed_generation_chain() to correctly update each
layer to ignore the generation_data chunk in this case. It now also
returns 1 if all layers have a generation_data chunk. This return
value will be used in the next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01 21:03:36 -08:00
Derrick Stolee 90cb1c47c7 commit-graph: always parse before commit_graph_data_at()
There is a subtle failure happening when computing corrected commit
dates with --split enabled. It requires a base layer needing the
generation_data_overflow chunk. Then, the next layer on top
erroneously thinks it needs an overflow chunk due to a bug leading
to recalculating all reachable generation numbers. The output of
the failure is

  BUG: commit-graph.c:1912: expected to write 8 bytes to
  chunk 47444f56, but wrote 0 instead

These "expected" 8 bytes are due to re-computing the corrected
commit date for the lower layer but the new layer does not need
any overflow.

Add a test to t5318-commit-graph.sh that demonstrates this bug. However,
it does not trigger consistently with the existing code.

The generation number data is stored in a slab and accessed by
commit_graph_data_at(). This data is initialized when parsing a commit,
but is otherwise used assuming it has been populated. The loop in
compute_generation_numbers() did not enforce that all reachable
commits were parsed and had correct values. This could lead to some
problems when writing a commit-graph with corrected commit dates based
on a commit-graph without them.

It has been difficult to identify the issue here because it was so hard
to reproduce. It relies on this uninitialized data having a non-zero
value, but also on specifically in a way that overwrites the existing
data.

This patch adds the extra parse to ensure the data is filled before we
compute the generation number of a commit. This triggers the new test
to fail because the generation number overflow count does not match
between this computation and the write for that chunk.

The actual fix will follow as the next few changes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01 21:03:36 -08:00
Derrick Stolee c4cc083169 commit-graph: use repo_parse_commit
The write_commit_graph_context has a repository pointer, so use it.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01 21:03:35 -08:00
Jeff King 8380dcd700 oid_pos(): access table through const pointers
When we are looking up an oid in an array, we obviously don't need to
write to the array. Let's mark it as const in the function interfaces,
as well as in the local variables we use to derference the void pointer
(note a few cases use pointers-to-pointers, so we mark everything
const).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-28 12:03:26 -08:00
Jeff King 45ee13b942 hash_pos(): convert to oid_pos()
All of our callers are actually looking up an object_id, not a bare
hash. Likewise, the arrays they are looking in are actual arrays of
object_id (not just raw bytes of hashes, as we might find in a pack
.idx; those are handled by bsearch_hash()).

Using an object_id gives us more type safety, and makes the callers
slightly shorter. It also gets rid of the word "sha1" from several
access functions, though we could obviously also rename those with
s/sha1/hash/.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-28 12:02:39 -08:00
Junio C Hamano 58e2ce9112 Merge branch 'ma/more-opaque-lock-file'
Code clean-up.

* ma/more-opaque-lock-file:
  read-cache: try not to peek into `struct {lock_,temp}file`
  refs/files-backend: don't peek into `struct lock_file`
  midx: don't peek into `struct lock_file`
  commit-graph: don't peek into `struct lock_file`
  builtin/gc: don't peek into `struct lock_file`
2021-01-25 14:19:17 -08:00
Jeff King 36a317929b refs: switch peel_ref() to peel_iterated_oid()
The peel_ref() interface is confusing and error-prone:

  - it's typically used by ref iteration callbacks that have both a
    refname and oid. But since they pass only the refname, we may load
    the ref value from the filesystem again. This is inefficient, but
    also means we are open to a race if somebody simultaneously updates
    the ref. E.g., this:

      int some_ref_cb(const char *refname, const struct object_id *oid, ...)
      {
              if (!peel_ref(refname, &peeled))
                      printf("%s peels to %s",
                             oid_to_hex(oid), oid_to_hex(&peeled);
      }

    could print nonsense. It is correct to say "refname peels to..."
    (you may see the "before" value or the "after" value, either of
    which is consistent), but mentioning both oids may be mixing
    before/after values.

    Worse, whether this is possible depends on whether the optimization
    to read from the current iterator value kicks in. So it is actually
    not possible with:

      for_each_ref(some_ref_cb);

    but it _is_ possible with:

      head_ref(some_ref_cb);

    which does not use the iterator mechanism (though in practice, HEAD
    should never peel to anything, so this may not be triggerable).

  - it must take a fully-qualified refname for the read_ref_full() code
    path to work. Yet we routinely pass it partial refnames from
    callbacks to for_each_tag_ref(), etc. This happens to work when
    iterating because there we do not call read_ref_full() at all, and
    only use the passed refname to check if it is the same as the
    iterator. But the requirements for the function parameters are quite
    unclear.

Instead of taking a refname, let's instead take an oid. That fixes both
problems. It's a little funny for a "ref" function not to involve refs
at all. The key thing is that it's optimizing under the hood based on
having access to the ref iterator. So let's change the name to make it
clear why you'd want this function versus just peel_object().

There are two other directions I considered but rejected:

  - we could pass the peel information into the each_ref_fn callback.
    However, we don't know if the caller actually wants it or not. For
    packed-refs, providing it is essentially free. But for loose refs,
    we actually have to peel the object, which would be wasteful in most
    cases. We could likewise pass in a flag to the callback indicating
    whether the peeled information is known, but that complicates those
    callbacks, as they then have to decide whether to manually peel
    themselves. Plus it requires changing the interface of every
    callback, whether they care about peeling or not, and there are many
    of them.

  - we could make a function to return the peeled value of the current
    iterated ref (computing it if necessary), and BUG() otherwise. I.e.:

      int peel_current_iterated_ref(struct object_id *out);

    Each of the current callers is an each_ref_fn callback, so they'd
    mostly be happy. But:

      - we use those callbacks with functions like head_ref(), which do
        not use the iteration code. So we'd need to handle the fallback
        case there, anyway.

      - it's possible that a caller would want to call into generic code
        that sometimes is used during iteration and sometimes not. This
        encapsulates the logic to do the fast thing when possible, and
        fallback when necessary.

The implementation is mostly obvious, but I want to call out a few
things in the patch:

  - the test-tool coverage for peel_ref() is now meaningless, as it all
    collapses to a single peel_object() call (arguably they were pretty
    uninteresting before; the tricky part of that function is the
    fast-path we see during iteration, but these calls didn't trigger
    that). I've just dropped it entirely, though note that some other
    tests relied on the tags we created; I've moved that creation to the
    tests where it matters.

  - we no longer need to take a ref_store parameter, since we'd never
    look up a ref now. We do still rely on a global "current iterator"
    variable which _could_ be kept per-ref-store. But in practice this
    is only useful if there are multiple recursive iterations, at which
    point the more appropriate solution is probably a stack of
    iterators. No caller used the actual ref-store parameter anyway
    (they all call the wrapper that passes the_repository).

  - the original only kicked in the optimization when the "refname"
    pointer matched (i.e., not string comparison). We do likewise with
    the "oid" parameter here, but fall back to doing an actual oideq()
    call. This in theory lets us kick in the optimization more often,
    though in practice no current caller cares. It should never be
    wrong, though (peeling is a property of an object, so two refs
    pointing to the same object would peel identically).

  - the original took care not to touch the peeled out-parameter unless
    we found something to put in it. But no caller cares about this, and
    anyway, it is enforced by peel_object() itself (and even in the
    optimized iterator case, that's where we eventually end up). We can
    shorten the code and avoid an extra copy by just passing the
    out-parameter through the stack.

Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-21 15:51:31 -08:00
Abhishek Kumar 8d00d7c3df commit-reach: use corrected commit dates in paint_down_to_common()
091f4cf (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.

For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.

While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).

This change incidentally broke the fragile t6404-recursive-merge test.
t6404-recursive-merge sets up a unique repository where all commits have
the same committer date without a well-defined merge-base.

While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.

With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.

While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.

As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar 1fdc383c5c commit-graph: use generation v2 only if entire chain does
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.

Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar e8b63005c4 commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:

We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.

If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.

We test the overflow-related code with the following repo history:

           F - N - U
          /         \
U - N - U            N
         \          /
	  N - F - N

Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.

The largest offset observed is 2 ^ 31, just large enough to overflow.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar c1a09119f6 commit-graph: implement corrected commit date
With most of preparations done, let's implement corrected commit date.

The corrected commit date for a commit is defined as:

* A commit with no parents (a root commit) has corrected commit date
  equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
  the maximum of its commit date and one more than the largest corrected
  commit date among its parents.

As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).

To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.

Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.

However, using offsets be problematic if a commit is malformed but valid
and has committer date of 0 Unix time, as the offset would be the same
as corrected commit date and thus require 64-bits to be stored properly.

While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar d7f92784c6 commit-graph: return 64-bit generation number
In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.

We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.

With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar 72a2bfcaf0 commit-graph: add a slab to store topological levels
In a later commit we will introduce corrected commit date as the
generation number v2. Corrected commit dates will be stored in the new
seperate Generation Data chunk. However, to ensure backwards
compatibility with "Old" Git we need to continue to write generation
number v1 (topological levels) to the commit data chunk. Thus, we need
to compute and store both versions of generation numbers to write the
commit-graph file.

Therefore, let's introduce a commit-slab `topo_level_slab` to store
topological levels; corrected commit date will be stored in the member
`generation` of struct commit_graph_data.

The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO`
mark commits not in the commit-graph file and commits written by a
version of Git that did not compute generation numbers respectively.
Generation numbers are computed identically for both kinds of commits.

A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit
is not in the commit-graph file. However, since the slab is
zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`).
Thus, we no longer need to check if the topological level of a commit is
`GENERATION_NUMBER_INFINITY`.

We will add a pointer to the slab in `struct write_commit_graph_context`
and `struct commit_graph` to populate the slab in
`fill_commit_graph_info` if the commit has a pre-computed topological
level as in case of split commit-graphs.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar f90fca638e commit-graph: consolidate fill_commit_graph_info
Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the upcoming switch to using corrected committer
date as generation number v2, we will have to load committer date to
compute generation number value anyway.

e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.

The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.

However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.

The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.

As the CDAT chunk is overflow by 12-octal digits but not 11-octal
digits, we split the existing tests to test both implementations
separately and add a new explicit test for 11-digit implementation.

To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits without
overflowing 34-bits of the Commit Date chunks.

To test the 12-octal digit implementation, the smallest committer date
possible is 2^36 + 1, which overflows the CDAT chunk and thus
commit-graph must be disabled for the test.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00