Commit Graph

8 Commits (next)

Author SHA1 Message Date
Junio C Hamano def5e32bc5 Merge branch 'tb/refs-exclude-fixes'
The refname exclusion logic in the packed-ref backend has been
broken for some time, which confused upload-pack to advertise
different set of refs.  This has been corrected.

* tb/refs-exclude-fixes:
  refs.c: stop matching non-directory prefixes in exclude patterns
  refs.c: remove empty '--exclude' patterns
2025-03-26 16:26:10 +09:00
Taylor Blau 10e8a9352b refs.c: stop matching non-directory prefixes in exclude patterns
In the packed-refs backend, our implementation of '--exclude' (dating
back to 59c35fac54 (refs/packed-backend.c: implement jump lists to avoid
excluded pattern(s), 2023-07-10)) considers, for example:

    $ git for-each-ref --exclude=refs/heads/ba

to exclude "refs/heads/bar", "refs/heads/baz", and so on.

The files backend, which does not implement '--exclude' (and relies on
the caller to cull out results that don't match) naturally will
enumerate "refs/heads/bar" and so on.

So in the above example, 'for-each-ref' will try and see if
"refs/heads/ba" matches "refs/heads/bar" (since the files backend simply
enumerated every loose reference), and, realizing that it does not
match, output the reference as expected. (A caller that did want to
exclude "refs/heads/bar" and "refs/heads/baz" might instead run "git
for-each-ref --exclude='refs/heads/ba*'").

This can lead to strange behavior, like seeing a different set of
references advertised via 'upload-pack' depending on what set of
references were loose versus packed.

So there is a subtle bug with '--exclude' which is that in the
packed-refs backend we will consider "refs/heads/bar" to be a pattern
match against "refs/heads/ba" when we shouldn't. Likewise, the reftable
backend (which in this case is bug-compatible with the packed backend)
exhibits the same broken behavior.

There are a few ways to fix this. One is to tighten the rules in
cmp_record_to_refname(), which is used to determine the start/end-points
of the jump list used by the packed backend. In this new "strict" mode,
the comparison function would handle the case where we've reached the
end of the pattern by introducing a new check like so:

    while (1) {
        if (*r1 == '\n')
            return *r2 ? -1 : 0;
        if (!*r2)
            if (strict && *r1 != '/')        /* <- here */
                return 1;
            return start ? 1 : -1;
        if (*r1 != *r2)
            return (unsigned char)*r1 < (unsigned char)*r2 ? -1 : +1;
        r1++;
        r2++;
    }

(eliding out the rest of cmp_record_to_refname()). Equivalently, we
could teach refs/packed-backend::populate_excluded_jump_list() to append
a trailing '/' if one does not already exist, forcing an exclude pattern
like "refs/heads/ba" to only match "refs/heads/ba/abc" and so forth.

But since the same problem exists in reftable, we can fix both at once
by performing this pre-processing step one layer up in refs.c at the
common entrypoint for the two, which is 'refs_ref_iterator_begin()'.

Since that solution is both the simplest and only requires modification
in one spot, let's normalize exclude patterns so that they end with a
trailing slash. This causes us to unify the behavior between all three
backends.

There is some minor test fallout in the "overlapping excluded regions"
test, which happens to use 'refs/ba' as an exclude pattern, and expects
references under the "refs/heads/bar/*" and "refs/heads/baz/*"
hierarchies to be excluded from the results.

But that test fallout is expected, because the test was codifying the
buggy behavior to begin with, and should have never been written that
way. Split that into its own test (since the range is no longer
overlapping under the stricter interpretation of --exclude patterns
presented here). Create a new test which does have overlapping
regions by using a refs/heads/bar/4/... hierarchy and excluding both
"refs/heads/bar" and "refs/heads/bar/4".

Reported-by: SURA <surak8806@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-06 09:11:05 -08:00
Taylor Blau 27be76b230 refs.c: remove empty '--exclude' patterns
In 59c35fac54 (refs/packed-backend.c: implement jump lists to avoid
excluded pattern(s), 2023-07-10), the packed-refs backend learned how to
construct "jump lists" to avoid enumerating sections of the packed-refs
file that we know the caller is going to throw out anyway.

This process works by finding the start- and end-points (that is, where
in the packed-refs file corresponds to the range we're going to ignore)
for each exclude pattern, then constructing a jump list based on that.
At enumeration time we'll consult the jump list to skip past everything
in the range(s) found in the previous step, saving time when excluding a
large portion of references.

But when there is a --exclude pattern which is just the empty string,
the behavior is a little funky. When we try and exclude the empty
string, the matched range covers the entire packed-refs file, meaning
that we won't output any packed references. But the empty pattern
doesn't actually match any references to begin with! For example, on my
copy of git.git I can do:

    $ git for-each-ref '' | wc -l
    0

So "git for-each-ref --exclude=''" shouldn't actually remove anything
from the output, and ought to be equivalent to "git for-each-ref". But
it's not, and in fact:

    $ git for-each-ref | wc -l
    2229
    $ git for-each-ref --exclude='' | wc -l
    480

But why does the '--exclude' version output only some of the references
in the repository? Here's a hint:

    $ find .git/refs -type f | wc -l
    480

Indeed, because the files backend doesn't implement[^1] the same jump
list concept as the packed backend we get the correct result for the
loose references, but none of the packed references.

Since the empty string exclude pattern doesn't match anything, we can
discard them before the packed-refs backend has a chance to even see it
(and likewise for reftable, which also implements a similar concept
since 1869525066 (refs/reftable: wire up support for exclude patterns,
2024-09-16)).

This approach (copying only some of the patterns into a strvec at the
refs.c layer) may seem heavy-handed, but it's setting us up to fix
another bug in the following commit where the fix will involve modifying
the incoming patterns.

[^1]: As noted in 59c35fac54. We technically could avoid opening and
  enumerating the contents of, for e.g., "$GIT_DIR/refs/heads/foo/" if
  we knew that we were excluding anything under the 'refs/heads/foo'
  hierarchy. But the --exclude stuff is all best-effort anyway, since
  the caller is expected to cull out any results that they don't want.

Noticed-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-06 09:11:04 -08:00
Patrick Steinhardt fc1ddf42af t: remove TEST_PASSES_SANITIZE_LEAK annotations
Now that the default value for TEST_PASSES_SANITIZE_LEAK is `true` there
is no longer a need to have that variable declared in all of our tests.
Drop it.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-21 08:23:48 +09:00
Patrick Steinhardt 1869525066 refs/reftable: wire up support for exclude patterns
Exclude patterns can be used by reference backends to skip over blocks
of references that are uninteresting to the caller. Reference backends
do not have to wire up support for them, and all callers are expected to
behave as if the backend didn't support them. In fact, the only backend
that supports exclude patterns right now is the "packed" backend.

Exclude patterns can be quite an important performance optimization in
repositories that have loads of references. The patterns are set up in
case "transfer.hideRefs" and friends are configured during a fetch, so
handling these patterns becomes important once there are lots of hidden
refs in a served repository.

Now that we have properly re-seekable reftable iterators we can also
wire up support for these patterns in the "reftable" backend. Doing so
is conceptually simple: once we hit a reference whose prefix matches the
current exclude pattern we re-seek the iterator to the first reference
that doesn't match the pattern anymore. This schema only works for
trivial patterns that do not have any globbing characters in them, but
this restriction also applies do the "packed" backend.

This makes t1419 work with the "reftable" backend with some slight
modifications. Of course it also speeds up listing of references with
hidden refs. The following benchmark prints one reference with 1 million
hidden references:

    Benchmark 1: HEAD~
      Time (mean ± σ):      93.3 ms ±   2.1 ms    [User: 90.3 ms, System: 2.5 ms]
      Range (min … max):    89.8 ms …  97.2 ms    33 runs

    Benchmark 2: HEAD
      Time (mean ± σ):       4.2 ms ±   0.6 ms    [User: 2.2 ms, System: 1.8 ms]
      Range (min … max):     3.1 ms …   8.1 ms    765 runs

    Summary
      HEAD ran
       22.15 ± 3.19 times faster than HEAD~

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-16 13:57:19 -07:00
Patrick Steinhardt 61e1c560bc t1419: mark test suite as files-backend specific
With 59c35fac54 (refs/packed-backend.c: implement jump lists to avoid
excluded pattern(s), 2023-07-10) we have implemented logic to handle
excluded refs more efficiently in the "packed" ref backend. This logic
allows us to skip emitting refs completely which we know to not be of
any interest to the caller, which can avoid quite some allocations and
object lookups.

This was wired up via a new `exclude_patterns` parameter passed to the
backend's ref iterator. The backend only needs to handle them on a best
effort basis though, and in fact we only handle it for the "packed-refs"
file, but not for loose references. Consequently, all callers must still
filter emitted refs with those exclude patterns.

The result is that handling exclude patterns is completely optional in
the ref backend, and any future backends may or may not implement it.
Let's thus mark the test for t1419 to depend on the REFFILES prereq.

An alternative would be to introduce a new prereq that tells us whether
the backend under test supports exclude patterns or not. But this does
feel a bit overblown:

  - It would either map to the REFFILES prereq, in which case it feels
    overengineered because the prereq is only ever relevant to t1419.

  - Otherwise, it could auto-detect whether the backend supports exclude
    patterns. But this could lead to silent failures in case the support
    for this feature breaks at any point in time.

It should thus be good enough to just use the REFFILES prereq for now.
If future backends ever grow support for exclude patterns we can easily
add their respective prereq as another condition for this test suite to
execute.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Reviewed-by: Christian Couder <christian.couder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-01-29 13:54:33 -08:00
Taylor Blau c489f47a64 refs/packed-backend.c: add trace2 counters for jump list
The previous commit added low-level tests to ensure that the packed-refs
iterator did not enumerate excluded sections of the refspace.

However, there was no guarantee that these sections weren't being
visited, only that they were being suppressed from the output. To harden
these tests, add a trace2 counter which tracks the number of regions
skipped by the packed-refs iterator, and assert on its value.

Suggested-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-10 14:48:55 -07:00
Taylor Blau 59c35fac54 refs/packed-backend.c: implement jump lists to avoid excluded pattern(s)
When iterating through the `packed-refs` file in order to answer a query
like:

    $ git for-each-ref --exclude=refs/__hidden__

it would be useful to avoid walking over all of the entries in
`refs/__hidden__/*` when possible, since we know that the ref-filter
code is going to throw them away anyways.

In certain circumstances, doing so is possible. The algorithm for doing
so is as follows:

  - For each excluded pattern, find the first record that matches it,
    and the first record that *doesn't* match it (i.e. the location
    you'd next want to consider when excluding that pattern).

  - Sort the set of excluded regions from the previous step in ascending
    order of the first location within the `packed-refs` file that
    matches.

  - Clean up the results from the previous step: discard empty regions,
    and combine adjacent regions. The set of regions which remains is
    referred to as the "jump list", and never contains any references
    which should be included in the result set.

Then when iterating through the `packed-refs` file, if `iter->pos` is
ever contained in one of the regions from the previous steps, advance
`iter->pos` past the end of that region, and continue enumeration.

Note that we only perform this optimization when none of the excluded
pattern(s) have special meta-characters in them. For a pattern like
"refs/foo[ac]", the excluded regions ("refs/fooa", "refs/fooc", and
everything underneath them) are not connected. A future implementation
that handles this case may split the character class (pretending as if
two patterns were excluded: "refs/fooa", and "refs/fooc").

There are a few other gotchas worth considering. First, note that the
jump list is sorted, so once we jump past a region, we can avoid
considering it (or any regions preceding it) again. The member
`jump_pos` is used to track the first next-possible region to jump
through.

Second, note that the jump list is best-effort, since we do not handle
loose references, and because of the meta-character issue above. The
jump list may not skip past all references which won't appear in the
results, but will never skip over a reference which does appear in the
result set.

In repositories with a large number of hidden references, the speed-up
can be significant. Tests here are done with a copy of linux.git with a
reference "refs/pull/N" pointing at every commit, as in:

    $ git rev-list HEAD | awk '{ print "create refs/pull/" NR " " $0 }' |
        git update-ref --stdin
    $ git pack-refs --all

, it is significantly faster to have `for-each-ref` jump over the
excluded references, as opposed to filtering them out after the fact:

    $ hyperfine \
      'git for-each-ref --format="%(objectname) %(refname)" | grep -vE "^[0-9a-f]{40} refs/pull/"' \
      'git.prev for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"' \
      'git.compile for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"'
    Benchmark 1: git for-each-ref --format="%(objectname) %(refname)" | grep -vE "^[0-9a-f]{40} refs/pull/"
      Time (mean ± σ):     798.1 ms ±   3.3 ms    [User: 687.6 ms, System: 146.4 ms]
      Range (min … max):   794.5 ms … 805.5 ms    10 runs

    Benchmark 2: git.prev for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"
      Time (mean ± σ):      98.9 ms ±   1.4 ms    [User: 93.1 ms, System: 5.7 ms]
      Range (min … max):    97.0 ms … 104.0 ms    29 runs

    Benchmark 3: git.compile for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"
      Time (mean ± σ):       4.5 ms ±   0.2 ms    [User: 0.7 ms, System: 3.8 ms]
      Range (min … max):     4.1 ms …   5.8 ms    524 runs

    Summary
      'git.compile for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"' ran
       21.87 ± 1.05 times faster than 'git.prev for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"'
      176.52 ± 8.19 times faster than 'git for-each-ref --format="%(objectname) %(refname)" | grep -vE "^[0-9a-f]{40} refs/pull/"'

(Comparing stock git and this patch isn't quite fair, since an earlier
commit in this series adds a naive implementation of the `--exclude`
option. `git.prev` is built from the previous commit and includes this
naive implementation).

Using the jump list is fairly straightforward (see the changes to
`refs/packed-backend.c::next_record()`), but constructing the list is
not. To ensure that the construction is correct, add a new suite of
tests in t1419 covering various corner cases (overlapping regions,
partially overlapping regions, adjacent regions, etc.).

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-10 14:48:55 -07:00