Command line completion support for zsh (in contrib/) has been
updated to stop exposing internal state to end-user shell
interaction.
* dk/zsh-git-repo-path-fix:
completion: zsh: stop leaking local cache variable
zsh can pretend to be a normal shell pretty well except for some
glitches that we tickle in some of our scripts. Work them around
so that "vimdiff" and our test suite works well enough with it.
* bc/zsh-compatibility:
vimdiff: make script and tests work with zsh
t4046: avoid continue in &&-chain for zsh
A scheduled "git maintenance" job is expected to work on all
repositories it knows about, but it stopped at the first one that
errored out. Now it keeps going.
* js/for-each-repo-keep-going:
maintenance: running maintenance should not stop on errors
for-each-repo: optionally keep going on an error
The procedure to build multi-pack-index got confused by the
replace-refs mechanism, which has been corrected by disabling the
latter.
* xx/disable-replace-when-building-midx:
midx: disable replace objects
"git rebase --signoff" used to forget that it needs to add a
sign-off to the resulting commit when told to continue after a
conflict stops its operation.
* pw/rebase-m-signoff-fix:
rebase -m: fix --signoff with conflicts
sequencer: store commit message in private context
sequencer: move current fixups to private context
sequencer: start removing private fields from public API
sequencer: always free "struct replay_opts"
The clear_skip_worktree_from_present_files() method was first introduced
in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files
present in worktree, 2022-01-14) to allow better interaction with the
working directory in the presence of paths outside of the
sparse-checkout. The initial implementation would lstat() every single
SKIP_WORKTREE path to see if it existed; if it ran across a sparse
directory that existed (when a sparse index was in use), then it would
expand the index and then check every SKIP_WORKTREE path.
Since these lstat() calls were very expensive, this was improved in
d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
caching, 2022-01-14) by caching directories that do not exist so it
could avoid lstat()ing any files under such directories. However, there
are some inefficiencies in that caching mechanism.
The caching mechanism stored only the parent directory as not existing,
even if a higher parent directory also does not exist. This means that
wasted lstat() calls would occur when the paths passed to path_found()
change immediate parent directories but within the same parent directory
that does not exist.
To create an example repository that demonstrates this problem, it helps
to have a directory outside of the sparse-checkout that contains many
deep paths. In particular, the first paths (in lexicographic order)
underneath the sparse directory should have deep directory structures,
maximizing the difference between the old caching algorithm that looks
to a single parent and the new caching algorithm that looks to the
top-most missing directory.
The performance test script p2000-sparse-operations.sh takes the sample
repository and copies its HEAD to several copies nested in directories
of the form f<i>/f<j>/f<k> where i, j, and k are numbers from 1 to 4.
The sparse-checkout cone is then selected as "f2/f4/". Creating "f1/f1/"
will trigger the behavior and also lead to some interesting cases for
the caching algorithm since "f1/f1/" exists but "f1/f2/" and "f3/" do
not.
This is difficult to notice when running performance tests using the Git
repository (or a blow-up of the Git repository, as in
p2000-sparse-operations.sh) because Git has a very shallow directory
structure.
This change reorganizes the caching algorithm to focus on storing the
highest level leading directory that does not exist; specifically this
means that that directory's parent _does_ exist. By doing a little extra
work on a path passed to path_found(), we can short-circuit all of the
paths passed to path_found() afterwards that match a prefix with that
non-existing directory. When in a repository where the first sparse file
is likely to have a much deeper path than the first non-existing
directory, this can realize significant gains.
The details of this algorithm require careful attention, so the new
implementation of path_found() has detailed comments, including the use
of a new max_common_dir_prefix() method that may be of independent
interest.
It's worth noting that this is not universally positive, since we are
doing extra lstat() calls to establish the exact path to cache. In the
blow-up of the Git repository, we can see that the lstat count
_increases_ from 28 to 31. However, these numbers were already
artificially low.
Contributor Elijah Newren created a publicly-available test repository
that demonstrates the difference in these caching algorithms in the most
extreme way. To test, follow these steps:
git clone --sparse https://github.com/newren/gvfs-like-git-bomb
cd gvfs-like-git-bomb
./runme.sh # NOTE: check scripts before running!
At this point, assuming you do not have index.sparse=true set globally,
the index has one million paths with the SKIP_WORKTREE bit and they will
all be sent to path_found() in the sparse loop. You can measure this by
running 'git status' with GIT_TRACE2_PERF=1:
Sparse files in the index: 1,000,000
sparse_lstat_count (before): 200,000
sparse_lstat_count (after): 2
And here are the performance numbers:
Benchmark 1: old
Time (mean ± σ): 397.5 ms ± 4.1 ms
Range (min … max): 391.2 ms … 404.8 ms 10 runs
Benchmark 2: new
Time (mean ± σ): 252.7 ms ± 3.1 ms
Range (min … max): 249.4 ms … 259.5 ms 11 runs
Summary
'new' ran
1.57 ± 0.02 times faster than 'old'
By modifying this example further, we can demonstrate a more realistic
example and include the sparse index expansion. Continue by creating
this directory, confusing both caching algorithms somewhat:
mkdir -p bomb/d/e/f/a/a
Then re-run the 'git status' tests to see these statistics:
Sparse files in the index: 1,000,000
sparse_lstat_count (before): 724,010
sparse_lstat_count (after): 106
Benchmark 1: old
Time (mean ± σ): 753.0 ms ± 3.5 ms
Range (min … max): 749.7 ms … 760.9 ms 10 runs
Benchmark 2: new
Time (mean ± σ): 201.4 ms ± 3.2 ms
Range (min … max): 196.0 ms … 207.9 ms 14 runs
Summary
'new' ran
3.74 ± 0.06 times faster than 'old'
Note that if this repository had a sparse index enabled, the additional
cost of expanding the sparse index affects the total time of these
commands by over four seconds, significantly diminishing the benefit of
the caching algorithm. Having existing paths outside of the
sparse-checkout is a known performance issue for the sparse index and is
a known trade-off for the performance benefits given when no such paths
exist.
Using an internal monorepo with over two million paths at HEAD and a
typical sparse-checkout cone such that the sparse index contains
~190,000 entries (including over two thousand sparse trees), I was able
to measure these lstat counts when one sparse directory actually exists
on disk:
Sparse files in expanded index: 1,841,997
full_lstat_count (before): 1,188,161
full_lstat_count (after): 4,404
This resulted in this absolute time change, on a warm disk:
Time in full loop (before): 13.481 s
Time in full loop (after): 0.081 s
(These times were calculated on a Windows machine, where lstat() is
slower than a similar Linux machine.)
Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The clear_skip_worktree.. methods already report some statistics about
how many cache entries are checked against path_found() due to having
the skip-worktree bit set. However, due to path_found() performing some
caching, this isn't the only information that would be helpful to
report.
Add a new lstat_count member to the path_found_data struct to count the
number of times path_found() calls lstat(). This will be helpful to help
explain performance problems in this method as well as to demonstrate
future changes to the caching algorithm in a more concrete way than
end-to-end timings.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The path_found() method previously reused strings from the cache entries
the calling methods were using. This prevents string manipulation in
place and causes some odd reallocation before the final lstat() call in
the method.
Refactor the method to use strbufs and copy the path into the strbuf,
but also only the parent directory and not the whole path. This looks
like extra copying when assigning the path to the strbuf, but we save an
allocation by dropping the 'tmp' string, and we are "reusing" the copy
from 'tmp' to put the data in the strbuf.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In advance of changing the behavior of path_found(), take all of the
intermediate data values and group them into a single struct. This
simplifies the method prototype as well as the initialization. Future
changes can be made directly to the struct and method without changing
the callers with this approach.
Note that the clear_path_found_data() method is currently empty, as
there is nothing to free. This method is a placeholder for future
changes that require a non-trivial implementation. Its stub is created
now so consumers could call it now and not change in future changes.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The clear_skip_worktree_from_present_files() method was introduced in
af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present
in worktree, 2022-01-14) to help cases where sparse-checkout is enabled
but some paths outside of the sparse-checkout also exist on disk. This
operation can be slow as it needs to check path existence in a way not
stored in the index, so caching was introduced in d79d299352 (Accelerate
clear_skip_worktree_from_present_files() by caching, 2022-01-14).
This check is particularly confusing in the presence of a sparse index,
as a sparse tree entry corresponding to an existing directory must first
be expanded to a full index before examining the paths within. This is
currently implemented using a 'goto' and a boolean variable to ensure we
restart only once.
Even with that caching, it was noticed that this could take a long time
to execute. 89aaab11a3 (index: add trace2 region for clear skip
worktree, 2022-11-03) introduced trace2 regions to measure this time.
Further, the way the loop repeats itself was slightly confusing and
prone to breakage, so a BUG() statement was added in 8c7abdc596 (index:
raise a bug if the index is materialised more than once, 2022-11-03) to
be sure that the second run of the loop does not hit any sparse trees.
One thing that can be confusing about the current setup is that the
trace2 regions nest and it is not clear that a second loop is running
after a sparse index is expanded. Here is an example of what the regions
look like in a typical case:
| region_enter | ... | label:clear_skip_worktree_from_present_files
| region_enter | ... | ..label:update
| region_leave | ... | ..label:update
| region_enter | ... | ..label:ensure_full_index
| region_enter | ... | ....label:update
| region_leave | ... | ....label:update
| region_leave | ... | ..label:ensure_full_index
| data | ... | ..sparse_path_count:1
| data | ... | ..sparse_path_count_full:269538
| region_leave | ... | label:clear_skip_worktree_from_present_files
One thing that is particularly difficult to understand about these
regions is that most of the time is spent between the close of the
ensure_full_index region and the reporting of the end data. This is
because of the restart of the loop being within the same region as the
first iteration of the loop.
This change refactors the method into two separate methods that are
traced separately. This will be more important later when we change
other features of the methods, but for now the only functional change is
the difference in the structure of the trace regions.
After this change, the same telemetry section is split into three
distinct chunks:
| region_enter | ... | label:clear_skip_worktree_from_present_files_sparse
| data | ... | ..sparse_path_count:1
| region_leave | ... | label:clear_skip_worktree_from_present_files_sparse
| region_enter | ... | label:update
| region_leave | ... | label:update
| region_enter | ... | label:ensure_full_index
| region_enter | ... | ..label:update
| region_leave | ... | ..label:update
| region_leave | ... | label:ensure_full_index
| region_enter | ... | label:clear_skip_worktree_from_present_files_full
| data | ... | ..full_path_count:269538
| region_leave | ... | label:clear_skip_worktree_from_present_files_full
Here, we see the sparse loop terminating early with its first sparse
path being a sparse directory containing a file. Then, that loop's
region terminates before ensure_full_index begins (in this case, the
cache-tree must also be computed). Then, _after_ the index is expanded,
the full loop begins with its own region.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git fetch-pack -k -k" without passing "--lock-pack" (which we
never do ourselves) did not work at all, which has been corrected.
* jk/fetch-pack-fsck-wo-lock-pack:
fetch-pack: fix segfault when fscking without --lock-pack
A helper function shared between two tests had a copy-paste bug,
which has been corrected.
* jk/t5500-typofix:
t5500: fix mistaken $SERVER reference in helper function
An unused extern declaration for mingw has been removed to prevent
it from causing build failure.
* js/mingw-remove-unused-extern-decl:
mingw: drop bogus (and unneeded) declaration of `_pgmptr`
When "git merge" sees that the index cannot be refreshed (e.g. due
to another process doing the same in the background), it died but
after writing MERGE_HEAD etc. files, which was useless for the
purpose to recover from the failure.
* kz/merge-fail-early-upon-refresh-failure:
merge: avoid write merge state when unable to write index
A few of the bundle URI tests point config at a fake bundle; they care
only that the client has been configured with _some_ bundle, but it
doesn't have to actually contain objects.
For the file:// tests, we use "$BUNDLE_URI_REPO_URI/fake.bdl", a
non-existent file inside the actual remote repo. But for git:// and
http:// tests, we use "https://example.com/fake.bdl". This works OK in
practice, but it means we actually make a request to example.com (which
returns a placeholder HTML response). That can be annoying when running
the test suite on a spotty network (it doesn't produce a wrong result,
since we expect it to fail, but it may introduce delays).
We can reduce our dependency on the outside world by using a local URL.
It would work to just do "file://$PWD/fake.bdl" here, since the bundle
code does not care about the actual location. But in the long run I
suspect we may have more restrictions on which protocols can be passed
around as bundle URIs. So instead, let's stick with the file:// repo's
pattern and just point to a bogus name based on the remote repo's URL.
For http this makes perfect sense; we'll make a request to the local
http server and find that there's nothing there. For git:// it's a
little weird, as you wouldn't normally access a bundle file over git://
at all. But it's probably the most reasonable guess we can make for now,
and anybody who tightens protocol selection later will know better
what's the best path forward.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
t5551 tries to access a URL with a bogus hostname and confirms that
http.curloptResolve lets us use this otherwise unresolvable name.
Before doing so, though, we confirm that trying to access the bogus
hostname without http.curloptResolve fails as expected. This isn't
testing Git at all, but is confirming the test's assumptions. That's
often a good thing to do, but in this case it means that we'll actually
try to resolve the external name. Even though it's unlikely that
"gitbogusexamplehost.invalid" would ever resolve, the DNS lookup itself
may take time.
It's probably reasonable to just assume that this obviously-bogus name
would not actually resolve in practice, which lets us reduce our test
suite's dependency on the outside world.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We test how "fetch --set-upstream" behaves when given an invalid URL,
using the bogus URL "http://nosuchdomain.example.com". But finding out
that it is invalid requires an actual DNS lookup.
Reduce our dependency on external factors by using an invalid local
filesystem URL, which works just as well for our purposes.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When describe is run with 'dirty' flag, we refresh the index
to make sure it is in sync with the filesystem before
determining if the working tree is dirty. However, this is
not done for the codepath where the 'broken' flag is used.
This causes `git describe --broken --dirty` to false
positively report the worktree being dirty if a file has
different stat info than what is recorded in the index.
Running `git update-index -q --refresh` to refresh the index
before running diff-index fixes the problem.
Also add tests to deliberately update stat info of a
file before running describe to verify it behaves correctly.
Reported-by: Paul Millar <paul.millar@desy.de>
Suggested-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Phillip Wood <phillip.wood123@gmail.com>
Signed-off-by: Abhijeet Sonar <abhijeet.nkt@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Tom Scogland noticed that `--add-virtual-file` option uses the path
specified as its value as-is, without prepending any value given to
the `--prefix` option like `--add-file` does.
The behaviour has always been that way since the option was
introduced, but the documentation has always been wrong and said
that it would use the value of `--prefix` just like `--add-file`
does.
We could modify the behaviour to make it literally work like the
documentation said, but it would break existing scripts the users
use.
Noticed-by: Tom Scogland <scogland1@llnl.gov>
Acked-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Overriding the date of a commit to be close to "1970-01-01 00:00:00"
with a large enough positive timezone for the equivelant GMT time to be
before the epoch is considered valid by `parse_date_basic`. Similar
behaviour occurs when using a date close to "2099-12-31 23:59:59" (the
maximum date allowed by `tm_to_time_t`) with a large enough negative
timezone offset.
This leads to an integer underflow or underflow respectively in the
commit timestamp, which is not caught by `git-commit`, but will cause
other services to fail, such as `git-fsck`, which, for the first case,
reports "badDateOverflow: invalid author/committer line - date causes
integer overflow".
Instead check the timezone offset and fail if the resulting time comes
before the epoch "1970-01-01T00:00:00Z" or after the maximum date
"2099-12-31T23:59:59Z".
Using the REQUIRE_64BIT_TIME prerequisite, make sure that the tests
near the end of Git time (aka end of year 2099) are not attempted on
purely 32-bit systems, as they cannot express timestamp beyond 2038
anyway.
Signed-off-by: Darcy Burke <acednes@gmail.com>
[jc: fixups for 32-bit platforms]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The system must support 64-bit time and its time_t must be 64-bit
wide to pass these tests. Combine these two prerequisites together
to simplify the tests. In theory, they could be fulfilled
independently and tests could require only one without the other,
but in practice, these must come hand-in-hand.
Update the "check_parse" test helper to pay attention to the
REQUIRE_64BIT_TIME variable, which can be set to the HAVE_64BIT_TIME
prerequisite so that a parse test can be skipped on 32-bit systems.
This will be used in the next step to skip tests for timestamps near
the end of year 2099, as 32-bit systems will not be able to express
a timestamp beyond 2038 anyway.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
After we are done using Bloom filters, we do not currently clean up any
memory allocated by the commit slab used to store those filters in the
first place.
Besides the bloom_filter structures themselves, there is mostly nothing
to free() in the first place, since in the read-only path all Bloom
filter's `data` members point to a memory mapped region in the
commit-graph file itself.
But when generating Bloom filters from scratch (or initializing
truncated filters) we allocate additional memory to store the filter's
data.
Keep track of when we need to free() this additional chunk of memory by
using an extra pointer `to_free`. Most of the time this will be NULL
(indicating that we are representing an existing Bloom filter stored in
a memory mapped region). When it is non-NULL, free it before discarding
the Bloom filters slab.
Suggested-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In an earlier commit, a bug was described where it's possible for Git to
produce non-murmur3 hashes when the platform's "char" type is signed,
and there are paths with characters whose highest bit is set (i.e. all
characters >= 0x80).
That patch allows the caller to control which version of Bloom filters
are read and written. However, even on platforms with a signed "char"
type, it is possible to reuse existing Bloom filters if and only if
there are no changed paths in any commit's first parent tree-diff whose
characters have their highest bit set.
When this is the case, we can reuse the existing filter without having
to compute a new one. This is done by marking trees which are known to
have (or not have) any such paths. When a commit's root tree is verified
to not have any such paths, we mark it as such and declare that the
commit's Bloom filter is reusable.
Note that this heuristic only goes in one direction. If neither a commit
nor its first parent have any paths in their trees with non-ASCII
characters, then we know for certain that a path with non-ASCII
characters will not appear in a tree-diff against that commit's first
parent. The reverse isn't necessarily true: just because the tree-diff
doesn't contain any such paths does not imply that no such paths exist
in either tree.
So we end up recomputing some Bloom filters that we don't strictly have
to (i.e. their bits are the same no matter which version of murmur3 we
use). But culling these out is impossible, since we'd have to perform
the full tree-diff, which is the same effort as computing the Bloom
filter from scratch.
But because we can cache our results in each tree's flag bits, we can
often avoid recomputing many filters, thereby reducing the time it takes
to run
$ git commit-graph write --changed-paths --reachable
when upgrading from v1 to v2 Bloom filters.
To benchmark this, let's generate a commit-graph in linux.git with v1
changed-paths in generation order[^1]:
$ git clone git@github.com:torvalds/linux.git
$ cd linux
$ git commit-graph write --reachable --changed-paths
$ graph=".git/objects/info/commit-graph"
$ mv $graph{,.bak}
Then let's time how long it takes to go from v1 to v2 filters (with and
without the upgrade path enabled), resetting the state of the
commit-graph each time:
$ git config commitGraph.changedPathsVersion 2
$ hyperfine -p 'cp -f $graph.bak $graph' -L v 0,1 \
'GIT_TEST_UPGRADE_BLOOM_FILTERS={v} git.compile commit-graph write --reachable --changed-paths'
On linux.git (where there aren't any non-ASCII paths), the timings
indicate that this patch represents a speed-up over recomputing all
Bloom filters from scratch:
Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths
Time (mean ± σ): 124.873 s ± 0.316 s [User: 124.081 s, System: 0.643 s]
Range (min … max): 124.621 s … 125.227 s 3 runs
Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
Time (mean ± σ): 79.271 s ± 0.163 s [User: 74.611 s, System: 4.521 s]
Range (min … max): 79.112 s … 79.437 s 3 runs
Summary
'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'
On git.git, we do have some non-ASCII paths, giving us a more modest
improvement from 4.163 seconds to 3.348 seconds, for a 1.24x speed-up.
On my machine, the stats for git.git are:
- 8,285 Bloom filters computed from scratch
- 10 Bloom filters generated as empty
- 4 Bloom filters generated as truncated due to too many changed paths
- 65,114 Bloom filters were reused when transitioning from v1 to v2.
[^1]: Note that this is is important, since `--stdin-packs` or
`--stdin-commits` orders commits in the commit-graph by their pack
position (with `--stdin-packs`) or in the raw input (with
`--stdin-commits`).
Since we compute Bloom filters in the same order that commits appear
in the graph, we must see a commit's (first) parent before we process
the commit itself. This is only guaranteed to happen when sorting
commits by their generation number.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The murmur3 implementation in bloom.c has a bug when converting series
of 4 bytes into network-order integers when char is signed (which is
controllable by a compiler option, and the default signedness of char is
platform-specific). When a string contains characters with the high bit
set, this bug causes results that, although internally consistent within
Git, does not accord with other implementations of murmur3 (thus,
the changed path filters wouldn't be readable by other off-the-shelf
implementatios of murmur3) and even with Git binaries that were compiled
with different signedness of char. This bug affects both how Git writes
changed path filters to disk and how Git interprets changed path filters
on disk.
Therefore, introduce a new version (2) of changed path filters that
corrects this problem. The existing version (1) is still supported and
is still the default, but users should migrate away from it as soon
as possible.
Because this bug only manifests with characters that have the high bit
set, it may be possible that some (or all) commits in a given repo would
have the same changed path filter both before and after this fix is
applied. However, in order to determine whether this is the case, the
changed paths would first have to be computed, at which point it is not
much more expensive to just compute a new changed path filter.
So this patch does not include any mechanism to "salvage" changed path
filters from repositories. There is also no "mixed" mode - for each
invocation of Git, reading and writing changed path filters are done
with the same version number; this version number may be explicitly
stated (typically if the user knows which version they need) or
automatically determined from the version of the existing changed path
filters in the repository.
There is a change in write_commit_graph(). graph_read_bloom_data()
makes it possible for chunk_bloom_data to be non-NULL but
bloom_filter_settings to be NULL, which causes a segfault later on. I
produced such a segfault while developing this patch, but couldn't find
a way to reproduce it neither after this complete patch (or before),
but in any case it seemed like a good thing to include that might help
future patch authors.
The value in t0095 was obtained from another murmur3 implementation
using the following Go source code:
package main
import "fmt"
import "github.com/spaolacci/murmur3"
func main() {
fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!")))
fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff}))
}
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In an earlier commit, we began ignoring the Bloom data ("BDAT") chunk
for commit-graphs whose Bloom filters were computed using a hash version
incompatible with the value of `commitGraph.changedPathVersion`.
Now that the Bloom API has been hardened to discard these incompatible
filters (with the exception of low-level APIs), we can safely load these
Bloom filters unconditionally.
We no longer want to return early from `graph_read_bloom_data()`, and
similarly do not want to set the bloom_settings' `hash_version` field as
a side-effect. The latter is because we want to wait until we know which
Bloom settings we're using (either the defaults, from the GIT_TEST
variables, or from the previous commit-graph layer) before deciding what
hash_version to use.
If we detect an existing BDAT chunk, we'll infer the rest of the
settings (e.g., number of hashes, bits per entry, and maximum number of
changed paths) from the earlier graph layer. The hash_version will be
inferred from the previous layer as well, unless one has already been
specified via configuration.
Once all of that is done, we normalize the value of the hash_version to
either "1" or "2".
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Callers use the inline `get_bloom_filter()` implementation as a thin
wrapper around `get_or_compute_bloom_filter()`. The former calls the
latter with a value of "0" for `compute_if_not_present`, making
`get_bloom_filter()` the default read-only path for fetching an existing
Bloom filter.
Callers expect the value returned from `get_bloom_filter()` is usable,
that is that it's compatible with the configured value corresponding to
`commitGraph.changedPathsVersion`.
This is OK, since the commit-graph machinery only initializes its BDAT
chunk (thereby enabling it to service Bloom filter queries) when the
Bloom filter hash_version is compatible with our settings. So any value
returned by `get_bloom_filter()` is trivially useable.
However, subsequent commits will load the BDAT chunk even when the Bloom
filters are built with incompatible hash versions. Prepare to handle
this by teaching `get_bloom_filter()` to discard filters that are
incompatible with the configured hash version.
Callers who wish to read incompatible filters (e.g., for upgrading
filters from v1 to v2) may use the lower level routine,
`get_or_compute_bloom_filter()`.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In subsequent commits, we will want to load existing Bloom filters out
of a commit-graph, even when the hash version they were computed with
does not match the value of `commitGraph.changedPathVersion`.
In order to differentiate between the two, add a "version" field to each
Bloom filter.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A subsequent commit will introduce another version of the changed-path
filter in the commit graph file. In order to control which version to
write (and read), a config variable is needed.
Therefore, introduce this config variable. For forwards compatibility,
teach Git to not read commit graphs when the config variable
is set to an unsupported version. Because we teach Git this,
commitgraph.readChangedPaths is now redundant, so deprecate it and
define its behavior in terms of the config variable we introduce.
This commit does not change the behavior of writing (Git writes changed
path filters when explicitly instructed regardless of any config
variable), but a subsequent commit will restrict Git such that it will
only write when commitgraph.changedPathsVersion is a recognized value.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Subsequent commits will teach Git another version of changed path
filter that has different behavior with paths that contain at least
one character with its high bit set, so test the existing behavior as
a baseline.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Implement a mode of the "read-graph" test helper to dump out the
hexadecimal contents of the Bloom filter(s) contained in a commit-graph.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Prepare for a future commit to use the load_bloom_filter_from_graph()
function directly to load specific Bloom filters out of the commit-graph
for manual inspection (to be used during tests).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Prepare for the 'read-graph' test helper to perform other tasks besides
dumping high-level information about the commit-graph by extracting its
main routine into a separate function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code change to Git to support version 2 will be done in subsequent
commits.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The changed-path Bloom filter mechanism is parameterized by a couple of
variables, notably the number of bits per hash (typically "m" in Bloom
filter literature) and the number of hashes themselves (typically "k").
It is critically important that filters are read with the Bloom filter
settings that they were written with. Failing to do so would mean that
each query is liable to compute different fingerprints, meaning that the
filter itself could return a false negative. This goes against a basic
assumption of using Bloom filters (that they may return false positives,
but never false negatives) and can lead to incorrect results.
We have some existing logic to carry forward existing Bloom filter
settings from one layer to the next. In `write_commit_graph()`, we have
something like:
if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
struct commit_graph *g = ctx->r->objects->commit_graph;
/* We have changed-paths already. Keep them in the next graph */
if (g && g->chunk_bloom_data) {
ctx->changed_paths = 1;
ctx->bloom_settings = g->bloom_filter_settings;
}
}
, which drags forward Bloom filter settings across adjacent layers.
This doesn't quite address all cases, however, since it is possible for
intermediate layers to contain no Bloom filters at all. For example,
suppose we have two layers in a commit-graph chain, say, {G1, G2}. If G1
contains Bloom filters, but G2 doesn't, a new G3 (whose base graph is
G2) may be written with arbitrary Bloom filter settings, because we only
check the immediately adjacent layer's settings for compatibility.
This behavior has existed since the introduction of changed-path Bloom
filters. But in practice, this is not such a big deal, since the only
way up until this point to modify the Bloom filter settings at write
time is with the undocumented environment variables:
- GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY
- GIT_TEST_BLOOM_SETTINGS_NUM_HASHES
- GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS
(it is still possible to tweak MAX_CHANGED_PATHS between layers, but
this does not affect reads, so is allowed to differ across multiple
graph layers).
But in future commits, we will introduce another parameter to change the
hash algorithm used to compute Bloom fingerprints itself. This will be
exposed via a configuration setting, making this foot-gun easier to use.
To prevent this potential issue, validate that all layers of a split
commit-graph have compatible settings with the newest layer which
contains Bloom filters.
Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Original-test-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The commit-graph stores changed-path Bloom filters which represent the
set of paths included in a tree-level diff between a commit's root tree
and that of its parent.
When a commit has no parents, the tree-diff is computed against that
commit's root tree and the empty tree. In other words, every path in
that commit's tree is stored in the Bloom filter (since they all appear
in the diff).
Consult these filters during pathspec-limited traversals in the function
`rev_same_tree_as_empty()`. Doing so yields a performance improvement
where we can avoid enumerating the full set of paths in a parentless
commit's root tree when we know that the path(s) of interest were not
listed in that commit's changed-path Bloom filter.
Suggested-by: SZEDER Gábor <szeder.dev@gmail.com>
Original-patch-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The existing implementation of test_bloom_filters_not_used() asserts
that the Bloom filter sub-system has not been initialized at all, by
checking for the absence of any data from it from trace2.
In the following commit, it will become possible to load Bloom filters
without using them (e.g., because the `commitGraph.changedPathVersion`
introduced later in this series is incompatible with the hash version
with which the commit-graph's Bloom filters were written).
When this is the case, it's possible to initialize the Bloom filter
sub-system, while still not using any Bloom filters. When this is the
case, check that the data dump from the Bloom sub-system is all zeros,
indicating that no filters were used.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When trying to execute a non-existent program from GIT_PAGER, we display
an error. However, we also send the complete text to the terminal
and return a successful exit code. This can be confusing for the user
and the displayed error could easily become obscured by a lengthy
text.
For example, here the error message would be very far above after
sending 50 MB of text:
$ GIT_PAGER=non-existent t/test-terminal.perl git log | wc -c
error: cannot run non-existent: No such file or directory
50314363
Let's make the error clear by aborting the process and return an error
so that the user can easily correct their mistake.
This will be the result of the change:
$ GIT_PAGER=non-existent t/test-terminal.perl git log | wc -c
error: cannot run non-existent: No such file or directory
fatal: unable to execute pager 'non-existent'
0
The behavior change we're introducing in this commit affects two tests
in t7006, which is a good sign regarding test coverage and requires us
to address it.
The first test is 'git skips paging non-existing command'. This test
comes from f7991f01f2 (t7006: clean up SIGPIPE handling in trace2 tests,
2021-11-21,) where a modification was made to a test that was originally
introduced in c24b7f6736 (pager: test for exit code with and without
SIGPIPE, 2021-02-02). That original test was, IMHO, in the same
direction we're going in this commit.
At any rate, this test obviously needs to be adjusted to check the new
behavior we are introducing. Do it.
The second test being affected is: 'non-existent pager doesnt cause
crash', introduced in f917f57f40 (pager: fix crash when pager program
doesn't exist, 2021-11-24). As its name states, it has the intention of
checking that we don't introduce a regression that produces a crash when
GIT_PAGER points to a nonexistent program.
This test could be considered redundant nowadays, due to us already
having several tests checking implicitly what a non-existent command in
GIT_PAGER produces. However, let's maintain a good belt-and-suspenders
strategy; adapt it to the new world.
Finally, it's worth noting that we are not changing the behavior if the
command specified in GIT_PAGER is a shell command. In such cases, it
is:
$ GIT_PAGER=:\;non-existent t/test-terminal.perl git log
:;non-existent: 1: non-existent: not found
died of signal 13 at t/test-terminal.perl line 33.
Signed-off-by: Rubén Justo <rjusto@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>