Go to file
Derrick Stolee b16a827764 bloom/diff: properly short-circuit on max_changes
Commit e3696980 (diff: halt tree-diff early after max_changes,
2020-03-30) intended to create a mechanism to short-circuit a diff
calculation after a certain number of paths were modified. By
incrementing a "num_changes" counter throughout the recursive
ll_diff_tree_paths(), this was supposed to match the number of changes
that would be written into the changed-path Bloom filters.
Unfortunately, this was not implemented correctly and instead misses
simple cases like file modifications. This then does not stop very
large changed-path filters from being written (unless they add or remove
many files).

To start, change the implementation in ll_diff_tree_paths() to instead
use the global diff_queue_diff struct's 'nr' member as the count. This
is a way to simplify the logic instead of making more mistakes in the
complicated diff code.

This has a drawback: the diff_queue_diff struct only lists the paths
corresponding to blob changes, not their leading directories. Thus,
get_or_compute_bloom_filter() needs an additional check to see if the
hashmap with the leading directories becomes too large.

One reason why this was not caught by test cases was that the test in
t4216-log-bloom.sh that was supposed to check this "too many changes"
condition only checked this on the initial commit of a repository. The
old logic counted these values correctly. Update this test in a few
ways:

1. Use GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS to reduce the limit,
   allowing smaller commits to engage with this logic.

2. Create several interesting cases of edits, adds, removes, and mode
   changes (in the second commit). By testing both sides of the
   inequality with the *_MAX_CHANGED_PATHS variable, we can see that
   the count is exactly correct, so none of these changes are missed
   or over-counted.

3. Use the trace2 data value filter_found_large to verify that these
   commits are on the correct side of the limit.

Another way to verify the behavior is correct is through performance
tests. By testing on my local copies of the Git repository and the Linux
kernel repository, I could measure the effect of these short-circuits
when computing a fresh commit-graph file with changed-path Bloom filters
using the command

  GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=N time \
    git commit-graph write --reachable --changed-paths

and reporting the wall time and resulting commit-graph size.

For Git, the results are

|        |      N=1       |       N=10     |      N=512     |
|--------|----------------|----------------|----------------|
| HEAD~1 | 10.90s  9.18MB | 11.11s  9.34MB | 11.31s  9.35MB |
| HEAD   |  9.21s  8.62MB | 11.11s  9.29MB | 11.29s  9.34MB |

For Linux, the results are

|        |       N=1      |     N=20      |     N=512     |
|--------|----------------|---------------|---------------|
| HEAD~1 | 61.28s  64.3MB | 76.9s  72.6MB | 77.6s  72.6MB |
| HEAD   | 49.44s  56.3MB | 68.7s  65.9MB | 69.2s  65.9MB |

Naturally, the improvement becomes much less as the limit grows, as
fewer commits satisfy the short-circuit.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 09:31:25 -07:00
.github Merge branch 'es/advertise-contribution-doc' 2020-06-17 21:54:06 -07:00
Documentation commit-graph: respect 'commitGraph.readChangedPaths' 2020-09-09 12:51:48 -07:00
block-sha1
builtin commit-graph: pass a 'struct repository *' in more places 2020-09-09 12:51:48 -07:00
ci ci: use absolute PYTHON_PATH in the Linux jobs 2020-07-23 15:32:06 -07:00
compat Merge branch 'js/msvc-build-fix' 2020-06-17 21:54:03 -07:00
contrib Merge branch 'mp/complete-show-color-moved' 2020-08-04 13:53:56 -07:00
ewah
git-gui Merge https://github.com/prati0100/git-gui into master 2020-07-20 12:04:06 -07:00
gitk-git
gitweb
mergetools
negotiator
perl perl: make SVN code hash independent 2020-06-22 11:21:07 -07:00
po Merge branch 'master' of github.com:Softcatala/git-po 2020-07-27 00:05:41 +08:00
ppc
refs refs: move the logic to add \t to reflog to the files backend 2020-07-31 10:21:51 -07:00
sha1collisiondetection@855827c583
sha1dc
sha256
t bloom/diff: properly short-circuit on max_changes 2020-09-17 09:31:25 -07:00
templates
trace2
vcs-svn
xdiff
.cirrus.yml
.clang-format
.editorconfig
.gitattributes
.gitignore
.gitmodules
.mailmap
.travis.yml
.tsan-suppressions
CODE_OF_CONDUCT.md
COPYING
GIT-VERSION-GEN Git 2.28 2020-07-26 18:01:43 -07:00
INSTALL
LGPL-2.1
Makefile Merge branch 'lo/sparse-universal-zero-init' 2020-06-02 13:35:04 -07:00
README.md
RelNotes First batch post 2.28 2020-07-30 13:20:36 -07:00
abspath.c
aclocal.m4
add-interactive.c
add-interactive.h
add-patch.c comment: fix spelling mistakes inside comments 2020-07-29 11:39:40 -07:00
advice.c
advice.h
alias.c
alias.h
alloc.c commit: move members graph_pos, generation to a slab 2020-06-17 14:37:30 -07:00
alloc.h object: drop parsed_object_pool->commit_count 2020-06-17 14:37:14 -07:00
apply.c
apply.h
archive-tar.c
archive-zip.c
archive.c
archive.h
argv-array.c
argv-array.h
attr.c
attr.h
banned.h
base85.c
bisect.c
bisect.h
blame.c bloom: split 'get_bloom_filter()' in two 2020-09-17 09:31:25 -07:00
blame.h
blob.c object: drop parsed_object_pool->commit_count 2020-06-17 14:37:14 -07:00
blob.h
bloom.c bloom/diff: properly short-circuit on max_changes 2020-09-17 09:31:25 -07:00
bloom.h bloom: use provided 'struct bloom_filter_settings' 2020-09-17 09:31:25 -07:00
branch.c Merge branch 'es/get-worktrees-unsort' 2020-07-06 22:09:15 -07:00
branch.h
bugreport.c Merge branch 'rs/retire-strbuf-write-fd' 2020-06-29 14:17:26 -07:00
builtin.h
bulk-checkin.c
bulk-checkin.h
bundle.c bundle: detect hash algorithm when reading refs 2020-06-19 14:04:09 -07:00
bundle.h bundle: detect hash algorithm when reading refs 2020-06-19 14:04:09 -07:00
cache-tree.c
cache-tree.h
cache.h Merge branch 'jk/reject-newer-extensions-in-v0' into master 2020-07-30 13:20:32 -07:00
chdir-notify.c
chdir-notify.h
check-builtins.sh
check_bindir
checkout.c
checkout.h
color.c
color.h
column.c comment: fix spelling mistakes inside comments 2020-07-29 11:39:40 -07:00
column.h
combine-diff.c
command-list.txt bash-completion: add git-prune into bash completion 2020-06-22 11:29:38 -07:00
commit-graph.c bloom: use provided 'struct bloom_filter_settings' 2020-09-17 09:31:25 -07:00
commit-graph.h commit-graph: pass a 'struct repository *' in more places 2020-09-09 12:51:48 -07:00
commit-reach.c Merge branch 'cb/is-descendant-of' 2020-07-06 22:09:16 -07:00
commit-reach.h commit-reach: avoid is_descendant_of() shim 2020-06-23 16:36:53 -07:00
commit-slab-decl.h Merge branch 'sg/commit-graph-cleanups' into master 2020-07-30 13:20:30 -07:00
commit-slab-impl.h commit-slab: add a function to deep free entries on the slab 2020-06-08 12:28:49 -07:00
commit-slab.h commit-slab: add a function to deep free entries on the slab 2020-06-08 12:28:49 -07:00
commit.c Merge branch 'tb/fix-persistent-shallow' into master 2020-07-09 14:00:44 -07:00
commit.h commit: move members graph_pos, generation to a slab 2020-06-17 14:37:30 -07:00
common-main.c
config.c
config.h
config.mak.dev
config.mak.in
config.mak.uname Merge branch 'cb/no-more-gmtime' 2020-05-20 08:33:27 -07:00
configure.ac
connect.c Merge branch 'bc/sha-256-part-2' 2020-07-06 22:09:13 -07:00
connect.h Merge branch 'bc/sha-256-part-2' 2020-07-06 22:09:13 -07:00
connected.c fetch-pack: support more than one pack lockfile 2020-06-10 18:06:34 -07:00
connected.h
convert.c
convert.h
copy.c
credential-cache--daemon.c
credential-cache.c
credential-store.c
credential.c
credential.h
csum-file.c
csum-file.h
ctype.c
daemon.c
date.c
decorate.c
decorate.h
delta-islands.c
delta-islands.h
delta.h
detect-compiler
diff-delta.c
diff-lib.c diff-files --raw: show correct post-image of intent-to-add files 2020-07-01 16:15:43 -07:00
diff-no-index.c
diff.c Merge branch 'jk/diff-memuse-optim-with-stat-unmatch' 2020-06-17 21:54:00 -07:00
diff.h bloom/diff: properly short-circuit on max_changes 2020-09-17 09:31:25 -07:00
diffcore-break.c
diffcore-delta.c
diffcore-order.c
diffcore-pickaxe.c
diffcore-rename.c
diffcore.h
dir-iterator.c
dir-iterator.h
dir.c Merge branch 'en/fill-directory-exponential' into master 2020-07-30 13:20:36 -07:00
dir.h
editor.c
entry.c Merge branch 'mt/entry-fstat-fallback-fix' into master 2020-07-09 14:00:45 -07:00
environment.c
exec-cmd.c
exec-cmd.h
fast-import.c Merge branch 'en/fast-import-looser-date' 2020-06-02 13:35:05 -07:00
fetch-negotiator.c
fetch-negotiator.h
fetch-pack.c Merge branch 'bc/sha-256-part-2' 2020-07-06 22:09:13 -07:00
fetch-pack.h fetch-pack: support more than one pack lockfile 2020-06-10 18:06:34 -07:00
fmt-merge-msg.c fmt-merge-msg: allow merge destination to be omitted again 2020-07-30 12:43:10 -07:00
fmt-merge-msg.h
fsck.c Merge branch 'rs/fsck-duplicate-names-in-trees' 2020-06-08 18:06:29 -07:00
fsck.h
fsmonitor.c Remove doubled words in various comments 2020-07-28 14:28:14 -07:00
fsmonitor.h
fuzz-commit-graph.c commit-graph: pass a 'struct repository *' in more places 2020-09-09 12:51:48 -07:00
fuzz-pack-headers.c
fuzz-pack-idx.c
generate-cmdlist.sh
generate-configlist.sh
gettext.c
gettext.h
git-add--interactive.perl checkout -p: handle new files correctly 2020-05-27 14:50:20 -07:00
git-archimport.perl
git-bisect.sh bisect: treat BISECT_HEAD as a pseudo ref 2020-07-10 13:53:37 -07:00
git-compat-util.h Merge branch 'bc/sha-256-part-2' 2020-07-06 22:09:13 -07:00
git-cvsexportcommit.perl git-cvsexportcommit: port to SHA-256 2020-06-22 11:21:07 -07:00
git-cvsimport.perl git-cvsimport: port to SHA-256 2020-06-22 11:21:07 -07:00
git-cvsserver.perl git-cvsserver: port to SHA-256 2020-06-22 11:21:07 -07:00
git-difftool--helper.sh
git-filter-branch.sh
git-instaweb.sh
git-merge-octopus.sh
git-merge-one-file.sh
git-merge-resolve.sh
git-mergetool--lib.sh
git-mergetool.sh
git-p4.py Merge branch 'bk/p4-prepare-p4-only-fix' 2020-06-02 13:35:01 -07:00
git-parse-remote.sh
git-quiltimport.sh
git-rebase--preserve-merges.sh
git-request-pull.sh
git-send-email.perl send-email: restore --in-reply-to superseding behavior 2020-07-01 16:12:21 -07:00
git-sh-i18n.sh
git-sh-setup.sh
git-submodule.sh submodule: port subcommand 'set-branch' from shell to C 2020-06-02 10:51:54 -07:00
git-svn.perl git-svn: set the OID length based on hash algorithm 2020-06-22 11:21:07 -07:00
git-web--browse.sh
git.c Merge branch 'ta/wait-on-aliased-commands-upon-signal' into master 2020-07-15 16:29:43 -07:00
git.rc
gpg-interface.c
gpg-interface.h
graph.c
graph.h
grep.c comment: fix spelling mistakes inside comments 2020-07-29 11:39:40 -07:00
grep.h
hash.h
hashmap.c
hashmap.h hashmap: fix typo in usage docs 2020-07-28 14:28:15 -07:00
help.c
help.h
hex.c
http-backend.c
http-fetch.c http-fetch: support fetching packfiles by URL 2020-06-10 18:06:34 -07:00
http-push.c Merge branch 'bc/http-push-flagsfix' 2020-07-06 22:09:17 -07:00
http-walker.c http: refactor finish_http_pack_request() 2020-06-10 18:06:34 -07:00
http.c Merge branch 'jt/cdn-offload' 2020-06-25 12:27:47 -07:00
http.h Merge branch 'jt/cdn-offload' 2020-06-25 12:27:47 -07:00
ident.c
imap-send.c
interdiff.c
interdiff.h
iterator.h
json-writer.c
json-writer.h
khash.h
kwset.c
kwset.h
levenshtein.c
levenshtein.h
line-log.c bloom: split 'get_bloom_filter()' in two 2020-09-17 09:31:25 -07:00
line-log.h
line-range.c
line-range.h
linear-assignment.c
linear-assignment.h
list-objects-filter-options.c repository: add a helper function to perform repository format upgrade 2020-06-05 10:13:30 -07:00
list-objects-filter-options.h
list-objects-filter.c
list-objects-filter.h
list-objects.c
list-objects.h
list.h
ll-merge.c
ll-merge.h
lockfile.c
lockfile.h
log-tree.c
log-tree.h
ls-refs.c
ls-refs.h
mailinfo.c
mailinfo.h
mailmap.c
mailmap.h
match-trees.c
mem-pool.c
mem-pool.h
merge-blobs.c
merge-blobs.h
merge-recursive.c
merge-recursive.h
merge.c
mergesort.c
mergesort.h
midx.c
midx.h
name-hash.c
notes-cache.c
notes-cache.h
notes-merge.c
notes-merge.h
notes-utils.c
notes-utils.h
notes.c
notes.h
object-store.h packfile: compute and use the index CRC offset 2020-05-27 10:07:07 -07:00
object.c object: drop parsed_object_pool->commit_count 2020-06-17 14:37:14 -07:00
object.h Merge branch 'tb/fix-persistent-shallow' into master 2020-07-09 14:00:44 -07:00
oid-array.c
oid-array.h
oidmap.c
oidmap.h
oidset.c
oidset.h
pack-bitmap-write.c
pack-bitmap.c
pack-bitmap.h
pack-check.c
pack-objects.c
pack-objects.h
pack-revindex.c
pack-revindex.h
pack-write.c Merge branch 'jb/doc-packfile-name' into master 2020-07-30 21:34:32 -07:00
pack.h
packfile.c packfile: compute and use the index CRC offset 2020-05-27 10:07:07 -07:00
packfile.h
pager.c
parse-options-cb.c
parse-options.c
parse-options.h
patch-delta.c
patch-ids.c
patch-ids.h
path.c
path.h
pathspec.c
pathspec.h
pkt-line.c Merge branch 'bc/sha-256-part-2' 2020-07-06 22:09:13 -07:00
pkt-line.h Merge branch 'bc/sha-256-part-2' 2020-07-06 22:09:13 -07:00
preload-index.c
pretty.c
pretty.h
prio-queue.c
prio-queue.h
progress.c
progress.h
promisor-remote.c
promisor-remote.h
prompt.c
prompt.h
protocol.c config: let feature.experimental imply protocol.version=2 2020-05-21 09:31:42 -07:00
protocol.h
prune-packed.c
prune-packed.h
quote.c
quote.h
range-diff.c
range-diff.h
reachable.c
reachable.h
read-cache.c read-cache: remove bogus shortcut 2020-07-16 10:42:52 -07:00
rebase-interactive.c
rebase-interactive.h
rebase.c
rebase.h
ref-filter.c Merge branch 'sk/typofixes' into master 2020-07-30 21:34:29 -07:00
ref-filter.h
reflog-walk.c
reflog-walk.h
refs.c Merge branch 'hn/reftable' into master 2020-08-01 13:49:13 -07:00
refs.h Merge branch 'js/default-branch-name' 2020-07-06 22:09:17 -07:00
refspec.c
refspec.h
remote-curl.c Merge branch 'bc/push-cas-cquoted-refname' into master 2020-07-30 13:20:34 -07:00
remote-testsvn.c testsvn: respect `init.defaultBranch` 2020-06-24 09:14:21 -07:00
remote.c remote: use the configured default branch name when appropriate 2020-06-24 09:14:21 -07:00
remote.h stateless-connect: send response end packet 2020-05-24 16:26:00 -07:00
replace-object.c
replace-object.h
repo-settings.c commit-graph: respect 'commitGraph.readChangedPaths' 2020-09-09 12:51:48 -07:00
repository.c
repository.h commit-graph: respect 'commitGraph.readChangedPaths' 2020-09-09 12:51:48 -07:00
rerere.c
rerere.h
reset.c
reset.h
resolve-undo.c
resolve-undo.h
revision.c bloom: split 'get_bloom_filter()' in two 2020-09-17 09:31:25 -07:00
revision.h Merge branch 'ds/commit-graph-bloom-updates' into master 2020-07-30 13:20:31 -07:00
run-command.c Merge branch 'ta/wait-on-aliased-commands-upon-signal' into master 2020-07-15 16:29:43 -07:00
run-command.h Merge branch 'ta/wait-on-aliased-commands-upon-signal' into master 2020-07-15 16:29:43 -07:00
send-pack.c Merge branch 'js/default-branch-name' 2020-07-06 22:09:17 -07:00
send-pack.h
sequencer.c
sequencer.h
serve.c Merge branch 'bc/sha-256-part-2' 2020-07-06 22:09:13 -07:00
serve.h
server-info.c
setup.c Merge branch 'jk/reject-newer-extensions-in-v0' into master 2020-07-30 13:20:32 -07:00
sh-i18n--envsubst.c
sha1-file.c Merge branch 'jt/pretend-object-never-come-from-elsewhere' 2020-08-04 13:53:58 -07:00
sha1-lookup.c
sha1-lookup.h
sha1-name.c
sha1dc_git.c
sha1dc_git.h
shallow.c Merge branch 'sg/commit-graph-cleanups' into master 2020-07-30 13:20:30 -07:00
shallow.h
shell.c
shortlog.h
sideband.c
sideband.h
sigchain.c
sigchain.h
split-index.c
split-index.h
stable-qsort.c
strbuf.c Merge branch 'rs/retire-strbuf-write-fd' 2020-06-29 14:17:26 -07:00
strbuf.h Merge branch 'rs/retire-strbuf-write-fd' 2020-06-29 14:17:26 -07:00
streaming.c
streaming.h
string-list.c
string-list.h
sub-process.c
sub-process.h
submodule-config.c
submodule-config.h
submodule.c
submodule.h
symlinks.c
tag.c object: drop parsed_object_pool->commit_count 2020-06-17 14:37:14 -07:00
tag.h
tar.h
tempfile.c
tempfile.h
thread-utils.c
thread-utils.h
tmp-objdir.c
tmp-objdir.h
trace.c
trace.h
trace2.c
trace2.h
trailer.c
trailer.h
transport-helper.c Merge branch 'js/default-branch-name' 2020-07-06 22:09:17 -07:00
transport-internal.h
transport.c Merge branch 'bc/sha-256-part-2' 2020-07-06 22:09:13 -07:00
transport.h Merge branch 'bc/sha-256-part-2' 2020-07-06 22:09:13 -07:00
tree-diff.c bloom/diff: properly short-circuit on max_changes 2020-09-17 09:31:25 -07:00
tree-walk.c tree-walk.c: don't match submodule entries for 'submod/anything' 2020-06-08 12:28:48 -07:00
tree-walk.h
tree.c object: drop parsed_object_pool->commit_count 2020-06-17 14:37:14 -07:00
tree.h
unicode-width.h
unimplemented.sh
unix-socket.c
unix-socket.h
unpack-trees.c Merge branch 'en/sparse-checkout' 2020-05-20 08:33:29 -07:00
unpack-trees.h
upload-pack.c upload-pack: do not lazy-fetch "have" objects 2020-07-16 14:07:19 -07:00
upload-pack.h
url.c
url.h
urlmatch.c
urlmatch.h
usage.c
userdiff.c
userdiff.h
utf8.c
utf8.h
varint.c
varint.h
version.c
version.h
versioncmp.c
walker.c
walker.h
wildmatch.c
wildmatch.h
worktree.c Merge branch 'es/worktree-code-cleanup' 2020-07-06 22:09:19 -07:00
worktree.h worktree: drop get_worktrees() unused 'flags' argument 2020-06-22 10:31:15 -07:00
wrap-for-bin.sh
wrapper.c wrapper: add function to compare strings with different NUL termination 2020-05-27 10:07:06 -07:00
write-or-die.c
ws.c
wt-status.c Remove doubled words in various comments 2020-07-28 14:28:14 -07:00
wt-status.h wt-status: show sparse checkout status as well 2020-06-18 14:12:28 -07:00
xdiff-interface.c
xdiff-interface.h
zlib.c

README.md

Build status

Git - fast, scalable, distributed revision control system

Git is a fast, scalable, distributed revision control system with an unusually rich command set that provides both high-level operations and full access to internals.

Git is an Open Source project covered by the GNU General Public License version 2 (some parts of it are under different licenses, compatible with the GPLv2). It was originally written by Linus Torvalds with help of a group of hackers around the net.

Please read the file INSTALL for installation instructions.

Many Git online resources are accessible from https://git-scm.com/ including full documentation and Git related tools.

See Documentation/gittutorial.txt to get started, then see Documentation/giteveryday.txt for a useful minimum set of commands, and Documentation/git-<commandname>.txt for documentation of each command. If git has been correctly installed, then the tutorial can also be read with man gittutorial or git help tutorial, and the documentation of each command with man git-<commandname> or git help <commandname>.

CVS users may also want to read Documentation/gitcvs-migration.txt (man gitcvs-migration or git help cvs-migration if git is installed).

The user discussion and development of Git take place on the Git mailing list -- everyone is welcome to post bug reports, feature requests, comments and patches to git@vger.kernel.org (read Documentation/SubmittingPatches for instructions on patch submission). To subscribe to the list, send an email with just "subscribe git" in the body to majordomo@vger.kernel.org. The mailing list archives are available at https://lore.kernel.org/git/, http://marc.info/?l=git and other archival sites.

Issues which are security relevant should be disclosed privately to the Git Security mailing list git-security@googlegroups.com.

The maintainer frequently sends the "What's cooking" reports that list the current status of various development topics to the mailing list. The discussion following them give a good reference for project status, development direction and remaining tasks.

The name "git" was given by Linus Torvalds when he wrote the very first version. He described the tool as "the stupid content tracker" and the name as (depending on your mood):

  • random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronunciation of "get" may or may not be relevant.
  • stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang.
  • "global information tracker": you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.
  • "goddamn idiotic truckload of sh*t": when it breaks