git/builtin
Derrick Stolee 6840fe9ee2 backfill: add --min-batch-size=<n> option
Users may want to specify a minimum batch size for their needs. This is only
a minimum: the path-walk API provides a list of OIDs that correspond to the
same path, and thus it is optimal to allow delta compression across those
objects in a single server request.

We could consider limiting the request to have a maximum batch size in the
future. For now, we let the path-walk API batches determine the
boundaries.

To get a feeling for the value of specifying the --min-batch-size parameter,
I tested a number of open source repositories available on GitHub. The
procedure was generally:

 1. git clone --filter=blob:none <url>
 2. git backfill

Checking the number of packfiles and the size of the .git/objects/pack
directory helps to identify the effects of different batch sizes.

For the Git repository, we get these results:

| Batch Size      | Pack Count | Pack Size | Time  |
|-----------------|------------|-----------|-------|
| (Initial clone) | 2          | 119 MB    |       |
| 25K             | 8          | 290 MB    | 24s   |
| 50K             | 5          | 290 MB    | 24s   |
| 100K            | 4          | 290 MB    | 29s   |

Other than the packfile counts decreasing as we need fewer batches, the
size and time required is not changing much for this small example.

For the nodejs/node repository, we see these results:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 330 MB    |        |
| 25K             | 19         | 1,222 MB  | 1m 22s |
| 50K             | 11         | 1,221 MB  | 1m 24s |
| 100K            | 7          | 1,223 MB  | 1m 40s |
| 250K            | 4          | 1,224 MB  | 2m 23s |
| 500K            | 3          | 1,216 MB  | 4m 38s |

Here, we don't have much difference in the size of the repo, though the
500K batch size results in a few MB gained. That comes at a cost of a
much longer time. This extra time is due to server-side delta
compression happening as the on-disk deltas don't appear to be reusable
all the time. But for smaller batch sizes, the server is able to find
reasonable deltas partly because we are asking for objects that appear
in the same region of the directory tree and include all versions of a
file at a specific path.

To contrast this example, I tested the microsoft/fluentui repo, which
has been known to have inefficient packing due to name hash collisions.
These results are found before GitHub had the opportunity to repack the
server with more advanced name hash versions:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 105 MB    |        |
| 5K              | 53         | 348 MB    | 2m 26s |
| 10K             | 28         | 365 MB    | 2m 22s |
| 15K             | 19         | 407 MB    | 2m 21s |
| 20K             | 15         | 393 MB    | 2m 28s |
| 25K             | 13         | 417 MB    | 2m 06s |
| 50K             | 8          | 509 MB    | 1m 34s |
| 100K            | 5          | 535 MB    | 1m 56s |
| 250K            | 4          | 698 MB    | 1m 33s |
| 500K            | 3          | 696 MB    | 1m 42s |

Here, a larger variety of batch sizes were chosen because of the great
variation in results. By asking the server to download small batches
corresponding to fewer paths at a time, the server is able to provide
better compression for these batches than it would for a regular clone.
A typical full clone for this repository would require 738 MB.

This example justifies the choice to batch requests by path name,
leading to improved communication with a server that is not optimally
packed.

Finally, the same experiment for the Linux repository had these results:

| Batch Size      | Pack Count | Pack Size | Time    |
|-----------------|------------|-----------|---------|
| (Initial clone) | 2          | 2,153 MB  |         |
| 25K             | 63         | 6,380 MB  | 14m 08s |
| 50K             | 58         | 6,126 MB  | 15m 11s |
| 100K            | 30         | 6,135 MB  | 18m 11s |
| 250K            | 14         | 6,146 MB  | 18m 22s |
| 500K            | 8          | 6,143 MB  | 33m 29s |

Even in this example, where the default name hash algorithm leads to
decent compression of the Linux kernel repository, there is value for
selecting a smaller batch size, to a limit. The 25K batch size has the
fastest time, but uses 250 MB more than the 50K batch size. The 500K
batch size took much more time due to server compression time and thus
we should avoid large batch sizes like this.

Based on these experiments, a batch size of 50,000 was chosen as the
default value.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-03 16:12:42 -08:00
..
add.c global: trivial conversions to fix `-Wsign-compare` warnings 2024-12-06 20:20:04 +09:00
am.c Merge branch 'jc/show-usage-help' 2025-01-28 13:02:22 -08:00
annotate.c Merge branch 'jc/a-commands-without-the-repo' 2024-10-25 14:02:36 -04:00
apply.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
archive.c archive: remove the_repository global variable 2024-10-11 09:37:18 -07:00
backfill.c backfill: add --min-batch-size=<n> option 2025-02-03 16:12:42 -08:00
bisect.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
blame.c Merge branch 'ps/the-repository' 2025-01-21 08:44:54 -08:00
branch.c Merge branch 'rs/ref-fitler-used-atoms-value-fix' 2025-01-29 14:05:09 -08:00
bugreport.c diagnose: stop using `the_repository` 2024-12-18 10:44:31 -08:00
bundle.c Merge branch 'jt/bundle-fsck' 2024-12-13 07:33:36 -08:00
cat-file.c Merge branch 'ps/build-sign-compare' 2024-12-23 09:32:11 -08:00
check-attr.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
check-ignore.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
check-mailmap.c Merge branch 'jc/pass-repo-to-builtins' 2024-09-23 10:35:09 -07:00
check-ref-format.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
checkout--worker.c builtins: send usage_with_options() help text to standard output 2025-01-17 13:30:03 -08:00
checkout-index.c builtins: send usage_with_options() help text to standard output 2025-01-17 13:30:03 -08:00
checkout.c Merge branch 'ps/build-sign-compare' 2024-12-23 09:32:11 -08:00
clean.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
clone.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
column.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
commit-graph.c progress: stop using `the_repository` 2024-12-18 10:44:30 -08:00
commit-tree.c builtins: send usage_with_options() help text to standard output 2025-01-17 13:30:03 -08:00
commit.c Merge branch 'ja/doc-commit-markup-updates' 2025-01-29 14:05:09 -08:00
config.c builtin: pass repository to sub commands 2024-11-26 10:36:08 +09:00
count-objects.c packfile: pass down repository to `has_object[_kept]_pack` 2024-12-04 08:21:54 +09:00
credential-cache--daemon.c Merge branch 'mh/credential-cache-authtype-request-fix' 2025-01-28 13:02:24 -08:00
credential-cache.c Merge branch 'rj/cygwin-exit' 2024-11-01 12:53:19 -04:00
credential-store.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
credential.c Merge branch 'jc/show-usage-help' 2025-01-28 13:02:22 -08:00
describe.c Merge branch 'ps/build-sign-compare' 2024-12-23 09:32:11 -08:00
diagnose.c diagnose: stop using `the_repository` 2024-12-18 10:44:31 -08:00
diff-files.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
diff-index.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
diff-tree.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
diff.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
difftool.c global: trivial conversions to fix `-Wsign-compare` warnings 2024-12-06 20:20:04 +09:00
fast-export.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
fast-import.c Merge branch 'kn/pack-write-with-reduced-globals' 2025-02-03 10:23:34 -08:00
fetch-pack.c oddballs: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
fetch.c Merge branch 'bf/fetch-set-head-config' 2025-01-06 12:02:21 -08:00
fmt-merge-msg.c Merge branch 'jc/pass-repo-to-builtins' 2024-09-23 10:35:09 -07:00
for-each-ref.c ref-filter: remove ref_format_clear() 2025-01-21 09:06:24 -08:00
for-each-repo.c global: trivial conversions to fix `-Wsign-compare` warnings 2024-12-06 20:20:04 +09:00
fsck.c progress: stop using `the_repository` 2024-12-18 10:44:30 -08:00
fsmonitor--daemon.c builtins: send usage_with_options() help text to standard output 2025-01-17 13:30:03 -08:00
gc.c Merge branch 'jc/show-usage-help' 2025-01-28 13:02:22 -08:00
get-tar-commit-id.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
grep.c Revert barrier-based LSan threading race workaround 2025-01-01 14:13:01 -08:00
hash-object.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
help.c pager: stop using `the_repository` 2024-12-18 10:44:30 -08:00
hook.c builtin: pass repository to sub commands 2024-11-26 10:36:08 +09:00
index-pack.c Merge branch 'kn/pack-write-with-reduced-globals' 2025-02-03 10:23:34 -08:00
init-db.c builtin/init-db: fix leaking directory paths 2024-11-21 08:23:45 +09:00
interpret-trailers.c trailer: spread usage of "trailer_block" language 2024-10-14 12:33:02 -04:00
log.c Merge branch 'ps/the-repository' 2025-01-21 08:44:54 -08:00
ls-files.c builtins: send usage_with_options() help text to standard output 2025-01-17 13:30:03 -08:00
ls-remote.c builtin/ls-remote: plug leaking server options 2024-11-04 22:37:51 -08:00
ls-tree.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
mailinfo.c mailinfo: stop using `the_repository` 2024-12-18 10:44:31 -08:00
mailsplit.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
merge-base.c commit-reach: use `size_t` to track indices when computing merge bases 2024-12-27 08:12:40 -08:00
merge-file.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
merge-index.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
merge-ours.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
merge-recursive.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
merge-tree.c global: trivial conversions to fix `-Wsign-compare` warnings 2024-12-06 20:20:04 +09:00
merge.c builtins: send usage_with_options() help text to standard output 2025-01-17 13:30:03 -08:00
mktag.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
mktree.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
multi-pack-index.c midx-write: pass down repository to `write_midx_file[_only]` 2024-12-04 10:32:20 +09:00
mv.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
name-rev.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
notes.c builtin: pass repository to sub commands 2024-11-26 10:36:08 +09:00
pack-objects.c Merge branch 'kn/pack-write-with-reduced-globals' 2025-02-03 10:23:34 -08:00
pack-redundant.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
pack-refs.c diff.h: fix index used to loop through unsigned integer 2024-12-06 20:20:03 +09:00
patch-id.c builtin/patch-id: fix type of `get_one_patchid()` 2024-12-06 20:20:05 +09:00
prune-packed.c builtin: remove USE_THE_REPOSITORY for those without the_repository 2024-09-13 14:33:30 -07:00
prune.c progress: stop using `the_repository` 2024-12-18 10:44:30 -08:00
pull.c global: trivial conversions to fix `-Wsign-compare` warnings 2024-12-06 20:20:04 +09:00
push.c global: trivial conversions to fix `-Wsign-compare` warnings 2024-12-06 20:20:04 +09:00
range-diff.c Merge branch 'js/range-diff-diff-merges' 2024-12-23 09:32:17 -08:00
read-tree.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
rebase.c builtins: send usage_with_options() help text to standard output 2025-01-17 13:30:03 -08:00
receive-pack.c Merge branch 'kn/pack-write-with-reduced-globals' 2025-02-03 10:23:34 -08:00
reflog.c diff.h: fix index used to loop through unsigned integer 2024-12-06 20:20:03 +09:00
refs.c Merge branch 'kn/pass-repo-to-builtin-sub-sub-commands' 2024-12-04 10:14:47 +09:00
remote-ext.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
remote-fd.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
remote.c Merge branch 'ps/3.0-remote-deprecation' 2025-02-03 10:23:33 -08:00
repack.c Merge branch 'ps/build-sign-compare' 2024-12-23 09:32:11 -08:00
replace.c refs: allow passing flags when setting up a transaction 2024-11-21 07:59:14 +09:00
replay.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
rerere.c global: trivial conversions to fix `-Wsign-compare` warnings 2024-12-06 20:20:04 +09:00
reset.c diff.h: fix index used to loop through unsigned integer 2024-12-06 20:20:03 +09:00
rev-list.c Merge branch 'jc/show-usage-help' 2025-01-28 13:02:22 -08:00
rev-parse.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
revert.c diff.h: fix index used to loop through unsigned integer 2024-12-06 20:20:03 +09:00
rm.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
send-pack.c send-pack: stop using `the_repository` 2024-12-18 10:44:30 -08:00
shortlog.c diff.h: fix index used to loop through unsigned integer 2024-12-06 20:20:03 +09:00
show-branch.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
show-index.c Merge branch 'jc/show-index-h-update' 2025-01-31 09:44:16 -08:00
show-ref.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
sparse-checkout.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
stash.c global: trivial conversions to fix `-Wsign-compare` warnings 2024-12-06 20:20:04 +09:00
stripspace.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
submodule--helper.c global: trivial conversions to fix `-Wsign-compare` warnings 2024-12-06 20:20:04 +09:00
symbolic-ref.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
tag.c ref-filter: remove ref_format_clear() 2025-01-21 09:06:24 -08:00
unpack-file.c oddballs: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
unpack-objects.c Merge branch 'jk/pack-header-parse-alignment-fix' 2025-01-28 13:02:23 -08:00
update-index.c builtins: send usage_with_options() help text to standard output 2025-01-17 13:30:03 -08:00
update-ref.c global: mark code units that generate warnings with `-Wsign-compare` 2024-12-06 20:20:02 +09:00
update-server-info.c server-info: stop using `the_repository` 2024-12-18 10:44:30 -08:00
upload-archive.c builtin: send usage() help text to standard output 2025-01-17 13:30:03 -08:00
upload-pack.c serve: stop using `the_repository` 2024-12-18 10:44:30 -08:00
var.c Merge branch 'jc/show-usage-help' 2025-01-28 13:02:22 -08:00
verify-commit.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
verify-pack.c builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h 2024-09-13 14:32:24 -07:00
verify-tag.c ref-filter: remove ref_format_clear() 2025-01-21 09:06:24 -08:00
worktree.c Merge branch 'ps/build-sign-compare' 2024-12-23 09:32:11 -08:00
write-tree.c Merge branch 'jc/pass-repo-to-builtins' 2024-09-23 10:35:09 -07:00