git/Documentation
Derrick Stolee 6840fe9ee2 backfill: add --min-batch-size=<n> option
Users may want to specify a minimum batch size for their needs. This is only
a minimum: the path-walk API provides a list of OIDs that correspond to the
same path, and thus it is optimal to allow delta compression across those
objects in a single server request.

We could consider limiting the request to have a maximum batch size in the
future. For now, we let the path-walk API batches determine the
boundaries.

To get a feeling for the value of specifying the --min-batch-size parameter,
I tested a number of open source repositories available on GitHub. The
procedure was generally:

 1. git clone --filter=blob:none <url>
 2. git backfill

Checking the number of packfiles and the size of the .git/objects/pack
directory helps to identify the effects of different batch sizes.

For the Git repository, we get these results:

| Batch Size      | Pack Count | Pack Size | Time  |
|-----------------|------------|-----------|-------|
| (Initial clone) | 2          | 119 MB    |       |
| 25K             | 8          | 290 MB    | 24s   |
| 50K             | 5          | 290 MB    | 24s   |
| 100K            | 4          | 290 MB    | 29s   |

Other than the packfile counts decreasing as we need fewer batches, the
size and time required is not changing much for this small example.

For the nodejs/node repository, we see these results:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 330 MB    |        |
| 25K             | 19         | 1,222 MB  | 1m 22s |
| 50K             | 11         | 1,221 MB  | 1m 24s |
| 100K            | 7          | 1,223 MB  | 1m 40s |
| 250K            | 4          | 1,224 MB  | 2m 23s |
| 500K            | 3          | 1,216 MB  | 4m 38s |

Here, we don't have much difference in the size of the repo, though the
500K batch size results in a few MB gained. That comes at a cost of a
much longer time. This extra time is due to server-side delta
compression happening as the on-disk deltas don't appear to be reusable
all the time. But for smaller batch sizes, the server is able to find
reasonable deltas partly because we are asking for objects that appear
in the same region of the directory tree and include all versions of a
file at a specific path.

To contrast this example, I tested the microsoft/fluentui repo, which
has been known to have inefficient packing due to name hash collisions.
These results are found before GitHub had the opportunity to repack the
server with more advanced name hash versions:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 105 MB    |        |
| 5K              | 53         | 348 MB    | 2m 26s |
| 10K             | 28         | 365 MB    | 2m 22s |
| 15K             | 19         | 407 MB    | 2m 21s |
| 20K             | 15         | 393 MB    | 2m 28s |
| 25K             | 13         | 417 MB    | 2m 06s |
| 50K             | 8          | 509 MB    | 1m 34s |
| 100K            | 5          | 535 MB    | 1m 56s |
| 250K            | 4          | 698 MB    | 1m 33s |
| 500K            | 3          | 696 MB    | 1m 42s |

Here, a larger variety of batch sizes were chosen because of the great
variation in results. By asking the server to download small batches
corresponding to fewer paths at a time, the server is able to provide
better compression for these batches than it would for a regular clone.
A typical full clone for this repository would require 738 MB.

This example justifies the choice to batch requests by path name,
leading to improved communication with a server that is not optimally
packed.

Finally, the same experiment for the Linux repository had these results:

| Batch Size      | Pack Count | Pack Size | Time    |
|-----------------|------------|-----------|---------|
| (Initial clone) | 2          | 2,153 MB  |         |
| 25K             | 63         | 6,380 MB  | 14m 08s |
| 50K             | 58         | 6,126 MB  | 15m 11s |
| 100K            | 30         | 6,135 MB  | 18m 11s |
| 250K            | 14         | 6,146 MB  | 18m 22s |
| 500K            | 8          | 6,143 MB  | 33m 29s |

Even in this example, where the default name hash algorithm leads to
decent compression of the Linux kernel repository, there is value for
selecting a smaller batch size, to a limit. The 25K batch size has the
fastest time, but uses 250 MB more than the 50K batch size. The 500K
batch size took much more time due to server compression time and thus
we should avoid large batch sizes like this.

Based on these experiments, a batch size of 50,000 was chosen as the
default value.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-03 16:12:42 -08:00
..
RelNotes The seventh batch 2025-02-03 10:23:35 -08:00
config Merge branch 'ja/doc-notes-markup-updates' 2025-01-31 09:44:15 -08:00
howto meson: generate articles 2024-12-27 08:28:11 -08:00
includes
mergetools
technical backfill: basic functionality and tests 2025-02-03 16:12:41 -08:00
.gitattributes
.gitignore Documentation: wire up sanity checks for Meson 2024-12-27 08:28:12 -08:00
BreakingChanges.txt remote: announce removal of "branches/" and "remotes/" 2025-01-24 08:08:56 -08:00
CodingGuidelines Merge branch 'ps/build' 2024-12-15 17:54:33 -08:00
DecisionMaking.txt Documentation: fix typos 2024-09-23 12:47:36 -07:00
Makefile Documentation: wire up sanity checks for Meson 2024-12-27 08:28:12 -08:00
MyFirstContribution.txt MyFirstContribution: mention contrib/contacts/git-contacts 2024-04-18 14:55:09 -07:00
MyFirstObjectWalk.txt MyFirstObjectWalk: add stderr to pipe processing 2024-03-27 09:24:35 -07:00
ReviewingGuidelines.txt ReviewingGuidelines: encourage positive reviews more 2024-07-25 08:50:09 -07:00
SubmittingPatches Merge branch 'as/gitk-git-gui-repo-update' 2024-12-28 10:11:42 -08:00
ToolsForGit.txt
asciidoc.conf.in Documentation: inline user-manual.conf 2024-12-27 08:28:10 -08:00
asciidoctor-extensions.rb.in asciidoctor-extensions.rb.in: inject GIT_DATE 2024-12-20 17:34:35 -08:00
blame-options.txt
build-docdep.perl Documentation: allow sourcing generated includes from separate dir 2024-12-07 07:52:12 +09:00
cat-texi.perl
cmd-list.perl Documentation: teach "cmd-list.perl" about out-of-tree builds 2024-12-07 07:52:12 +09:00
config.txt Merge branch 'tb/pseudo-merge-reachability-bitmap' 2024-06-24 16:39:13 -07:00
date-formats.txt Documentation: fix typos describing date format 2024-04-12 09:03:03 -07:00
diff-format.txt doc: git-diff: apply format changes to diff-format 2024-11-19 12:31:04 +09:00
diff-generate-patch.txt doc: git-diff: apply format changes to diff-generate-patch 2024-11-19 12:31:05 +09:00
diff-options.txt doc: git-diff: apply format changes to diff-options 2024-11-19 12:31:04 +09:00
doc-diff
docbook-xsl.css
docbook.xsl
docinfo-html.in Doc: fix Asciidoctor css workaround 2024-07-23 11:02:52 -07:00
everyday.txto
fetch-options.txt doc: correct misleading descriptions for --shallow-exclude 2024-11-04 22:53:23 -08:00
fix-texi.perl
fsck-msgids.txt ref: add symlink ref content check for files backend 2024-11-21 08:21:34 +09:00
generate-mergetool-list.sh Documentation: extract script to generate a list of mergetools 2024-12-07 07:52:13 +09:00
git-add.txt doc: git-add.txt: convert to new style convention 2024-11-12 09:32:18 +09:00
git-am.txt Merge branch 'jk/am-retry' 2024-06-17 15:55:56 -07:00
git-annotate.txt
git-apply.txt apply: support --ours, --theirs, and --union for three-way merges 2024-09-09 10:07:24 -07:00
git-archimport.txt
git-archive.txt archive: document that --add-virtual-file takes full path 2024-06-26 12:56:45 -07:00
git-backfill.txt backfill: add --min-batch-size=<n> option 2025-02-03 16:12:42 -08:00
git-bisect-lk2009.txt
git-bisect.txt
git-blame.txt
git-branch.txt
git-bugreport.txt bugreport.c: fix a crash in `git bugreport` with `--no-suffix` option 2024-03-16 09:31:42 -07:00
git-bundle.txt Merge branch 'kh/doc-bundle-typofix' 2024-12-13 07:33:38 -08:00
git-cat-file.txt docs: explain the order of output in the batched mode of git-cat-file(1) 2024-08-22 14:59:22 -07:00
git-check-attr.txt
git-check-ignore.txt
git-check-mailmap.txt check-mailmap: add options for additional mailmap sources 2024-08-27 14:51:29 -07:00
git-check-ref-format.txt
git-checkout-index.txt
git-checkout.txt checkout: refer to other-worktree branch, not ref 2024-10-10 13:09:13 -07:00
git-cherry-pick.txt cherry-pick: add `--empty` for more robust redundant commit handling 2024-03-25 16:45:41 -07:00
git-cherry.txt
git-citool.txt
git-clean.txt
git-clone.txt Merge branch 'bc/allow-upload-pack-from-other-people' 2024-12-10 10:04:55 +09:00
git-column.txt
git-commit-graph.txt
git-commit-tree.txt
git-commit.txt doc: make more direct explanations in git commit options 2025-01-15 14:43:36 -08:00
git-config.txt git-config.1: remove value from positional args in unset usage 2024-10-08 23:35:45 -07:00
git-count-objects.txt
git-credential-cache--daemon.txt
git-credential-cache.txt docs: discuss caching personal access tokens 2025-01-10 15:10:00 -08:00
git-credential-store.txt
git-credential.txt credential: add method for querying capabilities 2024-04-16 22:39:08 -07:00
git-cvsexportcommit.txt
git-cvsimport.txt
git-cvsserver.txt
git-daemon.txt
git-describe.txt
git-diagnose.txt
git-diff-files.txt
git-diff-index.txt
git-diff-tree.txt Merge branch 'jc/grammo-fixes' into maint-2.46 2024-09-12 11:02:19 -07:00
git-diff.txt doc: git-diff: apply new documentation guidelines 2024-11-19 12:31:04 +09:00
git-difftool.txt
git-fast-export.txt
git-fast-import.txt Merge branch 'xx/rfc2822-date-format-in-doc' 2024-04-23 11:52:40 -07:00
git-fetch-pack.txt doc: correct misleading descriptions for --shallow-exclude 2024-11-04 22:53:23 -08:00
git-fetch.txt
git-filter-branch.txt
git-fmt-merge-msg.txt
git-for-each-ref.txt for-each-ref: add 'is-base' token 2024-08-14 10:10:06 -07:00
git-for-each-repo.txt for-each-repo: optionally keep going on an error 2024-04-24 10:46:03 -07:00
git-format-patch.txt global: Fix duplicate word typos 2024-10-21 16:05:04 -04:00
git-fsck-objects.txt
git-fsck.txt
git-fsmonitor--daemon.txt
git-gc.txt builtin/gc: add a `--detach` flag 2024-08-16 09:46:25 -07:00
git-get-tar-commit-id.txt
git-grep.txt grep docs: describe --no-index further and improve formatting a bit 2024-03-25 14:00:03 -07:00
git-gui.txt SubmittingPatches: welcome the new maintainer of git-gui part 2024-05-11 14:31:30 -07:00
git-hash-object.txt
git-help.txt
git-hook.txt
git-http-backend.txt
git-http-fetch.txt
git-http-push.txt
git-imap-send.txt
git-index-pack.txt index-pack: teach --promisor to forbid pack name 2024-11-20 10:37:56 +09:00
git-init-db.txt
git-init.txt doc: apply synopsis simplification on git-clone and git-init 2024-09-24 10:20:26 -07:00
git-instaweb.txt
git-interpret-trailers.txt doc: fix some placeholders formating 2024-03-16 10:04:53 -07:00
git-log.txt
git-ls-files.txt doc: fix hex code escapes in git-ls-files 2024-07-26 10:53:21 -07:00
git-ls-remote.txt transport.c:🤝 make use of server options from remote 2024-10-08 10:22:08 -07:00
git-ls-tree.txt
git-mailinfo.txt
git-mailsplit.txt
git-maintenance.txt doc: add a note about staggering of maintenance 2024-10-03 11:23:09 -07:00
git-merge-base.txt
git-merge-file.txt
git-merge-index.txt
git-merge-one-file.txt
git-merge-tree.txt doc: merge-tree: improve example script 2024-10-09 10:40:42 -07:00
git-merge.txt
git-mergetool--lib.txt
git-mergetool.txt
git-mktag.txt
git-mktree.txt
git-multi-pack-index.txt midx: implement support for writing incremental MIDX chains 2024-08-06 12:01:39 -07:00
git-mv.txt
git-name-rev.txt
git-notes.txt doc: convert git-notes to new documentation format 2025-01-10 15:19:52 -08:00
git-p4.txt
git-pack-objects.txt
git-pack-redundant.txt
git-pack-refs.txt builtin/pack-refs: introduce new "--auto" flag 2024-03-25 09:54:07 -07:00
git-patch-id.txt
git-prune-packed.txt
git-prune.txt
git-pull.txt doc: format alternatives in synopsis 2024-03-16 10:04:45 -07:00
git-push.txt
git-quiltimport.txt
git-range-diff.txt range-diff: introduce the convenience option `--remerge-diff` 2024-12-16 08:45:48 -08:00
git-read-tree.txt
git-rebase.txt Merge branch 'jc/doc-rebase-fuzz-vs-offset-fix' into maint-2.46 2024-08-16 12:50:55 -07:00
git-receive-pack.txt
git-reflog.txt
git-refs.txt refs: add support for migrating reflogs 2024-12-16 09:45:34 -08:00
git-remote-ext.txt
git-remote-fd.txt
git-remote-helpers.txto
git-remote.txt
git-repack.txt
git-replace.txt
git-replay.txt Documentation: fix linkgit reference 2024-04-15 11:02:43 -07:00
git-request-pull.txt
git-rerere.txt
git-reset.txt
git-restore.txt doc: convert git-restore to new style format 2025-01-10 15:21:21 -08:00
git-rev-list.txt
git-rev-parse.txt Merge branch 'jc/rev-parse-fatal-doc' into maint-2.45 2024-06-28 15:53:14 -07:00
git-revert.txt
git-rm.txt
git-send-email.txt send-email: document --mailmap and associated configuration 2024-09-25 08:58:38 -07:00
git-send-pack.txt
git-sh-i18n--envsubst.txt
git-sh-i18n.txt
git-sh-setup.txt
git-shell.txt
git-shortlog.txt
git-show-branch.txt Merge branch 'ri/doc-show-branch-fix' 2024-07-15 10:11:43 -07:00
git-show-index.txt show-index: the short help should say the command reads from its input 2024-12-20 17:30:57 -08:00
git-show-ref.txt show-ref: introduce --branches and deprecate --heads 2024-06-04 15:07:08 -07:00
git-show.txt
git-sparse-checkout.txt
git-stage.txt
git-stash.txt
git-status.txt Merge branch 'jc/show-untracked-false' 2024-03-28 14:13:50 -07:00
git-stripspace.txt
git-submodule.txt builtin/submodule: allow "add" to use different ref storage format 2024-08-08 09:22:21 -07:00
git-svn.txt git-svn: mention `svn:global-ignores` in help+docs 2024-08-14 15:10:24 -07:00
git-switch.txt
git-symbolic-ref.txt Documentation: mutually link update-ref and symbolic-ref 2024-10-21 16:49:31 -04:00
git-tag.txt builtin/tag: add --trailer option 2024-05-07 10:06:03 -07:00
git-tools.txt
git-unpack-file.txt
git-unpack-objects.txt
git-update-index.txt documentation: git-update-index: add --show-index-version to synopsis 2024-05-13 16:57:17 -07:00
git-update-ref.txt Merge branch 'kh/doc-update-ref-grammofix' 2024-12-13 07:33:39 -08:00
git-update-server-info.txt
git-upload-archive.txt
git-upload-pack.txt Sync with 2.42.2 2024-04-19 12:38:50 +02:00
git-var.txt
git-verify-commit.txt
git-verify-pack.txt
git-verify-tag.txt
git-version.txt
git-web--browse.txt
git-whatchanged.txt
git-worktree.txt worktree: add relative cli/config options to `repair` command 2024-12-02 09:36:17 +09:00
git-write-tree.txt
git.txt Merge branch 'mh/doc-windows-home-env' 2025-01-06 08:23:29 -08:00
gitattributes.txt docs: fix typesetting of merge driver placeholders 2025-01-07 15:11:36 -08:00
gitcli.txt Merge branch 'jc/cli-doc-option-and-config' 2025-01-23 15:07:02 -08:00
gitcore-tutorial.txt
gitcredentials.txt docs: list popular credential helpers 2025-01-10 15:10:00 -08:00
gitcvs-migration.txt
gitdiffcore.txt
giteveryday.txt
gitfaq.txt gitfaq: add entry about syncing working trees 2024-07-09 21:24:42 -07:00
gitformat-bundle.txt
gitformat-chunk.txt
gitformat-commit-graph.txt Documentation: fix typos 2024-09-23 12:47:36 -07:00
gitformat-index.txt
gitformat-pack.txt
gitformat-signature.txt
gitglossary.txt
githooks.txt Merge branch 'jt/doc-post-receive-hook-update' into maint-2.46 2024-08-16 12:50:53 -07:00
gitignore.txt
gitk.txt
gitmailmap.txt
gitmodules.txt
gitnamespaces.txt
gitpacking.txt Documentation/gitpacking: make sample configs listing blocks 2024-07-17 08:48:30 -07:00
gitprotocol-capabilities.txt
gitprotocol-common.txt
gitprotocol-http.txt
gitprotocol-pack.txt
gitprotocol-v2.txt Merge branch 'xx/protocol-v2-doc-markup-fix' into maint-2.47 2024-11-25 12:29:47 +09:00
gitremote-helpers.txt Merge branch 'jk/remote-helper-object-format-option-fix' 2024-04-03 10:56:18 -07:00
gitrepository-layout.txt remote: announce removal of "branches/" and "remotes/" 2025-01-24 08:08:56 -08:00
gitrevisions.txt
gitsubmodules.txt
gittutorial-2.txt
gittutorial.txt Merge branch 'jc/grammo-fixes' into maint-2.46 2024-09-12 11:02:19 -07:00
gitweb.conf.txt
gitweb.txt Documentation: fix typos 2024-09-23 12:47:36 -07:00
gitworkflows.txt
glossary-content.txt Documentation/glossary: describe "trailer" 2024-11-18 09:41:24 +09:00
i18n.txt doc: migrate git-commit manpage secondary files to new format 2025-01-15 14:43:36 -08:00
install-doc-quick.sh
install-webdoc.sh
line-range-format.txt
line-range-options.txt
lint-fsck-msgids.perl
lint-gitlink.perl
lint-man-end-blurb.perl
lint-man-section-order.perl
lint-manpages.sh Documentation/lint-manpages: bubble up errors 2024-06-06 08:20:51 -07:00
manpage-bold-literal.xsl
manpage-normal.xsl
manpage.xsl
merge-options.txt
merge-strategies.txt
meson.build backfill: add builtin boilerplate 2025-02-03 16:12:41 -08:00
object-format-disclaimer.txt
pretty-formats.txt Merge branch 'bl/doc-key-val-sep-fix' 2024-03-25 16:16:35 -07:00
pretty-options.txt
pull-fetch-param.txt doc: clarify <src> in refspec syntax 2024-10-09 16:59:01 -07:00
ref-reachability-filters.txt
ref-storage-format.txt
rerere-options.txt
rev-list-description.txt
rev-list-options.txt Merge branch 'kk/doc-ancestry-path' 2024-12-13 07:33:46 -08:00
revisions.txt
scalar.txt scalar: add --no-tags option to 'scalar clone' 2024-09-06 14:13:48 -07:00
sequencer.txt
signoff-option.txt doc: migrate git-commit manpage secondary files to new format 2025-01-15 14:43:36 -08:00
texi.xsl
trace2-target-values.txt
transfer-data-leaks.txt
urls-remotes.txt
urls.txt doc: apply synopsis simplification on git-clone and git-init 2024-09-24 10:20:26 -07:00
user-manual.txt Documentation/user-manual.txt: example for generating object hashes 2024-03-12 13:32:11 -07:00