Junio C Hamano
7 years ago
1 changed files with 324 additions and 0 deletions
@ -0,0 +1,324 @@
@@ -0,0 +1,324 @@
|
||||
Partial Clone Design Notes |
||||
========================== |
||||
|
||||
The "Partial Clone" feature is a performance optimization for Git that |
||||
allows Git to function without having a complete copy of the repository. |
||||
The goal of this work is to allow Git better handle extremely large |
||||
repositories. |
||||
|
||||
During clone and fetch operations, Git downloads the complete contents |
||||
and history of the repository. This includes all commits, trees, and |
||||
blobs for the complete life of the repository. For extremely large |
||||
repositories, clones can take hours (or days) and consume 100+GiB of disk |
||||
space. |
||||
|
||||
Often in these repositories there are many blobs and trees that the user |
||||
does not need such as: |
||||
|
||||
1. files outside of the user's work area in the tree. For example, in |
||||
a repository with 500K directories and 3.5M files in every commit, |
||||
we can avoid downloading many objects if the user only needs a |
||||
narrow "cone" of the source tree. |
||||
|
||||
2. large binary assets. For example, in a repository where large build |
||||
artifacts are checked into the tree, we can avoid downloading all |
||||
previous versions of these non-mergeable binary assets and only |
||||
download versions that are actually referenced. |
||||
|
||||
Partial clone allows us to avoid downloading such unneeded objects *in |
||||
advance* during clone and fetch operations and thereby reduce download |
||||
times and disk usage. Missing objects can later be "demand fetched" |
||||
if/when needed. |
||||
|
||||
Use of partial clone requires that the user be online and the origin |
||||
remote be available for on-demand fetching of missing objects. This may |
||||
or may not be problematic for the user. For example, if the user can |
||||
stay within the pre-selected subset of the source tree, they may not |
||||
encounter any missing objects. Alternatively, the user could try to |
||||
pre-fetch various objects if they know that they are going offline. |
||||
|
||||
|
||||
Non-Goals |
||||
--------- |
||||
|
||||
Partial clone is a mechanism to limit the number of blobs and trees downloaded |
||||
*within* a given range of commits -- and is therefore independent of and not |
||||
intended to conflict with existing DAG-level mechanisms to limit the set of |
||||
requested commits (i.e. shallow clone, single branch, or fetch '<refspec>'). |
||||
|
||||
|
||||
Design Overview |
||||
--------------- |
||||
|
||||
Partial clone logically consists of the following parts: |
||||
|
||||
- A mechanism for the client to describe unneeded or unwanted objects to |
||||
the server. |
||||
|
||||
- A mechanism for the server to omit such unwanted objects from packfiles |
||||
sent to the client. |
||||
|
||||
- A mechanism for the client to gracefully handle missing objects (that |
||||
were previously omitted by the server). |
||||
|
||||
- A mechanism for the client to backfill missing objects as needed. |
||||
|
||||
|
||||
Design Details |
||||
-------------- |
||||
|
||||
- A new pack-protocol capability "filter" is added to the fetch-pack and |
||||
upload-pack negotiation. |
||||
|
||||
This uses the existing capability discovery mechanism. |
||||
See "filter" in Documentation/technical/pack-protocol.txt. |
||||
|
||||
- Clients pass a "filter-spec" to clone and fetch which is passed to the |
||||
server to request filtering during packfile construction. |
||||
|
||||
There are various filters available to accommodate different situations. |
||||
See "--filter=<filter-spec>" in Documentation/rev-list-options.txt. |
||||
|
||||
- On the server pack-objects applies the requested filter-spec as it |
||||
creates "filtered" packfiles for the client. |
||||
|
||||
These filtered packfiles are *incomplete* in the traditional sense because |
||||
they may contain objects that reference objects not contained in the |
||||
packfile and that the client doesn't already have. For example, the |
||||
filtered packfile may contain trees or tags that reference missing blobs |
||||
or commits that reference missing trees. |
||||
|
||||
- On the client these incomplete packfiles are marked as "promisor packfiles" |
||||
and treated differently by various commands. |
||||
|
||||
- On the client a repository extension is added to the local config to |
||||
prevent older versions of git from failing mid-operation because of |
||||
missing objects that they cannot handle. |
||||
See "extensions.partialClone" in Documentation/technical/repository-version.txt" |
||||
|
||||
|
||||
Handling Missing Objects |
||||
------------------------ |
||||
|
||||
- An object may be missing due to a partial clone or fetch, or missing due |
||||
to repository corruption. To differentiate these cases, the local |
||||
repository specially indicates such filtered packfiles obtained from the |
||||
promisor remote as "promisor packfiles". |
||||
|
||||
These promisor packfiles consist of a "<name>.promisor" file with |
||||
arbitrary contents (like the "<name>.keep" files), in addition to |
||||
their "<name>.pack" and "<name>.idx" files. |
||||
|
||||
- The local repository considers a "promisor object" to be an object that |
||||
it knows (to the best of its ability) that the promisor remote has promised |
||||
that it has, either because the local repository has that object in one of |
||||
its promisor packfiles, or because another promisor object refers to it. |
||||
|
||||
When Git encounters a missing object, Git can see if it a promisor object |
||||
and handle it appropriately. If not, Git can report a corruption. |
||||
|
||||
This means that there is no need for the client to explicitly maintain an |
||||
expensive-to-modify list of missing objects.[a] |
||||
|
||||
- Since almost all Git code currently expects any referenced object to be |
||||
present locally and because we do not want to force every command to do |
||||
a dry-run first, a fallback mechanism is added to allow Git to attempt |
||||
to dynamically fetch missing objects from the promisor remote. |
||||
|
||||
When the normal object lookup fails to find an object, Git invokes |
||||
fetch-object to try to get the object from the server and then retry |
||||
the object lookup. This allows objects to be "faulted in" without |
||||
complicated prediction algorithms. |
||||
|
||||
For efficiency reasons, no check as to whether the missing object is |
||||
actually a promisor object is performed. |
||||
|
||||
Dynamic object fetching tends to be slow as objects are fetched one at |
||||
a time. |
||||
|
||||
- `checkout` (and any other command using `unpack-trees`) has been taught |
||||
to bulk pre-fetch all required missing blobs in a single batch. |
||||
|
||||
- `rev-list` has been taught to print missing objects. |
||||
|
||||
This can be used by other commands to bulk prefetch objects. |
||||
For example, a "git log -p A..B" may internally want to first do |
||||
something like "git rev-list --objects --quiet --missing=print A..B" |
||||
and prefetch those objects in bulk. |
||||
|
||||
- `fsck` has been updated to be fully aware of promisor objects. |
||||
|
||||
- `repack` in GC has been updated to not touch promisor packfiles at all, |
||||
and to only repack other objects. |
||||
|
||||
- The global variable "fetch_if_missing" is used to control whether an |
||||
object lookup will attempt to dynamically fetch a missing object or |
||||
report an error. |
||||
|
||||
We are not happy with this global variable and would like to remove it, |
||||
but that requires significant refactoring of the object code to pass an |
||||
additional flag. We hope that concurrent efforts to add an ODB API can |
||||
encompass this. |
||||
|
||||
|
||||
Fetching Missing Objects |
||||
------------------------ |
||||
|
||||
- Fetching of objects is done using the existing transport mechanism using |
||||
transport_fetch_refs(), setting a new transport option |
||||
TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are |
||||
desired, not any object that they refer to. |
||||
|
||||
Because some transports invoke fetch_pack() in the same process, fetch_pack() |
||||
has been updated to not use any object flags when the corresponding argument |
||||
(no_dependents) is set. |
||||
|
||||
- The local repository sends a request with the hashes of all requested |
||||
objects as "want" lines, and does not perform any packfile negotiation. |
||||
It then receives a packfile. |
||||
|
||||
- Because we are reusing the existing fetch-pack mechanism, fetching |
||||
currently fetches all objects referred to by the requested objects, even |
||||
though they are not necessary. |
||||
|
||||
|
||||
Current Limitations |
||||
------------------- |
||||
|
||||
- The remote used for a partial clone (or the first partial fetch |
||||
following a regular clone) is marked as the "promisor remote". |
||||
|
||||
We are currently limited to a single promisor remote and only that |
||||
remote may be used for subsequent partial fetches. |
||||
|
||||
We accept this limitation because we believe initial users of this |
||||
feature will be using it on repositories with a strong single central |
||||
server. |
||||
|
||||
- Dynamic object fetching will only ask the promisor remote for missing |
||||
objects. We assume that the promisor remote has a complete view of the |
||||
repository and can satisfy all such requests. |
||||
|
||||
- Repack essentially treats promisor and non-promisor packfiles as 2 |
||||
distinct partitions and does not mix them. Repack currently only works |
||||
on non-promisor packfiles and loose objects. |
||||
|
||||
- Dynamic object fetching invokes fetch-pack once *for each item* |
||||
because most algorithms stumble upon a missing object and need to have |
||||
it resolved before continuing their work. This may incur significant |
||||
overhead -- and multiple authentication requests -- if many objects are |
||||
needed. |
||||
|
||||
- Dynamic object fetching currently uses the existing pack protocol V0 |
||||
which means that each object is requested via fetch-pack. The server |
||||
will send a full set of info/refs when the connection is established. |
||||
If there are large number of refs, this may incur significant overhead. |
||||
|
||||
|
||||
Future Work |
||||
----------- |
||||
|
||||
- Allow more than one promisor remote and define a strategy for fetching |
||||
missing objects from specific promisor remotes or of iterating over the |
||||
set of promisor remotes until a missing object is found. |
||||
|
||||
A user might want to have multiple geographically-close cache servers |
||||
for fetching missing blobs while continuing to do filtered `git-fetch` |
||||
commands from the central server, for example. |
||||
|
||||
Or the user might want to work in a triangular work flow with multiple |
||||
promisor remotes that each have an incomplete view of the repository. |
||||
|
||||
- Allow repack to work on promisor packfiles (while keeping them distinct |
||||
from non-promisor packfiles). |
||||
|
||||
- Allow non-pathname-based filters to make use of packfile bitmaps (when |
||||
present). This was just an omission during the initial implementation. |
||||
|
||||
- Investigate use of a long-running process to dynamically fetch a series |
||||
of objects, such as proposed in [5,6] to reduce process startup and |
||||
overhead costs. |
||||
|
||||
It would be nice if pack protocol V2 could allow that long-running |
||||
process to make a series of requests over a single long-running |
||||
connection. |
||||
|
||||
- Investigate pack protocol V2 to avoid the info/refs broadcast on |
||||
each connection with the server to dynamically fetch missing objects. |
||||
|
||||
- Investigate the need to handle loose promisor objects. |
||||
|
||||
Objects in promisor packfiles are allowed to reference missing objects |
||||
that can be dynamically fetched from the server. An assumption was |
||||
made that loose objects are only created locally and therefore should |
||||
not reference a missing object. We may need to revisit that assumption |
||||
if, for example, we dynamically fetch a missing tree and store it as a |
||||
loose object rather than a single object packfile. |
||||
|
||||
This does not necessarily mean we need to mark loose objects as promisor; |
||||
it may be sufficient to relax the object lookup or is-promisor functions. |
||||
|
||||
|
||||
Non-Tasks |
||||
--------- |
||||
|
||||
- Every time the subject of "demand loading blobs" comes up it seems |
||||
that someone suggests that the server be allowed to "guess" and send |
||||
additional objects that may be related to the requested objects. |
||||
|
||||
No work has gone into actually doing that; we're just documenting that |
||||
it is a common suggestion. We're not sure how it would work and have |
||||
no plans to work on it. |
||||
|
||||
It is valid for the server to send more objects than requested (even |
||||
for a dynamic object fetch), but we are not building on that. |
||||
|
||||
|
||||
Footnotes |
||||
--------- |
||||
|
||||
[a] expensive-to-modify list of missing objects: Earlier in the design of |
||||
partial clone we discussed the need for a single list of missing objects. |
||||
This would essentially be a sorted linear list of OIDs that the were |
||||
omitted by the server during a clone or subsequent fetches. |
||||
|
||||
This file would need to be loaded into memory on every object lookup. |
||||
It would need to be read, updated, and re-written (like the .git/index) |
||||
on every explicit "git fetch" command *and* on any dynamic object fetch. |
||||
|
||||
The cost to read, update, and write this file could add significant |
||||
overhead to every command if there are many missing objects. For example, |
||||
if there are 100M missing blobs, this file would be at least 2GiB on disk. |
||||
|
||||
With the "promisor" concept, we *infer* a missing object based upon the |
||||
type of packfile that references it. |
||||
|
||||
|
||||
Related Links |
||||
------------- |
||||
[0] https://bugs.chromium.org/p/git/issues/detail?id=2 |
||||
Chromium work item for: Partial Clone |
||||
|
||||
[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/ |
||||
Subject: [RFC] Add support for downloading blobs on demand |
||||
Date: Fri, 13 Jan 2017 10:52:53 -0500 |
||||
|
||||
[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/ |
||||
Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) |
||||
Date: Fri, 29 Sep 2017 13:11:36 -0700 |
||||
|
||||
[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/ |
||||
Subject: Proposal for missing blob support in Git repos |
||||
Date: Wed, 26 Apr 2017 15:13:46 -0700 |
||||
|
||||
[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ |
||||
Subject: [PATCH 00/10] RFC Partial Clone and Fetch |
||||
Date: Wed, 8 Mar 2017 18:50:29 +0000 |
||||
|
||||
[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/ |
||||
Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module |
||||
Date: Fri, 5 May 2017 11:27:52 -0400 |
||||
|
||||
[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/ |
||||
Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand |
||||
Date: Fri, 14 Jul 2017 09:26:50 -0400 |
Loading…
Reference in new issue