You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
125 lines
5.4 KiB
125 lines
5.4 KiB
From: Junio C Hamano <junkio@cox.net> |
|
Subject: Re: Make "git clone" less of a deathly quiet experience |
|
Date: Sun, 12 Feb 2006 19:36:41 -0800 |
|
Message-ID: <7v4q3453qu.fsf@assigned-by-dhcp.cox.net> |
|
References: <Pine.LNX.4.64.0602102018250.3691@g5.osdl.org> |
|
<7vwtg2o37c.fsf@assigned-by-dhcp.cox.net> |
|
<Pine.LNX.4.64.0602110943170.3691@g5.osdl.org> |
|
<1139685031.4183.31.camel@evo.keithp.com> <43EEAEF3.7040202@op5.se> |
|
<1139717510.4183.34.camel@evo.keithp.com> |
|
<46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com> |
|
Content-Type: text/plain; charset=us-ascii |
|
Cc: Keith Packard <keithp@keithp.com>, Andreas Ericsson <ae@op5.se>, |
|
Linus Torvalds <torvalds@osdl.org>, |
|
Git Mailing List <git@vger.kernel.org>, |
|
Petr Baudis <pasky@suse.cz> |
|
Return-path: <git-owner@vger.kernel.org> |
|
In-Reply-To: <46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com> |
|
(Martin Langhoff's message of "Mon, 13 Feb 2006 15:06:42 +1300") |
|
|
|
Martin Langhoff <martin.langhoff@gmail.com> writes: |
|
|
|
> +1... there should be an easy-to-compute threshold trigger to say -- |
|
> hey, let's quit being smart and send this client the packs we got and |
|
> get it over with. Or perhaps a client flag so large projects can |
|
> recommend that uses do their initial clone with --gimme-all-packs? |
|
|
|
What upload-pack does boils down to: |
|
|
|
* find out the latest of what client has and what client asked. |
|
|
|
* run "rev-list --objects ^client ours" to make a list of |
|
objects client needs. The actual command line has multiple |
|
"clients" to exclude what is unneeded to be sent, and |
|
multiple "ours" to include refs asked. When you are doing |
|
a full clone, ^client is empty and ours is essentially |
|
--all. |
|
|
|
* feed that output to "pack-objects --stdout" and send out |
|
the result. |
|
|
|
If you run this command: |
|
|
|
$ git-rev-list --objects --all | |
|
git-pack-objects --stdout >/dev/null |
|
|
|
It would say some things. The phases of operations are: |
|
|
|
Generating pack... |
|
Counting objects XXXX... |
|
Done counting XXXX objects. |
|
Packing XXXXX objects..... |
|
|
|
Phase (1). Between the time it says "Generating pack..." upto |
|
"Done counting XXXX objects.", the time is spent by rev-list to |
|
list up all the objects to be sent out. |
|
|
|
Phase (2). After that, it tries to make decision what object to |
|
delta against what other object, while twenty or so dots are |
|
printed after "Packing XXXXX objects." (see #git irc log a |
|
couple of days ago; Linus describes how pack building works). |
|
|
|
Phase (3). After the dot stops, the program becomes silent. |
|
That is where it actually does delta compression and writeout. |
|
|
|
You would notice that quite a lot of time is spent in all |
|
phases. |
|
|
|
There is an internal hook to create full repository pack inside |
|
upload-pack (which is what runs on the other end when you run |
|
fetch-pack or clone-pack), but it works slightly differently |
|
from what you are suggesting, in that it still tries to do the |
|
"correct" thing. It still runs "rev-list --objects --all", so |
|
"dangling objects" are never sent out. |
|
|
|
We could cheat in all phases to speed things up, at the expense |
|
of ending up sending excess objects. So let's pretend we |
|
decided to treat everything in .git/objects/packs/pack-* (and |
|
the ones found in alternates as well) have interesting objects |
|
for the cloner. |
|
|
|
(1) This part unfortunately cannot be totally eliminated. By |
|
assume all packs are interesting, we could use the object |
|
names from the pack index, which is a lot cheaper than |
|
rev-list object traversal. We still need to run rev-list |
|
--objects --all --unpacked to pick up loose objects we would |
|
not be able to tell by looking at the pack index to cover |
|
the rest. |
|
|
|
This however needs to be done in conjunction with the second |
|
phase change. pack-objects depends on the hint rev-list |
|
--objects output gives it to group the blobs and trees with |
|
the same pathnames together, and that greatly affects the |
|
packing efficiency. Unfortunately pack index does not have |
|
that information -- it does not know type, nor pathnames. |
|
Type is relatively cheap to obtain but pathnames for blob |
|
objects are inherently unavailable. |
|
|
|
(2) This part can be mostly eliminated for already packed |
|
objects, because we have already decided to cheat by sending |
|
everything, so we can just reuse how objects are deltified |
|
in existing packs. It still needs to be done for loose |
|
objects we collected to fill the gap in (1). |
|
|
|
(3) This also can be sped up by reusing what are already in |
|
packs. Pack index records starting (but not end) offset of |
|
each object in the pack, so we can sort by offset to find |
|
out which part of the existing pack corresponds to what |
|
object, to reorder the objects in the final pack. This |
|
needs to be done somewhat carefully to preserve the locality |
|
of objects (again, see #git log). The deltifying and |
|
compressing for loose objects cannot be avoided. |
|
|
|
While we are writing things out in (3), we need to keep |
|
track of running SHA1 sum of what we write out so that we |
|
can fill out the correct checksum at the end, but I am |
|
guessing that is relatively cheap compared to the |
|
deltification and compression cost we are currently paying |
|
in this phase. |
|
|
|
NB. In the #git log, Linus made it sound like I am clueless |
|
about how pack is generated, but if you check commit 9d5ab96, |
|
the "recency of delta is inherited from base", one of the tricks |
|
that have a big performance impact, was done by me ;-). |
|
|
|
|
|
|