You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

126 lines
5.4 KiB

From: Junio C Hamano <junkio@cox.net>
Subject: Re: Make "git clone" less of a deathly quiet experience
Date: Sun, 12 Feb 2006 19:36:41 -0800
Message-ID: <7v4q3453qu.fsf@assigned-by-dhcp.cox.net>
References: <Pine.LNX.4.64.0602102018250.3691@g5.osdl.org>
<7vwtg2o37c.fsf@assigned-by-dhcp.cox.net>
<Pine.LNX.4.64.0602110943170.3691@g5.osdl.org>
<1139685031.4183.31.camel@evo.keithp.com> <43EEAEF3.7040202@op5.se>
<1139717510.4183.34.camel@evo.keithp.com>
<46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com>
Content-Type: text/plain; charset=us-ascii
Cc: Keith Packard <keithp@keithp.com>, Andreas Ericsson <ae@op5.se>,
Linus Torvalds <torvalds@osdl.org>,
Git Mailing List <git@vger.kernel.org>,
Petr Baudis <pasky@suse.cz>
Return-path: <git-owner@vger.kernel.org>
In-Reply-To: <46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com>
(Martin Langhoff's message of "Mon, 13 Feb 2006 15:06:42 +1300")
Martin Langhoff <martin.langhoff@gmail.com> writes:
> +1... there should be an easy-to-compute threshold trigger to say --
> hey, let's quit being smart and send this client the packs we got and
> get it over with. Or perhaps a client flag so large projects can
> recommend that uses do their initial clone with --gimme-all-packs?
What upload-pack does boils down to:
* find out the latest of what client has and what client asked.
* run "rev-list --objects ^client ours" to make a list of
objects client needs. The actual command line has multiple
"clients" to exclude what is unneeded to be sent, and
multiple "ours" to include refs asked. When you are doing
a full clone, ^client is empty and ours is essentially
--all.
* feed that output to "pack-objects --stdout" and send out
the result.
If you run this command:
$ git-rev-list --objects --all |
git-pack-objects --stdout >/dev/null
It would say some things. The phases of operations are:
Generating pack...
Counting objects XXXX...
Done counting XXXX objects.
Packing XXXXX objects.....
Phase (1). Between the time it says "Generating pack..." upto
"Done counting XXXX objects.", the time is spent by rev-list to
list up all the objects to be sent out.
Phase (2). After that, it tries to make decision what object to
delta against what other object, while twenty or so dots are
printed after "Packing XXXXX objects." (see #git irc log a
couple of days ago; Linus describes how pack building works).
Phase (3). After the dot stops, the program becomes silent.
That is where it actually does delta compression and writeout.
You would notice that quite a lot of time is spent in all
phases.
There is an internal hook to create full repository pack inside
upload-pack (which is what runs on the other end when you run
fetch-pack or clone-pack), but it works slightly differently
from what you are suggesting, in that it still tries to do the
"correct" thing. It still runs "rev-list --objects --all", so
"dangling objects" are never sent out.
We could cheat in all phases to speed things up, at the expense
of ending up sending excess objects. So let's pretend we
decided to treat everything in .git/objects/packs/pack-* (and
the ones found in alternates as well) have interesting objects
for the cloner.
(1) This part unfortunately cannot be totally eliminated. By
assume all packs are interesting, we could use the object
names from the pack index, which is a lot cheaper than
rev-list object traversal. We still need to run rev-list
--objects --all --unpacked to pick up loose objects we would
not be able to tell by looking at the pack index to cover
the rest.
This however needs to be done in conjunction with the second
phase change. pack-objects depends on the hint rev-list
--objects output gives it to group the blobs and trees with
the same pathnames together, and that greatly affects the
packing efficiency. Unfortunately pack index does not have
that information -- it does not know type, nor pathnames.
Type is relatively cheap to obtain but pathnames for blob
objects are inherently unavailable.
(2) This part can be mostly eliminated for already packed
objects, because we have already decided to cheat by sending
everything, so we can just reuse how objects are deltified
in existing packs. It still needs to be done for loose
objects we collected to fill the gap in (1).
(3) This also can be sped up by reusing what are already in
packs. Pack index records starting (but not end) offset of
each object in the pack, so we can sort by offset to find
out which part of the existing pack corresponds to what
object, to reorder the objects in the final pack. This
needs to be done somewhat carefully to preserve the locality
of objects (again, see #git log). The deltifying and
compressing for loose objects cannot be avoided.
While we are writing things out in (3), we need to keep
track of running SHA1 sum of what we write out so that we
can fill out the correct checksum at the end, but I am
guessing that is relatively cheap compared to the
deltification and compression cost we are currently paying
in this phase.
NB. In the #git log, Linus made it sound like I am clueless
about how pack is generated, but if you check commit 9d5ab96,
the "recency of delta is inherited from base", one of the tricks
that have a big performance impact, was done by me ;-).