You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
126 lines
5.4 KiB
126 lines
5.4 KiB
![]()
19 years ago
|
From: Junio C Hamano <junkio@cox.net>
|
||
|
Subject: Re: Make "git clone" less of a deathly quiet experience
|
||
|
Date: Sun, 12 Feb 2006 19:36:41 -0800
|
||
|
Message-ID: <7v4q3453qu.fsf@assigned-by-dhcp.cox.net>
|
||
|
References: <Pine.LNX.4.64.0602102018250.3691@g5.osdl.org>
|
||
|
<7vwtg2o37c.fsf@assigned-by-dhcp.cox.net>
|
||
|
<Pine.LNX.4.64.0602110943170.3691@g5.osdl.org>
|
||
|
<1139685031.4183.31.camel@evo.keithp.com> <43EEAEF3.7040202@op5.se>
|
||
|
<1139717510.4183.34.camel@evo.keithp.com>
|
||
|
<46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com>
|
||
|
Content-Type: text/plain; charset=us-ascii
|
||
|
Cc: Keith Packard <keithp@keithp.com>, Andreas Ericsson <ae@op5.se>,
|
||
|
Linus Torvalds <torvalds@osdl.org>,
|
||
|
Git Mailing List <git@vger.kernel.org>,
|
||
|
Petr Baudis <pasky@suse.cz>
|
||
|
Return-path: <git-owner@vger.kernel.org>
|
||
|
In-Reply-To: <46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com>
|
||
|
(Martin Langhoff's message of "Mon, 13 Feb 2006 15:06:42 +1300")
|
||
|
|
||
|
Martin Langhoff <martin.langhoff@gmail.com> writes:
|
||
|
|
||
|
> +1... there should be an easy-to-compute threshold trigger to say --
|
||
|
> hey, let's quit being smart and send this client the packs we got and
|
||
|
> get it over with. Or perhaps a client flag so large projects can
|
||
|
> recommend that uses do their initial clone with --gimme-all-packs?
|
||
|
|
||
|
What upload-pack does boils down to:
|
||
|
|
||
|
* find out the latest of what client has and what client asked.
|
||
|
|
||
|
* run "rev-list --objects ^client ours" to make a list of
|
||
|
objects client needs. The actual command line has multiple
|
||
|
"clients" to exclude what is unneeded to be sent, and
|
||
|
multiple "ours" to include refs asked. When you are doing
|
||
|
a full clone, ^client is empty and ours is essentially
|
||
|
--all.
|
||
|
|
||
|
* feed that output to "pack-objects --stdout" and send out
|
||
|
the result.
|
||
|
|
||
|
If you run this command:
|
||
|
|
||
|
$ git-rev-list --objects --all |
|
||
|
git-pack-objects --stdout >/dev/null
|
||
|
|
||
|
It would say some things. The phases of operations are:
|
||
|
|
||
|
Generating pack...
|
||
|
Counting objects XXXX...
|
||
|
Done counting XXXX objects.
|
||
|
Packing XXXXX objects.....
|
||
|
|
||
|
Phase (1). Between the time it says "Generating pack..." upto
|
||
|
"Done counting XXXX objects.", the time is spent by rev-list to
|
||
|
list up all the objects to be sent out.
|
||
|
|
||
|
Phase (2). After that, it tries to make decision what object to
|
||
|
delta against what other object, while twenty or so dots are
|
||
|
printed after "Packing XXXXX objects." (see #git irc log a
|
||
|
couple of days ago; Linus describes how pack building works).
|
||
|
|
||
|
Phase (3). After the dot stops, the program becomes silent.
|
||
|
That is where it actually does delta compression and writeout.
|
||
|
|
||
|
You would notice that quite a lot of time is spent in all
|
||
|
phases.
|
||
|
|
||
|
There is an internal hook to create full repository pack inside
|
||
|
upload-pack (which is what runs on the other end when you run
|
||
|
fetch-pack or clone-pack), but it works slightly differently
|
||
|
from what you are suggesting, in that it still tries to do the
|
||
|
"correct" thing. It still runs "rev-list --objects --all", so
|
||
|
"dangling objects" are never sent out.
|
||
|
|
||
|
We could cheat in all phases to speed things up, at the expense
|
||
|
of ending up sending excess objects. So let's pretend we
|
||
|
decided to treat everything in .git/objects/packs/pack-* (and
|
||
|
the ones found in alternates as well) have interesting objects
|
||
|
for the cloner.
|
||
|
|
||
|
(1) This part unfortunately cannot be totally eliminated. By
|
||
|
assume all packs are interesting, we could use the object
|
||
|
names from the pack index, which is a lot cheaper than
|
||
|
rev-list object traversal. We still need to run rev-list
|
||
|
--objects --all --unpacked to pick up loose objects we would
|
||
|
not be able to tell by looking at the pack index to cover
|
||
|
the rest.
|
||
|
|
||
|
This however needs to be done in conjunction with the second
|
||
|
phase change. pack-objects depends on the hint rev-list
|
||
|
--objects output gives it to group the blobs and trees with
|
||
|
the same pathnames together, and that greatly affects the
|
||
|
packing efficiency. Unfortunately pack index does not have
|
||
|
that information -- it does not know type, nor pathnames.
|
||
|
Type is relatively cheap to obtain but pathnames for blob
|
||
|
objects are inherently unavailable.
|
||
|
|
||
|
(2) This part can be mostly eliminated for already packed
|
||
|
objects, because we have already decided to cheat by sending
|
||
|
everything, so we can just reuse how objects are deltified
|
||
|
in existing packs. It still needs to be done for loose
|
||
|
objects we collected to fill the gap in (1).
|
||
|
|
||
|
(3) This also can be sped up by reusing what are already in
|
||
|
packs. Pack index records starting (but not end) offset of
|
||
|
each object in the pack, so we can sort by offset to find
|
||
|
out which part of the existing pack corresponds to what
|
||
|
object, to reorder the objects in the final pack. This
|
||
|
needs to be done somewhat carefully to preserve the locality
|
||
|
of objects (again, see #git log). The deltifying and
|
||
|
compressing for loose objects cannot be avoided.
|
||
|
|
||
|
While we are writing things out in (3), we need to keep
|
||
|
track of running SHA1 sum of what we write out so that we
|
||
|
can fill out the correct checksum at the end, but I am
|
||
|
guessing that is relatively cheap compared to the
|
||
|
deltification and compression cost we are currently paying
|
||
|
in this phase.
|
||
|
|
||
|
NB. In the #git log, Linus made it sound like I am clueless
|
||
|
about how pack is generated, but if you check commit 9d5ab96,
|
||
|
the "recency of delta is inherited from base", one of the tricks
|
||
|
that have a big performance impact, was done by me ;-).
|
||
|
|
||
|
|