This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
This change provides a 8% saving on the pack size with a 4% CPU time
increase for git-repack -a on the current git archive.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This goes together with "rev-list --object-edge" change, to feed
pack-objects list of edge commits in addition to the usual
object list. Upon seeing such list, pack-objects loosens the
usual "self contained delta" constraints, and can produce delta
against blobs and trees contained in the edge commits without
storing the delta base objects themselves.
The resulting packfile is not usable in .git/object/packs, but
is a good way to implement "delta-only" transfer.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This tries to rework the solution for the excess delta chain
problem. An earlier commit worked it around ``cheaply'', but
repeated repacking risks unbound growth of delta chains.
This version counts the length of delta chain we are reusing
from the existing pack, and makes sure a base object that has
sufficiently long delta chain does not get deltified.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
You could give -q to squelch it, but currently no tool does it.
This would make 'git clone host:repo here' over ssh not silent
again.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This provides (minimal) documentation for the --non-empty command-line
option to the pack-objects command.
Signed-off-by: Nikolai Weibull <nikolai@bitwi.se>
Signed-off-by: Junio C Hamano <junkio@cox.net>
These commands are converted to run from a subdirectory.
commit-tree convert-objects merge-base merge-index mktag
pack-objects pack-redundant prune-packed read-tree tar-tree
unpack-file unpack-objects update-server-info write-tree
Signed-off-by: Junio C Hamano <junkio@cox.net>
In a corrupt repository, git-repack produces a pack that does not
contain needed objects without complaining, and the result of this
combined with -d flag can be very painful -- e.g. a lossage of one
tree object can lead to lossage of blobs reachable only through that
tree.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
git-pack-objects can reuse pack files stored in $GIT_DIR/pack-cache
directory, when a necessary pack is found. This is hopefully useful
when upload-pack (called from git-daemon) is expected to receive
requests for the same set of objects many times (e.g full cloning
request of any project, or updates from the set of heads previous day
to the latest for a slow moving project).
Currently git-pack-objects does *not* keep pack files it creates for
reusing. It might be useful to add --update-cache option to it,
which would allow it store pack files it created in the pack-cache
directory, and prune rarely used ones from it.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Do our own ctype.h, just to get the sane semantics: we want
locale-independence, _and_ we want the right signed behaviour. Plus we
only use a very small subset of ctype.h anyway (isspace, isalpha,
isdigit and isalnum).
Signed-off-by: Junio C Hamano <junkio@cox.net>
This adds the "--local" flag to git-pack-objects, which acts like
"--incremental", except that instead of ignoring all packed objects, it
only ignores objects that are packed and in an alternate object tree.
As a result, it effectively only does a local re-pack: any remote-packed
objects will stay in the alternate object directories.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This changes the generation of hash packfiles have in their names, from
"hash of object names as fed to us" to "hash of object names in the
resulting pack, in the order they appear in the index file". The new
"git-index-pack" command is taught to output the computed hash value
to its standard output.
With this, we can store downloaded pack in a temporary file without
knowing its final name, run git-index-pack to generate idx for it
while finding out its final name, and then rename the pack and idx to
their final names.
Signed-off-by: Junio C Hamano <junkio@cox.net>
find_deltas() should free its temporary objects before returning.
[jc: Sergey, if you have [PATCH] title on the Subject line of your
e-mail, please do not repeat it on the first line in your message
body. Thanks.]
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This means that the .git/objects/pack directory is also rsync'able,
since the filenames created there-in are either unique or refer to the
same data.
Otherwise you might not be able to pull from a directory that is partly
packed without having to worry about missing objects due to pack-file
name clashes.
This is a wrap-up patch including all the cleanups I've done to the
delta code and its usage. The most important change is the
factorization of the delta header handling code.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This makes it match the new delta encoding, and admittedly makes the
code easier to follow.
This also updates the PACK file version to 2, since this (and the delta
encoding change in the previous commit) are incompatible with the old
format.
Deltas are useless by themselves and when you use them you need to get
to their base objects. A base object should inherit recency from the
most recent deltified object that is based on it and that is what this
patch teaches git-pack-objects.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Standalone unpack-objects command was not adjusted for header length
encoding change when dealing with deltified entry. This fixes it.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This also adds a header with a signature, version info, and the number
of objects to the pack file. It also encodes the file length and type
more efficiently.
This lets us eliminate one use of map_sha1_file() outside
sha1_file.c, to bring us one step closer to the packed GIT.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Packed delta files created by git-pack-objects seems to be the
way to go, and existing "delta" object handling code has exposed
the object representation details to too many places. Remove it
while we refactor code to come up with a proper interface in
sha1_file.c.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Also, make the writing of the SHA1 as a end-header be conditional: not
every user will necessarily want to write the SHA1 to the file itself,
even though current users do (but we migh end up using the same helper
functions for the object files themselves, that don't do this).
This also makes the packed index file contain the SHA1 of the packed
data file at the end (just before its own SHA1). That way you can
validate the pairing of the two if you want to.
We want to be able to check their integrity later, and putting the
sha1-sum of the contents at the end is a good thing. The writing
routines are generic, so we could try to re-use them for the index file,
instead of having the same logic duplicated.
Update unpack-objects to know about the extra 20 bytes at the end
of the index.
Starting from big objects and going backwards means that we end up
picking a delta that goes from a bigger object to a smaller one. That's
advantageous for two reasons: the bigger object is likely the newer one
(since things tend to grow, rather than shrink), and doing a delete
tends to be smaller than doing an add.
So the deltas don't tend to be top-of-tree, and the packed end result is
just slightly smaller.
This actually successfully packed and unpacked a git archive down to
1.3MB (17MB unpacked).
Right now unpacking is way too noisy, lots of debug messages left.
This finishes the initial round of git-pack-object /
git-unpack-object pair. They are now good enough to be used as
a transport medium:
- Fix delta direction in pack-objects; the original was
computing delta to create the base object from the object to
be squashed, which was quite unfriendly for unpacker ;-).
- Add a script to test the very basics.
- Implement unpacker for both regular and deltified objects.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A zero disables delta generation (like before), but we make the window
be one bigger than specified, since we use one entry for the one to be
tested (it used to be that "--window=1" was meaningless, since we'd have
used up the single-entry window with the entry to be tested, and had no
chance of actually ever finding a delta).
The default window remains at 10, but now it really means "test the 10
closest objects", not "test the 9 closest objects".
Anything that generates a delta to see if two objects are close usually
isn't interested in the delta ends up being bigger than some specified
size, and this allows us to stop delta generation early when that
happens.
When Junio fixed the lack of a successful error code from try_delta(),
that uncovered an off-by-one error in the caller.
Also, some testing made it clear that we now find a lot more deltas,
because we used to (incorrectly) break early on bogus "failure"
cases.
Return value of try_delta is checked for negativeness, but the
success path does not return anything, letting compiler warn and
presumably return garbage.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is kind of like a tar-ball for a set of objects, ready to be
shipped off to another end. Alternatively, you could use is as a packed
representation of the object database directly, if you changed
"read_sha1_file()" to read these kinds of packs.
The latter is partiularly useful to generate a "packed history", ie you
could pack up your old history efficiently, but still have it available
(at a performance hit, of course).
I haven't actually written an unpacker yet, so the end result has not
been verified in any way yet. I obviously always write bug-free code,
so it just has to work, no?