kernel/git - git - PowerEL Git System

Commit Graph

Author	SHA1	Message	Date
Nicolas Pitre	03aa8ff3be	Nicolas Pitre has a new email address Due to problems at cam.org, my nico@cam.org email address is no longer valid. From now on, nico@fluxnic.net should be used instead. Signed-off-by: Nicolas Pitre <nico@fluxnic.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	16 years ago
Nicolas Pitre	ce85b053d8	fix style of a few comments in diff-delta.c Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	17 years ago
Pierre Habouzit	f9c5a80cdf	Fix segfault in diff-delta.c when FLEX_ARRAY is 1 aka don't do pointer arithmetics on structs that have a FLEX_ARRAY member, or you'll end up believing your array is 1 cell off its real address. Signed-off-by: Pierre Habouzit <madcoder@debian.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	17 years ago
David Kastrup	02e665ce49	diff-delta.c: Rationalize culling of hash buckets The previous hash bucket culling resulted in a somewhat unpredictable number of hash bucket entries in the order of magnitude of HASH_LIMIT. Replace this with a Bresenham-like algorithm leaving us with exactly HASH_LIMIT entries by uniform culling. Signed-off-by: David Kastrup <dak@gnu.org> Acked-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	18 years ago
David Kastrup	d2100860fd	diff-delta.c: pack the index structure In normal use cases, the performance wins are not overly impressive: we get something like 5-10% due to the slightly better locality of memory accesses using the packed structure. However, since the data structure for index entries saves 33% of memory on 32-bit platforms and 40% on 64-bit platforms, the behavior when memory gets limited should be nicer. This is a rather well-contained change. One obvious improvement would be sorting the elements in one bucket according to their hash, then using binary probing to find the elements with the right hash value. As it stands, the output should be strictly the same as previously unless one uses the option for limiting the amount of used memory, in which case the created packs might be better. Signed-off-by: David Kastrup <dak@gnu.org> Acked-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	18 years ago
David Kastrup	b1d884a9e3	diff-delta.c: Fix broken skip calculation. A particularly bad case was HASH_LIMIT <= hash_count[i] < 2HASH_LIMIT: in that case, only a single hash survived. For larger cases, 2HASH_LIMIT was the actual limiting value after pruning. Signed-off-by: David Kastrup <dak@gnu.org> Acked-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	18 years ago
Brian Downing	11779e7907	Support fetching the memory usage of a delta index Delta indices, at least on 64-bit platforms, tend to be larger than the actual uncompressed data. As such, keeping track of this storage is important if you want to successfully limit the memory size of your pack window. Squirrel away the total allocation size inside the delta_index struct, and add an accessor "sizeof_delta_index" to access it. Signed-off-by: Brian Downing <bdowning@lavos.net> Acked-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	18 years ago
Martin Koegler	b75c6c6de1	diff-delta: use realloc instead of xrealloc Commit `83572c1a91` changed many realloc to xrealloc. This change was made in diff-delta.c too, although the code can handle an out of memory failure. This patch reverts this change in diff-delta.c. Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at> Signed-off-by: Junio C Hamano <junkio@cox.net>	18 years ago
Nicolas Pitre	366b53c170	update diff-delta.c copyright There is actually nothing left from the original LibXDiff code I used over 2 years ago, and even the GIT implementation has diverged quite a bit from LibXDiff's at this point. Let's update the copyright notice to better reflect that fact. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	18 years ago
Nicolas Pitre	843366961c	improve delta long block matching with big files Martin Koegler noted that create_delta() performs a new hash lookup after every block copy encoding which are currently limited to 64KB. In case of larger identical blocks, the next hash lookup would normally point to the next 64KB block in the reference buffer and multiple block copy operations will be consecutively encoded. It is however possible that the reference buffer be sparsely indexed if hash buckets have been trimmed down in create_delta_index() when hashing of the reference buffer isn't well balanced. In that case the hash lookup following a block copy might fail to match anything and the fact that the reference buffer still matches beyond the previous 64KB block will be missed. Let's rework the code so that buffer comparison isn't bounded to 64KB anymore. The match size should be as large as possible up front and only then should multiple block copy be encoded to cover it all. Also, fewer hash lookups will be performed in the end. According to Martin, this patch should reduce his 92MB pack down to 75MB with the dataset he has. Tests performed on the Linux kernel repo show a slightly smaller pack and a slightly faster repack. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	18 years ago
Junio C Hamano	85023577a8	simplify inclusion of system header files. This is a mechanical clean-up of the way *.c files include system header files. (1) sources under compat/, platform sha-1 implementations, and xdelta code are exempt from the following rules; (2) the first #include must be "git-compat-util.h" or one of our own header file that includes it first (e.g. config.h, builtin.h, pkt-line.h); (3) system headers that are included in "git-compat-util.h" need not be included in individual C source files. (4) "git-compat-util.h" does not have to include subsystem specific header files (e.g. expat.h). Signed-off-by: Junio C Hamano <junkio@cox.net>	18 years ago
Junio C Hamano	29f049a0c2	Revert "move pack creation to version 3" This reverts commit `16854571aa`. Git as recent as v1.1.6 do not understand version 3 delta. v1.2.0 is Ok and I personally would say it is old enough, but the improvement between version 2 and version 3 delta is not bit enough to justify breaking older clients. We should resurrect this later, but when we do so we shold make it conditional. Signed-off-by: Junio C Hamano <junkio@cox.net>	18 years ago
Nicolas Pitre	16854571aa	move pack creation to version 3 It's been quite a while now that GIT is able to read version 3 packs. Let's create them at last. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Jonas Fonseca	83572c1a91	Use xrealloc instead of realloc Change places that use realloc, without a proper error path, to instead use xrealloc. Drop an erroneous error path in the daemon code that used errno in the die message in favour of the simpler xrealloc. Signed-off-by: Jonas Fonseca <fonseca@diku.dk> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Pierre Habouzit	b05faa2da9	Fix a comparison bug in diff-delta.c (1 << i) < hspace is compared in the `int` space rather that in the unsigned one. the result will be wrong if hspace is between 0x40000000 and 0x80000000. Signed-off-by: Pierre Habouzit <madcoder@debian.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Pavel Roskin	82e5a82fd7	Fix more typos, primarily in the code The only visible change is that git-blame doesn't understand "--compability" anymore, but it does accept "--compatibility" instead, which is already documented. Signed-off-by: Pavel Roskin <proski@gnu.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Florian Forster	1d7f171c3a	Remove all void-pointer arithmetic. ANSI C99 doesn't allow void-pointer arithmetic. This patch fixes this in various ways. Usually the strategy that required the least changes was used. Signed-off-by: Florian Forster <octo@verplant.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Florian Forster	63f175693e	Initialize FAMs using `FLEX_ARRAY'. When initializing a `flexible array member' the macro `FLEX_ARRAY' should be used. This was forgotten in `diff-delta.c'. Signed-off-by: Florian Forster <octo@verplant.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	639ca54972	fix diff-delta bad memory access It cannot be assumed that the given buffer will never be moved when shrinking the allocated memory size with realloc(). So let's ignore that optimization for now. This patch makes Electric Fence happy on Linux. Signed-off-by: Nicolas Pitre <nico@cam.org> Acked-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Junio C Hamano	ac4c758adc	delta: stricter constness Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	06a9f92035	improve diff-delta with sparse and/or repetitive data It is useless to preserve multiple hash entries for consecutive blocks with the same hash. Keeping only the first one will allow for matching the longest string of identical bytes while subsequent blocks will only allow for shorter matches. The backward matching code will match the end of it as necessary. This improves both performances (no repeated string compare with long successions of identical bytes, or even small group of bytes), as well as compression (less likely to need random hash bucket entry culling), especially with sparse files. With well behaved data sets this patch doesn't change much. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	2d08e5dd73	tiny optimization to diff-delta This is my assembly freak side looking at generated code again. And since create_delta() is certainly pretty high on the radar every bits count. In this case shorter code is generated if hash_mask is not copied to a local variable. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	3dc5a9e4cd	replace adler32 with Rabin's polynomial in diff-delta This brings another small repacking speedup for sensibly the same pack size. On the Linux kernel repo, git-repack -a -f is 3.7% faster for a 0.4% larger pack. Credits to Geert Bosch who brought the Rabin's polynomial idea to my attention. This also eliminate the issue of adler32() reading past the data buffer, as noticed by Johannes Schindelin. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	08abe669c0	split the diff-delta interface This patch splits the diff-delta interface into index creation and delta generation. A wrapper is provided to preserve the diff-delta() call. This will allow for an optimization in pack-objects.c where the source object could be fixed and a full window of objects tentatively tried against that same source object without recomputing the source index each time. This patch only restructure things, plus a couple cleanups for good measure. There is no performance change yet. Signed-off-by: Nicolas Pitre <nico@cam.org>	19 years ago
Nicolas Pitre	5a1fb2ca92	3% tighter packs for free This patch makes for 3.4% smaller pack with the git repository, and a bit more than 3% smaller pack with the kernel repository. And so with _no_ measurable CPU difference. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	c13c6bf758	diff-delta: bound hash list length to avoid O(mn) behavior The diff-delta code can exhibit O(mn) behavior with some patological data set where most hash entries end up in the same hash bucket. To prevent this, a limit is imposed to the number of entries that can exist in the same hash bucket. Because of the above the code is a tiny bit more expensive on average, even if some small optimizations were added as well to atenuate the overhead. But the problematic samples used to diagnoze the issue are now orders of magnitude less expensive to process with only a slight loss in compression. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	38fd0721d0	diff-delta: allow reusing of the reference buffer index When a reference buffer is used multiple times then its index can be computed only once and reused multiple times. This patch adds an extra pointer to a pointer argument (from_index) to diff_delta() for this. If from_index is NULL then everything is like before. If from_index is non NULL and from_index is NULL then the index is created and its location stored to from_index. In this case the caller has the responsibility to free the memory pointed to by from_index. If from_index and from_index are non NULL then the index is reused as is. This currently saves about 10% of CPU time to repack the git archive. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	5bb86b82ba	diff-delta: bound hash list length to avoid O(mn) behavior The diff-delta code can exhibit O(mn) behavior with some patological data set where most hash entries end up in the same hash bucket. The latest code rework reduced the block size making it particularly vulnerable to this issue, but the issue was always there and can be triggered regardless of the block size. This patch does two things: 1) the hashing has been reworked to offer a better distribution to atenuate the problem a bit, and 2) a limit is imposed to the number of entries that can exist in the same hash bucket. Because of the above the code is a bit more expensive on average, but the problematic samples used to diagnoze the issue are now orders of magnitude less expensive to process with only a slight loss in compression. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	cc5c59a30c	diff-delta: produce optimal pack data Indexing based on adler32 has a match precision based on the block size (currently 16). Lowering the block size would produce smaller deltas but the indexing memory and computing cost increases significantly. For optimal delta result the indexing block size should be 3 with an increment of 1 (instead of 16 and 16). With such low params the adler32 becomes a clear overhead increasing the time for git-repack by a factor of 3. And with such small blocks the adler 32 is not very useful as the whole of the block bits can be used directly. This patch replaces the adler32 with an open coded index value based on 3 characters directly. This gives sufficient bits for hashing and allows for optimal delta with reasonable CPU cycles. The resulting packs are 6% smaller on average. The increase in CPU time is about 25%. But this cost is now hidden by the delta reuse patch while the saving on data transfers is always there. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Junio C Hamano	c436eb8cf1	diff-delta: cull collided hash bucket more aggressively. This tries to limit collided hash buckets by removing identical three-byte prefix from the same hashbucket.	19 years ago
Nicolas Pitre	2b8d9347aa	diff-delta: bound hash list length to avoid O(mn) behavior The diff-delta code can exhibit O(mn) behavior with some patological data set where most hash entries end up in the same hash bucket. The latest code rework reduced the block size making it particularly vulnerable to this issue, but the issue was always there and can be triggered regardless of the block size. This patch does two things: 1) the hashing has been reworked to offer a better distribution to atenuate the problem a bit, and 2) a limit is imposed to the number of entries that can exist in the same hash bucket. Because of the above the code is a bit more expensive on average, but the problematic samples used to diagnoze the issue are now orders of magnitude less expensive to process with only a slight loss in compression. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Junio C Hamano	bec2a69fe4	Revert "Revert "diff-delta: produce optimal pack data""	19 years ago
Junio C Hamano	eae3fe5e50	Revert "diff-delta: produce optimal pack data" This reverts `6b7d25d97b` commit. It turns out that the new algorithm has a really bad corner case, that literally spends minutes for inputs that takes less than a quater seconds to delta with the old algorithm. The resulting delta is 50% smaller which is admirable, but the performance degradation is simply unacceptable for unconditional use. Some example cases are these blobs in Linux 2.6 repository: 4917ec509720a42846d513addc11cbd25e0e3c4f 9af06ba723df75fed49f7ccae5b6c9c34bc5115f dfc9cd58dc065d17030d875d3fea6e7862ede143 Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	8e1454b5ad	diff-delta: big code simplification This is much smaller and hopefully clearer code now. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	6b7d25d97b	diff-delta: produce optimal pack data Indexing based on adler32 has a match precision based on the block size (currently 16). Lowering the block size would produce smaller deltas but the indexing memory and computing cost increases significantly. For optimal delta result the indexing block size should be 3 with an increment of 1 (instead of 16 and 16). With such low params the adler32 becomes a clear overhead increasing the time for git-repack by a factor of 3. And with such small blocks the adler 32 is not very useful as the whole of the block bits can be used directly. This patch replaces the adler32 with an open coded index value based on 3 characters directly. This gives sufficient bits for hashing and allows for optimal delta with reasonable CPU cycles. The resulting packs are 6% smaller on average. The increase in CPU time is about 25%. But this cost is now hidden by the delta reuse patch while the saving on data transfers is always there. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	fe474b588b	diff-delta: fold two special tests into one plus cleanups Testing for realloc and size limit can be done with only one test per loop. Make it so and fix a theoretical off-by-one comparison error in the process. The output buffer memory allocation is also bounded by max_size when specified. Finally make some variable unsigned to allow the handling of files up to 4GB in size instead of 2GB. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Peter Eriksen	04fe2a1706	Use adler32() from zlib instead of defining our own. Since we already depend on zlib, we don't need to define our own adler32(). Spotted by oprofile. Signed-off-by: Peter Eriksen <s022018@student.dtu.dk> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	e5e3a9d8f9	small cleanup for diff-delta.c This patch removes unused remnants of the original xdiff source. No functional change. Possible tiny speed improvement. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Junio C Hamano	c7a45bd20e	Revert "diff-delta.c: allow delta with empty blob." This reverts `962537a3eb` commit to play safe.	19 years ago
Junio C Hamano	962537a3eb	diff-delta.c: allow delta with empty blob. Delta computation with an empty blob used to punt and returned NULL. This commit allows creation with empty blob; all combination of empty->empty, empty->something, and something->empty are allowed. Signed-off-by: Junio C Hamano <junkio@cox.net>	19 years ago
Nicolas Pitre	dcde55bc58	[PATCH] assorted delta code cleanup This is a wrap-up patch including all the cleanups I've done to the delta code and its usage. The most important change is the factorization of the delta header handling code. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	20 years ago
Nicolas Pitre	69a2d426f0	[PATCH] denser delta header encoding Since the delta data format is not tied to any actual git object anymore, now is the time to add a small improvement to the delta data header as it is been done for packed object header. This patch allows for reducing the delta header of about 2 bytes and makes for simpler code. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	20 years ago
Linus Torvalds	75c42d8cc3	Add a "max_size" parameter to diff_delta() Anything that generates a delta to see if two objects are close usually isn't interested in the delta ends up being bigger than some specified size, and this allows us to stop delta generation early when that happens.	20 years ago
Nicolas Pitre	a310d43494	[PATCH] Deltification library work by Nicolas Pitre. This patch adds the basic library functions to create and replay delta information. Also included is a test-delta utility to validate the code. diff-delta was based on LibXDiff written by Davide Libenzi Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	20 years ago

35 Commits (be53deef0d94b80b0a2df465d16f21ed29c3165b)