Commit Graph

20 Commits (578421fbd8db6602f290862f44843fc896b4fb0b)

Author SHA1 Message Date
Junio C Hamano b45563a229 rename: Break filepairs with different types.
When we consider if a path has been totally rewritten, we did not
touch changes from symlinks to files or vice versa.  But a change
that modifies even the type of a blob surely should count as a
complete rewrite.

While we are at it, modernise diffcore-break to be aware of gitlinks (we
do not want to touch them).

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-12-02 02:24:46 -08:00
Linus Torvalds 6dd4b66fde Fix diffcore-break total breakage
Ok, so on the kernel list, some people noticed that "git log --follow"
doesn't work too well with some files in the x86 merge, because a lot of
files got renamed in very special ways.

In particular, there was a pattern of doing single commits with renames
that looked basically like

 - rename "filename.h" -> "filename_64.h"
 - create new "filename.c" that includes "filename_32.h" or
   "filename_64.h" depending on whether we're 32-bit or 64-bit.

which was preparatory for smushing the two trees together.

Now, there's two issues here:

 - "filename.c" *remained*. Yes, it was a rename, but there was a new file
   created with the old name in the same commit. This was important,
   because we wanted each commit to compile properly, so that it was
   bisectable, so splitting the rename into one commit and the "create
   helper file" into another was *not* an option.

   So we need to break associations where the contents change too much.
   Fine. We have the -B flag for that. When we break things up, then the
   rename detection will be able to figure out whether there are better
   alternatives.

 - "git log --follow" didn't with with -B.

Now, the second case was really simple: we use a different "diffopt"
structure for the rename detection than the basic one (which we use for
showing the diffs). So that second case is trivially fixed by a trivial
one-liner that just copies the break_opt values from the "real" diffopts
to the one used for rename following. So now "git log -B --follow" works
fine:

	diff --git a/tree-diff.c b/tree-diff.c
	index 26bdbdd..7c261fd 100644
	--- a/tree-diff.c
	+++ b/tree-diff.c
	@@ -319,6 +319,7 @@ static void try_to_follow_renames(struct tree_desc *t1, struct tree_desc *t2, co
	 	diff_opts.detect_rename = DIFF_DETECT_RENAME;
	 	diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
	 	diff_opts.single_follow = opt->paths[0];
	+	diff_opts.break_opt = opt->break_opt;
	 	paths[0] = NULL;
	 	diff_tree_setup_paths(paths, &diff_opts);
	 	if (diff_setup_done(&diff_opts) < 0)

however, the end result does *not* work. Because our diffcore-break.c
logic is totally bogus!

In particular:

 - it used to do

	if (base_size < MINIMUM_BREAK_SIZE)
		return 0; /* we do not break too small filepair */

   which basically says "don't bother to break small files". But that
   "base_size" is the *smaller* of the two sizes, which means that if some
   large file was rewritten into one that just includes another file, we
   would look at the (small) result, and decide that it's smaller than the
   break size, so it cannot be worth it to break it up! Even if the other
   side was ten times bigger and looked *nothing* like the samell file!

   That's clearly bogus. I replaced "base_size" with "max_size", so that
   we compare the *bigger* of the filepair with the break size.

 - It calculated a "merge_score", which was the score needed to merge it
   back together if nothing else wanted it. But even if it was *so*
   different that we would never want to merge it back, we wouldn't
   consider it a break! That makes no sense. So I added

	if (*merge_score_p > break_score)
		return 1;

   to make it clear that if we wouldn't want to merge it at the end, it
   was *definitely* a break.

 - It compared the whole "extent of damage", counting all inserts and
   deletes, but it based this score on the "base_size", and generated the
   damage score with

	delta_size = src_removed + literal_added;
	damage_score = delta_size * MAX_SCORE / base_size;

   but that makes no sense either, since quite often, this will result in
   a number that is *bigger* than MAX_SCORE! Why? Because base_size is
   (again) the smaller of the two files we compare, and when you start out
   from a small file and add a lot (or start out from a large file and
   remove a lot), the base_size is going to be much smaller than the
   damage!

   Again, the fix was to replace "base_size" with "max_size", at which
   point the damage actually becomes a sane percentage of the whole.

With these changes in place, not only does "git log -B --follow" work for
the case that triggered this in the first place, ie now

	git log -B --follow arch/x86/kernel/vmlinux_64.lds.S

actually gives reasonable results. But I also wanted to verify it in
general, by doing a full-history

	git log --stat -B -C

on my kernel tree with the old code and the new code.

There's some tweaking to be done, but generally, the new code generates
much better results wrt breaking up files (and then finding better rename
candidates). Here's a few examples of the "--stat" output:

 - This:
	include/asm-x86/Kbuild        |    2 -
	include/asm-x86/debugreg.h    |   79 +++++++++++++++++++++++++++++++++++------
	include/asm-x86/debugreg_32.h |   64 ---------------------------------
	include/asm-x86/debugreg_64.h |   65 ---------------------------------
	4 files changed, 68 insertions(+), 142 deletions(-)

      Becomes:

	include/asm-x86/Kbuild                        |    2 -
	include/asm-x86/{debugreg_64.h => debugreg.h} |    9 +++-
	include/asm-x86/debugreg_32.h                 |   64 -------------------------
	3 files changed, 7 insertions(+), 68 deletions(-)

 - This:
	include/asm-x86/bug.h    |   41 +++++++++++++++++++++++++++++++++++++++--
	include/asm-x86/bug_32.h |   37 -------------------------------------
	include/asm-x86/bug_64.h |   34 ----------------------------------
	3 files changed, 39 insertions(+), 73 deletions(-)

      Becomes

	include/asm-x86/{bug_64.h => bug.h} |   20 +++++++++++++-----
	include/asm-x86/bug_32.h            |   37 -----------------------------------
	2 files changed, 14 insertions(+), 43 deletions(-)

Now, in some other cases, it does actually turn a rename into a real
"delete+create" pair, and then the diff is usually bigger, so truth in
advertizing: it doesn't always generate a nicer diff. But for what -B was
meant for, I think this is a big improvement, and I suspect those cases
where it generates a bigger diff are tweakable.

So I think this diff fixes a real bug, but we might still want to tweak
the default values and perhaps the exact rules for when a break happens.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
2007-10-21 01:59:42 -04:00
Junio C Hamano d8c3d03a0b diffcore_count_changes: pass diffcore_filespec
We may want to use richer information on the data we are dealing
with in this function, so instead of passing a buffer address
and length, just pass the diffcore_filespec structure.  Existing
callers always call this function with parameters taken from a
filespec anyway, so there is no functionality changes.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-06-30 20:51:31 -07:00
Shawn O. Pearce dc49cd769b Cast 64 bit off_t to 32 bit size_t
Some systems have sizeof(off_t) == 8 while sizeof(size_t) == 4.
This implies that we are able to access and work on files whose
maximum length is around 2^63-1 bytes, but we can only malloc or
mmap somewhat less than 2^32-1 bytes of memory.

On such a system an implicit conversion of off_t to size_t can cause
the size_t to wrap, resulting in unexpected and exciting behavior.
Right now we are working around all gcc warnings generated by the
-Wshorten-64-to-32 option by passing the off_t through xsize_t().

In the future we should make xsize_t on such problematic platforms
detect the wrapping and die if such a file is accessed.

Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-07 11:15:26 -08:00
David Rientjes a89fccd281 Do not use memcmp(sha1_1, sha1_2, 20) with hardcoded length.
Introduces global inline:

	hashcmp(const unsigned char *sha1, const unsigned char *sha2)

Uses memcmp for comparison and returns the result based on the length of
the hash name (a future runtime decision).

Acked-by: Alex Riesen <raa.lkml@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-08-17 14:23:53 -07:00
Junio C Hamano c06c79667c diffcore-rename: somewhat optimized.
This changes diffcore-rename to reuse statistics information
gathered during similarity estimation, and updates the hashtable
implementation used to keep track of the statistics to be
denser.  This seems to give better performance.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-12 03:22:10 -08:00
Junio C Hamano 4d0f39cecf diffcore-break: similarity estimator fix.
This is a companion patch to the previous fix to diffcore-rename.
The merging-back process should use a logic similar to what is used
there.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-04 13:26:36 -08:00
Junio C Hamano 65416758cd diffcore-rename: split out the delta counting code.
This is to rework diffcore break/rename/copy detection code
so that it does not affected when deltifier code gets improved.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-28 20:20:04 -08:00
Junio C Hamano aeecd23ae2 diffcore-break: micro-optimize by avoiding delta between identical files.
We did not check if we have the same file on both sides when
computing break score.  This is usually not a problem, but if
the user said --find-copies-harde with -B, we ended up trying a
delta between the same data even when we know the SHA1 hash of
both sides match.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-28 20:19:47 -08:00
Junio C Hamano 0532a5e46b diffcore-break: do not break too small filepair.
Somehow we checked only one side and not the other.  By checking
the filesize upfront, we can bypass generating delta
unnecessarily.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-12 17:15:55 -08:00
Junio C Hamano d28c8af623 diffcore-break.c: check diff_delta() return value.
This bug caused Darrin Thompson to notice that our deltifier was
half broken and punting on an empty blob.

Signed-off-by: Junio C Hamano <junio@twinsun.com>
2005-12-12 12:57:25 -08:00
Junio C Hamano 19397b4521 Revert "[PATCH] plug memory leak in diff.c::diff_free_filepair()"
This reverts 068eac91ce commit.
2005-09-14 14:06:50 -07:00
Yasushi SHOJI 068eac91ce [PATCH] plug memory leak in diff.c::diff_free_filepair()
When I run git-diff-tree on big change, it seems the command eats so
much memory.  so I just put git under valgrind to see what's going on.
diff_free_filespec_data() doesn't free diff_filespec itself.

[jc: I ended up doing things slightly differently from Yasushi's
patch.  The original idea was to use free_filespec_data() only to
free the data portion and keep useing the filespec itself, but
no existing code seems to do things that way, so I just yanked
that part out.]

Signed-off-by: Yasushi SHOJI <yashi@atmark-techno.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-08-13 18:28:55 -07:00
Junio C Hamano 3c84974207 [PATCH] Fixlets on top of Nico's clean-up.
If we prefer 0 as maxsize for diff_delta() to say "unlimited", let's be
consistent about it.

This patch also fixes type mismatch in a call to get_delta_hdr_size()
from packed_delta_info().

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-29 09:11:38 -07:00
Linus Torvalds 75c42d8cc3 Add a "max_size" parameter to diff_delta()
Anything that generates a delta to see if two objects are close usually
isn't interested in the delta ends up being bigger than some specified
size, and this allows us to stop delta generation early when that
happens.
2005-06-25 19:30:20 -07:00
Junio C Hamano 366175ef8c [PATCH] Rework -B output.
Patch for a completely rewritten file detected by the -B flag
was shown as a pair of creation followed by deletion in earlier
versions.  This was an misguided attempt to make reviewing such
a complete rewrite easier, and unnecessarily ended up confusing
git-apply.  Instead, show the entire contents of old version
prefixed with '-', followed by the entire contents of new
version prefixed with '+'.  This gives the same easy-to-review
for human consumer while keeping it a single, regular
modification patch for machine consumption, something that even
GNU patch can grok.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-19 20:13:18 -07:00
Junio C Hamano f78c79c5d4 [PATCH] diffcore-break.c: various fixes.
This fixes three bugs in the -B heuristics.

 - Although it was advertised that the initial break criteria
   used was the same as what diffcore-rename uses, it was using
   something different.  Instead of using smaller of src and dst
   size to compare with "edit" size, (insertion and deletion),
   it was using larger of src and dst, unlike the rename/copy
   detection logic.  This caused the parameter to -B to mean
   something different from the one to -M and -C.  To compensate
   for this change, the default break score is also changed to
   match that of the default for rename/copy.

 - The code would have crashed with division by zero when trying
   to break an originally empty file.

 - Contrary to what the comment said, the algorithm was breaking
   small files, only to later merge them together.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-05 14:14:58 -07:00
Junio C Hamano eeaa460314 [PATCH] diff: Update -B heuristics.
As Linus pointed out on the mailing list discussion, -B should
break a files that has many inserts even if it still keeps
enough of the original contents, so that the broken pieces can
later be matched with other files by -M or -C.  However, if such
a broken pair does not get picked up by -M or -C, we would want
to apply different criteria; namely, regardless of the amount of
new material in the result, the determination of "rewrite"
should be done by looking at the amount of original material
still left in the result.  If you still have the original 97
lines from a 100-line document, it does not matter if you add
your own 13 lines to make a 110-line document, or if you add 903
lines to make a 1000-line document.  It is not a rewrite but an
in-place edit.  On the other hand, if you did lose 97 lines from
the original, it does not matter if you added 27 lines to make a
30-line document or if you added 997 lines to make a 1000-line
document.  You did a complete rewrite in either case.

This patch introduces a post-processing phase that runs after
diffcore-rename matches up broken pairs diffcore-break creates.
The purpose of this post-processing is to pick up these broken
pieces and merge them back into in-place modifications.  For
this, the score parameter -B option takes is changed into a pair
of numbers, and it takes "-B99/80" format when fully spelled
out.  The first number is the minimum amount of "edit" (same
definition as what diffcore-rename uses, which is "sum of
deletion and insertion") that a modification needs to have to be
broken, and the second number is the minimum amount of "delete"
a surviving broken pair must have to avoid being merged back
together.  It can be abbreviated to "-B" to use default for
both, "-B9" or "-B9/" to use 90% for "edit" but default (80%)
for merge avoidance, or "-B/75" to use default (99%) "edit" and
75% for merge avoidance.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-03 11:23:03 -07:00
Junio C Hamano 355e76a4a3 [PATCH] Tweak count-delta interface
Make it return copied source and insertion separately, so that
later implementation of heuristics can use them more flexibly.

This does not change the heuristics implemented in
diffcore-rename nor diffcore-break in any way.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-03 11:23:03 -07:00
Junio C Hamano f345b0a066 [PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.

When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation.  This
makes it easier to review such a complete rewrite patch.

The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.

As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used.  For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:

    The original change looks like this:

	file0     --> file0' (quite different from file0)
	file1     --> /dev/null

    After diffcore-break runs, it would become this:

	file0     --> /dev/null
	/dev/null --> file0'
	file1     --> /dev/null

    Then diffcore-rename matches them up:

	file1     --> file0'

The internal score values are finer grained now.  Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 10:35:49 -07:00