Commit Graph

43 Commits (19be71db9c3faafc113c16b4ca9fc30281baf1ce)

Author SHA1 Message Date
Patrick Steinhardt a5aecb2cdc diff: improve lifecycle management of diff queues
The lifecycle management of diff queues is somewhat confusing:

  - For most of the part this can be attributed to `DIFF_QUEUE_CLEAR()`,
    which does not release any memory but rather initializes the queue,
    only. This is in contrast to our common naming schema, where
    "clearing" means that we release underlying memory and then
    re-initialize the data structure such that it is ready to use.

  - A second offender is `diff_free_queue()`, which does not free the
    queue structure itself. It is rather a release-style function.

Refactor the code to make things less confusing. `DIFF_QUEUE_CLEAR()` is
replaced by `DIFF_QUEUE_INIT` and `diff_queue_init()`, while
`diff_free_queue()` is replaced by `diff_queue_release()`. While on it,
adapt callsites where we call `DIFF_QUEUE_CLEAR()` with the intent to
release underlying memory to instead call `diff_queue_clear()` to fix
memory leaks.

This memory leak is exposed by t4211, but plugging it alone does not
make the whole test suite pass.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-30 11:23:05 -07:00
Patrick Steinhardt 12dfc2475c diffcore-break: fix leaking filespecs when merging broken pairs
When merging file pairs after they have been broken up we queue a new
file pair and discard the broken-up ones. The newly-queued file pair
reuses one filespec of the broken up pairs each, where the respective
other filespec gets discarded. But we only end up freeing the filespec's
data, not the filespec itself, and thus leak memory.

Fix these leaks by using `free_filespec()` instead.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-27 08:25:37 -07:00
Patrick Steinhardt e7da938570 global: introduce `USE_THE_REPOSITORY_VARIABLE` macro
Use of the `the_repository` variable is deprecated nowadays, and we
slowly but steadily convert the codebase to not use it anymore. Instead,
callers should be passing down the repository to work on via parameters.

It is hard though to prove that a given code unit does not use this
variable anymore. The most trivial case, merely demonstrating that there
is no direct use of `the_repository`, is already a bit of a pain during
code reviews as the reviewer needs to manually verify claims made by the
patch author. The bigger problem though is that we have many interfaces
that implicitly rely on `the_repository`.

Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code
units to opt into usage of `the_repository`. The intent of this macro is
to demonstrate that a certain code unit does not use this variable
anymore, and to keep it from new dependencies on it in future changes,
be it explicit or implicit

For now, the macro only guards `the_repository` itself as well as
`the_hash_algo`. There are many more known interfaces where we have an
implicit dependency on `the_repository`, but those are not guarded at
the current point in time. Over time though, we should start to add
guards as required (or even better, just remove them).

Define the macro as required in our code units. As expected, most of our
code still relies on the global variable. Nearly all of our builtins
rely on the variable as there is no way yet to pass `the_repository` to
their entry point. For now, declare the macro in "biultin.h" to keep the
required changes at least a little bit more contained.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14 10:26:33 -07:00
Elijah Newren eea0e59ffb treewide: remove unnecessary includes in source files
Each of these were checked with
   gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE}
to ensure that removing the direct inclusion of the header actually
resulted in that header no longer being included at all (i.e. that
no other header pulled it in transitively).

...except for a few cases where we verified that although the header
was brought in transitively, nothing from it was directly used in
that source file.  These cases were:
  * builtin/credential-cache.c
  * builtin/pull.c
  * builtin/send-pack.c

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-26 12:04:31 -08:00
Elijah Newren df6e874496 diff.h: remove unnecessary include of oidset.h
This also made it clear that several .c files depended upon various
things that oidset included, but had omitted the direct #include for
those headers.  Add those now.

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21 13:39:53 -07:00
Elijah Newren 08c46a499a read-cache*.h: move declarations for read-cache.c functions from cache.h
For the functions defined in read-cache.c, move their declarations from
cache.h to a new header, read-cache-ll.h.  Also move some related inline
functions from cache.h to read-cache.h.  The purpose of the
read-cache-ll.h/read-cache.h split is that about 70% of the sites don't
need the inline functions and the extra headers they include.

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21 13:39:53 -07:00
Ævar Arnfjörð Bjarmason a5183d7696 cocci: apply the "promisor-remote.h" part of "the_repository.pending"
Apply the part of "the_repository.pending.cocci" pertaining to
"promisor-remote.h".

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-28 07:36:46 -07:00
Jonathan Tan 95acf11a3d diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.

diffcore_std() may read blobs when it calls the following functions:
 (1) diffcore_skip_stat_unmatch() (controlled by the config variable
     diff.autorefreshindex)
 (2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
     detection)
 (3) diffcore_rename() (for rename detection)
 (4) diffcore_pickaxe() (for detecting addition/deletion of specified
     string)

Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-07 16:09:29 -07:00
Jonathan Tan 1c37e86ab2 diff: make diff_populate_filespec_options struct
The behavior of diff_populate_filespec() currently can be customized
through a bitflag, but a subsequent patch requires it to support a
non-boolean option. Replace the bitflag with an options struct.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-07 16:09:29 -07:00
Alex Henrie baed6bbb5b diffcore-break: use a goto instead of a redundant if statement
The condition "if (q->nr <= j)" checks whether the loop exited normally
or via a break statement. Avoid this check by replacing the jump out of
the inner loop with a jump to the end of the outer loop, which makes it
obvious that diff_q is not executed when the peer survives.

Signed-off-by: Alex Henrie <alexhenrie24@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-02 15:04:21 +09:00
Junio C Hamano 11877b9ebe Merge branch 'nd/the-index'
Various codepaths in the core-ish part learn to work on an
arbitrary in-core index structure, not necessarily the default
instance "the_index".

* nd/the-index: (23 commits)
  revision.c: reduce implicit dependency the_repository
  revision.c: remove implicit dependency on the_index
  ws.c: remove implicit dependency on the_index
  tree-diff.c: remove implicit dependency on the_index
  submodule.c: remove implicit dependency on the_index
  line-range.c: remove implicit dependency on the_index
  userdiff.c: remove implicit dependency on the_index
  rerere.c: remove implicit dependency on the_index
  sha1-file.c: remove implicit dependency on the_index
  patch-ids.c: remove implicit dependency on the_index
  merge.c: remove implicit dependency on the_index
  merge-blobs.c: remove implicit dependency on the_index
  ll-merge.c: remove implicit dependency on the_index
  diff-lib.c: remove implicit dependency on the_index
  read-cache.c: remove implicit dependency on the_index
  diff.c: remove implicit dependency on the_index
  grep.c: remove implicit dependency on the_index
  diff.c: remove the_index dependency in textconv() functions
  blame.c: rename "repo" argument to "r"
  combine-diff.c: remove implicit dependency on the_index
  ...
2018-10-19 13:34:02 +09:00
Nguyễn Thái Ngọc Duy b78ea5fc35 diff.c: reduce implicit dependency on the_index
diff and textconv code has so widespread use that it's hard to simply
update their api and all call sites at once because it would result in
a big patch. For now reduce the_index references to two places:
diff_setup() and fill_textconv().

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-21 09:48:10 -07:00
Jeff King 4a7e27e957 convert "oidcmp() == 0" to oideq()
Using the more restrictive oideq() should, in the long run,
give the compiler more opportunities to optimize these
callsites. For now, this conversion should be a complete
noop with respect to the generated code.

The result is also perhaps a little more readable, as it
avoids the "zero is equal" idiom. Since it's so prevalent in
C, I think seasoned programmers tend not to even notice it
anymore, but it can sometimes make for awkward double
negations (e.g., we can drop a few !!oidcmp() instances
here).

This patch was generated almost entirely by the included
coccinelle patch. This mechanical conversion should be
completely safe, because we check explicitly for cases where
oidcmp() is compared to 0, which is what oideq() is doing
under the hood. Note that we don't have to catch "!oidcmp()"
separately; coccinelle's standard isomorphisms make sure the
two are treated equivalently.

I say "almost" because I did hand-edit the coccinelle output
to fix up a few style violations (it mostly keeps the
original formatting, but sometimes unwraps long lines).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-29 11:32:49 -07:00
Junio C Hamano 6d40812e4b Merge branch 'tk/diffcore-delta-remove-unused'
Code cleanup.

* tk/diffcore-delta-remove-unused:
  diffcore-delta: remove unused parameter to diffcore_count_changes()
2016-11-17 13:45:22 -08:00
Tobias Klauser 974e0044d6 diffcore-delta: remove unused parameter to diffcore_count_changes()
The delta_limit parameter to diffcore_count_changes() has been unused
since commit ba23bbc8e ("diffcore-delta: make change counter to byte
oriented again.", 2006-03-04).

Remove the parameter and adjust all callers.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-11-14 09:24:04 -08:00
brian m. carlson 41c9560ee5 diff: rename struct diff_filespec's sha1_valid member
Now that this struct's sha1 member is called "oid", update the comment
and the sha1_valid member to be called "oid_valid" instead.  The
following Coccinelle semantic patch was used to implement this, followed
by the transformations in object_id.cocci:

@@
struct diff_filespec o;
@@
- o.sha1_valid
+ o.oid_valid

@@
struct diff_filespec *p;
@@
- p->sha1_valid
+ p->oid_valid

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-06-28 11:39:02 -07:00
brian m. carlson a0d12c4433 diff: convert struct diff_filespec to struct object_id
Convert struct diff_filespec's sha1 member to use a struct object_id
called "oid" instead.  The following Coccinelle semantic patch was used
to implement this, followed by the transformations in object_id.cocci:

@@
struct diff_filespec o;
@@
- o.sha1
+ o.oid.hash

@@
struct diff_filespec *p;
@@
- p->sha1
+ p->oid.hash

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-06-28 11:39:02 -07:00
Junio C Hamano 6936b5859c diff -B -M: fix output for "copy and then rewrite" case
Starting from a single file, A, if you create B as a copy of A (and
possibly make some edit) and then make extensive change to A, you
will see:

    $ git diff -C --name-status
    C89    A    B
    M      A

which is expected.  However, if you ask the same question in a
different way, you see this:

    $ git diff -B -M --name-status
    R89    A    B
    M100   A

telling us that A was rename-edited into B (as if "A will no longer
exist as the result") and at the same time A itself was extensively
edited.

In this case, because the resulting tree still does have file A
(even if it has contents vastly different from the original), we
should use "C"opy, not "R"ename, to avoid hinting that A somehow
goes away.

Two existing tests were depending on the wrong behaviour, and fixed.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-23 16:17:09 -07:00
John Keeping e7b00c5764 diffcore-break: don't divide by zero
When the source file is empty, the calculation of the merge score
results in a division by zero.  In the situation:

     == preimage ==             == postimage ==

     F (empty file)             F (a large file)
                                E (a new empty file)

it does not make sense to consider F->E as a rename, so it is better not
to break the pre- and post-image of F.

Bail out early in this case to avoid hitting the divide-by-zero.  This
causes the merge score to be left at zero.

Signed-off-by: John Keeping <john@keeping.me.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-03 12:48:02 -07:00
Bo Yang 9ca5df9061 Add a macro DIFF_QUEUE_CLEAR.
Refactor the diff_queue_struct code, this macro help
to reset the structure.

Signed-off-by: Bo Yang <struggleyb.nku@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-05-07 09:34:27 -07:00
Jeff King 8282de94bc diffcore-break: save cnt_data for other phases
The "break" phase works by counting changes between two
blobs with the same path. We do this by splitting the file
into chunks (or lines for text oriented files) and then
keeping a count of chunk hashes.

The "rename" phase counts changes between blobs at two
different paths. However, it uses the exact same set of
chunk hashes (which are immutable for a given sha1).

The rename phase can therefore use the same hash data as
break. Unfortunately, we were throwing this data away after
computing it in the break phase. This patch instead attaches
it to the filespec and lets it live through the rename
phase, working under the assumption that most of the time
that breaks are being computed, renames will be too.

We only do this optimization for files which have actually
been broken, as those ones will be candidates for rename
detection (and it is a time-space tradeoff, so we don't want
to waste space keeping useless data).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-11-16 13:21:12 -08:00
Jeff King f4f19fb634 diffcore-break: free filespec data as we go
As we look at each changed file and consider breaking it, we
load the blob data and make a decision about whether to
break, which is independent of any other blobs that might
have changed. However, we keep the data in memory while we
consider breaking all of the other files. Which means that
both versions of every file you are diffing are in memory at
the same time.

This patch instead frees the blob data as we finish with
each file pair, leading to much lower memory usage.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-11-16 13:21:11 -08:00
Benjamin Kramer eb3a9dd327 Remove unused function scope local variables
These variables were unused and can be removed safely:

  builtin-clone.c::cmd_clone(): use_local_hardlinks, use_separate_remote
  builtin-fetch-pack.c::find_common(): len
  builtin-remote.c::mv(): symref
  diff.c::show_stats():show_stats(): total
  diffcore-break.c::should_break(): base_size
  fast-import.c::validate_raw_date(): date, sign
  fsck.c::fsck_tree(): o_sha1, sha1
  xdiff-interface.c::parse_num(): read_some

Signed-off-by: Benjamin Kramer <benny.kra@googlemail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-03-07 20:52:17 -08:00
Junio C Hamano b45563a229 rename: Break filepairs with different types.
When we consider if a path has been totally rewritten, we did not
touch changes from symlinks to files or vice versa.  But a change
that modifies even the type of a blob surely should count as a
complete rewrite.

While we are at it, modernise diffcore-break to be aware of gitlinks (we
do not want to touch them).

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-12-02 02:24:46 -08:00
Linus Torvalds 6dd4b66fde Fix diffcore-break total breakage
Ok, so on the kernel list, some people noticed that "git log --follow"
doesn't work too well with some files in the x86 merge, because a lot of
files got renamed in very special ways.

In particular, there was a pattern of doing single commits with renames
that looked basically like

 - rename "filename.h" -> "filename_64.h"
 - create new "filename.c" that includes "filename_32.h" or
   "filename_64.h" depending on whether we're 32-bit or 64-bit.

which was preparatory for smushing the two trees together.

Now, there's two issues here:

 - "filename.c" *remained*. Yes, it was a rename, but there was a new file
   created with the old name in the same commit. This was important,
   because we wanted each commit to compile properly, so that it was
   bisectable, so splitting the rename into one commit and the "create
   helper file" into another was *not* an option.

   So we need to break associations where the contents change too much.
   Fine. We have the -B flag for that. When we break things up, then the
   rename detection will be able to figure out whether there are better
   alternatives.

 - "git log --follow" didn't with with -B.

Now, the second case was really simple: we use a different "diffopt"
structure for the rename detection than the basic one (which we use for
showing the diffs). So that second case is trivially fixed by a trivial
one-liner that just copies the break_opt values from the "real" diffopts
to the one used for rename following. So now "git log -B --follow" works
fine:

	diff --git a/tree-diff.c b/tree-diff.c
	index 26bdbdd..7c261fd 100644
	--- a/tree-diff.c
	+++ b/tree-diff.c
	@@ -319,6 +319,7 @@ static void try_to_follow_renames(struct tree_desc *t1, struct tree_desc *t2, co
	 	diff_opts.detect_rename = DIFF_DETECT_RENAME;
	 	diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
	 	diff_opts.single_follow = opt->paths[0];
	+	diff_opts.break_opt = opt->break_opt;
	 	paths[0] = NULL;
	 	diff_tree_setup_paths(paths, &diff_opts);
	 	if (diff_setup_done(&diff_opts) < 0)

however, the end result does *not* work. Because our diffcore-break.c
logic is totally bogus!

In particular:

 - it used to do

	if (base_size < MINIMUM_BREAK_SIZE)
		return 0; /* we do not break too small filepair */

   which basically says "don't bother to break small files". But that
   "base_size" is the *smaller* of the two sizes, which means that if some
   large file was rewritten into one that just includes another file, we
   would look at the (small) result, and decide that it's smaller than the
   break size, so it cannot be worth it to break it up! Even if the other
   side was ten times bigger and looked *nothing* like the samell file!

   That's clearly bogus. I replaced "base_size" with "max_size", so that
   we compare the *bigger* of the filepair with the break size.

 - It calculated a "merge_score", which was the score needed to merge it
   back together if nothing else wanted it. But even if it was *so*
   different that we would never want to merge it back, we wouldn't
   consider it a break! That makes no sense. So I added

	if (*merge_score_p > break_score)
		return 1;

   to make it clear that if we wouldn't want to merge it at the end, it
   was *definitely* a break.

 - It compared the whole "extent of damage", counting all inserts and
   deletes, but it based this score on the "base_size", and generated the
   damage score with

	delta_size = src_removed + literal_added;
	damage_score = delta_size * MAX_SCORE / base_size;

   but that makes no sense either, since quite often, this will result in
   a number that is *bigger* than MAX_SCORE! Why? Because base_size is
   (again) the smaller of the two files we compare, and when you start out
   from a small file and add a lot (or start out from a large file and
   remove a lot), the base_size is going to be much smaller than the
   damage!

   Again, the fix was to replace "base_size" with "max_size", at which
   point the damage actually becomes a sane percentage of the whole.

With these changes in place, not only does "git log -B --follow" work for
the case that triggered this in the first place, ie now

	git log -B --follow arch/x86/kernel/vmlinux_64.lds.S

actually gives reasonable results. But I also wanted to verify it in
general, by doing a full-history

	git log --stat -B -C

on my kernel tree with the old code and the new code.

There's some tweaking to be done, but generally, the new code generates
much better results wrt breaking up files (and then finding better rename
candidates). Here's a few examples of the "--stat" output:

 - This:
	include/asm-x86/Kbuild        |    2 -
	include/asm-x86/debugreg.h    |   79 +++++++++++++++++++++++++++++++++++------
	include/asm-x86/debugreg_32.h |   64 ---------------------------------
	include/asm-x86/debugreg_64.h |   65 ---------------------------------
	4 files changed, 68 insertions(+), 142 deletions(-)

      Becomes:

	include/asm-x86/Kbuild                        |    2 -
	include/asm-x86/{debugreg_64.h => debugreg.h} |    9 +++-
	include/asm-x86/debugreg_32.h                 |   64 -------------------------
	3 files changed, 7 insertions(+), 68 deletions(-)

 - This:
	include/asm-x86/bug.h    |   41 +++++++++++++++++++++++++++++++++++++++--
	include/asm-x86/bug_32.h |   37 -------------------------------------
	include/asm-x86/bug_64.h |   34 ----------------------------------
	3 files changed, 39 insertions(+), 73 deletions(-)

      Becomes

	include/asm-x86/{bug_64.h => bug.h} |   20 +++++++++++++-----
	include/asm-x86/bug_32.h            |   37 -----------------------------------
	2 files changed, 14 insertions(+), 43 deletions(-)

Now, in some other cases, it does actually turn a rename into a real
"delete+create" pair, and then the diff is usually bigger, so truth in
advertizing: it doesn't always generate a nicer diff. But for what -B was
meant for, I think this is a big improvement, and I suspect those cases
where it generates a bigger diff are tweakable.

So I think this diff fixes a real bug, but we might still want to tweak
the default values and perhaps the exact rules for when a break happens.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
2007-10-21 01:59:42 -04:00
Junio C Hamano d8c3d03a0b diffcore_count_changes: pass diffcore_filespec
We may want to use richer information on the data we are dealing
with in this function, so instead of passing a buffer address
and length, just pass the diffcore_filespec structure.  Existing
callers always call this function with parameters taken from a
filespec anyway, so there is no functionality changes.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-06-30 20:51:31 -07:00
Shawn O. Pearce dc49cd769b Cast 64 bit off_t to 32 bit size_t
Some systems have sizeof(off_t) == 8 while sizeof(size_t) == 4.
This implies that we are able to access and work on files whose
maximum length is around 2^63-1 bytes, but we can only malloc or
mmap somewhat less than 2^32-1 bytes of memory.

On such a system an implicit conversion of off_t to size_t can cause
the size_t to wrap, resulting in unexpected and exciting behavior.
Right now we are working around all gcc warnings generated by the
-Wshorten-64-to-32 option by passing the off_t through xsize_t().

In the future we should make xsize_t on such problematic platforms
detect the wrapping and die if such a file is accessed.

Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-07 11:15:26 -08:00
David Rientjes a89fccd281 Do not use memcmp(sha1_1, sha1_2, 20) with hardcoded length.
Introduces global inline:

	hashcmp(const unsigned char *sha1, const unsigned char *sha2)

Uses memcmp for comparison and returns the result based on the length of
the hash name (a future runtime decision).

Acked-by: Alex Riesen <raa.lkml@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-08-17 14:23:53 -07:00
Junio C Hamano c06c79667c diffcore-rename: somewhat optimized.
This changes diffcore-rename to reuse statistics information
gathered during similarity estimation, and updates the hashtable
implementation used to keep track of the statistics to be
denser.  This seems to give better performance.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-12 03:22:10 -08:00
Junio C Hamano 4d0f39cecf diffcore-break: similarity estimator fix.
This is a companion patch to the previous fix to diffcore-rename.
The merging-back process should use a logic similar to what is used
there.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-04 13:26:36 -08:00
Junio C Hamano 65416758cd diffcore-rename: split out the delta counting code.
This is to rework diffcore break/rename/copy detection code
so that it does not affected when deltifier code gets improved.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-28 20:20:04 -08:00
Junio C Hamano aeecd23ae2 diffcore-break: micro-optimize by avoiding delta between identical files.
We did not check if we have the same file on both sides when
computing break score.  This is usually not a problem, but if
the user said --find-copies-harde with -B, we ended up trying a
delta between the same data even when we know the SHA1 hash of
both sides match.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-28 20:19:47 -08:00
Junio C Hamano 0532a5e46b diffcore-break: do not break too small filepair.
Somehow we checked only one side and not the other.  By checking
the filesize upfront, we can bypass generating delta
unnecessarily.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-12 17:15:55 -08:00
Junio C Hamano d28c8af623 diffcore-break.c: check diff_delta() return value.
This bug caused Darrin Thompson to notice that our deltifier was
half broken and punting on an empty blob.

Signed-off-by: Junio C Hamano <junio@twinsun.com>
2005-12-12 12:57:25 -08:00
Junio C Hamano 19397b4521 Revert "[PATCH] plug memory leak in diff.c::diff_free_filepair()"
This reverts 068eac91ce commit.
2005-09-14 14:06:50 -07:00
Yasushi SHOJI 068eac91ce [PATCH] plug memory leak in diff.c::diff_free_filepair()
When I run git-diff-tree on big change, it seems the command eats so
much memory.  so I just put git under valgrind to see what's going on.
diff_free_filespec_data() doesn't free diff_filespec itself.

[jc: I ended up doing things slightly differently from Yasushi's
patch.  The original idea was to use free_filespec_data() only to
free the data portion and keep useing the filespec itself, but
no existing code seems to do things that way, so I just yanked
that part out.]

Signed-off-by: Yasushi SHOJI <yashi@atmark-techno.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-08-13 18:28:55 -07:00
Junio C Hamano 3c84974207 [PATCH] Fixlets on top of Nico's clean-up.
If we prefer 0 as maxsize for diff_delta() to say "unlimited", let's be
consistent about it.

This patch also fixes type mismatch in a call to get_delta_hdr_size()
from packed_delta_info().

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-29 09:11:38 -07:00
Linus Torvalds 75c42d8cc3 Add a "max_size" parameter to diff_delta()
Anything that generates a delta to see if two objects are close usually
isn't interested in the delta ends up being bigger than some specified
size, and this allows us to stop delta generation early when that
happens.
2005-06-25 19:30:20 -07:00
Junio C Hamano 366175ef8c [PATCH] Rework -B output.
Patch for a completely rewritten file detected by the -B flag
was shown as a pair of creation followed by deletion in earlier
versions.  This was an misguided attempt to make reviewing such
a complete rewrite easier, and unnecessarily ended up confusing
git-apply.  Instead, show the entire contents of old version
prefixed with '-', followed by the entire contents of new
version prefixed with '+'.  This gives the same easy-to-review
for human consumer while keeping it a single, regular
modification patch for machine consumption, something that even
GNU patch can grok.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-19 20:13:18 -07:00
Junio C Hamano f78c79c5d4 [PATCH] diffcore-break.c: various fixes.
This fixes three bugs in the -B heuristics.

 - Although it was advertised that the initial break criteria
   used was the same as what diffcore-rename uses, it was using
   something different.  Instead of using smaller of src and dst
   size to compare with "edit" size, (insertion and deletion),
   it was using larger of src and dst, unlike the rename/copy
   detection logic.  This caused the parameter to -B to mean
   something different from the one to -M and -C.  To compensate
   for this change, the default break score is also changed to
   match that of the default for rename/copy.

 - The code would have crashed with division by zero when trying
   to break an originally empty file.

 - Contrary to what the comment said, the algorithm was breaking
   small files, only to later merge them together.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-05 14:14:58 -07:00
Junio C Hamano eeaa460314 [PATCH] diff: Update -B heuristics.
As Linus pointed out on the mailing list discussion, -B should
break a files that has many inserts even if it still keeps
enough of the original contents, so that the broken pieces can
later be matched with other files by -M or -C.  However, if such
a broken pair does not get picked up by -M or -C, we would want
to apply different criteria; namely, regardless of the amount of
new material in the result, the determination of "rewrite"
should be done by looking at the amount of original material
still left in the result.  If you still have the original 97
lines from a 100-line document, it does not matter if you add
your own 13 lines to make a 110-line document, or if you add 903
lines to make a 1000-line document.  It is not a rewrite but an
in-place edit.  On the other hand, if you did lose 97 lines from
the original, it does not matter if you added 27 lines to make a
30-line document or if you added 997 lines to make a 1000-line
document.  You did a complete rewrite in either case.

This patch introduces a post-processing phase that runs after
diffcore-rename matches up broken pairs diffcore-break creates.
The purpose of this post-processing is to pick up these broken
pieces and merge them back into in-place modifications.  For
this, the score parameter -B option takes is changed into a pair
of numbers, and it takes "-B99/80" format when fully spelled
out.  The first number is the minimum amount of "edit" (same
definition as what diffcore-rename uses, which is "sum of
deletion and insertion") that a modification needs to have to be
broken, and the second number is the minimum amount of "delete"
a surviving broken pair must have to avoid being merged back
together.  It can be abbreviated to "-B" to use default for
both, "-B9" or "-B9/" to use 90% for "edit" but default (80%)
for merge avoidance, or "-B/75" to use default (99%) "edit" and
75% for merge avoidance.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-03 11:23:03 -07:00
Junio C Hamano 355e76a4a3 [PATCH] Tweak count-delta interface
Make it return copied source and insertion separately, so that
later implementation of heuristics can use them more flexibly.

This does not change the heuristics implemented in
diffcore-rename nor diffcore-break in any way.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-03 11:23:03 -07:00
Junio C Hamano f345b0a066 [PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.

When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation.  This
makes it easier to review such a complete rewrite patch.

The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.

As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used.  For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:

    The original change looks like this:

	file0     --> file0' (quite different from file0)
	file1     --> /dev/null

    After diffcore-break runs, it would become this:

	file0     --> /dev/null
	/dev/null --> file0'
	file1     --> /dev/null

    Then diffcore-rename matches them up:

	file1     --> file0'

The internal score values are finer grained now.  Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 10:35:49 -07:00