When you do this, existing "blame -C -C" would not find that the
latter half of the file2 came from the existing file1:
... both file1 and file2 are tracked ...
$ cat file1 >>file2
$ git add file1 file2
$ git commit
This is because we avoid the expensive find-copies-harder code
that makes unchanged file (in this case, file1) as a candidate
for copy & paste source when annotating an existing file
(file2). The third -C now allows it. However, this obviously
makes the process very expensive. We've actually seen this
patch before, but I dismissed it because it covers such a narrow
(and arguably stupid) corner case.
Signed-off-by: Junio C Hamano <junkio@cox.net>
The -C option to blame tries to find a section of a preimage
file by running diff against the lines whose origin is still
unknown, and excluding the different parts. The code however
did not cover the case where the tail part of the section
matched, which we handle for the normal non-move/copy codepath.
This breakage was most visible when preimage file matches in its
entirety and failed to pass blame in such a case.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Moved options that pertained to both git-blame and git-annotate to a
common file blame-options.txt.
builtin-blame.c: Removed --compatibility, --long, --time from the
short usage as they are not handled in the code.
Documentation/git-blame.txt: Removed common options to git-annotate.
Added documentation for --score-debug. Removed --compatibility.
Adjusted usage at top to not wrap on 80 columns.
Documentation/git-annotate.txt: Using common options blame-options.txt.
Documentation/blame-options.txt: Added -b note about associated config
option, added --root note about associated config option, added
documentation for --show-stats. Removed --long, --time, --rev-file as
those options do not really exist. Added documentation for -M/-C taking
an optional score argument for detection of moved lines.
Signed-off-by: Andrew Ruder <andy@aeruder.net>
Signed-off-by: Junio C Hamano <junkio@cox.net>
git-blame would overflow commit->buffer when annotating files with long paths.
Signed-off-by: Michael Spang <mspang@uwaterloo.ca>
Signed-off-by: Junio C Hamano <junkio@cox.net>
The earlier round makes the function return "is it different"
and it does not return a value suitable for sorting anymore. Reverse
the logic to return "are they the same suspect" instead, and rename
it to "same_suspect()".
Signed-off-by: Junio C Hamano <junkio@cox.net>
The commit structures are guaranteed their uniqueness by the object
layer, so we can check their address and see if they are the same
without going down to the object sha1 level.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Some systems have sizeof(off_t) == 8 while sizeof(size_t) == 4.
This implies that we are able to access and work on files whose
maximum length is around 2^63-1 bytes, but we can only malloc or
mmap somewhat less than 2^32-1 bytes of memory.
On such a system an implicit conversion of off_t to size_t can cause
the size_t to wrap, resulting in unexpected and exciting behavior.
Right now we are working around all gcc warnings generated by the
-Wshorten-64-to-32 option by passing the off_t through xsize_t().
In the future we should make xsize_t on such problematic platforms
detect the wrapping and die if such a file is accessed.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
We shouldn't attempt to assign constant strings into char*, as the
string is not writable at runtime. Likewise we should always be
treating unsigned values as unsigned values, not as signed values.
Most of these are very straightforward. The only exception is the
(unnecessary) xstrdup/free in builtin-branch.c for the detached
head case. Since this is a user-level interactive type program
and that particular code path is executed no more than once, I feel
that the extra xstrdup call is well worth the easy elimination of
this warning.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
We currently have two parallel notation for dealing with object types
in the code: a string and a numerical value. One of them is obviously
redundent, and the most used one requires more stack space and a bunch
of strcmp() all over the place.
This is an initial step for the removal of the version using a char array
found in object reading code paths. The patch is unfortunately large but
there is no sane way to split it in smaller parts without breaking the
system.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Previous step converted use of strncmp() with literal string
mechanically even when the result is only used as a boolean:
if (!strncmp("foo", arg, 3)) ==> if (!(-prefixcmp(arg, "foo")))
This step manually cleans them up to read:
if (!prefixcmp(arg, "foo"))
Signed-off-by: Junio C Hamano <junkio@cox.net>
This mechanically converts strncmp() to use prefixcmp(), but only when
the parameters match specific patterns, so that they can be verified
easily. Leftover from this will be fixed in a separate step, including
idiotic conversions like
if (!strncmp("foo", arg, 3))
=>
if (!(-prefixcmp(arg, "foo")))
This was done by using this script in px.perl
#!/usr/bin/perl -i.bak -p
if (/strncmp\(([^,]+), "([^\\"]*)", (\d+)\)/ && (length($2) == $3)) {
s|strncmp\(([^,]+), "([^\\"]*)", (\d+)\)|prefixcmp($1, "$2")|;
}
if (/strncmp\("([^\\"]*)", ([^,]+), (\d+)\)/ && (length($1) == $3)) {
s|strncmp\("([^\\"]*)", ([^,]+), (\d+)\)|(-prefixcmp($2, "$1"))|;
}
and running:
$ git grep -l strncmp -- '*.c' | xargs perl px.perl
Signed-off-by: Junio C Hamano <junkio@cox.net>
The 3rd branch in builtin-blame.c should also check for lacking
arguments. Running that in top dir does not trigger the problem
because the 'prefix' is NULL.
Signed-off-by: Tommi Kyntola <tommi.kyntola@ray.fi>
Signed-off-by: Junio C Hamano <junkio@cox.net>
git-merge-recursive wants an null tree as the fake merge base
while producing the merge result tree. The null tree does not
have to be written out in the object store as it won't be part
of the result, and it is a prime example for using the new
pretend_sha1_file() function.
git-blame needs to register an arbitrary data to in-core index
while annotating a working tree file (or standard input), but
git-blame is a read-only application and the user of it could
even lack the privilege to write into the object store; it is
another good example for pretend_sha1_file().
Signed-off-by: Junio C Hamano <junkio@cox.net>
Warning: this changes the semantics.
This makes "git blame" without any positive rev to start digging
from the working tree copy, which is made into a fake commit
whose sole parent is the HEAD.
It also adds --contents <file> option to pretend as if the
working tree copy has the contents of the named file. You can
use '-' to make the command read from the standard input.
If you want the command to start annotating from the HEAD
commit, you need to explicitly give HEAD parameter.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Starting a pager defeats the purpose of the incremental output
mode. This changes git-blame to only paginate if --incremental
was not given.
git -p blame --incremental still starts the pager, though.
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Otherwise a pathname that has funny characters such as LF would
screw up the parsing programs of the output.
Strictly speaking, this is not backward compatible, but the
current output for pathnames that have embedded LF and such
cannot be sanely parsed anyway, and pathnames that only use
characters from the portable pathname character set won't be
affected.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This adds --incremental option to help GUI porcelains to show
the result from git-blame incrementally. The output gives the
origin information in the same format as the porcelain format.
The first line has commit object name, the line number of the
first line in the group in the original file, the line number of
that file in the final image, and number of lines in the group.
Then subsequent lines show the metainformation for the commit
when the commit is shown for the first time, except the filename
information is always shown (we cannot even make it conditional
to -C option as blame always follows the renaming of the file
wholesale).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This is a mechanical clean-up of the way *.c files include
system header files.
(1) sources under compat/, platform sha-1 implementations, and
xdelta code are exempt from the following rules;
(2) the first #include must be "git-compat-util.h" or one of
our own header file that includes it first (e.g. config.h,
builtin.h, pkt-line.h);
(3) system headers that are included in "git-compat-util.h"
need not be included in individual C source files.
(4) "git-compat-util.h" does not have to include subsystem
specific header files (e.g. expat.h).
Signed-off-by: Junio C Hamano <junkio@cox.net>
When blame.blankboundary is set (or -b option is given), commit
object names are blanked out in the "human readable" output
format for boundary commits.
When blame.showroot is not set (or --root is not given), the
root commits are treated as boundary commits. The code still
attributes the lines to them, but with -b their object names are
not shown.
Signed-off-by: Junio C Hamano <junkio@cox.net>
When blaming with revision ranges, often many lines are attributed
to different commits at the boundary, but they are not interesting
for the purpose of finding project history during that revision range.
This outputs the lines blamed on boundary commits differently. When
showing "human format" output, their SHA-1 are shown with '^' prefixed.
In "porcelain format", the commit will be shown with an extra attribute
line "boundary".
Signed-off-by: Junio C Hamano <junkio@cox.net>
We lacked "--" termination in the underlying init_revisions() call
which made it impossible to specify a revision that happens to
have the same name as an existing file.
Signed-off-by: Junio C Hamano <junkio@cox.net>
We used to get the case that more than two paths came from the
same commit wrong when computing the output width and deciding
to turn on --show-name option automatically. When we find that
lines that came from a path that is different from what we
started digging from, we should always turn --show-name on, and
we should count the name length for all files involved.
Signed-off-by: Junio C Hamano <junkio@cox.net>
The graft file can contain comment lines and read_graft_line can
return NULL for such an input, which should be skipped by the
reader.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Just make it take over blame's place. Documentation and command
have all stopped mentioning "git-pickaxe". The built-in synonym
is left in the command table, so you can still say "git pickaxe",
but it probably is a good idea to retire it as well.
Signed-off-by: Junio C Hamano <junkio@cox.net>
With this,
git pickaxe -L '/--progress/,+20' v1.4.0 -- pack-objects.c
gives you 20 lines starting from the first occurrence of
'--progress' in pack-objects, digging from v1.4.0 version.
You can also say
git pickaxe -L '/--progress/,-5' v1.4.0 -- pack-objects.c
Signed-off-by: Junio C Hamano <junkio@cox.net>
With this change, you can specify the beginning and the ending
line of the range you wish to inspect with pattern matching.
For example, these are equivalent with the git.git sources:
git pickaxe -L 7,21 v1.4.0 -- commit.c
git pickaxe -L '/^struct sort_node/,/^}/' v1.4.0 -- commit.c
git pickaxe -L '7,/^}/' v1.4.0 -- commit.c
git pickaxe -L '/^struct sort_node/,21' v1.4.0 -- commit.c
Signed-off-by: Junio C Hamano <junkio@cox.net>
It turns out that pickaxe reads the same blob repeatedly while
blame can reuse the blob already read for the parent when
handling a child commit when it's parent's turn to pass its
blame to the grandparent. Have a cache in the origin structure
to keep the blob there, which will be garbage collected when the
origin loses the last reference to it.
Signed-off-by: Junio C Hamano <junkio@cox.net>
When we introduced the cached origin per commit, we gave up proper
garbage collecting because it meant that commits hold onto their
cached copy. There is no need to do so.
Signed-off-by: Junio C Hamano <junkio@cox.net>
The reason to do this is the same as in the previous change for
line copy detection within the same file (-M).
Also this fixes -C and -C -C (aka find-copies-harder) logic; in
this application we are not interested in the similarity
matching diffcore-rename makes, because we are only interested
in scanning files that were modified, or in the case of -C -C,
scanning all files in the parent and we want to do that
ourselves.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Otherwise we would miss copied lines that are contained in the
parts before or after the part that we find after splitting the
blame_entry (i.e. split[0] and split[2]).
Signed-off-by: Junio C Hamano <junkio@cox.net>
If more than one parents in an Octopus merge have the same
origin, ignore later ones because it would not make any
difference in the outcome.
Signed-off-by: Junio C Hamano <junkio@cox.net>
The idea is that we are interested in renaming into only one path, so
we do not care about renames that happen elsewhere.
Signed-off-by: Junio C Hamano <junkio@cox.net>
We forgot to add prefix to the given path.
[jc: interestingly enough, Jeff King had the same idea after I
pushed mine out to "pu", and his patch was cleaner, so I dropped
mine.]
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Depending on how bushy the commit DAG is, this saves calls to
the internal diff-tree for fork-point commits. For example,
annotating Makefile in the kernel repository saves about a third
of such diff-tree calls.
Signed-off-by: Junio C Hamano <junkio@cox.net>
When a merge adds a new file from the second parent, the
earlier code tried to find renames in the first parent before
noticing that the vertion from the second parent was added
without modification.
Signed-off-by: Junio C Hamano <junkio@cox.net>
When compiled for debugging, make sure that refcnt sanity check
code detects underflows in origin reference counting.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This makes "git-pickaxe -C master -- revision.c" to finish with
proper refcounts for all origins. I am reasonably happy with
it.
Signed-off-by: Junio C Hamano <junkio@cox.net>
The command rejects -L1,10 as an invalid line range specifier
and I got frustrated enough by it, so this makes it allow both
forms of input.
Signed-off-by: Junio C Hamano <junkio@cox.net>
The origin structure is allocated for each commit and path while
the code traverse down it is copied into different blame entries.
To avoid leaks, try refcounting them.
This still seems to leak, which I haven't tracked down fully yet.
Signed-off-by: Junio C Hamano <junkio@cox.net>
When assigning blames for code movements across file boundaries,
we used to iterate over blame entries (i.e. groups of lines to
be blamed) in the outer loop and compared each entry with paths
in the parent commit in an inner loop. This meant that we
opened the blob data from each path number of times.
Reorganize the loop so that we read the same path only once, and
compare it against all relevant blame entries.
This should perform better, but seems to give mixed results,
though.
Signed-off-by: Junio C Hamano <junkio@cox.net>
After finding out which path in the parent to scan to pass
blames, using get_tree_entry() to extract the blob information
again was quite wasteful, since diff-tree already gave us that
information. Separate the function to create an origin out as
get_origin().
You'll never know what is more efficient unless you try and/or
think hard. I somehow thought that extracting one known path
out of commit's tree is cheaper than running a diff-tree for the
current path between the commit and its parent, but it is not
the case. In real, non-toy projects, most commits do not touch
the path you are interested in, and if the path is a few levels
away from the toplevel, whole-subdirectory comparison logic
diff-tree allows us to skip opening lower subdirectories.
This commit rewrites find_origin() function to use a single-path
diff-tree to see if the parent has the same blob as the current
suspect, which is cheaper than extracting the blob information
using get_tree_entry() and comparing it with what the current
suspect has. This shaves about 6% overhead when annotating
kernel/sched.c in the Linux kernel repository on my machine.
The saving rises to 25% for arch/i386/kernel/Makefile.
Signed-off-by: Junio C Hamano <junkio@cox.net>
It used to be that we can compare the address of the origin
structure to determine if they are the same because they are
always registered with scoreboard. After introduction of the
loop to try finding the best split, that is not true anymore.
The current code has rather serious leaks with origin structure,
but more importantly it gets confused when two origins that
points at the same commit and same path.
We might eventually have to refcount and gc origin, but let's
fix the correctness issue first.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This adds scoring logic to blame_entry to prevent blames on very
trivial chunks (e.g. lots of empty lines, indent followed by a
closing brace) from being passed down to unrelated lines in the
parent.
The current heuristics are quite simple and may need to be
tweaked later, but we need to start somewhere.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Instead of comparing number of lines matched, look at the
matched characters and count alnums, so that we do not pass
blame on not-so-interesting lines, such as an empty line and
a line that is indentation followed by a closing brace.
Add an option --score-debug to show the score of each
blame_entry while we cook this further on the "next" branch.
Signed-off-by: Junio C Hamano <junkio@cox.net>