blame: add a fingerprint heuristic to match ignored lines
This algorithm will replace the heuristic used to identify lines from
ignored commits with one that finds likely candidate lines in the
parent's version of the file. The actual replacement occurs in an
upcoming commit.
The old heuristic simply assigned lines in the target to the same line
number (plus offset) in the parent. The new function uses a
fingerprinting algorithm to detect similarity between lines.
The new heuristic is designed to accurately match changes made
mechanically by formatting tools such as clang-format and clang-tidy.
These tools make changes such as breaking up lines to fit within a
character limit or changing identifiers to fit with a naming convention.
The heuristic is not intended to match more extensive refactoring
changes and may give misleading results in such cases.
In most cases formatting tools preserve line ordering, so the heuristic
is optimised for such cases. (Some types of changes do reorder lines
e.g. sorting keep the line content identical, the git blame -M option
can already be used to address this). The reason that it is advantageous
to rely on ordering is due to source code repeating the same character
sequences often e.g. declaring an identifier on one line and using that
identifier on several subsequent lines. This means that lines can look
very similar to each other which presents a problem when doing fuzzy
matching. Relying on ordering gives us extra clues to point towards the
true match.
The heuristic operates on a single diff chunk change at a time. It
creates a “fingerprint” for each line on each side of the change.
Fingerprints are described in detail in the comment for `struct
fingerprint`, but essentially are a multiset of the character pairs in a
line. The heuristic first identifies the line in the target entry whose
fingerprint is most clearly matched to a line fingerprint in the parent
entry. Where fingerprints match identically, the position of the lines
is used as a tie-break. The heuristic locks in the best match, and
subtracts the fingerprint of the line in the target entry from the
fingerprint of the line in the parent entry to prevent other lines being
matched on the same parts of that line. It then repeats the process
recursively on the section of the chunk before the match, and then the
section of the chunk after the match.
Here's an example of the difference the fingerprinting makes. Consider
a file with two commits:
commit-a 1) void func_1(void *x, void *y);
commit-b 2) void func_2(void *x, void *y);
After a commit 'X', we have:
commit-X 1) void func_1(void *x,
commit-X 2) void *y);
commit-X 3) void func_2(void *x,
commit-X 4) void *y);
When we blame-ignored with the old algorithm, we get:
commit-a 1) void func_1(void *x,
commit-b 2) void *y);
commit-X 3) void func_2(void *x,
commit-X 4) void *y);
Where commit-b is blamed for 2 instead of 3. With the fingerprint
algorithm, we get:
commit-a 1) void func_1(void *x,
commit-a 2) void *y);
commit-b 3) void func_2(void *x,
commit-b 4) void *y);
Note line 2 could be matched with either commit-a or commit-b as it is
equally similar to both lines, but is matched with commit-a because its
position as a fraction of the new line range is more similar to commit-a
as a fraction of the old line range. Line 4 is also equally similar to
both lines, but as it appears after line 3 which will be matched first
it cannot be matched with an earlier line.
For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains
example parent and target files and the line numbers in the parent that
must be matched.
Signed-off-by: Michael Platings <michael@platin.gs>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
6 years ago
|
|
|
#!/bin/sh
|
|
|
|
|
|
|
|
test_description='git blame ignore fuzzy heuristic'
|
|
|
|
. ./test-lib.sh
|
|
|
|
|
|
|
|
pick_author='s/^[0-9a-f^]* *(\([^ ]*\) .*/\1/'
|
|
|
|
|
|
|
|
# Each test is composed of 4 variables:
|
|
|
|
# titleN - the test name
|
|
|
|
# aN - the initial content
|
|
|
|
# bN - the final content
|
|
|
|
# expectedN - the line numbers from aN that we expect git blame
|
|
|
|
# on bN to identify, or "Final" if bN itself should
|
|
|
|
# be identified as the origin of that line.
|
|
|
|
|
|
|
|
# We start at test 2 because setup will show as test 1
|
|
|
|
title2="Regression test for partially overlapping search ranges"
|
|
|
|
cat <<EOF >a2
|
|
|
|
1
|
|
|
|
2
|
|
|
|
3
|
|
|
|
abcdef
|
|
|
|
5
|
|
|
|
6
|
|
|
|
7
|
|
|
|
ijkl
|
|
|
|
9
|
|
|
|
10
|
|
|
|
11
|
|
|
|
pqrs
|
|
|
|
13
|
|
|
|
14
|
|
|
|
15
|
|
|
|
wxyz
|
|
|
|
17
|
|
|
|
18
|
|
|
|
19
|
|
|
|
EOF
|
|
|
|
cat <<EOF >b2
|
|
|
|
abcde
|
|
|
|
ijk
|
|
|
|
pqr
|
|
|
|
wxy
|
|
|
|
EOF
|
|
|
|
cat <<EOF >expected2
|
|
|
|
4
|
|
|
|
8
|
|
|
|
12
|
|
|
|
16
|
|
|
|
EOF
|
|
|
|
|
|
|
|
title3="Combine 3 lines into 2"
|
|
|
|
cat <<EOF >a3
|
|
|
|
if ((maxgrow==0) ||
|
|
|
|
( single_line_field && (field->dcols < maxgrow)) ||
|
|
|
|
(!single_line_field && (field->drows < maxgrow)))
|
|
|
|
EOF
|
|
|
|
cat <<EOF >b3
|
|
|
|
if ((maxgrow == 0) || (single_line_field && (field->dcols < maxgrow)) ||
|
|
|
|
(!single_line_field && (field->drows < maxgrow))) {
|
|
|
|
EOF
|
|
|
|
cat <<EOF >expected3
|
|
|
|
2
|
|
|
|
3
|
|
|
|
EOF
|
|
|
|
|
|
|
|
title4="Add curly brackets"
|
|
|
|
cat <<EOF >a4
|
|
|
|
if (rows) *rows = field->rows;
|
|
|
|
if (cols) *cols = field->cols;
|
|
|
|
if (frow) *frow = field->frow;
|
|
|
|
if (fcol) *fcol = field->fcol;
|
|
|
|
EOF
|
|
|
|
cat <<EOF >b4
|
|
|
|
if (rows) {
|
|
|
|
*rows = field->rows;
|
|
|
|
}
|
|
|
|
if (cols) {
|
|
|
|
*cols = field->cols;
|
|
|
|
}
|
|
|
|
if (frow) {
|
|
|
|
*frow = field->frow;
|
|
|
|
}
|
|
|
|
if (fcol) {
|
|
|
|
*fcol = field->fcol;
|
|
|
|
}
|
|
|
|
EOF
|
|
|
|
cat <<EOF >expected4
|
|
|
|
1
|
|
|
|
1
|
|
|
|
Final
|
|
|
|
2
|
|
|
|
2
|
|
|
|
Final
|
|
|
|
3
|
|
|
|
3
|
|
|
|
Final
|
|
|
|
4
|
|
|
|
4
|
|
|
|
Final
|
|
|
|
EOF
|
|
|
|
|
|
|
|
|
|
|
|
title5="Combine many lines and change case"
|
|
|
|
cat <<EOF >a5
|
|
|
|
for(row=0,pBuffer=field->buf;
|
|
|
|
row<height;
|
|
|
|
row++,pBuffer+=width )
|
|
|
|
{
|
|
|
|
if ((len = (int)( After_End_Of_Data( pBuffer, width ) - pBuffer )) > 0)
|
|
|
|
{
|
|
|
|
wmove( win, row, 0 );
|
|
|
|
waddnstr( win, pBuffer, len );
|
|
|
|
EOF
|
|
|
|
cat <<EOF >b5
|
|
|
|
for (Row = 0, PBuffer = field->buf; Row < Height; Row++, PBuffer += Width) {
|
|
|
|
if ((Len = (int)(afterEndOfData(PBuffer, Width) - PBuffer)) > 0) {
|
|
|
|
wmove(win, Row, 0);
|
|
|
|
waddnstr(win, PBuffer, Len);
|
|
|
|
EOF
|
|
|
|
cat <<EOF >expected5
|
|
|
|
1
|
|
|
|
5
|
|
|
|
7
|
|
|
|
8
|
|
|
|
EOF
|
|
|
|
|
|
|
|
title6="Rename and combine lines"
|
|
|
|
cat <<EOF >a6
|
|
|
|
bool need_visual_update = ((form != (FORM *)0) &&
|
|
|
|
(form->status & _POSTED) &&
|
|
|
|
(form->current==field));
|
|
|
|
|
|
|
|
if (need_visual_update)
|
|
|
|
Synchronize_Buffer(form);
|
|
|
|
|
|
|
|
if (single_line_field)
|
|
|
|
{
|
|
|
|
growth = field->cols * amount;
|
|
|
|
if (field->maxgrow)
|
|
|
|
growth = Minimum(field->maxgrow - field->dcols,growth);
|
|
|
|
field->dcols += growth;
|
|
|
|
if (field->dcols == field->maxgrow)
|
|
|
|
EOF
|
|
|
|
cat <<EOF >b6
|
|
|
|
bool NeedVisualUpdate = ((Form != (FORM *)0) && (Form->status & _POSTED) &&
|
|
|
|
(Form->current == field));
|
|
|
|
|
|
|
|
if (NeedVisualUpdate) {
|
|
|
|
synchronizeBuffer(Form);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (SingleLineField) {
|
|
|
|
Growth = field->cols * amount;
|
|
|
|
if (field->maxgrow) {
|
|
|
|
Growth = Minimum(field->maxgrow - field->dcols, Growth);
|
|
|
|
}
|
|
|
|
field->dcols += Growth;
|
|
|
|
if (field->dcols == field->maxgrow) {
|
|
|
|
EOF
|
|
|
|
cat <<EOF >expected6
|
|
|
|
1
|
|
|
|
3
|
|
|
|
4
|
|
|
|
5
|
|
|
|
6
|
|
|
|
Final
|
|
|
|
7
|
|
|
|
8
|
|
|
|
10
|
|
|
|
11
|
|
|
|
12
|
|
|
|
Final
|
|
|
|
13
|
|
|
|
14
|
|
|
|
EOF
|
|
|
|
|
|
|
|
# Both lines match identically so position must be used to tie-break.
|
|
|
|
title7="Same line twice"
|
|
|
|
cat <<EOF >a7
|
|
|
|
abc
|
|
|
|
abc
|
|
|
|
EOF
|
|
|
|
cat <<EOF >b7
|
|
|
|
abcd
|
|
|
|
abcd
|
|
|
|
EOF
|
|
|
|
cat <<EOF >expected7
|
|
|
|
1
|
|
|
|
2
|
|
|
|
EOF
|
|
|
|
|
|
|
|
title8="Enforce line order"
|
|
|
|
cat <<EOF >a8
|
|
|
|
abcdef
|
|
|
|
ghijkl
|
|
|
|
ab
|
|
|
|
EOF
|
|
|
|
cat <<EOF >b8
|
|
|
|
ghijk
|
|
|
|
abcd
|
|
|
|
EOF
|
|
|
|
cat <<EOF >expected8
|
|
|
|
2
|
|
|
|
3
|
|
|
|
EOF
|
|
|
|
|
|
|
|
title9="Expand lines and rename variables"
|
|
|
|
cat <<EOF >a9
|
|
|
|
int myFunction(int ArgumentOne, Thing *ArgTwo, Blah XuglyBug) {
|
|
|
|
Squiggle FabulousResult = squargle(ArgumentOne, *ArgTwo,
|
|
|
|
XuglyBug) + EwwwGlobalWithAReallyLongNameYepTooLong;
|
|
|
|
return FabulousResult * 42;
|
|
|
|
}
|
|
|
|
EOF
|
|
|
|
cat <<EOF >b9
|
|
|
|
int myFunction(int argument_one, Thing *arg_asdfgh,
|
|
|
|
Blah xugly_bug) {
|
|
|
|
Squiggle fabulous_result = squargle(argument_one,
|
|
|
|
*arg_asdfgh, xugly_bug)
|
|
|
|
+ g_ewww_global_with_a_really_long_name_yep_too_long;
|
|
|
|
return fabulous_result * 42;
|
|
|
|
}
|
|
|
|
EOF
|
|
|
|
cat <<EOF >expected9
|
|
|
|
1
|
|
|
|
1
|
|
|
|
2
|
|
|
|
3
|
|
|
|
3
|
|
|
|
4
|
|
|
|
5
|
|
|
|
EOF
|
|
|
|
|
|
|
|
title10="Two close matches versus one less close match"
|
|
|
|
cat <<EOF >a10
|
|
|
|
abcdef
|
|
|
|
abcdef
|
|
|
|
ghijkl
|
|
|
|
EOF
|
|
|
|
cat <<EOF >b10
|
|
|
|
gh
|
|
|
|
abcdefx
|
|
|
|
EOF
|
|
|
|
cat <<EOF >expected10
|
|
|
|
Final
|
|
|
|
2
|
|
|
|
EOF
|
|
|
|
|
|
|
|
# The first line of b matches best with the last line of a, but the overall
|
|
|
|
# match is better if we match it with the the first line of a.
|
|
|
|
title11="Piggy in the middle"
|
|
|
|
cat <<EOF >a11
|
|
|
|
abcdefg
|
|
|
|
ijklmn
|
|
|
|
abcdefgh
|
|
|
|
EOF
|
|
|
|
cat <<EOF >b11
|
|
|
|
abcdefghx
|
|
|
|
ijklm
|
|
|
|
EOF
|
|
|
|
cat <<EOF >expected11
|
|
|
|
1
|
|
|
|
2
|
|
|
|
EOF
|
|
|
|
|
|
|
|
title12="No trailing newline"
|
|
|
|
printf "abc\ndef" >a12
|
|
|
|
printf "abx\nstu" >b12
|
|
|
|
cat <<EOF >expected12
|
|
|
|
1
|
|
|
|
Final
|
|
|
|
EOF
|
|
|
|
|
|
|
|
title13="Reorder includes"
|
|
|
|
cat <<EOF >a13
|
|
|
|
#include "c.h"
|
|
|
|
#include "b.h"
|
|
|
|
#include "a.h"
|
|
|
|
#include "e.h"
|
|
|
|
#include "d.h"
|
|
|
|
EOF
|
|
|
|
cat <<EOF >b13
|
|
|
|
#include "a.h"
|
|
|
|
#include "b.h"
|
|
|
|
#include "c.h"
|
|
|
|
#include "d.h"
|
|
|
|
#include "e.h"
|
|
|
|
EOF
|
|
|
|
cat <<EOF >expected13
|
|
|
|
3
|
|
|
|
2
|
|
|
|
1
|
|
|
|
5
|
|
|
|
4
|
|
|
|
EOF
|
|
|
|
|
|
|
|
last_test=13
|
|
|
|
|
|
|
|
test_expect_success setup '
|
|
|
|
for i in $(test_seq 2 $last_test)
|
blame: add a fingerprint heuristic to match ignored lines
This algorithm will replace the heuristic used to identify lines from
ignored commits with one that finds likely candidate lines in the
parent's version of the file. The actual replacement occurs in an
upcoming commit.
The old heuristic simply assigned lines in the target to the same line
number (plus offset) in the parent. The new function uses a
fingerprinting algorithm to detect similarity between lines.
The new heuristic is designed to accurately match changes made
mechanically by formatting tools such as clang-format and clang-tidy.
These tools make changes such as breaking up lines to fit within a
character limit or changing identifiers to fit with a naming convention.
The heuristic is not intended to match more extensive refactoring
changes and may give misleading results in such cases.
In most cases formatting tools preserve line ordering, so the heuristic
is optimised for such cases. (Some types of changes do reorder lines
e.g. sorting keep the line content identical, the git blame -M option
can already be used to address this). The reason that it is advantageous
to rely on ordering is due to source code repeating the same character
sequences often e.g. declaring an identifier on one line and using that
identifier on several subsequent lines. This means that lines can look
very similar to each other which presents a problem when doing fuzzy
matching. Relying on ordering gives us extra clues to point towards the
true match.
The heuristic operates on a single diff chunk change at a time. It
creates a “fingerprint” for each line on each side of the change.
Fingerprints are described in detail in the comment for `struct
fingerprint`, but essentially are a multiset of the character pairs in a
line. The heuristic first identifies the line in the target entry whose
fingerprint is most clearly matched to a line fingerprint in the parent
entry. Where fingerprints match identically, the position of the lines
is used as a tie-break. The heuristic locks in the best match, and
subtracts the fingerprint of the line in the target entry from the
fingerprint of the line in the parent entry to prevent other lines being
matched on the same parts of that line. It then repeats the process
recursively on the section of the chunk before the match, and then the
section of the chunk after the match.
Here's an example of the difference the fingerprinting makes. Consider
a file with two commits:
commit-a 1) void func_1(void *x, void *y);
commit-b 2) void func_2(void *x, void *y);
After a commit 'X', we have:
commit-X 1) void func_1(void *x,
commit-X 2) void *y);
commit-X 3) void func_2(void *x,
commit-X 4) void *y);
When we blame-ignored with the old algorithm, we get:
commit-a 1) void func_1(void *x,
commit-b 2) void *y);
commit-X 3) void func_2(void *x,
commit-X 4) void *y);
Where commit-b is blamed for 2 instead of 3. With the fingerprint
algorithm, we get:
commit-a 1) void func_1(void *x,
commit-a 2) void *y);
commit-b 3) void func_2(void *x,
commit-b 4) void *y);
Note line 2 could be matched with either commit-a or commit-b as it is
equally similar to both lines, but is matched with commit-a because its
position as a fraction of the new line range is more similar to commit-a
as a fraction of the old line range. Line 4 is also equally similar to
both lines, but as it appears after line 3 which will be matched first
it cannot be matched with an earlier line.
For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains
example parent and target files and the line numbers in the parent that
must be matched.
Signed-off-by: Michael Platings <michael@platin.gs>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
6 years ago
|
|
|
do
|
|
|
|
# Append each line in a separate commit to make it easy to
|
|
|
|
# check which original line the blame output relates to.
|
|
|
|
|
|
|
|
line_count=0 &&
|
|
|
|
while IFS= read line
|
blame: add a fingerprint heuristic to match ignored lines
This algorithm will replace the heuristic used to identify lines from
ignored commits with one that finds likely candidate lines in the
parent's version of the file. The actual replacement occurs in an
upcoming commit.
The old heuristic simply assigned lines in the target to the same line
number (plus offset) in the parent. The new function uses a
fingerprinting algorithm to detect similarity between lines.
The new heuristic is designed to accurately match changes made
mechanically by formatting tools such as clang-format and clang-tidy.
These tools make changes such as breaking up lines to fit within a
character limit or changing identifiers to fit with a naming convention.
The heuristic is not intended to match more extensive refactoring
changes and may give misleading results in such cases.
In most cases formatting tools preserve line ordering, so the heuristic
is optimised for such cases. (Some types of changes do reorder lines
e.g. sorting keep the line content identical, the git blame -M option
can already be used to address this). The reason that it is advantageous
to rely on ordering is due to source code repeating the same character
sequences often e.g. declaring an identifier on one line and using that
identifier on several subsequent lines. This means that lines can look
very similar to each other which presents a problem when doing fuzzy
matching. Relying on ordering gives us extra clues to point towards the
true match.
The heuristic operates on a single diff chunk change at a time. It
creates a “fingerprint” for each line on each side of the change.
Fingerprints are described in detail in the comment for `struct
fingerprint`, but essentially are a multiset of the character pairs in a
line. The heuristic first identifies the line in the target entry whose
fingerprint is most clearly matched to a line fingerprint in the parent
entry. Where fingerprints match identically, the position of the lines
is used as a tie-break. The heuristic locks in the best match, and
subtracts the fingerprint of the line in the target entry from the
fingerprint of the line in the parent entry to prevent other lines being
matched on the same parts of that line. It then repeats the process
recursively on the section of the chunk before the match, and then the
section of the chunk after the match.
Here's an example of the difference the fingerprinting makes. Consider
a file with two commits:
commit-a 1) void func_1(void *x, void *y);
commit-b 2) void func_2(void *x, void *y);
After a commit 'X', we have:
commit-X 1) void func_1(void *x,
commit-X 2) void *y);
commit-X 3) void func_2(void *x,
commit-X 4) void *y);
When we blame-ignored with the old algorithm, we get:
commit-a 1) void func_1(void *x,
commit-b 2) void *y);
commit-X 3) void func_2(void *x,
commit-X 4) void *y);
Where commit-b is blamed for 2 instead of 3. With the fingerprint
algorithm, we get:
commit-a 1) void func_1(void *x,
commit-a 2) void *y);
commit-b 3) void func_2(void *x,
commit-b 4) void *y);
Note line 2 could be matched with either commit-a or commit-b as it is
equally similar to both lines, but is matched with commit-a because its
position as a fraction of the new line range is more similar to commit-a
as a fraction of the old line range. Line 4 is also equally similar to
both lines, but as it appears after line 3 which will be matched first
it cannot be matched with an earlier line.
For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains
example parent and target files and the line numbers in the parent that
must be matched.
Signed-off-by: Michael Platings <michael@platin.gs>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
6 years ago
|
|
|
do
|
|
|
|
line_count=$((line_count+1)) &&
|
|
|
|
echo "$line" >>"$i" &&
|
|
|
|
git add "$i" &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME="$line_count" git commit -m "$line_count"
|
|
|
|
done <"a$i"
|
|
|
|
done &&
|
blame: add a fingerprint heuristic to match ignored lines
This algorithm will replace the heuristic used to identify lines from
ignored commits with one that finds likely candidate lines in the
parent's version of the file. The actual replacement occurs in an
upcoming commit.
The old heuristic simply assigned lines in the target to the same line
number (plus offset) in the parent. The new function uses a
fingerprinting algorithm to detect similarity between lines.
The new heuristic is designed to accurately match changes made
mechanically by formatting tools such as clang-format and clang-tidy.
These tools make changes such as breaking up lines to fit within a
character limit or changing identifiers to fit with a naming convention.
The heuristic is not intended to match more extensive refactoring
changes and may give misleading results in such cases.
In most cases formatting tools preserve line ordering, so the heuristic
is optimised for such cases. (Some types of changes do reorder lines
e.g. sorting keep the line content identical, the git blame -M option
can already be used to address this). The reason that it is advantageous
to rely on ordering is due to source code repeating the same character
sequences often e.g. declaring an identifier on one line and using that
identifier on several subsequent lines. This means that lines can look
very similar to each other which presents a problem when doing fuzzy
matching. Relying on ordering gives us extra clues to point towards the
true match.
The heuristic operates on a single diff chunk change at a time. It
creates a “fingerprint” for each line on each side of the change.
Fingerprints are described in detail in the comment for `struct
fingerprint`, but essentially are a multiset of the character pairs in a
line. The heuristic first identifies the line in the target entry whose
fingerprint is most clearly matched to a line fingerprint in the parent
entry. Where fingerprints match identically, the position of the lines
is used as a tie-break. The heuristic locks in the best match, and
subtracts the fingerprint of the line in the target entry from the
fingerprint of the line in the parent entry to prevent other lines being
matched on the same parts of that line. It then repeats the process
recursively on the section of the chunk before the match, and then the
section of the chunk after the match.
Here's an example of the difference the fingerprinting makes. Consider
a file with two commits:
commit-a 1) void func_1(void *x, void *y);
commit-b 2) void func_2(void *x, void *y);
After a commit 'X', we have:
commit-X 1) void func_1(void *x,
commit-X 2) void *y);
commit-X 3) void func_2(void *x,
commit-X 4) void *y);
When we blame-ignored with the old algorithm, we get:
commit-a 1) void func_1(void *x,
commit-b 2) void *y);
commit-X 3) void func_2(void *x,
commit-X 4) void *y);
Where commit-b is blamed for 2 instead of 3. With the fingerprint
algorithm, we get:
commit-a 1) void func_1(void *x,
commit-a 2) void *y);
commit-b 3) void func_2(void *x,
commit-b 4) void *y);
Note line 2 could be matched with either commit-a or commit-b as it is
equally similar to both lines, but is matched with commit-a because its
position as a fraction of the new line range is more similar to commit-a
as a fraction of the old line range. Line 4 is also equally similar to
both lines, but as it appears after line 3 which will be matched first
it cannot be matched with an earlier line.
For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains
example parent and target files and the line numbers in the parent that
must be matched.
Signed-off-by: Michael Platings <michael@platin.gs>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
6 years ago
|
|
|
|
|
|
|
for i in $(test_seq 2 $last_test)
|
blame: add a fingerprint heuristic to match ignored lines
This algorithm will replace the heuristic used to identify lines from
ignored commits with one that finds likely candidate lines in the
parent's version of the file. The actual replacement occurs in an
upcoming commit.
The old heuristic simply assigned lines in the target to the same line
number (plus offset) in the parent. The new function uses a
fingerprinting algorithm to detect similarity between lines.
The new heuristic is designed to accurately match changes made
mechanically by formatting tools such as clang-format and clang-tidy.
These tools make changes such as breaking up lines to fit within a
character limit or changing identifiers to fit with a naming convention.
The heuristic is not intended to match more extensive refactoring
changes and may give misleading results in such cases.
In most cases formatting tools preserve line ordering, so the heuristic
is optimised for such cases. (Some types of changes do reorder lines
e.g. sorting keep the line content identical, the git blame -M option
can already be used to address this). The reason that it is advantageous
to rely on ordering is due to source code repeating the same character
sequences often e.g. declaring an identifier on one line and using that
identifier on several subsequent lines. This means that lines can look
very similar to each other which presents a problem when doing fuzzy
matching. Relying on ordering gives us extra clues to point towards the
true match.
The heuristic operates on a single diff chunk change at a time. It
creates a “fingerprint” for each line on each side of the change.
Fingerprints are described in detail in the comment for `struct
fingerprint`, but essentially are a multiset of the character pairs in a
line. The heuristic first identifies the line in the target entry whose
fingerprint is most clearly matched to a line fingerprint in the parent
entry. Where fingerprints match identically, the position of the lines
is used as a tie-break. The heuristic locks in the best match, and
subtracts the fingerprint of the line in the target entry from the
fingerprint of the line in the parent entry to prevent other lines being
matched on the same parts of that line. It then repeats the process
recursively on the section of the chunk before the match, and then the
section of the chunk after the match.
Here's an example of the difference the fingerprinting makes. Consider
a file with two commits:
commit-a 1) void func_1(void *x, void *y);
commit-b 2) void func_2(void *x, void *y);
After a commit 'X', we have:
commit-X 1) void func_1(void *x,
commit-X 2) void *y);
commit-X 3) void func_2(void *x,
commit-X 4) void *y);
When we blame-ignored with the old algorithm, we get:
commit-a 1) void func_1(void *x,
commit-b 2) void *y);
commit-X 3) void func_2(void *x,
commit-X 4) void *y);
Where commit-b is blamed for 2 instead of 3. With the fingerprint
algorithm, we get:
commit-a 1) void func_1(void *x,
commit-a 2) void *y);
commit-b 3) void func_2(void *x,
commit-b 4) void *y);
Note line 2 could be matched with either commit-a or commit-b as it is
equally similar to both lines, but is matched with commit-a because its
position as a fraction of the new line range is more similar to commit-a
as a fraction of the old line range. Line 4 is also equally similar to
both lines, but as it appears after line 3 which will be matched first
it cannot be matched with an earlier line.
For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains
example parent and target files and the line numbers in the parent that
must be matched.
Signed-off-by: Michael Platings <michael@platin.gs>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
6 years ago
|
|
|
do
|
|
|
|
# Overwrite the files with the final content.
|
|
|
|
cp b$i $i &&
|
|
|
|
git add $i
|
|
|
|
done &&
|
blame: add a fingerprint heuristic to match ignored lines
This algorithm will replace the heuristic used to identify lines from
ignored commits with one that finds likely candidate lines in the
parent's version of the file. The actual replacement occurs in an
upcoming commit.
The old heuristic simply assigned lines in the target to the same line
number (plus offset) in the parent. The new function uses a
fingerprinting algorithm to detect similarity between lines.
The new heuristic is designed to accurately match changes made
mechanically by formatting tools such as clang-format and clang-tidy.
These tools make changes such as breaking up lines to fit within a
character limit or changing identifiers to fit with a naming convention.
The heuristic is not intended to match more extensive refactoring
changes and may give misleading results in such cases.
In most cases formatting tools preserve line ordering, so the heuristic
is optimised for such cases. (Some types of changes do reorder lines
e.g. sorting keep the line content identical, the git blame -M option
can already be used to address this). The reason that it is advantageous
to rely on ordering is due to source code repeating the same character
sequences often e.g. declaring an identifier on one line and using that
identifier on several subsequent lines. This means that lines can look
very similar to each other which presents a problem when doing fuzzy
matching. Relying on ordering gives us extra clues to point towards the
true match.
The heuristic operates on a single diff chunk change at a time. It
creates a “fingerprint” for each line on each side of the change.
Fingerprints are described in detail in the comment for `struct
fingerprint`, but essentially are a multiset of the character pairs in a
line. The heuristic first identifies the line in the target entry whose
fingerprint is most clearly matched to a line fingerprint in the parent
entry. Where fingerprints match identically, the position of the lines
is used as a tie-break. The heuristic locks in the best match, and
subtracts the fingerprint of the line in the target entry from the
fingerprint of the line in the parent entry to prevent other lines being
matched on the same parts of that line. It then repeats the process
recursively on the section of the chunk before the match, and then the
section of the chunk after the match.
Here's an example of the difference the fingerprinting makes. Consider
a file with two commits:
commit-a 1) void func_1(void *x, void *y);
commit-b 2) void func_2(void *x, void *y);
After a commit 'X', we have:
commit-X 1) void func_1(void *x,
commit-X 2) void *y);
commit-X 3) void func_2(void *x,
commit-X 4) void *y);
When we blame-ignored with the old algorithm, we get:
commit-a 1) void func_1(void *x,
commit-b 2) void *y);
commit-X 3) void func_2(void *x,
commit-X 4) void *y);
Where commit-b is blamed for 2 instead of 3. With the fingerprint
algorithm, we get:
commit-a 1) void func_1(void *x,
commit-a 2) void *y);
commit-b 3) void func_2(void *x,
commit-b 4) void *y);
Note line 2 could be matched with either commit-a or commit-b as it is
equally similar to both lines, but is matched with commit-a because its
position as a fraction of the new line range is more similar to commit-a
as a fraction of the old line range. Line 4 is also equally similar to
both lines, but as it appears after line 3 which will be matched first
it cannot be matched with an earlier line.
For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains
example parent and target files and the line numbers in the parent that
must be matched.
Signed-off-by: Michael Platings <michael@platin.gs>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
6 years ago
|
|
|
test_tick &&
|
|
|
|
|
|
|
|
# Commit the final content all at once so it can all be
|
|
|
|
# referred to with the same commit ID.
|
|
|
|
GIT_AUTHOR_NAME=Final git commit -m Final &&
|
|
|
|
|
|
|
|
IGNOREME=$(git rev-parse HEAD)
|
|
|
|
'
|
|
|
|
|
|
|
|
for i in $(test_seq 2 $last_test); do
|
|
|
|
eval title="\$title$i"
|
|
|
|
test_expect_success "$title" \
|
|
|
|
"git blame -M9 --ignore-rev $IGNOREME $i >output &&
|
|
|
|
sed -e \"$pick_author\" output >actual &&
|
|
|
|
test_cmp expected$i actual"
|
|
|
|
done
|
|
|
|
|
|
|
|
# This invoked a null pointer dereference when the chunk callback was called
|
|
|
|
# with a zero length parent chunk and there were no more suspects.
|
|
|
|
test_expect_success 'Diff chunks with no suspects' '
|
|
|
|
test_write_lines xy1 A B C xy1 >file &&
|
|
|
|
git add file &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME=1 git commit -m 1 &&
|
|
|
|
|
|
|
|
test_write_lines xy2 A B xy2 C xy2 >file &&
|
|
|
|
git add file &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME=2 git commit -m 2 &&
|
|
|
|
REV_2=$(git rev-parse HEAD) &&
|
|
|
|
|
|
|
|
test_write_lines xy3 A >file &&
|
|
|
|
git add file &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME=3 git commit -m 3 &&
|
|
|
|
REV_3=$(git rev-parse HEAD) &&
|
|
|
|
|
|
|
|
test_write_lines 1 1 >expected &&
|
|
|
|
|
|
|
|
git blame --ignore-rev $REV_2 --ignore-rev $REV_3 file >output &&
|
|
|
|
sed -e "$pick_author" output >actual &&
|
|
|
|
|
|
|
|
test_cmp expected actual
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'position matching' '
|
|
|
|
test_write_lines abc def >file2 &&
|
|
|
|
git add file2 &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME=1 git commit -m 1 &&
|
|
|
|
|
|
|
|
test_write_lines abc def abc def >file2 &&
|
|
|
|
git add file2 &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME=2 git commit -m 2 &&
|
|
|
|
|
|
|
|
test_write_lines abcx defx abcx defx >file2 &&
|
|
|
|
git add file2 &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME=3 git commit -m 3 &&
|
|
|
|
REV_3=$(git rev-parse HEAD) &&
|
|
|
|
|
|
|
|
test_write_lines abcy defy abcx defx >file2 &&
|
|
|
|
git add file2 &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME=4 git commit -m 4 &&
|
|
|
|
REV_4=$(git rev-parse HEAD) &&
|
|
|
|
|
|
|
|
test_write_lines 1 1 2 2 >expected &&
|
|
|
|
|
|
|
|
git blame --ignore-rev $REV_3 --ignore-rev $REV_4 file2 >output &&
|
|
|
|
sed -e "$pick_author" output >actual &&
|
|
|
|
|
|
|
|
test_cmp expected actual
|
|
|
|
'
|
|
|
|
|
|
|
|
# This fails if each blame entry is processed independently instead of
|
|
|
|
# processing each diff change in full.
|
|
|
|
test_expect_success 'preserve order' '
|
|
|
|
test_write_lines bcde >file3 &&
|
|
|
|
git add file3 &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME=1 git commit -m 1 &&
|
|
|
|
|
|
|
|
test_write_lines bcde fghij >file3 &&
|
|
|
|
git add file3 &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME=2 git commit -m 2 &&
|
|
|
|
|
|
|
|
test_write_lines bcde fghij abcd >file3 &&
|
|
|
|
git add file3 &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME=3 git commit -m 3 &&
|
|
|
|
|
|
|
|
test_write_lines abcdx fghijx bcdex >file3 &&
|
|
|
|
git add file3 &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME=4 git commit -m 4 &&
|
|
|
|
REV_4=$(git rev-parse HEAD) &&
|
|
|
|
|
|
|
|
test_write_lines abcdx fghijy bcdex >file3 &&
|
|
|
|
git add file3 &&
|
|
|
|
test_tick &&
|
|
|
|
GIT_AUTHOR_NAME=5 git commit -m 5 &&
|
|
|
|
REV_5=$(git rev-parse HEAD) &&
|
|
|
|
|
|
|
|
test_write_lines 1 2 3 >expected &&
|
|
|
|
|
|
|
|
git blame --ignore-rev $REV_4 --ignore-rev $REV_5 file3 >output &&
|
|
|
|
sed -e "$pick_author" output >actual &&
|
|
|
|
|
|
|
|
test_cmp expected actual
|
|
|
|
'
|
|
|
|
|
|
|
|
test_done
|