From d0e52c1728bdab27f7ca61ee0d4ff91055646bae Mon Sep 17 00:00:00 2001
From: Jeff King <peff@peff.net>
Date: Wed, 6 Nov 2024 16:16:58 -0500
Subject: [PATCH 1/5] t6120: demonstrate weakness in disjoint-root handling

Commit 30b1c7ad9d (describe: don't abort too early when searching tags,
2020-02-26) tried to fix a problem that happens when there are disjoint
histories: to accurately compare the counts for different tags, we need
to keep walking the history longer in order to find a common base.

But its fix misses a case: we may still bail early if we hit the
max_candidates limit, producing suboptimal output. You can see this in
action by adding "--candidates=2" to the tests; we'll stop traversing as
soon as we see the second tag and will produce the wrong answer. I hit
this in practice while trying to teach git-describe not to keep looking
for candidates after we've seen all tags in the repo (effectively adding
--candidates=2, since these toy repos have only two tags each).

This is probably fixable by continuing to walk after hitting the
max-candidates limit, all the way down to a common ancestor of all
candidates. But it's not clear in practice what the preformance
implications would be (it would depend on how long the branches that
hold the candidates are).

So I'm punting on that for now, but I'd like to adjust the tests to be
more resilient, and to document the findings. So this patch:

  1. Adds an extra tag at the bottom of history. This shouldn't change
     the output, but does mean we are more resilient to low values of
     --candidates (e.g., if we start reducing it to the total number of
     tags). This is arguably closer to the real world anyway, where
     you're not going to have just 2 tags, but an arbitrarily long
     history going back in time, possibly with multiple irrelevant tags
     in it (I called the new tag "H" here for "history").

  2. Run the same tests with --candidates=2, which shows that even with
     the current code they can fail if we end the traversal early. That
     leaves a trail for anybody interested in trying to improve the
     behavior.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/t6120-describe.sh | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/t/t6120-describe.sh b/t/t6120-describe.sh
index 05ed2510d9..69689d2f36 100755
--- a/t/t6120-describe.sh
+++ b/t/t6120-describe.sh
@@ -19,6 +19,7 @@ TEST_PASSES_SANITIZE_LEAK=true
 
 check_describe () {
 	indir= &&
+	outcome=success &&
 	while test $# != 0
 	do
 		case "$1" in
@@ -26,6 +27,9 @@ check_describe () {
 			indir="$2"
 			shift
 			;;
+		--expect-failure)
+			outcome=failure
+			;;
 		*)
 			break
 			;;
@@ -36,7 +40,7 @@ check_describe () {
 	expect="$1"
 	shift
 	describe_opts="$@"
-	test_expect_success "describe $describe_opts" '
+	test_expect_${outcome} "describe $describe_opts" '
 		git ${indir:+ -C "$indir"} describe $describe_opts >raw &&
 		sed -e "s/-g[0-9a-f]*\$/-gHASH/" <raw >actual &&
 		echo "$expect" >expect &&
@@ -617,7 +621,7 @@ test_expect_success 'name-rev --annotate-stdin works with commitGraph' '
 
 #               B
 #               o
-#                \
+#  H             \
 #  o-----o---o----x
 #        A
 #
@@ -627,6 +631,7 @@ test_expect_success 'setup: describe commits with disjoint bases' '
 		cd disjoint1 &&
 
 		echo o >> file && git add file && git commit -m o &&
+		git tag H -a -m H &&
 		echo A >> file && git add file && git commit -m A &&
 		git tag A -a -m A &&
 		echo o >> file && git add file && git commit -m o &&
@@ -639,8 +644,9 @@ test_expect_success 'setup: describe commits with disjoint bases' '
 '
 
 check_describe -C disjoint1 "A-3-gHASH" HEAD
+check_describe -C disjoint1 --expect-failure "A-3-gHASH" --candidates=2 HEAD
 
-#           B
+#       H   B
 #   o---o---o------------.
 #                         \
 #                  o---o---x
@@ -658,6 +664,7 @@ test_expect_success 'setup: describe commits with disjoint bases 2' '
 		git checkout --orphan branch &&
 		echo o >> file2 && git add file2 && GIT_COMMITTER_DATE="2020-01-01 15:00" git commit -m o &&
 		echo o >> file2 && git add file2 && GIT_COMMITTER_DATE="2020-01-01 15:01" git commit -m o &&
+		git tag H -a -m H &&
 		echo B >> file2 && git add file2 && GIT_COMMITTER_DATE="2020-01-01 15:02" git commit -m B &&
 		git tag B -a -m B &&
 		git merge --no-ff --allow-unrelated-histories main -m x
@@ -665,6 +672,7 @@ test_expect_success 'setup: describe commits with disjoint bases 2' '
 '
 
 check_describe -C disjoint2 "B-3-gHASH" HEAD
+check_describe -C disjoint2 --expect-failure "B-3-gHASH" --candidates=2 HEAD
 
 test_expect_success 'setup misleading taggerdates' '
 	GIT_COMMITTER_DATE="2006-12-12 12:31" git tag -a -m "another tag" newer-tag-older-commit unique-file~1

From bb0830c6820bf25cdf4722c63a3ff06470e18b0e Mon Sep 17 00:00:00 2001
From: Jeff King <peff@peff.net>
Date: Wed, 6 Nov 2024 16:17:08 -0500
Subject: [PATCH 2/5] t/perf: add tests for git-describe

We don't have a perf script for git-describe, despite it often being
accused of slowness. Let's add a few simple tests to start with.

Rather than use the existing tags from our test repo, we'll make our own
so that we have a known quantity and position. We'll add a "new" tag
near the tip of HEAD, and an "old" one that is at the very bottom. And
then our tests are:

  1. Describing HEAD naively requires walking all the way down to the
     old tag as we collect candidates. This gives us a baseline for what
     "slow" looks like.

  2. Doing the same with --candidates=1 can potentially be fast, because
     we can quie after finding "new". But we don't, and it's also slow.

  3. Likewise we should be able to quit when there are no more tags to
     find. This can happen naturally if a repo has few tags, but also if
     you restrict the set of tags with --match.

Here are the results running against linux.git. Note that I have a
commit-graph built for the repo, so "slow" here is ~700ms. Without a
commit graph it's more like 9s!

  Test                                           HEAD
  --------------------------------------------------------------
  6100.2: describe HEAD                          0.70(0.66+0.04)
  6100.3: describe HEAD with one max candidate   0.70(0.66+0.04)
  6100.4: describe HEAD with one tag             0.70(0.64+0.06)

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/perf/p6100-describe.sh | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)
 create mode 100755 t/perf/p6100-describe.sh

diff --git a/t/perf/p6100-describe.sh b/t/perf/p6100-describe.sh
new file mode 100755
index 0000000000..069f91ce49
--- /dev/null
+++ b/t/perf/p6100-describe.sh
@@ -0,0 +1,30 @@
+#!/bin/sh
+
+test_description='performance of git-describe'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+# clear out old tags and give us a known state
+test_expect_success 'set up tags' '
+	git for-each-ref --format="delete %(refname)" refs/tags >to-delete &&
+	git update-ref --stdin <to-delete &&
+	new=$(git rev-list -1000 HEAD | tail -n 1) &&
+	git tag -m new new $new &&
+	old=$(git rev-list       HEAD | tail -n 1) &&
+	git tag -m old old $old
+'
+
+test_perf 'describe HEAD' '
+	git describe HEAD
+'
+
+test_perf 'describe HEAD with one max candidate' '
+	git describe --candidates=1 HEAD
+'
+
+test_perf 'describe HEAD with one tag' '
+	git describe --match=new HEAD
+'
+
+test_done

From 737904622116cb9b01aa18a0b98e55be16c92a0c Mon Sep 17 00:00:00 2001
From: Jeff King <peff@peff.net>
Date: Wed, 6 Nov 2024 16:17:14 -0500
Subject: [PATCH 3/5] describe: stop digging for max_candidates+1

By default, describe considers only 10 candidate matches, and stops
traversing when we have enough. This makes things much faster in a large
repository, where collecting all candidates requires walking all the way
down to the root (or at least to the oldest tag). This goes all the way
back to 8713ab3079 (Improve git-describe performance by reducing
revision listing., 2007-01-13).

However, we don't stop immediately when we have enough candidates. We
keep traversing and only bail when we find one more candidate that we're
ignoring. Usually this is not too expensive, if the tags are sprinkled
evenly throughout history. But if you are unlucky, you might hit the max
candidate quickly, and then have a huge swath of history before finding
the next one.

Our p6100 test has exactly this unlucky case: with a max of "1", we find
a recent tag quickly and then have to go all the way to the root to find
the old tag that will be discarded.

A more interesting real-world case is:

  git describe --candidates=1 --match=v6.12-rc4 HEAD

in the linux.git repo. There we restrict the set of tags to a single
one, so there is no older candidate to find at all! But despite
--candidates=1, we keep traversing to the root only to find nothing.

So why do we keep traversing after hitting thet max? There are two
reasons I can see:

  1. In theory the extra information that there was another candidate
     could be useful, and we record it in the gave_up_on variable. But
     we only show this information with --debug.

  2. After finding the candidate, there's more processing we do in our
     loop. The most important of this is propagating the "within" flags
     to our parent commits, and putting them in the commit_list we'll
     use for finish_depth_computation().

     That function continues the traversal until we've counted all
     commits reachable from the starting point but not reachable from
     our best candidate tag (so essentially counting "$tag..$start", but
     avoiding re-walking over the bits we've seen).  If we break
     immediately without putting those commits into the list, our depth
     computation will be wrong (in the worst case we'll count all the
     way down to the root, not realizing those commits are included in
     our tag).

But we don't need to find a new candidate for (2). As soon as we finish
the loop iteration where we hit max_candidates, we can then quit on the
next iteration. This should produce the same output as the original code
(which could, after all, find a candidate on the very next commit
anyway) but ends the traversal with less pointless digging.

We still have to set "gave_up_on"; we've popped it off the list and it
has to go back. An alternative would be to re-order the loop so that it
never gets popped, but it's perhaps still useful to show in the --debug
output, so we need to know it anyway. We do have to adjust the --debug
output since it's now just a commit where we stopped traversing, and not
the max+1th candidate.

p6100 shows the speedup using linux.git:

  Test                                           HEAD^             HEAD
  ---------------------------------------------------------------------------------------
  6100.2: describe HEAD                          0.70(0.63+0.06)   0.71(0.66+0.04) +1.4%
  6100.3: describe HEAD with one max candidate   0.70(0.64+0.05)   0.01(0.00+0.00) -98.6%
  6100.4: describe HEAD with one tag             0.70(0.67+0.03)   0.70(0.63+0.06) +0.0%

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Helped-by: Rasmus Villemoes <ravi@prevas.dk>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/describe.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/builtin/describe.c b/builtin/describe.c
index 7330a77b38..69f2d942be 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -366,6 +366,12 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 		struct commit_name **slot;
 
 		seen_commits++;
+
+		if (match_cnt == max_candidates) {
+			gave_up_on = c;
+			break;
+		}
+
 		slot = commit_names_peek(&commit_names, c);
 		n = slot ? *slot : NULL;
 		if (n) {
@@ -381,10 +387,6 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 				if (n->prio == 2)
 					annotated_cnt++;
 			}
-			else {
-				gave_up_on = c;
-				break;
-			}
 		}
 		for (cur_match = 0; cur_match < match_cnt; cur_match++) {
 			struct possible_tag *t = &all_matches[cur_match];
@@ -470,9 +472,8 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 		fprintf(stderr, _("traversed %lu commits\n"), seen_commits);
 		if (gave_up_on) {
 			fprintf(stderr,
-				_("more than %i tags found; listed %i most recent\n"
-				"gave up search at %s\n"),
-				max_candidates, max_candidates,
+				_("found %i tags; gave up search at %s\n"),
+				max_candidates,
 				oid_to_hex(&gave_up_on->object.oid));
 		}
 	}

From b8150bfee1f8d609e3d1f38eaa98cfb64a1c2ca5 Mon Sep 17 00:00:00 2001
From: Jeff King <peff@peff.net>
Date: Wed, 6 Nov 2024 16:17:17 -0500
Subject: [PATCH 4/5] describe: stop traversing when we run out of names

When trying to describe a commit, we'll traverse from the commit,
collecting candidate tags that point to its ancestors. But once we've
seen all of the tags in the repo, there's no point in traversing
further. There's nothing left to find!

For a default "git describe", this isn't usually a big problem. In a
large repo you'll probably have multiple tags, so we'll eventually find
10 candidates (the default for max_candidates) and stop there. And in a
small repo, it's quick to traverse to the root.

But you can imagine a large repo with few tags. Or, as we saw in a real
world case, explicitly limiting the set of matches like this (on
linux.git):

  git describe --match=v6.12-rc4 HEAD

which goes all the way to the root before realizing that no, there are
no other tags under consideration besides the one we fed via --match.
If we add in "--candidates=1" there, it's much faster (at least as of
the previous commit).

But we should be able to speed this up without the user asking for it.
After expanding all matching tags, we know the total number of names. We
could just stop the traversal there, but as hinted at above we already
have a mechanism for doing that: the max_candidate limit. So we can just
reduce that limit to match the number of possible candidates.

Our p6100 test shows this off:

  Test                                           HEAD^             HEAD
  ---------------------------------------------------------------------------------------
  6100.2: describe HEAD                          0.71(0.65+0.06)   0.72(0.68+0.04) +1.4%
  6100.3: describe HEAD with one max candidate   0.01(0.00+0.00)   0.01(0.00+0.00) +0.0%
  6100.4: describe HEAD with one tag             0.72(0.66+0.05)   0.01(0.00+0.00) -98.6%

Now we are fast automatically, just as if --candidates=1 were supplied
by the user.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Helped-by: Rasmus Villemoes <ravi@prevas.dk>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/describe.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/builtin/describe.c b/builtin/describe.c
index 69f2d942be..8ec3be87df 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -667,6 +667,8 @@ int cmd_describe(int argc,
 			     NULL);
 	if (!hashmap_get_size(&names) && !always)
 		die(_("No names found, cannot describe anything."));
+	if (hashmap_get_size(&names) < max_candidates)
+		max_candidates = hashmap_get_size(&names);
 
 	if (argc == 0) {
 		if (broken) {

From db162862b36f8c2de8e7020801870abf33b61a20 Mon Sep 17 00:00:00 2001
From: Jeff King <peff@peff.net>
Date: Fri, 6 Dec 2024 00:42:18 -0500
Subject: [PATCH 5/5] describe: split "found all tags" and max_candidates logic

Commit a30154187a (describe: stop traversing when we run out of names,
2024-10-31) taught git-describe to automatically reduce the
max_candidates setting to match the total number of possible names. This
lets us break out of the traversal rather than fruitlessly searching for
more candidates when there are no more to be found.

However, setting max_candidates to 0 (e.g., if the repo has no tags)
overlaps with the --exact-match option, which explicitly uses the same
value. And this causes a regression with --always, which is ignored in
exact-match mode. We used to get this in a repo with no tags:

  $ git describe --always HEAD
  b2f0a7f

and now we get:

  $ git describe --always HEAD
  fatal: no tag exactly matches 'b2f0a7f47f5f2aebe1e7fceff19a57de20a78c06'

The reason is that we bail early in describe_commit() when
max_candidates is set to 0. This logic goes all the way back to
2c33f75754 (Teach git-describe --exact-match to avoid expensive tag
searches, 2008-02-24).

We should obviously fix this regression, but there are two paths,
depending on what you think:

  $ git describe --always --exact-match

and

  $ git describe --always --candidates=0

should do. Since the "--always" option was added, it has always been
ignored in --exact-match (or --candidates=0) mode. I.e., we treat
--exact-match as a true exact match of a tag, and never fall back to
using --always, even if it was requested.

If we think that's a bug (or at least a misfeature), then the right
solution is to fix it by removing the early bail-out from 2c33f75754,
letting the noop algorithm run and then hitting the --always fallback
output. And then our regression naturally goes away, because it follows
the same path.

If we think that the current "--exact-match --always" behavior is the
right thing, then we have to differentiate the case where we
automatically reduced max_candidates to 0 from the case where the user
asked for it specifically. That's possible to do with a flag, but we can
also just reimplement the logic from a30154187a to explicitly break out
of the traversal when we run out of candidates (rather than relying on
the existing max_candidates check).

My gut feeling is along the lines of option 1 (it's a bug, and people
would be happy for "--exact-match --always" to give the fallback rather
than ignoring "--always"). But the documentation can be interpreted in
the other direction, and we've certainly lived with the existing
behavior for many years. So it's possible that changing it now is the
wrong thing.

So this patch fixes the regression by taking the second option,
retaining the "--exact-match" behavior as-is. There are two new tests.
The first shows that the regression is fixed (we don't even need a new
repo without tags; a restrictive --match is enough to create the
situation that there are no candidate names).

The second test confirms that the "--exact-match --always" behavior
remains unchanged and continues to die when there is no tag pointing at
the specified commit. It's possible we may reconsider this in the
future, but this shows that the approach described above is implemented
faithfully.

We can also run the perf tests in p6100 to see that we've retained the
speedup that a30154187a was going for:

  Test                                           HEAD^             HEAD
  --------------------------------------------------------------------------------------
  6100.2: describe HEAD                          0.72(0.64+0.07)   0.72(0.66+0.06) +0.0%
  6100.3: describe HEAD with one max candidate   0.01(0.00+0.00)   0.01(0.00+0.00) +0.0%
  6100.4: describe HEAD with one tag             0.01(0.01+0.00)   0.01(0.01+0.00) +0.0%

Reported-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/describe.c  |  5 ++---
 t/t6120-describe.sh | 10 ++++++++++
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/builtin/describe.c b/builtin/describe.c
index 8ec3be87df..a6ef8af32a 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -367,7 +367,8 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 
 		seen_commits++;
 
-		if (match_cnt == max_candidates) {
+		if (match_cnt == max_candidates ||
+		    match_cnt == hashmap_get_size(&names)) {
 			gave_up_on = c;
 			break;
 		}
@@ -667,8 +668,6 @@ int cmd_describe(int argc,
 			     NULL);
 	if (!hashmap_get_size(&names) && !always)
 		die(_("No names found, cannot describe anything."));
-	if (hashmap_get_size(&names) < max_candidates)
-		max_candidates = hashmap_get_size(&names);
 
 	if (argc == 0) {
 		if (broken) {
diff --git a/t/t6120-describe.sh b/t/t6120-describe.sh
index 69689d2f36..8e8bfed05b 100755
--- a/t/t6120-describe.sh
+++ b/t/t6120-describe.sh
@@ -716,4 +716,14 @@ test_expect_success 'describe --broken --dirty with a file with changed stat' '
 	)
 '
 
+test_expect_success '--always with no refs falls back to commit hash' '
+	git rev-parse HEAD >expect &&
+	git describe --no-abbrev --always --match=no-such-tag >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success '--exact-match does not show --always fallback' '
+	test_must_fail git describe --exact-match --always
+'
+
 test_done