The reftable/ library code has been made -Wsign-compare clean.
* ps/reftable-sign-compare:
reftable: address trivial -Wsign-compare warnings
reftable/blocksource: adjust `read_block()` to return `ssize_t`
reftable/blocksource: adjust type of the block length
reftable/block: adjust type of the restart length
reftable/block: adapt header and footer size to return a `size_t`
reftable/basics: adjust `hash_size()` to return `uint32_t`
reftable/basics: adjust `common_prefix_size()` to return `size_t`
reftable/record: handle overflows when decoding varints
reftable/record: drop unused `print` function pointer
meson: stop disabling -Wsign-compare
The "cache" credential back-end did not handle authtype correctly,
which has been corrected.
* mh/credential-cache-authtype-request-fix:
credential-cache: respect authtype capability
It was possible for "git unpack-objects" and "git index-pack" to
make an unaligned access, which has been corrected.
* jk/pack-header-parse-alignment-fix:
index-pack, unpack-objects: use skip_prefix to avoid magic number
index-pack, unpack-objects: use get_be32() for reading pack header
parse_pack_header_option(): avoid unaligned memory writes
packfile: factor out --pack_header argument parsing
bswap.h: squelch potential sparse -Wcast-truncate warnings
The meson-driven build is now aware of "git-subtree" housed in
contrib/subtree hierarchy.
* ps/build-meson-subtree:
meson: wire up the git-subtree(1) command
meson: introduce build option for contrib
contrib/subtree: fix building docs
The code in connect.c has been updated to work around complaints
from -Wsign-compare.
* mh/connect-sign-compare:
connect: address -Wsign-compare warnings
Move a few more unit tests to the clar test framework.
* sk/unit-tests:
t/unit-tests: convert reftable tree test to use clar test framework
t/unit-tests: adapt priority queue test to use clar test framework
t/unit-tests: convert mem-pool test to use clar test framework
t/unit-tests: handle dashes in test suite filenames
The help text from "git $cmd -h" appear on the standard output for
some $cmd and the standard error for others. The built-in commands
have been fixed to show them on the standard output consistently.
* jc/show-usage-help:
builtin: send usage() help text to standard output
oddballs: send usage() help text to standard output
builtins: send usage_with_options() help text to standard output
usage: add show_usage_if_asked()
parse-options: add show_usage_with_options_if_asked()
t0012: optionally check that "-h" output goes to stdout
When the --name-hash-version option is used in 'git pack-objects', it
can change from the initial assignment to when it is used based on
interactions with other arguments. Specifically, when writing or reading
bitmaps, we must force version 1 for now. This could change in the
future when the bitmap format can store a name hash version value,
indicating which was used during the writing of the packfile.
Protect the 'git pack-objects' process from getting confused by failing
with a BUG() statement if the value of the name hash version changes
between calls to pack_name_hash_fn().
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a new test-tool helper, name-hash, to output the value of the
name-hash algorithms for the input list of strings, one per line.
Since the name-hash values can be stored in the .bitmap files, it is
important that these hash functions do not change across Git versions.
Add a simple test to t5310-pack-bitmaps.sh to provide some testing of
the current values. Due to how these functions are implemented, it would
be difficult to change them without disturbing these values. The paths
used for this test are carefully selected to demonstrate some of the
behavior differences of the two current name hash versions, including
which conditions will cause them to collide.
Create a performance test that uses test_size to demonstrate how
collisions occur for these hash algorithms. This test helps inform
someone as to the behavior of the name-hash algorithms for their repo
based on the paths at HEAD.
My copy of the Git repository shows modest statistics around the
collisions of the default name-hash algorithm:
Test this tree
--------------------------------------------------
5314.1: paths at head 4.5K
5314.2: distinct hash value: v1 4.1K
5314.3: maximum multiplicity: v1 13
5314.4: distinct hash value: v2 4.2K
5314.5: maximum multiplicity: v2 9
Here, the maximum collision multiplicity is 13, but around 10% of paths
have a collision with another path.
In a more interesting example, the microsoft/fluentui [1] repo had these
statistics at time of committing:
Test this tree
--------------------------------------------------
5314.1: paths at head 19.5K
5314.2: distinct hash value: v1 8.2K
5314.3: maximum multiplicity: v1 279
5314.4: distinct hash value: v2 17.8K
5314.5: maximum multiplicity: v2 44
[1] https://github.com/microsoft/fluentui
That demonstrates that of the nearly twenty thousand path names, they
are assigned around eight thousand distinct values. 279 paths are
assigned to a single value, leading the packing algorithm to sort
objects from those paths together, by size.
With the v2 name hash function, the maximum multiplicity lowers to 44,
leaving some room for further improvement.
In a more extreme example, an internal monorepo had a much worse
collision rate:
Test this tree
--------------------------------------------------
5314.1: paths at head 227.3K
5314.2: distinct hash value: v1 72.3K
5314.3: maximum multiplicity: v1 14.4K
5314.4: distinct hash value: v2 166.5K
5314.5: maximum multiplicity: v2 138
Here, we can see that the v2 name hash function provides somem
improvements, but there are still a number of collisions that could lead
to repacking problems at this scale.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As custom options are added to 'git pack-objects' and 'git repack' to
adjust how compression is done, use this new performance test script to
demonstrate their effectiveness in performance and size.
The recently-added --name-hash-version option allows for testing
different name hash functions. Version 2 intends to preserve some of the
locality of version 1 while more often breaking collisions due to long
filenames.
Distinguishing objects by more of the path is critical when there are
many name hash collisions and several versions of the same path in the
full history, giving a significant boost to the full repack case. The
locality of the hash function is critical to compressing something like
a shallow clone or a thin pack representing a push of a single commit.
This can be seen by running pt5313 on the open source fluentui
repository [1]. Most commits will have this kind of output for the thin
and big pack cases, though certain commits (such as [2]) will have
problematic thin pack size for other reasons.
[1] https://github.com/microsoft/fluentui
[2] a637a06df05360ce5ff21420803f64608226a875
Checked out at the parent of [2], I see the following statistics:
Test HEAD
---------------------------------------------------------------
5313.2: thin pack with version 1 0.37(0.44+0.02)
5313.3: thin pack size with version 1 1.2M
5313.4: big pack with version 1 2.04(7.77+0.23)
5313.5: big pack size with version 1 20.4M
5313.6: shallow fetch pack with version 1 1.41(2.94+0.11)
5313.7: shallow pack size with version 1 34.4M
5313.8: repack with version 1 95.70(676.41+2.87)
5313.9: repack size with version 1 439.3M
5313.10: thin pack with version 2 0.12(0.12+0.06)
5313.11: thin pack size with version 2 22.0K
5313.12: big pack with version 2 2.80(5.43+0.34)
5313.13: big pack size with version 2 25.9M
5313.14: shallow fetch pack with version 2 1.77(2.80+0.19)
5313.15: shallow pack size with version 2 33.7M
5313.16: repack with version 2 33.68(139.52+2.58)
5313.17: repack size with version 2 160.5M
To make comparisons easier, I will reformat this output into a different
table style:
| Test | V1 Time | V2 Time | V1 Size | V2 Size |
|--------------|---------|---------|---------|---------|
| Thin Pack | 0.37 s | 0.12 s | 1.2 M | 22.0 K |
| Big Pack | 2.04 s | 2.80 s | 20.4 M | 25.9 M |
| Shallow Pack | 1.41 s | 1.77 s | 34.4 M | 33.7 M |
| Repack | 95.70 s | 33.68 s | 439.3 M | 160.5 M |
The v2 hash function successfully differentiates the CHANGELOG.md files
from each other, which leads to significant improvements in the thin
pack (simulating a push of this commit) and the full repack. There is
some bloat in the "big pack" scenario and essentially the same results
for the shallow pack.
In the case of the Git repository, these numbers show some of the issues
with this approach:
| Test | V1 Time | V2 Time | V1 Size | V2 Size |
|--------------|---------|---------|---------|---------|
| Thin Pack | 0.02 s | 0.02 s | 1.1 K | 1.1 K |
| Big Pack | 1.69 s | 1.95 s | 13.5 M | 14.5 M |
| Shallow Pack | 1.26 s | 1.29 s | 12.0 M | 12.2 M |
| Repack | 29.51 s | 29.01 s | 237.7 M | 238.2 M |
Here, the attempts to remove conflicts in the v2 function seem to cause
slight bloat to these sizes. This shows that the Git repository benefits
a lot from cross-path delta pairs.
The results are similar with the nodejs/node repo:
| Test | V1 Time | V2 Time | V1 Size | V2 Size |
|--------------|---------|---------|---------|---------|
| Thin Pack | 0.02 s | 0.02 s | 1.6 K | 1.6 K |
| Big Pack | 4.61 s | 3.26 s | 56.0 M | 52.8 M |
| Shallow Pack | 7.82 s | 7.51 s | 104.6 M | 107.0 M |
| Repack | 88.90 s | 73.75 s | 740.1 M | 764.5 M |
Here, the v2 name-hash causes some size bloat more often than it reduces
the size, but it also universally improves performance time, which is an
interesting reversal. This must mean that it is helping to short-circuit
some delta computations even if it is not finding the most efficient
ones. The performance improvement cannot be explained only due to the
I/O cost of writing the resulting packfile.
The Linux kernel repository was the initial target of the default name
hash value, and its naming conventions are practically build to take the
most advantage of the default name hash values:
| Test | V1 Time | V2 Time | V1 Size | V2 Size |
|--------------|----------|----------|---------|---------|
| Thin Pack | 0.17 s | 0.07 s | 4.6 K | 4.6 K |
| Big Pack | 17.88 s | 12.35 s | 201.1 M | 159.1 M |
| Shallow Pack | 11.05 s | 22.94 s | 269.2 M | 273.8 M |
| Repack | 727.39 s | 566.95 s | 2.5 G | 2.5 G |
Here, the thin and big packs gain some performance boosts in time, with
a modest gain in the size of the big pack. The shallow pack, however, is
more expensive to compute, likely because similarly-named files across
different directories are farther apart in the name hash ordering in v2.
The repack also gains benefits in computation time but no meaningful
change to the full size.
Finally, an internal Javascript repo of moderate size shows significant
gains when repacking with --name-hash-version=2 due to it having many name
hash collisions. However, it's worth noting that only the full repack
case has significant differences from the v1 name hash:
| Test | V1 Time | V2 Time | V1 Size | V2 Size |
|-----------|-----------|----------|---------|---------|
| Thin Pack | 8.28 s | 7.28 s | 16.8 K | 16.8 K |
| Big Pack | 12.81 s | 11.66 s | 29.1 M | 29.1 M |
| Shallow | 4.86 s | 4.06 s | 42.5 M | 44.1 M |
| Repack | 3126.50 s | 496.33 s | 6.2 G | 855.6 M |
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a new environment variable to opt-in to different values of the
--name-hash-version=<n> option in 'git pack-objects'. This allows for
extra testing of the feature without repeating all of the test
scenarios. Unlike many GIT_TEST_* variables, we are choosing to not add
this to the linux-TEST-vars CI build as that test run is already
overloaded. The behavior exposed by this test variable is of low risk
and should be sufficient to allow manual testing when an issue arises.
But this option isn't free. There are a few tests that change behavior
with the variable enabled.
First, there are a few tests that are very sensitive to certain delta
bases being picked. These are both involving the generation of thin
bundles and then counting their objects via 'git index-pack --fix-thin'
which pulls the delta base into the new packfile. For these tests,
disable the option as a decent long-term option.
Second, there are some tests that compare the exact output of a 'git
pack-objects' process when using bitmaps. The warning that ignores the
--name-hash-version=2 and forces version 1 causes these tests to fail.
Disable the environment variable to get around this issue.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The new '--name-hash-version' option for 'git repack' is a simple
pass-through to the underlying 'git pack-objects' subcommand. However,
this subcommand may have other options and a temporary filename as part
of the subcommand execution that may not be predictable or could change
over time.
The existing test_subcommand method requires an exact list of arguments
for the subcommand. This is too rigid for our needs here, so create a
new method, test_subcommand_flex. Use it to check that the
--name-hash-version option is passing through.
Since we are modifying the 'git repack' command, let's bring its usage
in line with the Documentation's synopsis. This removes it from the
allow list in t0450 so it will remain in sync in the future.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The previous change introduced a new pack_name_hash_v2() function that
intends to satisfy much of the hash locality features of the existing
pack_name_hash() function while also distinguishing paths with similar
final components of their paths.
This change adds a new --name-hash-version option for 'git pack-objects'
to allow users to select their preferred function version. This use of
an integer version allows for future expansion and a direct way to later
store a name hash version in the .bitmap format.
For now, let's consider how effective this mechanism is when repacking a
repository with different name hash versions. Specifically, we will
execute 'git pack-objects' the same way a 'git repack -adf' process
would, except we include --name-hash-version=<n> for testing.
On the Git repository, we do not expect much difference. All path names
are short. This is backed by our results:
| Stage | Pack Size | Repack Time |
|-----------------------|-----------|-------------|
| After clone | 260 MB | N/A |
| --name-hash-version=1 | 127 MB | 129s |
| --name-hash-version=2 | 127 MB | 112s |
This example demonstrates how there is some natural overhead coming from
the cloned copy because the server is hosting many forks and has not
optimized for exactly this set of reachable objects. But the full repack
has similar characteristics for both versions.
Let's consider some repositories that are hitting too many collisions
with version 1. First, let's explore the kinds of paths that are
commonly causing these collisions:
* "/CHANGELOG.json" is 15 characters, and is created by the beachball
[1] tool. Only the final character of the parent directory can
differentiate different versions of this file, but also only the two
most-significant digits. If that character is a letter, then this is
always a collision. Similar issues occur with the similar
"/CHANGELOG.md" path, though there is more opportunity for
differences In the parent directory.
* Localization files frequently have common filenames but
differentiates via parent directories. In C#, the name
"/strings.resx.lcl" is used for these localization files and they
will all collide in name-hash.
[1] https://github.com/microsoft/beachball
I've come across many other examples where some internal tool uses a
common name across multiple directories and is causing Git to repack
poorly due to name-hash collisions.
One open-source example is the fluentui [2] repo, which uses beachball
to generate CHANGELOG.json and CHANGELOG.md files, and these files have
very poor delta characteristics when comparing against versions across
parent directories.
| Stage | Pack Size | Repack Time |
|-----------------------|-----------|-------------|
| After clone | 694 MB | N/A |
| --name-hash-version=1 | 438 MB | 728s |
| --name-hash-version=2 | 168 MB | 142s |
[2] https://github.com/microsoft/fluentui
In this example, we see significant gains in the compressed packfile
size as well as the time taken to compute the packfile.
Using a collection of repositories that use the beachball tool, I was
able to make similar comparisions with dramatic results. While the
fluentui repo is public, the others are private so cannot be shared for
reproduction. The results are so significant that I find it important to
share here:
| Repo | --name-hash-version=1 | --name-hash-version=2 |
|----------|-----------------------|-----------------------|
| fluentui | 440 MB | 161 MB |
| Repo B | 6,248 MB | 856 MB |
| Repo C | 37,278 MB | 6,755 MB |
| Repo D | 131,204 MB | 7,463 MB |
Future changes could include making --name-hash-version implied by a config
value or even implied by default during a full repack.
It is important to point out that the name hash value is stored in the
.bitmap file format, so we must force --name-hash-version=1 when bitmaps
are being read or written. Later, the bitmap format could be updated to
be aware of the name hash version so deltas can be quickly computed
across the bitmapped/not-bitmapped boundary. To promote the safety of
this parameter, the validate_name_hash_version() method will die() if
the given name-hash version is incorrect and will disable newer versions
if not yet compatible with other features, such as --write-bitmap-index.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As we will explore in later changes, the default name-hash function used
in 'git pack-objects' has a tendency to cause collisions and cause poor
delta selection. This change creates an alternative that avoids some
collisions while preserving some amount of hash locality.
The pack_name_hash() method has not been materially changed since it was
introduced in ce0bd64 (pack-objects: improve path grouping
heuristics., 2006-06-05). The intention here is to group objects by path
name, but also attempt to group similar file types together by making
the most-significant digits of the hash be focused on the final
characters.
Here's the crux of the implementation:
/*
* This effectively just creates a sortable number from the
* last sixteen non-whitespace characters. Last characters
* count "most", so things that end in ".c" sort together.
*/
while ((c = *name++) != 0) {
if (isspace(c))
continue;
hash = (hash >> 2) + (c << 24);
}
As the comment mentions, this only cares about the last sixteen
non-whitespace characters. This cause some filenames to collide more than
others. This collision is somewhat by design in order to promote hash
locality for files that have similar types (.c, .h, .json) or could be the
same file across a directory rename (a/foo.txt to b/foo.txt). This leads to
decent cross-path deltas in cases like shallow clones or packing a
repository with very few historical versions of files that share common data
with other similarly-named files.
However, when the name-hash instead leads to a large number of name-hash
collisions for otherwise unrelated files, this can lead to confusing the
delta calculation to prefer cross-path deltas over previous versions of the
same file.
The new pack_name_hash_v2() function attempts to fix this issue by
taking more of the directory path into account through its hash
function. Its naming implies that we will later wire up details for
choosing a name-hash function by version.
The first change is to be more careful about paths using non-ASCII
characters. With these characters in mind, reverse the bits in the byte
as the least-significant bits have the highest entropy and we want to
maximize their influence. This is done with some bit manipulation that
swaps the two halves, then the quarters within those halves, and then
the bits within those quarters.
The second change is to perform hash composition operations at every
level of the path. This is done by storing a 'base' hash value that
contains the hash of the parent directory. When reaching a directory
boundary, we XOR the current level's name-hash value with a downshift of
the previous level's hash. This perturbation intends to create low-bit
distinctions for paths with the same final 16 bytes but distinct parent
directory structures.
The collision rate and effectiveness of this hash function will be
explored in later changes as the function is integrated with 'git
pack-objects' and 'git repack'.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When migrating reflogs between reference backends, maintaining the
original order of the reflog entries is crucial. To achieve this, an
`index` field is stored within the `ref_update` struct that encodes the
relative order of reflog entries. This field is used by the reftable
backend as update index for the respective reflog entries to maintain
that ordering.
These update indices must be respected when writing table headers, which
encode the minimum and maximum update index of contained records in the
header and footer. This logic was added in commit bc67b4ab5f (reftable:
write correct max_update_index to header, 2025-01-15), which started to
use `reftable_writer_set_limits()` to propagate the mininum and maximum
update index of all records contained in a ref transaction.
However, we only set the maximum update index for the first transaction
argument, even though there can be multiple such arguments. This is the
case when we write to multiple stacks in a single transaction, e.g. when
updating references in two different worktrees at once. Consequently,
the update index for all but the first argument remain uninitialized,
which may cause undefined behaviour.
Fix this by moving the assignment of the maximum update index in
`reftable_be_transaction_finish()` inside the loop, which ensures that
all elements of the array are correctly initialized.
Furthermore, initialize the `max_index` field to 0 when queueing a new
transaction argument. This is not strictly necessary, as all elements of
`write_transaction_table_arg.max_index` are now assigned correctly.
However, this initialization is added for consistency and to safeguard
against potential future changes that might inadvertently introduce
uninitialized memory access.
Reported-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In b1b713f722 (fetch set_head: handle mirrored bare repositories,
2024-11-22) it was implicitly assumed that all remotes will be mirrors
in a bare repository, thus fetching a non-mirrored remote could lead to
HEAD pointing to a non-existent reference. Make sure we only overwrite
HEAD if we are in a bare repository and fetching from a mirror.
Otherwise, proceed as normally, and create
refs/remotes/<nonmirrorremote>/HEAD instead.
Reported-by: Christian Hesse <list@eworm.de>
Signed-off-by: Bence Ferdinandy <bence@ferdinandy.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As a preparatory step to use even more properties from the remote
struct, refactor set_head to take the entire struct as a parameter,
instead of the necessary bits. This also allows consolidating the use of
gtransport->remote in set_head, making the access of the remote's
properties consistent in the function.
Signed-off-by: Bence Ferdinandy <bence@ferdinandy.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Already when introduced in c7a8a16239 (Add bundle transport,
2007-09-10), the `bundle` transport had a bug where it would open a file
descriptor to the bundle file and then close it _twice_: First, the file
descriptor (`data->fd`) is passed to `unbundle()`, which would use it as
the `stdin` of the `index-pack` process, which as a consequence would
close it via `start_command()`. However, `data->fd` would still hold the
numerical value of the file descriptor, and `close_bundle()` would see
that and happily close it again.
This seems not to have caused too many problems in almost two decades,
but I encountered a situation today where it _does_ cause problems: In
i686 variants of Git for Windows, it seems that file descriptors are
reused quickly after they have been closed.
In the particular scenario I faced, `git fetch <bundle> <ref>` gets the
same file descriptor value when opening the bundle file and importing
its embedded packfile (which implicitly closes the file descriptor) and
then when opening a pack file in `fetch_and_consume_refs()` while
looking up an object's header.
Later on, after the bundle has been imported (and the `close_bundle()`
function erroneously closes the file descriptor that has _already_ been
closed when using it as `stdin` for `git index-pack`), the same file
descriptor value has now been reused via `use_pack()`. Now, when either
the recursive fetch (which defaults to "on", unfortunately) or a
commit-graph update needs to `mmap()` the packfile, it fails due to a
now-invalid file descriptor that _should_ point to the pack file but
doesn't anymore.
To fix that, let's invalidate `data->fd` after calling `unbundle()`.
That way, `close_bundle()` does not close a file descriptor that may
have been reused for something different. While at it, document that
`unbundle()` closes the file descriptor, and ensure that it also does
that when failing to verify the bundle.
Luckily, this bug does not affect the bundle URI feature, it only
affects the `git fetch <bundle>` code path.
Note that this patch does not _completely_ clarifies who is responsible
to close that file descriptor, as `run_command()` may fail _without_
closing `cmd->in`. Addressing this issue thoroughly, however, would
require a rather thorough re-design of the `start_command()` and
`finish_command()` functionality to make it a lot less murky who is
responsible for what file descriptors.
At least this here patch is relatively easy to reason about, and
addresses a hard failure (`fatal: mmap: could not determine filesize`)
at the expense of leaking a file descriptor under very rare
circumstances in which `git fetch` would error out anyway.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This commit extends the functionality of `git gc`
by adding a new option, `--expire-to=<dir>`. Previously,
this feature was implemented in 91badeba32 (builtin/repack.c:
implement `--expire-to` for storing pruned objects, 2022-10-24),
which allowing users to specify a directory where unreachable
and expired cruft packs are stored during garbage collection.
However, users had to run `git repack --cruft --expire-to=<dir>`
followed by `git prune` to achieve similar results within `git gc`.
By introducing `--expire-to=<dir>` directly into `git gc`,
we simplify the process for users who wish to manage their
repository's cleanup more efficiently. This change involves
passing the `--expire-to=<dir>` parameter through to `git repack`,
making it easier for users to set up a backup location for cruft
packs that will be pruned.
Due to the original `git gc --prune=now` deleting all unreachable
objects by passing the `-a` parameter to git repack. With the
addition of the `--cruft` and `--expire-to` options, it is necessary
to modify this default behavior: instead of deleting these
unreachable objects, they should be merged into a cruft pack and
collected in a specified directory. Therefore, we do not pass `-a`
to the repack command but instead pass `--cruft`, `--expire-to`,
and `--cruft-expiration=now` to repack.
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The trailer.* configuration variables are currently only described in
git-interpret-trailers(1) but affect git-commit and git-tag as well.
Move that section into its own config/trailer.txt file and also include
it in git-config(1).
Signed-off-by: Julian Prein <julian@druckdev.xyz>
Acked-by: Eric Sesterhenn <eric.sesterhenn@x41-dsec.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Back when Git was in its infancy, remotes were configured via separate
files in "branches/" (back in 2005). This mechanism was replaced later
that year with the "remotes/" directory. Both mechanisms have eventually
been replaced by config-based remotes, and it is very unlikely that
anybody still uses these directories to configure their remotes.
Both of these directories have been marked as deprecated, one in 2005
and the other one in 2011. Follow through with the deprecation and
finally announce the removal of these features in Git 3.0.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
[jc: with a small tweak to the help message]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Document that it is insecure to use Personal Access Tokens, which
some hosting providers take as username/password, embedded in URLs.
* mh/doc-credential-helpers-with-pat:
docs: discuss caching personal access tokens
docs: list popular credential helpers
The "instaweb" bound only to local IP address without "--local" and
to all addresses with "--local", which was the other way around, when
using Python's http.server class, which has been corrected.
* ak/instaweb-python-port-binding-fix:
instaweb: fix ip binding for the python http.server
The meson build procedure for Documentation/technical/ hierarchy was
missing necessary dependencies, which has been corrected.
* sj/meson-doc-technical-dependency-fix:
meson: fix missing deps for technical articles
The meson build procedure looked for the 'version-def.h' file in a
wrong directory, which has been corrected.
* tc/meson-use-our-version-def-h:
meson: ensure correct version-def.h is used
Extended SHA-1 expression parser did not work well when a branch
with an unusual name (e.g. "foo{bar") is involved.
* en/object-name-with-funny-refname-fix:
object-name: be more strict in parsing describe-like output
object-name: fix resolution of object names containing curly braces
* ds/path-walk-1:
path-walk: drop redundant parse_tree() call
path-walk: reorder object visits
path-walk: mark trees and blobs as UNINTERESTING
path-walk: visit tags and cached objects
path-walk: allow consumer to specify object types
t6601: add helper for testing path-walk API
test-lib-functions: add test_cmp_sorted
path-walk: introduce an object walk by path
Now that all callers have been converted from:
the_hash_algo->unsafe_init_fn();
to
unsafe_hash_algo(the_hash_algo)->init_fn();
and similar, we can remove the scaffolding for the unsafe_ function
variants and force callers to use the new unsafe_hash_algo() mechanic
instead.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 106140a99f (builtin/fast-import: fix segfault with unsafe SHA1
backend, 2024-12-30) and 9218c0bfe1 (bulk-checkin: fix segfault with
unsafe SHA1 backend, 2024-12-30), we observed the effects of failing to
initialize a hashfile_checkpoint with the same hash function
implementation as is used by the hashfile it is used to checkpoint.
While both 106140a99f and 9218c0bfe1 work around the immediate crash,
changing the hash function implementation within the hashfile API to,
for example, the non-unsafe variant would re-introduce the crash. This
is a result of the tight coupling between initializing hashfiles and
hashfile_checkpoints.
Introduce and use a new function which ensures that both parts of a
hashfile and hashfile_checkpoint pair use the same hash function
implementation to avoid such crashes.
A few things worth noting:
- In the change to builtin/fast-import.c::stream_blob(), we can see
that by removing the explicit reference to
'the_hash_algo->unsafe_init_fn()', we are hardened against the
hashfile API changing away from the_hash_algo (or its unsafe
variant) in the future.
- The bulk-checkin code no longer needs to explicitly zero-initialize
the hashfile_checkpoint, since it is now done as a result of calling
'hashfile_checkpoint_init()'.
- Also in the bulk-checkin code, we add an additional call to
prepare_to_stream() outside of the main loop in order to initialize
'state->f' so we know which hash function implementation to use when
calling 'hashfile_checkpoint_init()'.
This is OK, since subsequent 'prepare_to_stream()' calls are noops.
However, we only need to call 'prepare_to_stream()' when we have the
HASH_WRITE_OBJECT bit set in our flags. Without that bit, calling
'prepare_to_stream()' does not assign 'state->f', so we have nothing
to initialize.
- Other uses of the 'checkpoint' in 'deflate_blob_to_pack()' are
appropriately guarded.
Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Remove a series of conditionals within the shared cmd_hash_impl() helper
that powers the 'sha1' and 'sha1-unsafe' helpers.
Instead, replace them with a single conditional that transforms the
specified hash algorithm into its unsafe variant. Then all subsequent
calls can directly use whatever function it wants to call without having
to decide between the safe and unsafe variants.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Instead of calling the unsafe_ hash function variants directly, make use
of the shared 'algop' pointer by initializing it to:
f->algop = unsafe_hash_algo(the_hash_algo);
, thus making all calls use the unsafe variants directly.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 253ed9ecff (hash.h: scaffolding for _unsafe hashing variants,
2024-09-26), we introduced "unsafe" variants of the SHA-1 hashing
functions by introducing new functions like "unsafe_init_fn()" and so
on.
This approach has a major shortcoming that callers must remember to
consistently use one variant or the other. Failing to consistently use
(or not use) the unsafe variants can lead to crashes at best, or subtle
memory corruption issues at worst.
In the hashfile API, this isn't difficult to achieve, but verifying that
all callers consistently use the unsafe variants is somewhat of a chore
given how spread out all of the callers are. In the sha1 and sha1-unsafe
test helpers, all of the calls to various hash functions are guarded by
an "if (unsafe)" conditional, which is repetitive and cumbersome.
Address these issues by introducing a new pattern whereby one
'git_hash_algo' can return a pointer to another 'git_hash_algo' that
represents the unsafe version of itself. So instead of having something
like:
if (unsafe)
the_hash_algo->init_fn(...);
the_hash_algo->update_fn(...);
the_hash_algo->final_fn(...);
else
the_hash_algo->unsafe_init_fn(...);
the_hash_algo->unsafe_update_fn(...);
the_hash_algo->unsafe_final_fn(...);
we can instead write:
struct git_hash_algo *algop = the_hash_algo;
if (unsafe)
algop = unsafe_hash_algo(algop);
algop->init_fn(...);
algop->update_fn(...);
algop->final_fn(...);
This removes the existing shortcoming by no longer forcing the caller to
"remember" which variant of the hash functions it wants to call, only to
hold onto a 'struct git_hash_algo' pointer that is initialized once.
Similarly, while there currently is still a way to "mix" safe and unsafe
functions, this too will go away after subsequent commits remove all
direct calls to the unsafe_ variants.
Note that hash_algo_by_ptr() needs an adjustment to allow passing in the
unsafe variant of a hash function. All other query functions on the
hash_algos array will continue to return the safe variants of any
function.
Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Perform a similar transformation as in the previous commit, but focused
instead on hashfile_checksum_valid(). This function does not work with a
hashfile structure itself, and instead validates the raw contents of a
file written using the hashfile API.
We'll want to be prepared for a similar change to this function in the
future, so prepare ourselves for that by extracting 'the_hash_algo' into
its own field for use within this function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Throughout the hashfile API, we rely on a reference to 'the_hash_algo',
and call its _unsafe function variants directly.
Prepare for a future change where we may use a different 'git_hash_algo'
pointer (instead of just relying on 'the_hash_algo' throughout) by
making the 'git_hash_algo' pointer a member of the 'hashfile' structure
itself.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
With the new "unsafe" SHA-1 build knob, it is convenient to have a
test-tool that can exercise Git's unsafe SHA-1 wrappers for testing,
similar to 't/helper/test-tool sha1'.
Implement that helper by altering the implementation of that test-tool
(in cmd_hash_impl(), which is generic and parameterized over different
hash functions) to conditionally run the unsafe variants of the chosen
hash function, and expose the new behavior via a new 'sha1-unsafe' test
helper.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When TRACE2 analytics is enabled, a configuration variable set to
"valueless true" causes a segfault.
Steps to Reproduce
GIT_TRACE2=true GIT_TRACE2_CONFIG_PARAMS=status.*
git -c status.relativePaths version
Expected Result
git version 2.46.0
Actual Result
zsh: segmentation fault GIT_TRACE2=true
Add checks to prevent the segfault and instead show that the
variable without value.
Signed-off-by: Adam Murray <ad@canva.com>
Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The commit 297c09eabb (refs: allow multiple reflog entries for the
same refname, 2024-12-16) added logic to exit early in
`lock_ref_for_update()` after obtaining the required lock. This was
added as a performance optimization on a false assumption that no
further processing was required for reflog-only updates.
However the assumption was wrong. For a symref's reflog entry, the
update needs to be populated with the old_oid value, but the early
exit skipped this necessary step.
This caused a bug in Git 2.48 in the files backend where target
references of symrefs being updated would create a corrupted reflog
entry for the symref since the old_oid is not populated.
Everything the early exit skipped in the code path is necessary for
both regular and symbolic ref, so eliminate the mistaken
optimization, and also add a test to ensure that such an issue
doesn't arise in the future.
Reported-by: Nika Layzell <nika@thelayzells.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This call to parse_tree() was flagged by Coverity for ignoring the
return value. But if we look a little further up the function, we can
see that there is already a call to parse_tree_gently(), and we'll
return early if that fails. So by this point the tree will always be
parsed, and the call is redundant.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* ps/build-meson-fixes:
ci: wire up Visual Studio build with Meson
ci: raise error when Meson generates warnings
meson: fix compilation with Visual Studio
meson: make the CSPRNG backend configurable
meson: wire up fuzzers
meson: wire up generation of distribution archive
meson: wire up development environments
meson: fix dependencies for generated headers
meson: populate project version via GIT-VERSION-GEN
GIT-VERSION-GEN: allow running without input and output files
GIT-VERSION-GEN: simplify computing the dirty marker
Add a new job to GitHub Actions and GitLab CI that builds and tests
Meson-based builds with Visual Studio.
A couple notes:
- While the build job is mandatory, the test job is marked as "manual"
on GitLab so that it doesn't run by default. We already have a bunch
of Windows-based jobs, and the computational overhead that these
cause is simply out of proportion to run the test suite twice.
The same isn't true for GitHub as I could not find a way to make a
subset of jobs manually triggered.
- We disable Perl. This is because we pick up Perl from Git for
Windows, which outputs different paths ("/c/" instead of "C:\") than
what we expect in our tests.
- We don't use the Git for Windows SDK. Instead, the build only
depends on Visual Studio, Meson and Git for Windows. All the other
dependencies like curl, pcre2 and zlib get pulled in and compiled
automatically by Meson and thus do not have to be provided by the
system.
- We open-code "ci/run-test-slice.sh". This is because we only have
direct access to PowerShell, so we manually implement the logic.
There is an upstream pull request for the Meson build system [1] to
implement test slicing in Meson directly.
- We don't process test artifacts for failed CI jobs. This is done to
keep down prerequisites to a minimum.
All tests are passing.
[1]: https://github.com/mesonbuild/meson/pull/14092
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Meson prints warnings in several cases, like for example when using a
feature supported by the current version of Meson, but not yet supported
by the minimum required version as declared by the project. These
warnings will not cause the setup to fail by default, which makes it
quite easy to miss them.
Improve this by passing `--fatal-meson-warnings` to `meson setup` so
that our CI jobs will fail on warnings.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The Visual Studio compiler defaults to C89 unless explicitly asked to
use a different version of the C standard. We don't specify any C
standard at all though in our Meson build, and consequently compiling
Git fails:
...\git\git-compat-util.h(14): fatal error C1189: #error: "Required C99 support is in a test phase. Please see git-compat-util.h for more details."
Fix the issue by specifying the project's C standard. Funny enough,
specifying C99 does not work because apparently, `__STDC_VERSION__` is
not getting defined in that version at all. Instead, we have to specify
C11 as the project's C standard, which is also done in our CMake build
instructions.
We don't want to generally enforce C11 though, as our requiremets only
state that a C99 compiler is required. In fact, we don't even require
plain C99, but rather the GNU variant thereof.
Meson allows us to handle this case rather easily by specifying
"gnu99,c11", which will cause it to fall back to C11 in case GNU C99 is
unsupported. This feature has only been introduced with Meson 1.3.0
though, and we support 0.61.0 and newer. In case we use such an oldish
version though we fall back to requiring GNU99 unconditionally. This
means that Windows essentially requires Meson 1.3.0 and newer when using
Visual Studio, but I doubt that this is ever going to be a real problem.
Tested-by: M Hickford <mirth.hickford@gmail.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The CSPRNG backend is not configurable in Meson and isn't quite
discoverable, either. Make it configurable and add the actual backend
used to the summary.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Meson does not yet know to build our fuzzers. Introduce a new build
option "fuzzers" and wire up the fuzzers in case it is enabled. Adapt
our CI jobs so that they build the fuzzers by default.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Meson knows to generate distribution archives via `meson dist`. In
addition to generating the archive itself, this target also knows to
compile and execute tests from that archive, which helps to ensure that
the result is an adequate drop-in replacement for the versioned project.
While this already works as-is, one omission is that we don't propagate
the commit that this is built from into the resulting archive. This can
be fixed though by adding a distribution script that propagates the
version into the "version" file, which GIT-VERSION-GEN knows to read if
present.
Use GIT-VERSION-GEN to populate that file. As the script is executed in
the build directory, not in the directory where we generate the archive,
we have to use a shell to resolve the "MESON_DIST_ROOT" environment
variable.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The Meson build system is able to wire up development environments. The
intent is to make build artifacts of the project available. This is
typically used to export e.g. paths to linkable libraries, which isn't
all that interesting in our context given that we don't have an official
library interface.
But what we can use this mechanism for is to expose the built Git
executables as well as the build directory. This allows users to play
around with the built Git version in the devenv, and allows them to
execute our test scripts directly with the built distribution.
Wire up this feature, which can then be used via `meson devenv` in the
build directory.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We generate a couple of headers from our documentation. These headers
are added to the libgit sources, but two of them aren't used by the
library, but instead by our builtins. This can cause parallel builds to
fail because the builtin object may be compiled before the header was
generated.
Fix the issue by adding both "config-list.h" and "hook-list.h" to the
list of builtin sources. While "command-list.h" is generated similarly,
it is used by "help.c" and thus part of the libgit sources indeed.
Reported-by: Evan Martin <evan.martin@gmail.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The Git version for Meson is currently wired up manually. It can thus
grow (and already has grown) stale quite easily, as having multiple
sources of truth is never a good idea. This issue is mostly of cosmetic
nature as we don't use the project version anywhere, and instead use the
GIT-VERSION-GEN script to propagate the correct version into our build.
But it is somewhat puzzling when `meson setup` announces to build an old
Git release.
There are a couple of alternatives for how to solve this:
- We can keep the version undefined, but this makes Meson output
"undefined" for the version, as well.
- We can use GIT-VERSION-GEN to generate the version for us. At the
point of configuring the project we haven't yet figured out host
details though, and thus we didn't yet set up the shell environment.
While not an issue for Unix-based systems, this would be an issue in
Windows, where the shell typically gets provided via Git for Windows
and thus requires some special setup.
- We can pull the default version out of GIT-VERSION-GEN and move it
into its own file. This likely requires some adjustments for scripts
that bump the version, but allows Meson to read the version from
that file trivially.
Pick the second option and use GIT-VERSION-GEN as it gives us the most
accurate version. In order to fix the bootstrapping issue on Windows
systems we simply set the version to 'unknown' in case no shell was
found. As the version is only of cosmetic value this isn't really much
of an issue.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>