When testing corruption of files using the chunk format (like
commit-graphs and midx files), it's helpful to be able to modify bytes
in specific chunks. This requires being able both to read the
table-of-contents (to find the chunk to modify) but also to adjust it
(to account for size changes in the offsets of subsequent chunks).
We have some tests already which corrupt chunk files, but they have some
downsides:
1. They are very brittle, as they manually compute the expected size
of a particular instance of the file (e.g., see the definitions
starting with NUM_OBJECTS in t5319).
2. Because they rely on manual offsets and don't read the
table-of-contents, they're limited to overwriting bytes. But there
are many interesting corruptions that involve changing the sizes of
chunks (especially smaller-than-expected ones).
This patch adds a perl script which makes such corruptions easy. We'll
use it in subsequent patches.
Note that we could get by with just a big "perl -e" inside the helper
function. I chose to put it in a separate script for two reasons. One,
so we don't have to worry about the extra layer of shell quoting. And
two, the script is kind of big, and running the tests with "-x" would
repeatedly dump it into the log output.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The pair_chunk() function is provided as an easy helper for parsing
chunks that just want a pointer to a set of bytes. But every caller has
a hidden bug: because we return only the pointer without the matching
chunk size, the callers have no clue how many bytes they are allowed to
look at. And as a result, they may read off the end of the mmap'd data
when the on-disk file does not match their expectations.
Since chunk files are typically used for local-repository data like
commit-graph files and midx's, the security implications here are pretty
mild. The worst that can happen is that you hand somebody a corrupted
repository tarball, and running Git on it does an out-of-bounds read and
crashes. So it's worth being more defensive, but we don't need to drop
everything and fix every caller immediately.
I noticed the problem because the pair_chunk_fn() callback does not look
at its chunk_size argument, and wanted to annotate it to silence
-Wunused-parameter. We could do that now, but we'd lose the hint that
this code should be audited and fixed.
So instead, let's set ourselves up for going down that path:
1. Provide a pair_chunk() function that does return the size, which
prepares us for fixing these cases.
2. Rename the existing function to pair_chunk_unsafe(). That gives us
an easy way to grep for cases which still need to be fixed, and the
name should cause anybody adding new calls to think twice before
using it.
There are no callers of the "safe" version yet, but we'll add some in
subsequent patches.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Modify the 'readdir' loop in 'loose_fill_ref_dir' to, rather than 'stat' a
file to determine whether it is a directory or not, use 'get_dtype'.
Currently, the loop uses 'stat' to determine whether each dirent is a
directory itself or not in order to construct the appropriate ref cache
entry. If 'stat' fails (returning a negative value), the dirent is silently
skipped; otherwise, 'S_ISDIR(st.st_mode)' is used to check whether the entry
is a directory.
On platforms that include an entry's d_type in in the 'dirent' struct, this
extra 'stat' check is redundant. We can use the 'get_dtype' method to
extract this information on platforms that support it (i.e. where
NO_D_TYPE_IN_DIRENT is unset), and derive it with 'stat' on platforms that
don't. Because 'stat' is an expensive call, this confers a
modest-but-noticeable performance improvement when iterating over large
numbers of refs (approximately 20% speedup in 'git for-each-ref' in a 30k
ref repo).
Unlike other existing usage of 'get_dtype', the 'follow_symlinks' arg is set
to 1 to replicate the existing handling of symlink dirents. This
unfortunately requires calling 'stat' on the associated entry regardless of
platform, but symlinks in the loose ref store are highly unlikely since
they'd need to be created manually by a user.
Note that this patch also changes the condition for skipping creation of a
ref entry from "when 'stat' fails" to "when the d_type is anything other
than DT_REG or DT_DIR". If a dirent's d_type is DT_UNKNOWN (either because
the platform doesn't support d_type in dirents or some other reason) or
DT_LNK, 'get_dtype' will try to derive the underlying type with 'stat'. If
the 'stat' fails, the d_type will remain 'DT_UNKNOWN' and dirent will be
skipped. However, it will also be skipped if it is any other valid d_type
(e.g. DT_FIFO for named pipes, DT_LNK for a nested symlink). Git does not
handle these properly anyway, so we can safely constrain accepted types to
directories and regular files.
Signed-off-by: Victoria Dye <vdye@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a 'follow_symlink' boolean option to 'get_type()'. If 'follow_symlink'
is enabled, DT_LNK (in addition to DT_UNKNOWN) d_types triggers the
stat-based d_type resolution, using 'stat' instead of 'lstat' to get the
type of the followed symlink. Note that symlinks are not followed
recursively, so a symlink pointing to another symlink will still resolve to
DT_LNK.
Update callers in 'diagnose.c' to specify 'follow_symlink = 0' to preserve
current behavior.
Signed-off-by: Victoria Dye <vdye@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Move 'get_dtype()' from 'diagnose.c' to 'dir.c' and add its declaration to
'dir.h' so that it is accessible to callers in other files. The function and
its documentation are moved verbatim except for a small addition to the
description clarifying what the 'path' arg represents.
Signed-off-by: Victoria Dye <vdye@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Update 'cache_ref_iterator_advance' to skip over refs that are not matched
by the given prefix.
Currently, a ref entry is considered "matched" if the entry name is fully
contained within the prefix:
* prefix: "refs/heads/v1"
* entry: "refs/heads/v1.0"
OR if the prefix is fully contained in the entry name:
* prefix: "refs/heads/v1.0"
* entry: "refs/heads/v1"
The first case is always correct, but the second is only correct if the ref
cache entry is a directory, for example:
* prefix: "refs/heads/example"
* entry: "refs/heads/"
Modify the logic in 'cache_ref_iterator_advance' to reflect these
expectations:
1. If 'overlaps_prefix' returns 'PREFIX_EXCLUDES_DIR', then the prefix and
ref cache entry do not overlap at all. Skip this entry.
2. If 'overlaps_prefix' returns 'PREFIX_WITHIN_DIR', then the prefix matches
inside this entry if it is a directory. Skip if the entry is not a
directory, otherwise iterate over it.
3. Otherwise, 'overlaps_prefix' returned 'PREFIX_CONTAINS_DIR', indicating
that the cache entry (directory or not) is fully contained by or equal to
the prefix. Iterate over this entry.
Note that condition 2 relies on the names of directory entries having the
appropriate trailing slash. The existing function documentation of
'create_dir_entry' explicitly calls out the trailing slash requirement, so
this is a safe assumption to make.
This bug generally doesn't have any user-facing impact, since it requires:
1. using a non-empty prefix without a trailing slash in an iteration like
'for_each_fullref_in',
2. the callback to said iteration not reapplying the original filter (as
for-each-ref does) to ensure unmatched refs are skipped, and
3. the repository having one or more refs that match part of, but not all
of, the prefix.
However, there are some niche scenarios that meet those criteria
(specifically, 'rev-parse --bisect' and '(log|show|shortlog) --bisect'). Add
tests covering those cases to demonstrate the fix in this patch.
Signed-off-by: Victoria Dye <vdye@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
initialize_attr_index() does not initialize the repo member of
attr_index. Starting in 44451a2e5e (attr: teach "--attr-source=<tree>"
global option to "git", 2023-05-06), this became a problem because
istate->repo gets passed down the call chain starting in
git_check_attr(). This gets passed all the way down to
replace_refs_enabled(), which segfaults when accessing r->gitdir.
Fix this by initializing the repository in the index state.
Signed-off-by: John Cai <johncai86@gmail.com>
Helped-by: Christian Couder <christian.couder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
'--dd' only makes sense for 'git log' and 'git show', so add it to
__git_log_show_options which is referenced in the completion for these
two commands.
Signed-off-by: Sergey Organov <sorganov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This option provides a shortcut to request diff with respect to first
parent for any kind of commit, universally. It's implemented as pure
synonym for "--diff-merges=first-parent --patch".
Gives user quick and universal way to see what changes, exactly, were
brought to a branch by merges as well as by regular commits.
Signed-off-by: Sergey Organov <sorganov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* Put descriptions of convenience shortcuts first, so they are the
first things reader observes rather than lengthy detailed stuff.
* Get rid of very long line containing all the --diff-merges formats
by replacing them with <format>, and putting each supported format
on its own line.
Signed-off-by: Sergey Organov <sorganov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The DESCRIPTION's "first form" is actually the 1st, 2nd, 3rd and 5th
form in SYNOPSIS, the "second form" is the 4th one.
Interestingly, this state of affairs was introduced in
97fe725075 (cat-file docs: fix SYNOPSIS and "-h" output, 2021-12-28)
with the claim of "Now the two will match again." ("the two" being
DESCRIPTION and SYNOPSIS)...
The description also suffers from other correctness and clarity issues,
e.g., the "first form" paragraph discusses -p, -s and -t, but leaves out
-e, which is included in the corresponding SYNOPSIS section; the second
paragraph mentions <format>, which doesn't occur in SYNOPSIS at all, and
of the three batch options, really only describes the behavior of
--batch-check. Also the mention of "drivers" seems an implementation
detail not adding much clarity in a short summary (and isn't expanded
upon in the rest of the man page, either).
Rather than trying to maintain one-to-one (or N-to-M) correspondence
between the DESCRIPTION and SYNOPSIS forms, creating duplication and
providing opportunities for error, shorten the former into a concise
summary describing the two general modes of operation: batch and
non-batch, leaving details to the subsequent manual sections.
While here, fix a grammar error in the description of -e and make the
following further minor improvements:
NAME:
shorten ("content or type and size" isn't the whole story; say
"details" and leave the actual details to later sections)
SYNOPSIS and --help:
move the (--textconv | --filters) form before --batch, closer
to the other non-batch forms
Signed-off-by: Štěpán Němec <stepnem@smrk.net>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The soft limit of the first line of the commit message should be
"no more than 50 characters" or "50 characters or less", but not
"less than 50 character".
This is an addition to commit c2c349a15c (doc: correct the 50 characters
soft limit, 2023-09-28).
Signed-off-by: 谢致邦 (XIE Zhibang) <Yeking@Red54.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Get rid of extraneous whitespace, replace tab-after-fullstop with
space, etc.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Mark pretty formats containing "%(decorate" as requiring decoration in
userformat_find_requirements(), same as "%d" and "%D".
Without this, cmd_log_init_finish() didn't invoke load_ref_decorations()
with the decoration_filter it puts together, and hence filtering options
such as --decorate-refs were quietly ignored.
Amend one of the %(decorate) checks in t4205-log-pretty-formats.sh to
test this.
Signed-off-by: Andy Koppe <andy.koppe@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We allocate an array of packed_git pointers so that we can sort the list
of cruft packs, but we never free the array, causing a small leak. Note
that we don't need to free the packed_git structs themselves; they're
owned by the repository object.
Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
No disrespect to other mailing list archives, but the local part of
their URLs will become pretty much meaningless once the archives go
out of service, and we learned the lesson hard way when $gmane
stopped serving.
Let's point into https://lore.kernel.org/ for an article that can be
found there, because the local part of the URL has the Message-Id:
that can be used to find the same message in other archives, even if
lore goes down.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We build up a string list of listen addresses from the command-line
arguments, but never free it. This causes t5811 to complain of a leak
(though curiously it seems to do so only when compiled with gcc, not
with clang).
To handle this correctly, we have to do a little refactoring:
- there are two exit points from the main function, depending on
whether we are entering the main loop or serving a single client
(since rather than a traditional fork model, we re-exec ourselves
with the extra "--serve" argument to accommodate Windows).
We don't need --listen at all in the --serve case, of course, but it
is passed along by the parent daemon, which simply copies all of the
command-line options it got.
- we just "return serve()" to run the main loop, giving us no chance
to do any cleanup
So let's use a "ret" variable to store the return code, and give
ourselves a single exit point at the end. That gives us one place to do
cleanup.
Note that this code also uses the "use a no-dup string-list, but
allocate strings we add to it" trick, meaning string_list_clear() will
not realize it should free them. We can fix this by switching to a "dup"
string-list, but using the "append_nodup" function to add to it (this is
preferable to tweaking the strdup_strings flag before clearing, as it
puts all the subtle memory-ownership code together).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The point of release_revisions() is to free memory associated with the
rev_info struct, but we have several "struct decoration" members that
are left untouched. Since the previous commit introduced a function to
do that, we can just call it.
We do have to provide some specialized callbacks to map the void
pointers onto real ones (the alternative would be casting the existing
function pointers; this generally works because "void *" is usually
interchangeable with a struct pointer, but it is technically forbidden
by the standard).
Since the line-log code does not expose the type it stores in the
decoration (nor of course the function to free it), I put this behind a
generic line_log_free() entry point. It's possible we may need to add
more line-log specific bits anyway (running t4211 shows a number of
other leaks in the line-log code).
While this doubtless cleans up many leaks triggered by the test suite,
the only script which becomes leak-free is t4217, as it does very little
beyond a simple traversal (its existing leak was from the use of
--children, which is now fixed).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There's not currently any way to free the resources associated with a
decoration struct. As a result, we have several memory leaks which
cannot easily be plugged.
Let's add a "clear" function and make use of it in the example code of
t9004. This removes the only leak from that script, so we can mark it as
passing the leak sanitizer.
Curiously this leak is found only when running SANITIZE=leak with clang,
but not with gcc. But it is a bog-standard leak: we allocate some
memory in a local variable struct, and then exit main() without
releasing it. I'm not sure why gcc doesn't find it. After this
patch, both compilers report it as leak-free.
Note that the clear function takes a callback to free the individual
entries. That's not needed for our example (which is just decorating
with ints), but will be for real callers.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When doing a `--geometric` repack, we make sure that the preferred pack
(if writing a MIDX) is the largest pack that we *didn't* repack. That
has the effect of keeping the preferred pack in sync with the pack
containing a majority of the repository's reachable objects.
But if the repository happens to double in size, we'll repack
everything. Here we don't specify any `--preferred-pack`, and instead
let the MIDX code choose.
In the past, that worked fine, since there would only be one pack to
choose from: the one we just wrote. But it's no longer necessarily the
case that there is one pack to choose from. It's possible that the
repository also has a cruft pack, too.
If the cruft pack happens to come earlier in lexical order (and has an
earlier mtime than any non-cruft pack), we'll pick that pack as
preferred. This makes it impossible to reuse chunks of the reachable
pack verbatim from pack-objects, so is sub-optimal.
Luckily, this is a somewhat rare circumstance to be in, since we would
have to repack the entire repository during a `--geometric` repack, and
the cruft pack would have to sort ahead of the pack we just created.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.
When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:
- Repacking periodically, either by packing anything loose in the
repository (via `git repack -d`) or producing a geometric sequence
of packs (via `git repack --geometric=<d> -d`).
- Every so often, splitting the repository into two packs, one cruft
to store the unreachable objects, and another non-cruft pack to
store the reachable objects.
Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.
This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).
This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.
Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).
At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.
Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:
- Sort a list of any existing cruft packs in ascending order of pack
size.
- Starting from the beginning of the list, group cruft packs together
while the accumulated size is smaller than the maximum specified
pack size.
- Combine the objects in these cruft packs together into a new cruft
pack, along with any other unreachable objects which have since
entered the repository.
Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.
When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.
This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The repack builtin takes a `--max-pack-size` command-line argument which
it uses to feed into any of the pack-objects children that it may spawn
when generating a new pack.
This option is parsed with OPT_STRING, meaning that we'll accept
anything as input, punting on more fine-grained validation until we get
down into pack-objects.
This is fine, but it's wasteful to spend an entire sub-process just to
figure out that one of its option is bogus. Instead, parse the value of
`--max-pack-size` with OPT_MAGNITUDE in 'git repack', and then pass the
known-good result down to pack-objects.
Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
With the broken quoting the test wouldn't even parse correctly, but
there's also the '==' instead of POSIX '=' (of the shells I tested,
busybox ash, bash and ksh (93 and OpenBSD) accept '==', dash and zsh do
not), and 'print 2' from Python 2 days.
(I assume the test failing due to 3 != 4 is intentional or immaterial.)
Fixes: 93a5724613 ("test-lib: Add support for multiple test prerequisites")
Signed-off-by: Štěpán Němec <stepnem@smrk.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The terminology was changed in b0d12fc9b2 (Use the word 'stuck'
instead of 'sticked').
Signed-off-by: Štěpán Němec <stepnem@smrk.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There's even an example of such usage in the README.
Fixes: 67ba13e5a4 ("git-jump: pass "merge" arguments to ls-files")
Signed-off-by: Štěpán Němec <stepnem@smrk.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Fix the grammar ("which default value is") and reword to match other
similar descriptions (say "configuration variable" instead of
"parameter", link to git-config(1)).
Signed-off-by: Štěpán Němec <stepnem@smrk.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>