When reading a loose object, we first try to expand the first 32 bytes
to read the type+size header. This is enough for any of the normal Git
types. But since 46f034483e (sha1_file: support reading from a loose
object of unknown type, 2015-05-03), the caller can also ask us to parse
any unknown names, which can be much longer. In this case we keep
inflating until we find the NUL at the end of the header, or hit
Z_STREAM_END.
But what if zlib can't make forward progress? For example, if the loose
object file is truncated, we'll have no more data to feed it. It will
return Z_BUF_ERROR, and we'll just loop infinitely, calling
git_inflate() over and over but never seeing new bytes nor an
end-of-stream marker.
We can fix this by only looping when we think we can make forward
progress. This will always be Z_OK in this case. In other code we might
also be able to continue on Z_BUF_ERROR, but:
- We will never see Z_BUF_ERROR because the output buffer is full; we
always feed a fresh 32-byte buffer on each call to git_inflate().
- We may see Z_BUF_ERROR if we run out of input. But since we've fed
the whole mmap'd buffer to zlib, if it runs out of input there is
nothing more we can do.
So if we don't see Z_OK (and didn't see the end-of-header NUL, otherwise
we'd have broken out of the loop), then we should stop looping and
return an error.
The test case shows an example where the input is truncated (which gives
us the input Z_BUF_ERROR case above).
Although we do operate on objects we might get from an untrusted remote,
I don't think the security implications of this bug are too great. It
can only trigger if both of these are true:
- You're reading a loose object whose on-disk representation was
written by an attacker. So fetching an object (or receiving a push)
are mostly OK, because even with unpack-objects it is our local,
trusted code that writes out the object file. The exception may be
fetching from an untrusted local repo, or using dumb-http, which
copies objects verbatim. But...
- The only code path which triggers the inflate loop is cat-file's
--allow-unknown-type option. This is unlikely to be called at all
outside of debugging. But I also suspect that objects with
non-standard types (or that are truncated) would not survive the
usual fetch/receive checks in the first place.
So I think it would be quite hard to trick somebody into running the
infinite loop, and we can just fix the bug.
Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If a caller asks us to read the whole loose object header value into a
strbuf (e.g., via "cat-file --allow-unknown-type"), we'll keep reading
until we see a NUL byte marking the end of the header.
If we hit Z_STREAM_END before seeing the NUL, we obviously have to stop.
But we return ULHR_TOO_LONG, which doesn't make any sense. The "too
long" return code is used in the normal, 32-byte limited mode to
indicate that we stopped looking. There is no such thing as "too long"
here, as we'd keep reading forever until we see the end of stream or the
NUL.
Instead, we should return ULHR_BAD. The loose object has no NUL marking
the end of header, so it is malformed. The behavior difference is
slight; in either case we'd consider the object unreadable and refuse to
go further. The only difference is the specific error message we
produce.
There's no test case here, as we'd need to generate a valid zlib stream
without a NUL. That's not something Git will do without writing new
custom code. And in the next patch we'll fix another bug in this area
which will make this easier to do (and we will test it then).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When using OBJECT_INFO_ALLOW_UNKNOWN_TYPE to unpack a header that
doesn't fit into our initial 32-byte buffer, we loop over calls
git_inflate(), feeding it our buffer to the "next_out" pointer each
time. As the code is written, we reset next_out after each inflate call
(and after reading the output), ready for the next loop.
This isn't wrong, but there are a few advantages to setting up
"next_out" right before each inflate call, rather than after:
1. It drops a few duplicated lines of code.
2. It makes it obvious that we always feed a fresh buffer on each call
(and thus can never see Z_BUF_ERROR due to due to a lack of output
space).
3. After we exit the loop, we'll leave stream->next_out pointing to
the end of the fetched data (this is how zlib callers find out how
much data is in the buffer). This doesn't matter in practice, since
nobody looks at it again. But it's probably the least-surprising
thing to do, as it matches how next_out is left when the whole
thing fits in the initial 32-byte buffer (and we don't enter the
loop at all).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
After unpack_loose_header() returns, it will have inflated not only the
object header, but possibly some bytes of the object content. When we
call unpack_loose_rest() to extract the actual content, it finds those
extra bytes by skipping past the header's terminating NUL in the buffer.
Like this:
int bytes = strlen(buffer) + 1;
n = stream->total_out - bytes;
...
memcpy(buf, (char *) buffer + bytes, n);
This won't work with the OBJECT_INFO_ALLOW_UNKNOWN_TYPE flag, as there
we allow a header of arbitrary size. We put into a strbuf, but feed only
the final 32-byte chunk we read to unpack_loose_rest(). In that case
stream->total_out may unexpectedly large, and thus our "n" will be
large, causing an out-of-bounds read (we do check it against our
allocated buffer size, which prevents an out-of-bounds write).
Probably this could be made to work by feeding the strbuf to
unpack_loose_rest(), along with adjusting some types (e.g., "bytes"
would need to be a size_t, since it is no longer operating on a 32-byte
buffer).
But I don't think it's possible to actually trigger this in practice.
The only caller who passes ALLOW_UNKNOWN_TYPE is cat-file, which only
allows it with the "-t" and "-s" options (neither of which access the
content). There is one way you can _almost_ trigger it: the oid compat
routines (i.e., accessing sha1 via sha256 names and vice versa) will
convert objects on the fly (which requires access to the content) using
the same flags that were passed in. So in theory this:
t='some very large type field that causes an extra inflate call'
sha1_oid=$(git hash-object -w -t "$t" file)
sha256_oid=$(git rev-parse --output-object-format=sha256 $sha1_oid)
git cat-file --allow-unknown-type -s $sha256_oid
would try to access the content. But it doesn't work, because using
compat objects requires an entry in the .git/objects/loose-object-idx
file, and we don't generate such an entry for non-standard types (see
the "compat" section of write_object_file_literally()).
If we use "t=blob" instead, then it does access the compat object, but
it doesn't trigger the problem (because "blob" is a standard short type
name, and it fits in the initial 32-byte buffer).
So given that this is almost a memory error bug, I think it's worth
addressing. But because we can't actually trigger the situation, I'm
hesitant to try a fix that we can't run. Instead let's document the
restriction and protect ourselves from the out-of-bounds read by adding
a BUG() check.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Remove `git_path_buf()` in favor of `repo_git_path_replace()`. The
latter does essentially the same, with the only exception that it does
not rely on `the_repository` but takes the repo as separate parameter.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Remove `git_pathdup()` in favor of `repo_git_path()`. The latter does
essentially the same, with the only exception that it does not rely on
`the_repository` but takes the repo as separate parameter.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The API around choosing to use unsafe variant of SHA-1
implementation has been updated in an attempt to make it harder to
abuse.
* tb/unsafe-hash-cleanup:
hash.h: drop unsafe_ function variants
csum-file: introduce hashfile_checkpoint_init()
t/helper/test-hash.c: use unsafe_hash_algo()
csum-file.c: use unsafe_hash_algo()
hash.h: introduce `unsafe_hash_algo()`
csum-file.c: extract algop from hashfile_checksum_valid()
csum-file: store the hash algorithm as a struct field
t/helper/test-tool: implement sha1-unsafe helper
Adapt callers to use generic hash context helpers instead of using the
hash algorithm to update them. This makes the callsites easier to reason
about and removes the possibility that the wrong hash algorithm is used
to update the hash context's state. And as a nice side effect this also
gets rid of a bunch of users of `the_hash_algo`.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The hash context is supposed to be updated via the `git_hash_algo`
structure, which contains a list of function pointers to update, clone
or finalize a hashing context. This requires the callers to track which
algorithm was used to initialize the context and continue to use the
exact same algorithm. If they fail to do that correctly, it can happen
that we start to access context state of one hash algorithm with
functions of a different hash algorithm. The result would typically be a
segfault, as could be seen e.g. in the patches part of 98422943f0 (Merge
branch 'ps/weak-sha1-for-tail-sum-fix', 2025-01-01).
The situation was significantly improved starting with 04292c3796
(hash.h: drop unsafe_ function variants, 2025-01-23) and its parent
commits. These refactorings ensure that it is not possible to mix up
safe and unsafe variants of the same hash algorithm anymore. But in
theory, it is still possible to mix up different hash algorithms with
each other, even though this is a lot less likely to happen.
But still, we can do better: instead of asking the caller to remember
the hash algorithm used to initialize a context, we can instead make the
context itself remember which algorithm it has been initialized with. If
we do so, callers can use a set of generic helpers to update the context
and don't need to be aware of the hash algorithm at all anymore.
Adapt the context initialization functions to store the hash algorithm
in the hashing context and introduce these generic helpers. Callers will
be adapted in the subsequent commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We generally avoid using `typedef` in the Git codebase. One exception
though is the `git_hash_ctx`, likely because it used to be a union
rather than a struct until the preceding commit refactored it. But now
that it is a normal `struct` there isn't really a need for a typedef
anymore.
Drop the typedef and adapt all callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `git_hash_context` is a union containing the different hash-specific
states for SHA1, its unsafe variant as well as SHA256. We know that only
one of these states will ever be in use at the same time because hash
contexts cannot be used for multiple different hashes at the same point
in time.
We're about to extend the structure though to keep track of the hash
algorithm used to initialize the context, which is impossible to do
while the context is a union. Refactor it to instead be a structure that
contains the union of context states.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* tb/unsafe-hash-cleanup:
hash.h: drop unsafe_ function variants
csum-file: introduce hashfile_checkpoint_init()
t/helper/test-hash.c: use unsafe_hash_algo()
csum-file.c: use unsafe_hash_algo()
hash.h: introduce `unsafe_hash_algo()`
csum-file.c: extract algop from hashfile_checksum_valid()
csum-file: store the hash algorithm as a struct field
t/helper/test-tool: implement sha1-unsafe helper
Now that all callers have been converted from:
the_hash_algo->unsafe_init_fn();
to
unsafe_hash_algo(the_hash_algo)->init_fn();
and similar, we can remove the scaffolding for the unsafe_ function
variants and force callers to use the new unsafe_hash_algo() mechanic
instead.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 253ed9ecff (hash.h: scaffolding for _unsafe hashing variants,
2024-09-26), we introduced "unsafe" variants of the SHA-1 hashing
functions by introducing new functions like "unsafe_init_fn()" and so
on.
This approach has a major shortcoming that callers must remember to
consistently use one variant or the other. Failing to consistently use
(or not use) the unsafe variants can lead to crashes at best, or subtle
memory corruption issues at worst.
In the hashfile API, this isn't difficult to achieve, but verifying that
all callers consistently use the unsafe variants is somewhat of a chore
given how spread out all of the callers are. In the sha1 and sha1-unsafe
test helpers, all of the calls to various hash functions are guarded by
an "if (unsafe)" conditional, which is repetitive and cumbersome.
Address these issues by introducing a new pattern whereby one
'git_hash_algo' can return a pointer to another 'git_hash_algo' that
represents the unsafe version of itself. So instead of having something
like:
if (unsafe)
the_hash_algo->init_fn(...);
the_hash_algo->update_fn(...);
the_hash_algo->final_fn(...);
else
the_hash_algo->unsafe_init_fn(...);
the_hash_algo->unsafe_update_fn(...);
the_hash_algo->unsafe_final_fn(...);
we can instead write:
struct git_hash_algo *algop = the_hash_algo;
if (unsafe)
algop = unsafe_hash_algo(algop);
algop->init_fn(...);
algop->update_fn(...);
algop->final_fn(...);
This removes the existing shortcoming by no longer forcing the caller to
"remember" which variant of the hash functions it wants to call, only to
hold onto a 'struct git_hash_algo' pointer that is initialized once.
Similarly, while there currently is still a way to "mix" safe and unsafe
functions, this too will go away after subsequent commits remove all
direct calls to the unsafe_ variants.
Note that hash_algo_by_ptr() needs an adjustment to allow passing in the
unsafe variant of a hash function. All other query functions on the
hash_algos array will continue to return the safe variants of any
function.
Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
CI jobs gave sporadic failures, which turns out that that the
object finalization code was giving an error when it did not have
to.
* ps/object-collision-check:
object-file: retry linking file into place when occluding file vanishes
object-file: don't special-case missing source file in collision check
object-file: rename variables in `check_collision()`
object-file: fix race in object collision check
Prior to 0ad3d65652 (object-file: fix race in object collision check,
2024-12-30), callers could expect that a successful return from
`finalize_object_file()` means that either the file was moved into
place, or the identical bytes were already present. If neither of those
happens, we'd return an error.
Since that commit, if the destination file disappears between our
link(3p) call and the collision check, we'd return success without
actually checking the contents, and without retrying the link. This
solves the common case that the files were indeed the same, but it means
that we may corrupt the repository if they weren't (this implies a hash
collision, but the whole point of this function is protecting against
hash collisions).
We can't be pessimistic and assume they're different; that hurts the
common case that the mentioned commit was trying to fix. But after
seeing that the destination file went away, we can retry linking again.
Adapt the code to do so when we see that the destination file has racily
vanished. This should generally succeed as we have just observed that
the destination file does not exist anymore, except in the very unlikely
event that it gets recreated by another concurrent process again.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 0ad3d65652 (object-file: fix race in object collision check,
2024-12-30) we have started to ignore ENOENT when opening either the
source or destination file of the collision check. This was done to
handle races more gracefully in case either of the potentially-colliding
disappears.
The fix is overly broad though: while the destination file may indeed
vanish racily, this shouldn't ever happen for the source file, which is
a temporary object file (either loose or in packfile format) that we
have just created. So if any concurrent process would have removed that
temporary file it would indicate an actual issue.
Stop treating ENOENT specially for the source file so that we always
bubble up this error.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Rename variables used in `check_collision()` to clearly identify which
file is the source and which is the destination. This will make the next
step easier to reason about when we start to treat those files different
from one another.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
One of the tests in t5616 asserts that git-fetch(1) with `--refetch`
triggers repository maintenance with the correct set of arguments. This
test is flaky and causes us to fail sometimes:
++ git -c protocol.version=0 -c gc.autoPackLimit=0 -c maintenance.incremental-repack.auto=1234 -C pc1 fetch --refetch origin
error: unable to open .git/objects/pack/pack-029d08823bd8a8eab510ad6ac75c823cfd3ed31e.pack: No such file or directory
fatal: unable to rename temporary file to '.git/objects/pack/pack-029d08823bd8a8eab510ad6ac75c823cfd3ed31e.pack'
fatal: could not finish pack-objects to repack local links
fatal: index-pack failed
error: last command exited with $?=128
The error message is quite confusing as it talks about trying to rename
a temporary packfile. A first hunch would thus be that this packfile
gets written by git-fetch(1), but removed by git-maintenance(1) while it
hasn't yet been finalized, which shouldn't ever happen. And indeed, when
looking closer one notices that the file that is supposedly of temporary
nature does not have the typical `tmp_pack_` prefix.
As it turns out, the "unable to rename temporary file" fatal error is a
red herring and the real error is "unable to open". That error is raised
by `check_collision()`, which is called by `finalize_object_file()` when
moving the new packfile into place. Because t5616 re-fetches objects, we
end up with the exact same pack as we already have in the repository. So
when the concurrent git-maintenance(1) process rewrites the preexisting
pack and unlinks it exactly at the point in time where git-fetch(1)
wants to check the old and new packfiles for equality we will see ENOENT
and thus `check_collision()` returns an error, which gets bubbled up by
`finalize_object_file()` and is then handled by `rename_tmp_packfile()`.
That function does not know about the exact root cause of the error and
instead just claims that the rename has failed.
This race is thus caused by b1b8dfde69 (finalize_object_file():
implement collision check, 2024-09-26), where we have newly introduced
the collision check.
By definition, two files cannot collide with each other when one of them
has been removed. We can thus trivially fix the issue by ignoring ENOENT
when opening either of the files we're about to check for collision.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Mark code units that generate warnings with `-Wsign-compare`. This
allows for a structured approach to get rid of all such warnings over
time in a way that can be easily measured.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We define macros with the bytes of the empty trees and blobs for sha1
and sha256. But since e1ccd7e2b1 (sha1_file: only expose empty object
constants through git_hash_algo, 2018-05-02), those are used only for
initializing the git_hash_algo entries. Any other code using the macros
directly would be suspicious, since a hash_algo pointer is the level of
indirection we use to make everything work with both sha1 and sha256.
So let's future proof against code doing the wrong thing by dropping the
macros entirely and just initializing the structs directly.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The cached-object API maps oids to in-memory entries. Once inserted,
these entries should be immutable. Let's return them from the
find_cached_object() call with a const tag to make this clear.
Suggested-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The pretend_object_file() function adds to an array mapping oids to
object contents, which are later retrieved with find_cached_object().
We naturally need to store the oid for each entry, since it's the lookup
key.
But find_cached_object() also returns a hard-coded empty_tree object.
There we don't care about its oid field and instead compare against
the_hash_algo->empty_tree. The oid field is left as all-zeroes.
This all works, but it means that the cached_object struct we return
from find_cached_object() may or may not have a valid oid field, depend
whether it is the hard-coded tree or came from pretend_object_file().
Nobody looks at the field, so there's no bug. But let's future-proof it
by returning only the object contents themselves, not the oid. We'll
continue to call this "struct cached_object", and the array entry
mapping the key to those contents will be a "cached_object_entry".
This would also let us swap out the array for a better data structure
(like a hashmap) if we chose, but there's not much point. The only code
that adds an entry is git-blame, which adds at most a single entry per
process.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The fake empty_tree struct is a static global, but the only code that
looks at it is find_cached_object(). The struct itself is a little odd,
with an invalid "oid" field that is handled specially by that function.
Since it's really just an implementation detail, let's move it to a
static within the function. That future-proofs against other code trying
to use it and seeing the weird oid value.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We treat the empty tree specially, providing an in-memory "cached" copy,
which allows you to diff against it even if the object doesn't exist in
the repository. This is implemented as part of the larger cached_object
subsystem, but we use a stand-alone empty_tree struct.
We initialize the oid of that struct using EMPTY_TREE_SHA1_BIN_LITERAL.
At first glance, that seems like a bug; how could this ever work for
sha256 repositories?
The answer is that we never look at the oid field! The oid field is used
to look up entries added by pretend_object_file() to the cached_objects
array. But for our stand-alone entry, we look for it independently using
the_hash_algo->empty_tree, which will point to the correct algo struct
for the repository.
This happened in 62ba93eaa9 (sha1_file: convert cached object code to
struct object_id, 2018-05-02), which even mentions that this field is
never used. Let's reduce confusion for anybody reading this code by
replacing the sha1 initializer with a comment. The resulting field will
be all-zeroes, so any violation of our assumption that the oid field is
not used will break equally for sha1 and sha256.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We hard-code a few well-known hash values for empty trees and blobs in
both sha1 and sha256 formats. We do so with string literals like this:
#define EMPTY_TREE_SHA256_BIN_LITERAL \
"\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1" \
"\x04\xd4\x5d\x8d\x85\xef\xa9\xb0\x57\xb5" \
"\x3b\x14\xb4\xb9\xb9\x39\xdd\x74\xde\xcc" \
"\x53\x21"
and then use it to initialize the hash field of an object_id struct.
That hash field is exactly 32 bytes long (the size we need for sha256).
But the string literal above is actually 33 bytes long due to the NUL
terminator. This is legal in C, and the NUL is ignored.
Side note on legality: in general excess initializer elements are
forbidden, and gcc will warn on both of these:
char foo[3] = { 'h', 'u', 'g', 'e' };
char bar[3] = "VeryLongString";
I couldn't find specific language in the standard allowing
initialization from a string literal where _just_ the NUL is ignored,
but C99 section 6.7.8 (Initialization), paragraph 32 shows this exact
case as "example 8".
However, the upcoming gcc 15 will start warning for this case (when
compiled with -Wextra via DEVELOPER=1):
CC object-file.o
object-file.c:52:9: warning: initializer-string for array of ‘unsigned char’ is too long [-Wunterminated-string-initialization]
52 | "\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1" \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
object-file.c:79:17: note: in expansion of macro ‘EMPTY_TREE_SHA256_BIN_LITERAL’
which is understandable. Even though this is not a bug for us, since we
do not care about the NUL terminator (and are just using the literal as
a convenient format), it would be easy to accidentally create an array
that was mistakenly unterminated.
We can avoid this warning by switching the initializer to an actual
array of unsigned values. That arguably demonstrates our intent more
clearly anyway.
Reported-by: Sam James <sam@gentoo.org>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The checksum at the tail of files are now computed without
collision detection protection. This is safe as the consumer of
the information to protect itself from replay attacks checks for
hash collisions independently.
* tb/weak-sha1-for-tail-sum:
csum-file.c: use unsafe SHA-1 implementation when available
Makefile: allow specifying a SHA-1 for non-cryptographic uses
hash.h: scaffolding for _unsafe hashing variants
sha1: do not redefine `platform_SHA_CTX` and friends
pack-objects: use finalize_object_file() to rename pack/idx/etc
finalize_object_file(): implement collision check
finalize_object_file(): refactor unlink_or_warn() placement
finalize_object_file(): check for name collision before renaming
Git's default SHA-1 implementation is collision-detecting, which hardens
us against known SHA-1 attacks against Git objects. This makes Git
object writes safer at the expense of some speed when hashing through
the collision-detecting implementation, which is slower than
non-collision detecting alternatives.
Prepare for loading a separate "unsafe" SHA-1 implementation that can be
used for non-cryptographic purposes, like computing the checksum of
files that use the hashwrite() API.
This commit does not actually introduce any new compile-time knobs to
control which implementation is used as the unsafe SHA-1 variant, but
does add scaffolding so that the "git_hash_algo" structure has five new
function pointers which are "unsafe" variants of the five existing
hashing-related function pointers:
- git_hash_init_fn unsafe_init_fn
- git_hash_clone_fn unsafe_clone_fn
- git_hash_update_fn unsafe_update_fn
- git_hash_final_fn unsafe_final_fn
- git_hash_final_oid_fn unsafe_final_oid_fn
The following commit will introduce compile-time knobs to specify which
SHA-1 implementation is used for non-cryptographic uses.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We've had "FIXME!!! Collision check here ?" in finalize_object_file()
since aac1794132 (Improve sha1 object file writing., 2005-05-03). That
is, when we try to write a file with the same name, we assume the
on-disk contents are the same and blindly throw away the new copy.
One of the reasons we never implemented this is because the files it
moves are all named after the cryptographic hash of their contents
(either loose objects, or packs which have their hash in the name these
days). So we are unlikely to see such a collision by accident. And even
though there are weaknesses in sha1, we assume they are mitigated by our
use of sha1dc.
So while it's a theoretical concern now, it hasn't been a priority.
However, if we start using weaker hashes for pack checksums and names,
this will become a practical concern. So in preparation, let's actually
implement a byte-for-byte collision check.
The new check will cause the write of new differing content to be a
failure, rather than a silent noop, and we'll retain the temporary file
on disk. If there's no collision present, we'll clean up the temporary
file as usual after either rename()-ing or link()-ing it into place.
Note that this may cause some extra computation when the files are in
fact identical, but this should happen rarely.
Loose objects are exempt from this check, and the collision check may be
skipped by calling the _flags variant of this function with the
FOF_SKIP_COLLISION_CHECK bit set. This is done for a couple of reasons:
- We don't treat the hash of the loose object file's contents as a
checksum, since the same loose object can be stored using different
bytes on disk (e.g., when adjusting core.compression, using a
different version of zlib, etc.).
This is fundamentally different from cases where
finalize_object_file() is operating over a file which uses the hash
value as a checksum of the contents. In other words, a pair of
identical loose objects can be stored using different bytes on disk,
and that should not be treated as a collision.
- We already use the path of the loose object as its hash value /
object name, so checking for collisions at the content level doesn't
add anything.
Adding a content-level collision check would have to happen at a
higher level than in finalize_object_file(), since (avoiding race
conditions) writing an object loose which already exists in the
repository will prevent us from even reaching finalize_object_file()
via the object freshening code.
There is a collision check in index-pack via its `check_collision()`
function, but there isn't an analogous function in unpack-objects,
which just feeds the result to write_object_file().
So skipping the collision check here does not change for better or
worse the hardness of loose object writes.
As a small note related to the latter bullet point above, we must teach
the tmp-objdir routines to similarly skip the content-level collision
checks when calling migrate_one() on a loose object file, which we do by
setting the FOF_SKIP_COLLISION_CHECK bit when we are inside of a loose
object shard.
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As soon as we've tried to link() a temporary object into place, we then
unlink() the tempfile immediately, whether we were successful or not.
For the success case, this is because we no longer need the old file
(it's now linked into place).
For the error case, there are two outcomes. Either we got EEXIST, in
which case we consider the collision to be a noop. Or we got a system
error, in which we case we are just cleaning up after ourselves.
Using a single line for all of these cases has some problems:
- in the error case, our unlink() may clobber errno, which we use in
the error message
- for the collision case, there's a FIXME that indicates we should do
a collision check. In preparation for implementing that, we'll need
to actually hold on to the file.
Split these three cases into their own calls to unlink_or_warn(). This
is more verbose, but lets us do the right thing in each case.
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We prefer link()/unlink() to rename() for object files, with the idea
that we should prefer the data that is already on disk to what is
incoming. But we may fall back to rename() if the user has configured us
to do so, or if the filesystem seems not to support cross-directory
links. This loses the "prefer what is on disk" property.
We can mitigate this somewhat by trying to stat() the destination
filename before doing the rename. This is racy, since the object could
be created between the stat() and rename() calls. But in practice it is
expanding the definition of "what is already on disk" to be the point
that the function is called. That is enough to deal with any potential
attacks where an attacker is trying to collide hashes with what's
already in the repository.
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Typofix.
* ak/typofix-2.46-maint:
upload-pack: fix a typo
sideband: fix a typo
setup: fix a typo
run-command: fix a typo
revision: fix a typo
refs: fix typos
rebase: fix a typo
read-cache-ll: fix a typo
pretty: fix a typo
object-file: fix a typo
merge-ort: fix typos
merge-ll: fix a typo
http: fix a typo
gpg-interface: fix a typo
git-p4: fix typos
git-instaweb: fix a typo
fsmonitor-settings: fix a typo
diffcore-rename: fix typos
config.mak.dev: fix a typo
The `odb_mkstemp()` and `odb_pack_keep()` functions are quite clearly
tied to the object store, but regardless of that they are located in
"environment.c". Move them over, which also helps to get rid of
dependencies on `the_repository` in the environment subsystem.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `get_object_directory()` function retrieves the path to the object
directory for `the_repository`. Make it accept a `struct repository`
such that it can work on arbitrary repositories and make it part of the
repository subsystem. This reduces our reliance on `the_repository` and
clarifies scope.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
More leak fixes.
* ps/leakfixes-part-4: (22 commits)
builtin/diff: free symmetric diff members
diff: free state populated via options
builtin/log: fix leak when showing converted blob contents
userdiff: fix leaking memory for configured diff drivers
builtin/format-patch: fix various trivial memory leaks
diff: fix leak when parsing invalid ignore regex option
unpack-trees: clear index when not propagating it
sequencer: release todo list on error paths
merge-ort: unconditionally release attributes index
builtin/fast-export: plug leaking tag names
builtin/fast-export: fix leaking diff options
builtin/fast-import: plug trivial memory leaks
builtin/notes: fix leaking `struct notes_tree` when merging notes
builtin/rebase: fix leaking `commit.gpgsign` value
config: fix leaking comment character config
submodule-config: fix leaking name entry when traversing submodules
read-cache: fix leaking hashfile when writing index fails
bulk-checkin: fix leaking state TODO
object-name: fix leaking symlink paths in object context
object-file: fix memory leak when reading corrupted headers
...
When reading corrupt object headers in `read_loose_object()`, we bail
out immediately. This causes a memory leak though because we would have
already initialized the zstream in `unpack_loose_header()`, and it is
the callers responsibility to finish the zstream even on error. While
this feels weird, other callsites do it correctly already.
Fix this leak by ending the zstream even on errors. We may want to
revisit this interface in the future such that the callee handles this
for us already when there was an error.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "fsck_error" callback is designed to report the objects-related
error messages. It accepts two parameter "oid" and "object_type" which
is not generic. In order to provide a unified callback which can report
either objects or refs, remove the objects-related parameters and add
the generic parameter "void *fsck_report".
Create a new "fsck_object_report" structure which incorporates the
removed parameters "oid" and "object_type". Then change the
corresponding references to adapt to new "fsck_error" callback.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A CPP macro USE_THE_REPOSITORY_VARIABLE is introduced to help
transition the codebase to rely less on the availability of the
singleton the_repository instance.
* ps/use-the-repository:
hex: guard declarations with `USE_THE_REPOSITORY_VARIABLE`
t/helper: remove dependency on `the_repository` in "proc-receive"
t/helper: fix segfault in "oid-array" command without repository
t/helper: use correct object hash in partial-clone helper
compat/fsmonitor: fix socket path in networked SHA256 repos
replace-object: use hash algorithm from passed-in repository
protocol-caps: use hash algorithm from passed-in repository
oidset: pass hash algorithm when parsing file
http-fetch: don't crash when parsing packfile without a repo
hash-ll: merge with "hash.h"
refs: avoid include cycle with "repository.h"
global: introduce `USE_THE_REPOSITORY_VARIABLE` macro
hash: require hash algorithm in `empty_tree_oid_hex()`
hash: require hash algorithm in `is_empty_{blob,tree}_oid()`
hash: make `is_null_oid()` independent of `the_repository`
hash: convert `oidcmp()` and `oideq()` to compare whole hash
global: ensure that object IDs are always padded
hash: require hash algorithm in `oidread()` and `oidclr()`
hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()`
hash: drop (mostly) unused `is_empty_{blob,tree}_sha1()` functions
I'm not sure exactly how to trigger the leak, but it seems fairly
obvious that the `content' buffer should be freed even if
convert_object_file() fails. Noticed while working in this area
on unrelated things.
Signed-off-by: Eric Wong <e@80x24.org>
Acked-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Building with "-Werror -Wwrite-strings" is now supported.
* ps/no-writable-strings: (27 commits)
config.mak.dev: enable `-Wwrite-strings` warning
builtin/merge: always store allocated strings in `pull_twohead`
builtin/rebase: always store allocated string in `options.strategy`
builtin/rebase: do not assign default backend to non-constant field
imap-send: fix leaking memory in `imap_server_conf`
imap-send: drop global `imap_server_conf` variable
mailmap: always store allocated strings in mailmap blob
revision: always store allocated strings in output encoding
remote-curl: avoid assigning string constant to non-const variable
send-pack: always allocate receive status
parse-options: cast long name for OPTION_ALIAS
http: do not assign string constant to non-const field
compat/win32: fix const-correctness with string constants
pretty: add casts for decoration option pointers
object-file: make `buf` parameter of `index_mem()` a constant
object-file: mark cached object buffers as const
ident: add casts for fallback name and GECOS
entry: refactor how we remove items for delayed checkouts
line-log: always allocate the output prefix
line-log: stop assigning string constant to file parent buffer
...
Use of the `the_repository` variable is deprecated nowadays, and we
slowly but steadily convert the codebase to not use it anymore. Instead,
callers should be passing down the repository to work on via parameters.
It is hard though to prove that a given code unit does not use this
variable anymore. The most trivial case, merely demonstrating that there
is no direct use of `the_repository`, is already a bit of a pain during
code reviews as the reviewer needs to manually verify claims made by the
patch author. The bigger problem though is that we have many interfaces
that implicitly rely on `the_repository`.
Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code
units to opt into usage of `the_repository`. The intent of this macro is
to demonstrate that a certain code unit does not use this variable
anymore, and to keep it from new dependencies on it in future changes,
be it explicit or implicit
For now, the macro only guards `the_repository` itself as well as
`the_hash_algo`. There are many more known interfaces where we have an
implicit dependency on `the_repository`, but those are not guarded at
the current point in time. Over time though, we should start to add
guards as required (or even better, just remove them).
Define the macro as required in our code units. As expected, most of our
code still relies on the global variable. Nearly all of our builtins
rely on the variable as there is no way yet to pass `the_repository` to
their entry point. For now, declare the macro in "biultin.h" to keep the
required changes at least a little bit more contained.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `empty_tree_oid_hex()` function use `the_repository` to derive the
hash function that shall be used. Require callers to pass in the hash
algorithm to get rid of this implicit dependency.
While at it, remove the unused `empty_blob_oid_hex()` function.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `oidcmp()` and `oideq()` functions only compare the prefix length as
specified by the given hash algorithm. This mandates that the object IDs
have a valid hash algorithm set, or otherwise we wouldn't be able to
figure out that prefix. As we do not have a hash algorithm in many
cases, for example when handling null object IDs, this assumption cannot
always be fulfilled. We thus have a fallback in place that instead uses
`the_repository` to derive the hash function. This implicit dependency
is hidden away from callers and can be quite surprising, especially in
contexts where there may be no repository.
In theory, we can adapt those functions to always memcmp(3P) the whole
length of their hash arrays. But there exist a couple of sites where we
populate `struct object_id`s such that only the prefix of its hash that
is actually used by the hash algorithm is populated. The remaining bytes
are left uninitialized. The fact that those bytes are uninitialized also
leads to warnings under Valgrind in some places where we copy those
bytes.
Refactor callsites where we populate object IDs to always initialize all
bytes. This also allows us to get rid of `oidcpy_with_padding()`, for
one because the input is now fully initialized, and because `oidcpy()`
will now always copy the whole hash array.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Both `oidread()` and `oidclr()` use `the_repository` to derive the hash
function that shall be used. Require callers to pass in the hash
algorithm to get rid of this implicit dependency.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `buf` parameter of `index_mem()` is a non-constant string. This will
break once we enable `-Wwrite-strings` because we also pass constants
from at least one callsite.
Adapt the parameter to be a constant. As we cannot free the buffer
without casting now, this also requires us to move the lifetime of the
nested buffer around.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The buffers of cached objects are never modified, but are still stored
as a non-constant pointer. This will cause a compiler warning once we
enable the `-Wwrite-strings` compiler warning as we assign an empty
constant string when initializing the static `empty_tree` cached object.
Convert the field to be constant. This requires us to shuffle around
the code a bit because we memcpy(3P) into the allocated buffer in
`pretend_object_file()`. This is easily fixed though by allocating the
buffer into a temporary variable first.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In `resolve_gitlink_ref()` we implicitly rely on `the_repository` to
look up the submodule ref store. Now that we can look up submodule ref
stores for arbitrary repositories we can improve this function to
instead accept a repository as parameter for which we want to resolve
the gitlink.
Do so and adjust callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Work to support a repository that work with both SHA-1 and SHA-256
hash algorithms has started.
* eb/hash-transition: (30 commits)
t1016-compatObjectFormat: add tests to verify the conversion between objects
t1006: test oid compatibility with cat-file
t1006: rename sha1 to oid
test-lib: compute the compatibility hash so tests may use it
builtin/ls-tree: let the oid determine the output algorithm
object-file: handle compat objects in check_object_signature
tree-walk: init_tree_desc take an oid to get the hash algorithm
builtin/cat-file: let the oid determine the output algorithm
rev-parse: add an --output-object-format parameter
repository: implement extensions.compatObjectFormat
object-file: update object_info_extended to reencode objects
object-file-convert: convert commits that embed signed tags
object-file-convert: convert commit objects when writing
object-file-convert: don't leak when converting tag objects
object-file-convert: convert tag objects when writing
object-file-convert: add a function to convert trees between algorithms
object: factor out parse_mode out of fast-import and tree-walk into in object.h
cache: add a function to read an OID of a specific algorithm
tag: sign both hashes
commit: export add_header_signature to support handling signatures on tags
...
Each of these were checked with
gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE}
to ensure that removing the direct inclusion of the header actually
resulted in that header no longer being included at all (i.e. that
no other header pulled it in transitively).
...except for a few cases where we verified that although the header
was brought in transitively, nothing from it was directly used in
that source file. These cases were:
* builtin/credential-cache.c
* builtin/pull.c
* builtin/send-pack.c
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Update check_object_signature to find the hash algorithm the exising
signature uses, and to use the same hash algorithm when recomputing it
to check the signature is valid.
This will be useful when teaching git ls-tree to display objects
encoded with the compat hash algorithm.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
oid_object_info_extended is updated to detect an oid encoding that
does not match the current repository, use repo_oid_to_algop to find
the correspoding oid in the current repository and to return the data
for the oid.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
To create the proper signatures for commit objects both versions of
the commit object need to be generated and signed. After that it is
a waste to throw away the work of generating the compatibility hash
so update write_object_file_flags to take a compatibility hash input
parameter that it can use to skip the work of generating the
compatability hash.
Update the places that don't generate the compatability hash to
pass NULL so it is easy to tell write_object_file_flags should
not attempt to use their compatability hash.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
To implement SHA1 compatibility on SHA256 repositories the loose
object map needs to be updated whenver a loose object is written.
Updating the loose object map this way allows git to support
the old hash algorithm in constant time.
The functions write_loose_object, and stream_loose_object are
the only two functions that write to the loose object store.
Update stream_loose_object to compute the compatibiilty hash, update
the loose object, and then call repo_add_loose_object_map to update
the loose object map.
Update write_object_file_flags to convert the object into
it's compatibility encoding, hash the compatibility encoding,
write the object, and then update the loose object map.
Update force_object_loose to lookup the hash of the compatibility
encoding, write the loose object, and then update the loose object
map.
Update write_object_file_literally to convert the object into it's
compatibility hash encoding, hash the compatibility enconding, write
the object, and then update the loose object map, when the type string
is a known type. For objects with an unknown type this results in a
partially broken repository, as the objects are not mapped.
The point of write_object_file_literally is to generate a partially
broken repository for testing. For testing skipping writing the loose
object map is much more useful than refusing to write the broken
object at all.
Except that the loose objects are updated before the loose object map
I have not done any analysis to see how robust this scheme is in the
event of failure.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As the code is written today index_bulk_checkin only accepts blobs.
Remove the enum object_type parameter and rename index_bulk_checkin to
index_blob_bulk_checkin, index_stream to index_blob_stream,
deflate_to_pack to deflate_blob_to_pack, stream_to_pack to
stream_blob_to_pack, to make this explicit.
Not supporting commits, tags, or trees has no downside as it is not
currently supported now, and commits, tags, and trees being smaller by
design do not have the problem that the problem that index_bulk_checkin
was built to solve.
Before we start adding code to support the hash function transition
supporting additional objects types in index_bulk_checkin has no real
additional cost, just an extra function parameter to know what the
object type is. Once we begin the hash function transition this is not
the case.
The hash function transition document specifies that a repository with
compatObjectFormat enabled will compute and store both the SHA-1 and
SHA-256 hash of every object in the repository.
What makes this a challenge is that it is not just an additional hash
over the same object. Instead the hash function transition document
specifies that the compatibility hash (specified with
compatObjectFormat) be computed over the equivalent object that another
git repository whose storage hash (specified with objectFormat) would
store. When comparing equivalent repositories built with different
storage hash functions, the oids embedded in objects used to refer to
other objects differ and the location of signatures within objects
differ.
As blob objects have neither oids referring to other objects nor stored
signatures their storage hash and their compatibility hash are computed
over the same object.
The other kinds of objects: trees, commits, and tags, all store oids
referring to other objects. Signatures are stored in commit and tag
objects. As oids and the tags to store signatures are not the same size
in repositories built with different storage hashes the size of the
equivalent objects are also different.
A version of index_bulk_checkin that supports more than just blobs when
computing both the SHA-1 and the SHA-256 of every object added would
need a different, and more expensive structure. The structure is more
expensive because it would be required to temporarily buffering the
equivalent object the compatibility hash needs to be computed over.
A temporary object is needed, because before a hash over an object can
computed it's object header needs to be computed. One of the members of
the object header is the entire size of the object. To know the size of
an equivalent object an entire pass over the original object needs to be
made, as trees, commits, and tags are composed of a variable number of
variable sized pieces. Unfortunately there is no formula to compute the
size of an equivalent object from just the size of the original object.
Avoid all of those future complications by limiting index_bulk_checkin
to only work on blobs.
Inspired-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Mark-up unused parameters in the code so that we can eventually
enable -Wunused-parameter by default.
* jk/unused-parameter:
t/helper: mark unused callback void data parameters
tag: mark unused parameters in each_tag_name_fn callbacks
rev-parse: mark unused parameter in for_each_abbrev callback
replace: mark unused parameter in each_mergetag_fn callback
replace: mark unused parameter in ref callback
merge-tree: mark unused parameter in traverse callback
fsck: mark unused parameters in various fsck callbacks
revisions: drop unused "opt" parameter in "tweak" callbacks
count-objects: mark unused parameter in alternates callback
am: mark unused keep_cr parameters
http-push: mark unused parameter in xml callback
http: mark unused parameters in curl callbacks
do_for_each_ref_helper(): mark unused repository parameter
test-ref-store: drop unimplemented reflog-expire command
There are a few callback functions which are used with the fsck code,
but it's natural that not all callbacks need all parameters. For
reporting, even something as obvious as "the oid of the object which had
a problem" is not always used, as some callers are only checking a
single object in the first place. And for both reporting and walking,
things like void data pointers and the fsck_options aren't always
necessary.
But since each such parameter is used by _some_ callback, we have to
keep them in the interface. Mark the unused ones in specific callbacks
to avoid triggering -Wunused-parameter.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
alloc_nr, ALLOC_GROW, and ALLOC_GROW_BY are commonly used macros for
dynamic array allocation. Moving these macros to git-compat-util.h with
the other alloc macros focuses alloc.[ch] to allocation for Git objects
and additionally allows us to remove inclusions to alloc.h from files
that solely used the above macros.
Signed-off-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This also made it clear that several .c files that depended upon path.h
were missing a #include for it; add the missing includes while at it.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
More header clean-up.
* en/header-split-cache-h-part-2: (22 commits)
reftable: ensure git-compat-util.h is the first (indirect) include
diff.h: reduce unnecessary includes
object-store.h: reduce unnecessary includes
commit.h: reduce unnecessary includes
fsmonitor: reduce includes of cache.h
cache.h: remove unnecessary headers
treewide: remove cache.h inclusion due to previous changes
cache,tree: move basic name compare functions from read-cache to tree
cache,tree: move cmp_cache_name_compare from tree.[ch] to read-cache.c
hash-ll.h: split out of hash.h to remove dependency on repository.h
tree-diff.c: move S_DIFFTREE_IFXMIN_NEQ define from cache.h
dir.h: move DTYPE defines from cache.h
versioncmp.h: move declarations for versioncmp.c functions from cache.h
ws.h: move declarations for ws.c functions from cache.h
match-trees.h: move declarations for match-trees.c functions from cache.h
pkt-line.h: move declarations for pkt-line.c functions from cache.h
base85.h: move declarations for base85.c functions from cache.h
copy.h: move declarations for copy.c functions from cache.h
server-info.h: move declarations for server-info.c functions from cache.h
packfile.h: move pack_window and pack_entry from cache.h
...
Geometric repacking ("git repack --geometric=<n>") in a repository
that borrows from an alternate object database had various corner
case bugs, which have been corrected.
* ps/fix-geom-repack-with-alternates:
repack: disable writing bitmaps when doing a local repack
repack: honor `-l` when calculating pack geometry
t/helper: allow chmtime to print verbosely without modifying mtime
pack-objects: extend test coverage of `--stdin-packs` with alternates
pack-objects: fix error when same packfile is included and excluded
pack-objects: fix error when packing same pack twice
pack-objects: split out `--stdin-packs` tests into separate file
repack: fix generating multi-pack-index with only non-local packs
repack: fix trying to use preferred pack in alternates
midx: fix segfault with no packs and invalid preferred pack
In order to write a bitmap, we need to have full coverage of all objects
that are about to be packed. In the traditional non-multi-pack-index
world this meant we need to do a full repack of all objects into a
single packfile. But in the new multi-pack-index world we can get away
with writing bitmaps when we have multiple packfiles as long as the
multi-pack-index covers all objects.
This is not always the case though. When asked to perform a repack of
local objects, only, then we cannot guarantee to have full coverage of
all objects regardless of whether we do a full repack or a repack with a
multi-pack-index. The end result is that writing the bitmap will fail in
both worlds:
$ git multi-pack-index write --stdin-packs --bitmap <packfiles
warning: Failed to write bitmap index. Packfile doesn't have full closure (object 1529341d78cf45377407369acb0f4ff2b5cdae42 is missing)
error: could not write multi-pack bitmap
Now there are two different ways to fix this. The first one would be to
amend git-multi-pack-index(1) to disable writing bitmaps when we notice
that we don't have full object coverage.
- We don't have enough information in git-multi-pack-index(1) in
order to tell whether the local repository _should_ have full
coverage. Because even when connected to an alternate object
directory, it may be the case that we still have all objects
around in the main object database.
- git-multi-pack-index(1) is quite a low-level tool. Automatically
disabling functionality that it was asked to provide does not feel
like the right thing to do.
We can easily fix it at a higher level in git-repack(1) though. When
asked to only include local objects via `-l` and when connected to an
alternate object directory then we will override the user's ask and
disable writing bitmaps with a warning. This is similar to what we do in
git-pack-objects(1), where we also disable writing bitmaps in case we
omit an object from the pack.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* ab/remove-implicit-use-of-the-repository:
libs: use "struct repository *" argument, not "the_repository"
post-cocci: adjust comments for recent repo_* migration
cocci: apply the "revision.h" part of "the_repository.pending"
cocci: apply the "rerere.h" part of "the_repository.pending"
cocci: apply the "refs.h" part of "the_repository.pending"
cocci: apply the "promisor-remote.h" part of "the_repository.pending"
cocci: apply the "packfile.h" part of "the_repository.pending"
cocci: apply the "pretty.h" part of "the_repository.pending"
cocci: apply the "object-store.h" part of "the_repository.pending"
cocci: apply the "diff.h" part of "the_repository.pending"
cocci: apply the "commit.h" part of "the_repository.pending"
cocci: apply the "commit-reach.h" part of "the_repository.pending"
cocci: apply the "cache.h" part of "the_repository.pending"
cocci: add missing "the_repository" macros to "pending"
cocci: sort "the_repository" rules by header
cocci: fix incorrect & verbose "the_repository" rules
cocci: remove dead rule from "the_repository.pending.cocci"
In preceding commits we changed many calls to macros that were
providing a "the_repository" argument to invoke corresponding repo_*()
function instead. Let's follow-up and adjust references to those in
comments, which coccinelle didn't (and inherently can't) catch.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Apply the part of "the_repository.pending.cocci" pertaining to
"object-store.h".
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
With the change in the last commit to move several functions to
write-or-die.h, csum-file.h no longer needs to include cache.h.
However, removing that include forces several other C files, which
directly or indirectly dependend upon csum-file.h's inclusion of
cache.h, to now be more explicit about their dependencies.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is another step towards letting us remove the include of cache.h in
strbuf.c. It does mean that we also need to add includes of abspath.h
in a number of C files.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Dozens of files made use of gettext functions, without explicitly
including gettext.h. This made it more difficult to find which files
could remove a dependence on cache.h. Make C files explicitly include
gettext.h if they are using it.
However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an
include of gettext.h, it was left out to avoid conflicting with an
in-flight topic.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
More work towards -Wunused.
* jk/unused-post-2.39-part2: (21 commits)
help: mark unused parameter in git_unknown_cmd_config()
run_processes_parallel: mark unused callback parameters
userformat_want_item(): mark unused parameter
for_each_commit_graft(): mark unused callback parameter
rewrite_parents(): mark unused callback parameter
fetch-pack: mark unused parameter in callback function
notes: mark unused callback parameters
prio-queue: mark unused parameters in comparison functions
for_each_object: mark unused callback parameters
list-objects: mark unused callback parameters
mark unused parameters in signal handlers
run-command: mark error routine parameters as unused
mark "pointless" data pointers in callbacks
ref-filter: mark unused callback parameters
http-backend: mark unused parameters in virtual functions
http-backend: mark argc/argv unused
object-name: mark unused parameters in disambiguate callbacks
serve: mark unused parameters in virtual functions
serve: use repository pointer to get config
ls-refs: drop config caching
...
The for_each_{loose,packed}_object interface uses callback functions,
but not every callback needs all of the parameters. Mark the unused ones
to satisfy -Wunused-parameter.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This allows us to replace includes of cache.h with includes of the much
smaller alloc.h in many places. It does mean that we also need to add
includes of alloc.h in a number of C files.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git hash-object" now checks that the resulting object is well
formed with the same code as "git fsck".
* jk/hash-object-fsck:
fsck: do not assume NUL-termination of buffers
hash-object: use fsck for object checks
fsck: provide a function to fsck buffer without object struct
t: use hash-object --literally when created malformed objects
t7030: stop using invalid tag name
t1006: stop using 0-padded timestamps
t1007: modernize malformed object tests
Since c879daa237 (Make hash-object more robust against malformed
objects, 2011-02-05), we've done some rudimentary checks against objects
we're about to write by running them through our usual parsers for
trees, commits, and tags.
These parsers catch some problems, but they are not nearly as careful as
the fsck functions (which make sense; the parsers are designed to be
fast and forgiving, bailing only when the input is unintelligible). We
are better off doing the more thorough fsck checks when writing objects.
Doing so at write time is much better than writing garbage only to find
out later (after building more history atop it!) that fsck complains
about it, or hosts with transfer.fsckObjects reject it.
This is obviously going to be a user-visible behavior change, and the
test changes earlier in this series show the scope of the impact. But
I'd argue that this is OK:
- the documentation for hash-object is already vague about which
checks we might do, saying that --literally will allow "any
garbage[...] which might not otherwise pass standard object parsing
or git-fsck checks". So we are already covered under the documented
behavior.
- users don't generally run hash-object anyway. There are a lot of
spots in the tests that needed to be updated because creating
garbage objects is something that Git's tests disproportionately do.
- it's hard to imagine anyone thinking the new behavior is worse. Any
object we reject would be a potential problem down the road for the
user. And if they really want to create garbage, --literally is
already the escape hatch they need.
Note that the change here is actually in index_mem(), which handles the
HASH_FORMAT_CHECK flag passed by hash-object. That flag is also used by
"git-replace --edit" to sanity-check the result. Covering that with more
thorough checks likewise seems like a good thing.
Besides being more thorough, there are a few other bonuses:
- we get rid of some questionable stack allocations of object structs.
These don't seem to currently cause any problems in practice, but
they subtly violate some of the assumptions made by the rest of the
code (e.g., the "struct commit" we put on the stack and
zero-initialize will not have a proper index from
alloc_comit_index().
- likewise, those parsed object structs are the source of some small
memory leaks
- the resulting messages are much better. For example:
[before]
$ echo 'tree 123' | git hash-object -t commit --stdin
error: bogus commit object 0000000000000000000000000000000000000000
fatal: corrupt commit
[after]
$ echo 'tree 123' | git.compile hash-object -t commit --stdin
error: object fails fsck: badTreeSha1: invalid 'tree' line format - bad sha1
fatal: refusing to create malformed object
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit b25562e63f (object-file: inline calls to read_object(),
2023-01-07) accidentally indented a conditional block with spaces
instead of a tab.
Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The only caller of read_object_file_extended() is the thin wrapper of
repo_read_object_file(). Instead of wrapping, let's just rename the
inner function and let people call it directly. This cleans up the
namespace and reduces confusion.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our sole caller always passes in "1", so we can just drop the parameter
entirely. Anybody who doesn't want this behavior could easily call
oid_object_info_extended() themselves, as we're just a thin wrapper
around it.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since read_object() is these days just a thin wrapper around
oid_object_info_extended(), and since it only has two callers, let's
just inline those calls. This has a few positive outcomes:
- it's a net reduction in source code lines
- even though the callers end up with a few extra lines, they're now
more flexible and can use object_info flags directly. So no more
need to convert die_if_corrupt between parameter/flag, and we can
ask for lookup replacement with a flag rather than doing it
ourselves.
- there's one fewer function in an already crowded namespace (e.g.,
the difference between read_object() and read_object_file() was not
immediately obvious; now we only have one of them).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Even in a repository with promisor remote, it is useless to
attempt to lazily attempt fetching an object that is expected to be
commit, because no "filter" mode omits commit objects. Take
advantage of this assumption to fail fast on errors.
* jt/avoid-lazy-fetch-commits:
commit: don't lazy-fetch commits
object-file: emit corruption errors when detected
object-file: refactor map_loose_object_1()
object-file: remove OBJECT_INFO_IGNORE_LOOSE
Instead of relying on errno being preserved across function calls, teach
do_oid_object_info_extended() to itself report object corruption when
it first detects it. There are 3 types of corruption being detected:
- when a replacement object is missing
- when a loose object is corrupt
- when a packed object is corrupt and the object cannot be read
in another way
Note that in the RHS of this patch's diff, a check for ENOENT that was
introduced in 3ba7a06552 (A loose object is not corrupt if it cannot
be read due to EMFILE, 2010-10-28) is also removed. The purpose of this
check is to avoid a false report of corruption if the errno contains
something like EMFILE (or anything that is not ENOENT), in which case
a more generic report is presented. Because, as of this patch, we no
longer rely on such a heuristic to determine corruption, but surface
the error message at the point when we read something that we did not
expect, this check is no longer necessary.
Besides being more resilient, this also prepares for a future patch in
which an indirect caller of do_oid_object_info_extended() will need
such functionality.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This function can do 3 things:
1. Gets an fd given a path
2. Simultaneously gets a path and fd given an OID
3. Memory maps an fd
Keep 3 (renaming the function accordingly) and inline 1 and 2 into their
respective callers.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Its last user was removed in 97b2fa08b6 (fetch-pack: drop
custom loose object cache, 2018-11-12), so we can remove it.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
write_buffer() reports the OS error if it is unable to write. Its only
caller dies in that case, giving some more context in its last message.
Inline this function and show only a single error message that includes
both the context (writing a loose object file) and the OS error. This
shortens the code and simplifies the output.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When adding an alternate ODB, we check if the alternate has the same
path as the object dir, and if so, we do nothing. However, that
comparison does not resolve symlinks. This makes it possible to add the
object dir as an alternate, which may result in bad behavior. For
example, it can trick "git repack -a -l -d" (possibly run by "git gc")
into thinking that all packs come from an alternate and delete all
objects.
rm -rf test &&
git clone https://github.com/git/git test &&
(
cd test &&
ln -s objects .git/alt-objects &&
# -c repack.updateserverinfo=false silences a warning about not
# being able to update "info/refs", it isn't needed to show the
# bad behavior
GIT_ALTERNATE_OBJECT_DIRECTORIES=".git/alt-objects" git \
-c repack.updateserverinfo=false repack -a -l -d &&
# It's broken!
git status
# Because there are no more objects!
ls .git/objects/pack
)
Fix this by resolving symlinks and relative paths before comparing the
alternate and object dir. This lets us clean up a number of issues noted
in 37a95862c6 (alternates: re-allow relative paths from environment,
2016-11-07):
- Now that we compare the real paths, duplicate detection is no longer
foiled by relative paths.
- Using strbuf_realpath() allows us to "normalize" paths that
strbuf_normalize_path() can't, so we can stop silently ignoring errors
when "normalizing" paths from the environment.
- We now store an absolute path based on getcwd() (the "future
direction" named in 37a95862c6), so chdir()-ing in the process no
longer changes the directory pointed to by the alternate. This is a
change in behavior, but a desirable one.
Signed-off-by: Glen Choo <chooglen@google.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
More UNUSED annotation to help using -Wunused option with the
compiler.
* jk/unused-anno-more:
ll-merge: mark unused parameters in callbacks
diffcore-pickaxe: mark unused parameters in pickaxe functions
convert: mark unused parameter in null stream filter
apply: mark unused parameters in noop error/warning routine
apply: mark unused parameters in handlers
date: mark unused parameters in handler functions
string-list: mark unused callback parameters
object-file: mark unused parameters in hash_unknown functions
mark unused parameters in trivial compat functions
update-index: drop unused argc from do_reupdate()
submodule--helper: drop unused argc from module_list_compute()
diffstat_consume(): assert non-zero length
The 0'th entry of our hash_algos array fills out the virtual methods
with a series of functions which simply BUG(). This is the right thing
to do, since the point is to catch use of an invalid algo parameter, but
we need to annotate them to appease -Wunused-parameters.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Remove error detection from a function that fetches from promisor
remotes, and make it die when such a fetch fails to bring all the
requested objects, to give an early failure to various operations.
* jt/promisor-remote-fetch-tweak:
promisor-remote: die upon failing fetch
promisor-remote: remove a return value
In a partial clone, an attempt to read a missing object results in an
attempt to fetch that single object. In order to avoid multiple
sequential fetches, which would occur when multiple objects are missing
(which is the typical case), some commands have been taught to prefetch
in a batch: such a command would, in a partial clone, notice that
several objects that it will eventually need are missing, and call
promisor_remote_get_direct() with all such objects at once.
When this batch prefetch fails, these commands fall back to the
sequential fetches. But at $DAYJOB we have noticed that this results in
a bad user experience: a command would take unexpectedly long to finish
(and possibly use up a lot of bandwidth) if the batch prefetch would
fail for some intermittent reason, but all subsequent fetches would
work. It would be a better user experience for such a command would
just fail.
Therefore, make it a fatal error if the prefetch fails and at least one
object being fetched is known to be a promisor object. (The latter
criterion is to make sure that we are not misleading the user that such
an object would be present from the promisor remote. For example, a
missing object may be a result of repository corruption and not because
it is expectedly missing due to the repository being a partial clone.)
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Allow large objects read from a packstream to be streamed into a
loose object file straight, without having to keep it in-core as a
whole.
* hx/unpack-streaming:
unpack-objects: use stream_loose_object() to unpack large objects
core doc: modernize core.bigFileThreshold documentation
object-file.c: add "stream_loose_object()" to handle large object
object-file.c: factor out deflate part of write_loose_object()
object-file.c: refactor write_loose_object() to several steps
unpack-objects: low memory footprint for get_data() in dry_run mode
If we want unpack and write a loose object using "write_loose_object",
we have to feed it with a buffer with the same size of the object, which
will consume lots of memory and may cause OOM. This can be improved by
feeding data to "stream_loose_object()" in a stream.
Add a new function "stream_loose_object()", which is a stream version of
"write_loose_object()" but with a low memory footprint. We will use this
function to unpack large blob object in later commit.
Another difference with "write_loose_object()" is that we have no chance
to run "write_object_file_prepare()" to calculate the oid in advance.
In "write_loose_object()", we know the oid and we can write the
temporary file in the same directory as the final object, but for an
object with an undetermined oid, we don't know the exact directory for
the object.
Still, we need to save the temporary file we're preparing
somewhere. We'll do that in the top-level ".git/objects/"
directory (or whatever "GIT_OBJECT_DIRECTORY" is set to). Once we've
streamed it we'll know the OID, and will move it to its canonical
path.
"freshen_packed_object()" or "freshen_loose_object()" will be called
inside "stream_loose_object()" after obtaining the "oid". After the
temporary file is written, we wants to mark the object to recent and we
may find that where indeed is already the object. We should remove the
temporary and do not leave a new copy of the object.
Helped-by: René Scharfe <l.s.r@web.de>
Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Helped-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
Signed-off-by: Han Xin <chiyutianyi@gmail.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Split out the part of write_loose_object() that deals with calling
git_deflate() into a utility function, a subsequent commit will
introduce another function that'll make use of it.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>