The `null_oid()` function returns the object ID that only consists of
zeroes. Naturally, this ID also depends on the hash algorithm used, as
the number of zeroes is different between SHA1 and SHA256. Consequently,
the function returns the hash-algorithm-specific null object ID.
This is currently done by depending on `the_hash_algo`, which implicitly
makes us depend on `the_repository`. Refactor the function to instead
pass in the hash algorithm for which we want to retrieve the null object
ID. Adapt callsites accordingly by passing in `the_repository`, thus
bubbling up the dependency on that global variable by one layer.
There are a couple of trivial exceptions for subsystems that already got
rid of `the_repository`. These subsystems instead use the repository
that is available via the calling context:
- "builtin/grep.c"
- "grep.c"
- "refs/debug.c"
There are also two non-trivial exceptions:
- "diff-no-index.c": Here we know that we may not have a repository
initialized at all, so we cannot rely on `the_repository`. Instead,
we adapt `diff_no_index()` to get a `struct git_hash_algo` as
parameter. The only caller is located in "builtin/diff.c", where we
know to call `repo_set_hash_algo()` in case we're running outside of
a Git repository. Consequently, it is fine to continue passing
`the_repository->hash_algo` even in this case.
- "builtin/ls-files.c": There is an in-flight patch series that drops
`USE_THE_REPOSITORY_VARIABLE` in this file, which causes a semantic
conflict because we use `null_oid()` in `show_submodule()`. The
value is passed to `repo_submodule_init()`, which may use the object
ID to resolve a tree-ish in the superproject from which we want to
read the submodule config. As such, the object ID should refer to an
object in the superproject, and consequently we need to use its hash
algorithm.
This means that we could in theory just not bother about this edge
case at all and just use `the_repository` in "diff-no-index.c". But
doing so would feel misdesigned.
Remove the `USE_THE_REPOSITORY_VARIABLE` preprocessor define in
"hash.c".
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are a couple of trivial "-Wsign-compare" warnings in "hash.c". Fix
them.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
While we have a "hash.h" header, the actual implementation of the
subsystem is hosted by "object-file.c". This makes it harder than
necessary to find the actual implementation of the hash subsystem and
intermingles the different concerns with one another.
Split out the implementation of hash algorithms into a new, separate
"hash.c" file.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
For the find_exact_renames() function, this allows us to pass the
diff_options structure pointer to the low-level routines. We will use
that to distinguish between the "rename" and "copy" cases.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is in an effort to make the source index of 'unpack_trees()' as
being const, and thus making the compiler help us verify that we only
access it for reading.
The constification also extended to some of the hashing helpers that get
called indirectly.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We were returning the _address of_ the stored item (or NULL)
instead of the item itself. While this sort of indirection
is useful for insertion (since you can lookup and then
modify), it is unnecessary for read-only lookup. Since the
hash code splits these functions between the internal
lookup_hash_entry function and the public lookup_hash
function, it makes sense for the latter to provide what
users of the library expect.
The result of this was that the index caching returned bogus
results on lookup. We unfortunately didn't catch this
because we were returning a "struct cache_entry **" as a
"void *", and accidentally assigning it to a "struct
cache_entry *".
As it happens, this actually _worked_ most of the time,
because the entries were defined as:
struct cache_entry {
struct cache_entry *next;
...
};
meaning that interpreting a "struct cache_entry **" as a
"struct cache_entry *" would yield an entry where all fields
were totally bogus _except_ for the next pointer, which
pointed to the actual cache entry. When walking the list, we
would look at the bogus "name" field, which was unlikely to
match our lookup, and then proceed to the "real" entry.
The reading of bogus data was silently ignored most of the
time, but could cause a segfault for some data (which seems
to be more common on OS X).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This implements a smarter rename detector for exact renames, which
rather than doing a pairwise comparison (time O(m*n)) will just hash the
files into a hash-table (size O(n+m)), and only do pairwise comparisons
to renames that have the same hash (time O(n+m) except for unrealistic
hash collissions, which we just cull aggressively).
Admittedly the exact rename case is not nearly as interesting as the
generic case, but it's an important case none-the-less. A similar general
approach should work for the generic case too, but even then you do need
to handle the exact renames/copies separately (to avoid the inevitable
added cost factor that comes from the _size_ of the file), so this is
worth doing.
In the expectation that we will indeed do the same hashing trick for the
general rename case, this code uses a generic hash-table implementation
that can be used for other things too. In fact, we might be able to
consolidate some of our existing hash tables with the new generic code
in hash.[ch].
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>