Commit Graph

71 Commits (87e0910fb821d125fc06d25a8309d3c668045f57)

Author SHA1 Message Date
Junio C Hamano 246569bf83 Merge branch 'ps/hash-cleanup'
Further code clean-up on the use of hash functions.  Now the
context object knows what hash function it is working with.

* ps/hash-cleanup:
  global: adapt callers to use generic hash context helpers
  hash: provide generic wrappers to update hash contexts
  hash: stop typedeffing the hash context
  hash: convert hashing context to a structure
2025-02-10 10:18:31 -08:00
Junio C Hamano 9d0e81e2ae Merge branch 'ps/zlib-ng'
The code paths to interact with zlib has been cleaned up in
preparation for building with zlib-ng.

* ps/zlib-ng:
  ci: make "linux-musl" job use zlib-ng
  ci: switch linux-musl to use Meson
  compat/zlib: allow use of zlib-ng as backend
  git-zlib: cast away potential constness of `next_in` pointer
  compat/zlib: provide stubs for `deflateSetHeader()`
  compat/zlib: provide `deflateBound()` shim centrally
  git-compat-util: move include of "compat/zlib.h" into "git-zlib.h"
  compat: introduce new "zlib.h" header
  git-compat-util: drop `z_const` define
  compat: drop `uncompress2()` compatibility shim
2025-02-06 14:56:45 -08:00
Patrick Steinhardt 0578f1e66a global: adapt callers to use generic hash context helpers
Adapt callers to use generic hash context helpers instead of using the
hash algorithm to update them. This makes the callsites easier to reason
about and removes the possibility that the wrong hash algorithm is used
to update the hash context's state. And as a nice side effect this also
gets rid of a bunch of users of `the_hash_algo`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-31 10:06:11 -08:00
Patrick Steinhardt 7346e340f1 hash: stop typedeffing the hash context
We generally avoid using `typedef` in the Git codebase. One exception
though is the `git_hash_ctx`, likely because it used to be a union
rather than a struct until the preceding commit refactored it. But now
that it is a normal `struct` there isn't really a need for a typedef
anymore.

Drop the typedef and adapt all callers accordingly.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-31 10:06:10 -08:00
Patrick Steinhardt 41f1a8435a git-compat-util: move include of "compat/zlib.h" into "git-zlib.h"
We include "compat/zlib.h" in "git-compat-util.h", which is
unnecessarily broad given that we only have a small handful of files
that use the zlib library. Move the header into "git-zlib.h" instead and
adapt users of zlib to include that header.

One exception is the reftable library, as we don't want to use the
Git-specific wrapper of zlib there, so we include "compat/zlib.h"
instead. Furthermore, we move the include into "reftable/system.h" so
that users of the library other than Git can wire up zlib themselves.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-28 13:03:22 -08:00
Taylor Blau a8dd3821fe csum-file: introduce hashfile_checkpoint_init()
In 106140a99f (builtin/fast-import: fix segfault with unsafe SHA1
backend, 2024-12-30) and 9218c0bfe1 (bulk-checkin: fix segfault with
unsafe SHA1 backend, 2024-12-30), we observed the effects of failing to
initialize a hashfile_checkpoint with the same hash function
implementation as is used by the hashfile it is used to checkpoint.

While both 106140a99f and 9218c0bfe1 work around the immediate crash,
changing the hash function implementation within the hashfile API to,
for example, the non-unsafe variant would re-introduce the crash. This
is a result of the tight coupling between initializing hashfiles and
hashfile_checkpoints.

Introduce and use a new function which ensures that both parts of a
hashfile and hashfile_checkpoint pair use the same hash function
implementation to avoid such crashes.

A few things worth noting:

  - In the change to builtin/fast-import.c::stream_blob(), we can see
    that by removing the explicit reference to
    'the_hash_algo->unsafe_init_fn()', we are hardened against the
    hashfile API changing away from the_hash_algo (or its unsafe
    variant) in the future.

  - The bulk-checkin code no longer needs to explicitly zero-initialize
    the hashfile_checkpoint, since it is now done as a result of calling
    'hashfile_checkpoint_init()'.

  - Also in the bulk-checkin code, we add an additional call to
    prepare_to_stream() outside of the main loop in order to initialize
    'state->f' so we know which hash function implementation to use when
    calling 'hashfile_checkpoint_init()'.

    This is OK, since subsequent 'prepare_to_stream()' calls are noops.
    However, we only need to call 'prepare_to_stream()' when we have the
    HASH_WRITE_OBJECT bit set in our flags. Without that bit, calling
    'prepare_to_stream()' does not assign 'state->f', so we have nothing
    to initialize.

  - Other uses of the 'checkpoint' in 'deflate_blob_to_pack()' are
    appropriately guarded.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-23 10:28:17 -08:00
Taylor Blau f0c266af4e csum-file.c: use unsafe_hash_algo()
Instead of calling the unsafe_ hash function variants directly, make use
of the shared 'algop' pointer by initializing it to:

    f->algop = unsafe_hash_algo(the_hash_algo);

, thus making all calls use the unsafe variants directly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-23 10:28:17 -08:00
Taylor Blau 5fcc683338 csum-file.c: extract algop from hashfile_checksum_valid()
Perform a similar transformation as in the previous commit, but focused
instead on hashfile_checksum_valid(). This function does not work with a
hashfile structure itself, and instead validates the raw contents of a
file written using the hashfile API.

We'll want to be prepared for a similar change to this function in the
future, so prepare ourselves for that by extracting 'the_hash_algo' into
its own field for use within this function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-23 10:28:16 -08:00
Taylor Blau 48524fac64 csum-file: store the hash algorithm as a struct field
Throughout the hashfile API, we rely on a reference to 'the_hash_algo',
and call its _unsafe function variants directly.

Prepare for a future change where we may use a different 'git_hash_algo'
pointer (instead of just relying on 'the_hash_algo' throughout) by
making the 'git_hash_algo' pointer a member of the 'hashfile' structure
itself.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-23 10:28:16 -08:00
Junio C Hamano ba8f6018b5 csum-file: fix -Wsign-compare warning on 32-bit platform
On 32-bit platforms, ssize_t may be "int" while size_t may be
"unsigned int".  At times we compare the number of bytes we read
stored in a ssize_t variable with "unsigned int", but that is done
after we check that we did not get an error return (which is
negative---and that is the whole reason why we used ssize_t and not
size_t), so these comparisons are safe.

But compilers may not realize that.  Cast these to size_t to work
around the false positives.  On platforms with size_t/ssize_t wider
than a normal int, this won't be an issue.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-06 20:20:03 +09:00
Patrick Steinhardt 41f43b8243 global: mark code units that generate warnings with `-Wsign-compare`
Mark code units that generate warnings with `-Wsign-compare`. This
allows for a structured approach to get rid of all such warnings over
time in a way that can be easily measured.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-06 20:20:02 +09:00
Taylor Blau 1b9e9be8b4 csum-file.c: use unsafe SHA-1 implementation when available
Update hashwrite() and friends to use the unsafe_-variants of hashing
functions, calling for e.g., "the_hash_algo->unsafe_update_fn()" instead
of "the_hash_algo->update_fn()".

These callers only use the_hash_algo to produce a checksum, which we
depend on for data integrity, but not for cryptographic purposes, so
these callers are safe to use the unsafe (non-collision detecting) SHA-1
implementation.

To time this, I took a freshly packed copy of linux.git, and ran the
following with and without the OPENSSL_SHA1_UNSAFE=1 build-knob. Both
versions were compiled with -O3:

    $ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
    $ valgrind --tool=callgrind ~/src/git/git-pack-objects \
        --revs --stdout --all-progress --use-bitmap-index <in >/dev/null

Without OPENSSL_SHA1_UNSAFE=1 (that is, using the collision-detecting
SHA-1 implementation for both cryptographic and non-cryptographic
purposes), we spend a significant amount of our instruction count in
hashwrite():

    $ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
    159,998,868,413 (79.42%)  /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]

, and the resulting "clone" takes 19.219 seconds of wall clock time,
18.94 seconds of user time and 0.28 seconds of system time.

Compiling with OPENSSL_SHA1_UNSAFE=1, we spend ~60% fewer instructions
in hashwrite():

    $ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
     59,164,001,176 (58.79%)  /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]

, and generate the resulting "clone" much faster, in only 11.597 seconds
of wall time, 11.37 seconds of user time, and 0.23 seconds of system
time, for a ~40% speed-up.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-27 11:27:47 -07:00
Junio C Hamano 1b6b2bfae5 Merge branch 'ps/leakfixes-part-4'
More leak fixes.

* ps/leakfixes-part-4: (22 commits)
  builtin/diff: free symmetric diff members
  diff: free state populated via options
  builtin/log: fix leak when showing converted blob contents
  userdiff: fix leaking memory for configured diff drivers
  builtin/format-patch: fix various trivial memory leaks
  diff: fix leak when parsing invalid ignore regex option
  unpack-trees: clear index when not propagating it
  sequencer: release todo list on error paths
  merge-ort: unconditionally release attributes index
  builtin/fast-export: plug leaking tag names
  builtin/fast-export: fix leaking diff options
  builtin/fast-import: plug trivial memory leaks
  builtin/notes: fix leaking `struct notes_tree` when merging notes
  builtin/rebase: fix leaking `commit.gpgsign` value
  config: fix leaking comment character config
  submodule-config: fix leaking name entry when traversing submodules
  read-cache: fix leaking hashfile when writing index fails
  bulk-checkin: fix leaking state TODO
  object-name: fix leaking symlink paths in object context
  object-file: fix memory leak when reading corrupted headers
  ...
2024-08-23 09:02:33 -07:00
Patrick Steinhardt c81dcf630c bulk-checkin: fix leaking state TODO
When flushing a bulk-checking to disk we also reset the `struct
bulk_checkin_packfile` state. But while we free some of its members,
others aren't being free'd, leading to memory leaks:

  - The temporary packfile name is not getting freed.

  - The `struct hashfile` only gets freed in case we end up calling
    `finalize_hashfile()`. There are code paths though where that is not
    the case, namely when nothing has been written. For this, we need to
    make `free_hashfile()` public.

Fix those leaks.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-14 10:07:57 -07:00
Junio C Hamano c3d034df16 csum-file: introduce discard_hashfile()
The hashfile API is used to write out a "hashfile", which has a
final checksum (typically SHA-1) at the end.  An in-core hashfile
structure has up to two file descriptors and a few buffers that can
only be freed by calling a helper function that is private to the
csum-file implementation.

The usual flow of a user of the API is to first open a file
descriptor for writing, obtain a hashfile associated with that write
file descriptor by calling either hashfd() or hashfd_check(), call
hashwrite() number of times to write data to the file, and then call
finalize_hashfile(), which appends th checksum to the end of the
file, closes file descriptors and releases associated buffers.

But what if a caller finds some error after calling hashfd() to
start the process and/or hashwrite() to send some data to the file,
and wants to abort the operation?  The underlying file descriptor is
often managed by the tempfile API, so aborting will clean the file
out of the filesystem, but the resources associated with the in-core
hashfile structure is lost.

Introduce discard_hashfile() API function to allow them to release
the resources held by a hashfile structure the callers want to
dispose of, and use that in read-cache.c:do_write_index(), which is
a central place that writes the index file.

Mark t2107 as leak-free, as this leak in "update-index --cacheinfo"
test that deliberately makes it fail is now plugged.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-07-26 09:04:02 -07:00
Patrick Steinhardt e7da938570 global: introduce `USE_THE_REPOSITORY_VARIABLE` macro
Use of the `the_repository` variable is deprecated nowadays, and we
slowly but steadily convert the codebase to not use it anymore. Instead,
callers should be passing down the repository to work on via parameters.

It is hard though to prove that a given code unit does not use this
variable anymore. The most trivial case, merely demonstrating that there
is no direct use of `the_repository`, is already a bit of a pain during
code reviews as the reviewer needs to manually verify claims made by the
patch author. The bigger problem though is that we have many interfaces
that implicitly rely on `the_repository`.

Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code
units to opt into usage of `the_repository`. The intent of this macro is
to demonstrate that a certain code unit does not use this variable
anymore, and to keep it from new dependencies on it in future changes,
be it explicit or implicit

For now, the macro only guards `the_repository` itself as well as
`the_hash_algo`. There are many more known interfaces where we have an
implicit dependency on `the_repository`, but those are not guarded at
the current point in time. Over time though, we should start to add
guards as required (or even better, just remove them).

Define the macro as required in our code units. As expected, most of our
code still relies on the global variable. Nearly all of our builtins
rely on the variable as there is no way yet to pass `the_repository` to
their entry point. For now, declare the macro in "biultin.h" to keep the
required changes at least a little bit more contained.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14 10:26:33 -07:00
Patrick Steinhardt f4836570a7 hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()`
Many of our hash functions have two variants, one receiving a `struct
git_hash_algo` and one that derives it via `the_repository`. Adapt all
of those functions to always require the hash algorithm as input and
drop the variants that do not accept one.

As those functions are now independent of `the_repository`, we can move
them from "hash.h" to "hash-ll.h".

Note that both in this and subsequent commits in this series we always
just pass `the_repository->hash_algo` as input even if it is obvious
that there is a repository in the context that we should be using the
hash from instead. This is done to be on the safe side and not introduce
any regressions. All callsites should eventually be amended to use a
repo passed via parameters, but this is outside the scope of this patch
series.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14 10:26:32 -07:00
Junio C Hamano 331f20d52d Merge branch 'ew/hash-with-openssl-evp'
Fix-up new-ish code to support OpenSSL EVP API.

* ew/hash-with-openssl-evp:
  treewide: fix various bugs w/ OpenSSL 3+ EVP API
2023-09-13 10:07:57 -07:00
Eric Wong e0b8c84240 treewide: fix various bugs w/ OpenSSL 3+ EVP API
The OpenSSL 3+ EVP API for SHA-* cannot support our prior use cases
supported by other SHA-* implementations.  It has the following
differences:

1. ->init_fn is required before all use
2. struct assignments don't work and requires ->clone_fn
3. can't support ->update_fn after ->final_*fn

While fixing cases 1 and 2 is merely the matter of calling ->init_fn and
->clone_fn as appropriate, fixing case 3 requires calling ->final_*fn on
a temporary context that's cloned from the primary context.

Reported-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/ZPCL11k38PXTkFga@debian.me/
Helped-by: brian m. carlson <sandals@crustytoothpaste.net>
Fixes: 3e440ea0ab ("sha256: avoid functions deprecated in OpenSSL 3+")
Fixes: bda9c12073 ("avoid SHA-1 functions deprecated in OpenSSL 3+")
Signed-off-by: Eric Wong <e@80x24.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-31 22:26:01 -07:00
Calvin Wan da9502ff4d treewide: remove unnecessary includes for wrapper.h
Signed-off-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05 11:41:59 -07:00
Elijah Newren d1cbe1e6d8 hash-ll.h: split out of hash.h to remove dependency on repository.h
hash.h depends upon and includes repository.h, due to the definition and
use of the_hash_algo (defined as the_repository->hash_algo).  However,
most headers trying to include hash.h are only interested in the layout
of the structs like object_id.  Move the parts of hash.h that do not
depend upon repository.h into a new file hash-ll.h (the "low level"
parts of hash.h), and adjust other files to use this new header where
the convenience inline functions aren't needed.

This allows hash.h and object.h to be fairly small, minimal headers.  It
also exposes a lot of hidden dependencies on both path.h (which was
brought in by repository.h) and repository.h (which was previously
implicitly brought in by object.h), so also adjust other files to be
more explicit about what they depend upon.

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-24 12:47:32 -07:00
Elijah Newren 9875058870 treewide: remove cache.h inclusion due to environment.h changes
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21 10:56:54 -07:00
Elijah Newren d5ebb50dcb wrapper.h: move declarations for wrapper.c functions from cache.h
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21 10:56:53 -07:00
Derrick Stolee 1687150b5d hashfile: allow skipping the hash function
The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.

Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.

However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.

This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.

We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.

Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.

This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.

A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.

[1] 21fed2d914

Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-07 07:46:14 +09:00
Neeraj Singh 020406eaa5 core.fsync: introduce granular fsync control infrastructure
This commit introduces the infrastructure for the core.fsync
configuration knob. The repository components we want to sync
are identified by flags so that we can turn on or off syncing
for specific components.

If core.fsyncObjectFiles is set and the core.fsync configuration
also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any
loose objects. This picks the strictest data integrity behavior
if core.fsync and core.fsyncObjectFiles are set to conflicting values.

This change introduces the currently unused fsync_component
helper, which will be used by a later patch that adds fsyncing to
the refs backend.

Actual configuration and documentation of the fsync components
list are in other patches in the series to separate review of
the underlying mechanism from the policy of how it's configured.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-10 15:10:22 -08:00
René Scharfe 66e905b7dd use xopen() to handle fatal open(2) failures
Add and apply a semantic patch for using xopen() instead of calling
open(2) and die() or die_errno() explicitly.  This makes the error
messages more consistent and shortens the code.

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-25 14:39:08 -07:00
Junio C Hamano 3b57e72c0c Merge branch 'tb/midx-use-checksum'
When rebuilding the multi-pack index file reusing an existing one,
we used to blindly trust the existing file and ended up carrying
corrupted data into the updated file, which has been corrected.

* tb/midx-use-checksum:
  midx: report checksum mismatches during 'verify'
  midx: don't reuse corrupt MIDXs when writing
  commit-graph: rewrite to use checksum_valid()
  csum-file: introduce checksum_valid()
2021-07-16 17:42:46 -07:00
Taylor Blau f9221e2cf5 csum-file: introduce checksum_valid()
Introduce a new function which checks the validity of a file's trailing
checksum. This is similar to hashfd_check(), but different since it is
intended to be used by callers who aren't writing the same data (like
`git index-pack --verify`), but who instead want to validate the
integrity of data that they are reading.

Rewrite the first of two callers which could benefit from this new
function in pack-check.c. Subsequent callers will be added in the
following patches.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-28 20:36:17 -07:00
Derrick Stolee 2ca245f8be csum-file.h: increase hashfile buffer size
The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s. Since our end goal is to have the index writing
code use the hashfile API, we need to unify this buffer size to avoid a
performance regression.

There is a buffer, 'check_buffer', that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. To avoid issues with stack size, move both 'buffer' and
'check_buffer' to be heap pointers within 'struct hashfile'. The
'check_buffer' member is left as NULL unless check_fd is set in
hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
which also frees the full structure.

Since these buffers are now on the heap, we can adjust their size based
on the needs of the consumer. In particular, callers to
hashfd_throughput() are expecting to report progress indicators as the
buffer flushes. These callers would prefer the smaller 8k buffer to
avoid large delays between updates, especially for users with slower
networks. When the progress indicator is not used, the larger buffer is
preferrable.

By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19 16:41:21 +09:00
Derrick Stolee 68142e117c hashfile: use write_in_full()
The flush() logic in csum-file.c was introduced originally by c38138c
(git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26)
and a portion of the logic performs similar utility to write_in_full()
in wrapper.c. The history of write_in_full() is full of moves and
renames, but was originally introduced by 7230e6d (Add write_or_die(), a
helper function, 2006-08-21).

The point of these sections of code are to flush a write buffer using
xwrite() and report errors in the case of disk space issues or other
generic input/output errors. The logic in flush() can interpret the
output of write_in_full() to provide the correct error messages to
users.

The logic in the hashfile API has an additional set of logic to augment
the progress indicator between calls to xwrite(). This was introduced by
2a128d6 (add throughput display to git-push, 2007-10-30). It seems that
since the hashfile's buffer is only 8KB, these additional progress
indicators might not be incredibly necessary. Instead, update the
progress only when write_in_full() complete.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-18 06:32:35 +09:00
Derrick Stolee ddaf1f62e3 csum-file: make hashwrite() more readable
The hashwrite() method takes an input buffer and updates a hashfile's
hash function while writing the data to a file. To avoid overuse of
flushes, the hashfile has an internal buffer and most writes will use
memcpy() to transfer data from the input 'buf' to the hashfile's buffer
of size 8 * 1024 bytes.

Logic introduced by a8032d12 (sha1write: don't copy full sized buffers,
2008-09-02) reduces the number of memcpy() calls when the input buffer
is sufficiently longer than the hashfile's buffer, causing nr to be the
length of the full buffer. In these cases, the input buffer is used
directly in chunks equal to the hashfile's buffer size.

This method caught my attention while investigating some performance
issues, but it turns out that these performance issues were noise within
the variance of the experiment.

However, during this investigation, I inspected hashwrite() and
misunderstood it, even after looking closely and trying to make it
faster. This change simply reorganizes some parts of the loop within
hashwrite() to make it clear that each batch either uses memcpy() to the
hashfile's buffer or writes directly from the input buffer. The previous
code relied on indirection through local variables and essentially
inlined the implementation of hashflush() to reduce lines of code.

Helped-by: Jeff King <peff@peff.net>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-26 14:32:45 -07:00
brian m. carlson 768e30ea27 hash: implement and use a context cloning function
For all of our SHA-1 implementations and most of our SHA-256
implementations, the hash context we use is a real struct.  For these
implementations, it's possible to copy a hash context by making a copy
of the struct.

However, for our libgcrypt implementation, our hash context is a
pointer.  Consequently, copying it does not lead to an independent hash
context like we intended.

Fortunately, however, libgcrypt provides us with a handy function to
copy hash contexts.  Let's add a cloning function to the hash algorithm
API, and use it in the one place we need to make a hash context copy.
With this change, our libgcrypt SHA-256 implementation is fully
functional with all of our other hash implementations.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-24 09:33:21 -08:00
Derrick Stolee cfe83216e4 csum-file: refactor finalize_hashfile() method
If we want to use a hashfile on the temporary file for a lockfile, then
we need finalize_hashfile() to fully write the trailing hash but also keep
the file descriptor open.

Do this by adding a new CSUM_HASH_IN_STREAM flag along with a functional
change that checks this flag before writing the checksum to the stream.
This differs from previous behavior since it would be written if either
CSUM_CLOSE or CSUM_FSYNC is provided.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-02 14:27:30 -07:00
Derrick Stolee f2af9f5e02 csum-file: rename hashclose() to finalize_hashfile()
The hashclose() method behaves very differently depending on the flags
parameter. In particular, the file descriptor is not always closed.

Perform a simple rename of "hashclose()" to "finalize_hashfile()" in
preparation for functional changes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-02 14:27:30 -07:00
brian m. carlson 4d2735005a csum-file: abstract uses of SHA-1
Convert several direct uses of SHA-1 to use the_hash_algo instead.
Convert one use of the constant 20 as well.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-02-02 11:28:41 -08:00
brian m. carlson 98a3beab6a csum-file: rename sha1file to hashfile
Rename struct sha1file to struct hashfile, along with all of its related
functions.

The transformation in this commit was made by global search-and-replace.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-02-02 11:28:41 -08:00
Jeff King 61d36330b4 prefer "!=" when checking read_in_full() result
Comparing the result of read_in_full() using less-than is
potentially dangerous, as a negative return value may be
converted to an unsigned type and be considered a success.
This is discussed further in 561598cfcf (read_pack_header:
handle signed/unsigned comparison in read result,
2017-09-13).

Each of these instances is actually fine in practice:

 - in get-tar-commit-id, the HEADERSIZE macro expands to a
   signed integer. If it were switched to an unsigned type
   (e.g., a size_t), then it would be a bug.

 - the other two callers check for a short read only after
   handling a negative return separately. This is a fine
   practice, but we'd prefer to model "!=" as a general
   rule.

So all of these cases can be considered cleanups and not
actual bugfixes.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-27 15:45:24 +09:00
Jeff King 599d223107 sha1fd_check: die when we cannot open the file
Right now we return a NULL "struct sha1file" if we encounter
an error. However, the sole caller (write_idx_file) does not
check the return value, and will segfault if we hit this
case.

One option would be to handle the error in the caller.
However, there's really nothing for it to do but die. This
code path is hit during "git index-pack --verify"; after we
verify the packfile, we check that the ".idx" we would
generate from it is byte-wise identical to what is on disk.
We hit the error (and segfault) if we can't open the .idx
file (a likely cause of this is that somebody else ran "git
repack -ad" while we were verifying). Since we can't
complete the requested verification, we really have no
choice but to die.

Furthermore, the rest of the sha1fd_* functions simply die
on errors. So if were to open the file successfully, for
example, and then hit a read error, sha1write would call
die() for us. So pushing the die() down into sha1fd_check
keeps the interface consistent.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-03-19 13:35:15 -07:00
Junio C Hamano f06a5e607d Merge branch 'jk/sha1write-void'
Code clean-up.

* jk/sha1write-void:
  do not pretend sha1write returns errors
2014-01-10 10:33:09 -08:00
Jeff King 9af270e8c2 do not pretend sha1write returns errors
The sha1write function returns an int, but it will always be
"0". The failure-prone parts of the function happen in the
"flush" callback, which cannot pass an error back to us. So
we just end up calling die() during the flush.

Let's just drop the return value altogether, as it only
confuses callers into thinking that it might be useful.

Only one call site actually checked the return value. We can
drop that check, since it just led to a die() anyway.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-12-26 11:50:20 -08:00
Jeff King e74435a516 sha1write: make buffer const-correct
We are passed a "void *" and write it out without ever
touching it; let's indicate that by using "const".

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-10-24 15:44:18 -07:00
Junio C Hamano 6c52614864 csum-file: introduce sha1file_checkpoint
It is useful to be able to rewind a check-summed file to a certain
previous state after writing data into it using sha1write() API. The
fast-import command does this after streaming a blob data to the packfile
being generated and then noticing that the same blob has already been
written, and it does this with a private code truncate_pack() that is
commented as "Yes, this is a layering violation".

Introduce two API functions, sha1file_checkpoint(), that allows the caller
to save a state of a sha1file, and then later revert it to the saved state.
Use it to reimplement truncate_pack().

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-11-30 14:27:59 -08:00
Junio C Hamano d907bf8ef3 Merge branch 'jc/index-pack'
* jc/index-pack:
  verify-pack: use index-pack --verify
  index-pack: show histogram when emulating "verify-pack -v"
  index-pack: start learning to emulate "verify-pack -v"
  index-pack: a miniscule refactor
  index-pack --verify: read anomalous offsets from v2 idx file
  write_idx_file: need_large_offset() helper function
  index-pack: --verify
  write_idx_file: introduce a struct to hold idx customization options
  index-pack: group the delta-base array entries also by type

Conflicts:
	builtin/verify-pack.c
	cache.h
	sha1_file.c
2011-07-19 09:54:51 -07:00
Stephen Boyd 1e4cd68c00 sparse: Fix errors and silence warnings
* load_file() returns a void pointer but is using 0 for the return
   value

 * builtin/receive-pack.c forgot to include builtin.h

 * packet_trace_prefix can be marked static

 * ll_merge takes a pointer for its last argument, not an int

 * crc32 expects a pointer as the second argument but Z_NULL is defined
   to be 0 (see 38f4d13 sparse fix: Using plain integer as NULL pointer,
   2006-11-18 for more info)

Signed-off-by: Stephen Boyd <bebarino@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-04-03 10:14:53 -07:00
Junio C Hamano e337a04de2 index-pack: --verify
Given an existing .pack file and the .idx file that describes it,
this new mode of operation reads and re-index the packfile and makes
sure the existing .idx file matches the result byte-for-byte.

All the objects in the .pack file are validated during this operation as
well.  Unlike verify-pack, which visits each object described in the .idx
file in the SHA-1 order, index-pack efficiently exploits the delta-chain
to avoid rebuilding the objects that are used as the base of deltified
objects over and over again while validating the objects, resulting in
much quicker verification of the .pack file and its .idx file.

This version however cannot verify a .pack/.idx pair with a handcrafted v2
index that uses 64-bit offset representation for offsets that would fit
within 31-bit. You can create such an .idx file by giving a custom offset
to --index-version option to the command.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-02-27 23:29:03 -08:00
Thomas Rast d824cbba02 Convert existing die(..., strerror(errno)) to die_errno()
Change calls to die(..., strerror(errno)) to use the new die_errno().

In the process, also make slight style adjustments: at least state
_something_ about the function that failed (instead of just printing
the pathname), and put paths in single quotes.

Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-06-27 11:14:53 -07:00
Shawn O. Pearce e782e12f89 Merge branch 'maint'
* maint:
  rebase -i: do not fail when there is no commit to cherry-pick
  test-lib: fix color reset in say_color()
  fix pread()'s short read in index-pack

Conflicts:
	csum-file.c
2008-10-10 08:39:20 -07:00
Nicolas Pitre 838cd34664 fix pread()'s short read in index-pack
Since v1.6.0.2~13^2~ the completion of a thin pack uses sha1write() for
its ability to compute a SHA1 on the written data.  This also provides
data buffering which, along with commit 92392b4a45, will confuse pread()
whenever an appended object is 1) freed due to memory pressure because
of the depth-first delta processing, and 2) needed again because it has
many delta children, and 3) its data is still buffered by sha1write().

Let's fix the issue by simply forcing cached data out when such an
object is written so it can be pread()'d at leisure.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
2008-10-10 07:09:30 -07:00
Nicolas Pitre 9126f0091f fix openssl headers conflicting with custom SHA1 implementations
On ARM I have the following compilation errors:

    CC fast-import.o
In file included from cache.h:8,
                 from builtin.h:6,
                 from fast-import.c:142:
arm/sha1.h:14: error: conflicting types for 'SHA_CTX'
/usr/include/openssl/sha.h:105: error: previous declaration of 'SHA_CTX' was here
arm/sha1.h:16: error: conflicting types for 'SHA1_Init'
/usr/include/openssl/sha.h:115: error: previous declaration of 'SHA1_Init' was here
arm/sha1.h:17: error: conflicting types for 'SHA1_Update'
/usr/include/openssl/sha.h:116: error: previous declaration of 'SHA1_Update' was here
arm/sha1.h:18: error: conflicting types for 'SHA1_Final'
/usr/include/openssl/sha.h:117: error: previous declaration of 'SHA1_Final' was here
make: *** [fast-import.o] Error 1

This is because openssl header files are always included in
git-compat-util.h since commit 684ec6c63c whenever NO_OPENSSL is not
set, which somehow brings in <openssl/sha1.h> clashing with the custom
ARM version.  Compilation of git is probably broken on PPC too for the
same reason.

Turns out that the only file requiring openssl/ssl.h and openssl/err.h
is imap-send.c.  But only moving those problematic includes there
doesn't solve the issue as it also includes cache.h which brings in the
conflicting local SHA1 header file.

As suggested by Jeff King, the best solution is to rename our references
to SHA1 functions and structure to something git specific, and define those
according to the implementation used.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
2008-10-02 18:06:56 -07:00
Nicolas Pitre a8032d1224 sha1write: don't copy full sized buffers
No need to memcpy() source buffer data when we might just process the
data in place instead of accumulating it into a separate buffer.
This is the case when a whole buffer would have been copied, summed,
written out and then discarded right away.

Also move the CRC32 processing within the loop so the data is more likely
to remain in the L1 CPU cache between the CRC32 sum, SHA1 sum and the
write call.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-09-02 17:05:47 -07:00