The ref backends do not have any way to disable the creation of reflog
entries. This will be required for upcoming ref format migration logic
so that we do not create any entries that didn't exist in the original
ref database.
Provide a new `REF_SKIP_CREATE_REFLOG` flag that allows the caller to
disable reflog entry creation.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Further clean-up the refs subsystem to stop relying on
the_repository, and instead use the repository associated to the
ref_store object.
* ps/refs-without-the-repository-updates:
refs/packed: remove references to `the_hash_algo`
refs/files: remove references to `the_hash_algo`
refs/files: use correct repository
refs: remove `dwim_log()`
refs: drop `git_default_branch_name()`
refs: pass repo when peeling objects
refs: move object peeling into "object.c"
refs: pass ref store when detecting dangling symrefs
refs: convert iteration over replace refs to accept ref store
refs: retrieve worktree ref stores via associated repository
refs: refactor `resolve_gitlink_ref()` to accept a repository
refs: pass repo when retrieving submodule ref store
refs: track ref stores via strmap
refs: implement releasing ref storages
refs: rename `init_db` callback to avoid confusion
refs: adjust names for `init` and `init_db` callbacks
Code clean-up to make the reftable iterator closer to be reusable.
* ps/reftable-reusable-iterator:
reftable/merged: adapt interface to allow reuse of iterators
reftable/stack: provide convenience functions to create iterators
reftable/reader: adapt interface to allow reuse of iterators
reftable/generic: adapt interface to allow reuse of iterators
reftable/generic: move seeking of records into the iterator
reftable/merged: simplify indices for subiterators
reftable/merged: split up initialization and seeking of records
reftable/reader: set up the reader when initializing table iterator
reftable/reader: inline `reader_seek_internal()`
reftable/reader: separate concerns of table iter and reftable reader
reftable/reader: unify indexed and linear seeking
reftable/reader: avoid copying index iterator
reftable/block: use `size_t` to track restart point index
The knobs to tweak how reftable files are written have been made
available as configuration variables.
* ps/reftable-write-options:
refs/reftable: allow configuring geometric factor
reftable: make the compaction factor configurable
refs/reftable: allow disabling writing the object index
refs/reftable: allow configuring restart interval
reftable: use `uint16_t` to track restart interval
refs/reftable: allow configuring block size
reftable/dump: support dumping a table's block structure
reftable/writer: improve error when passed an invalid block size
reftable/writer: drop static variable used to initialize strbuf
reftable: pass opts as constant pointer
reftable: consistently refer to `reftable_write_options` as `opts`
Terminology to call various ref-like things are getting
straightened out.
* ps/pseudo-ref-terminology:
refs: refuse to write pseudorefs
ref-filter: properly distinuish pseudo and root refs
refs: pseudorefs are no refs
refs: classify HEAD as a root ref
refs: do not check ref existence in `is_root_ref()`
refs: rename `is_special_ref()` to `is_pseudo_ref()`
refs: rename `is_pseudoref()` to `is_root_ref()`
Documentation/glossary: define root refs as refs
Documentation/glossary: clarify limitations of pseudorefs
Documentation/glossary: redefine pseudorefs as special refs
Expose "name conflict" error when a ref creation fails due to D/F
conflict in the ref namespace, to improve an error message given by
"git fetch".
* it/refs-name-conflict:
refs: return conflict error when checking packed refs
* kn/ref-transaction-symref:
refs: remove `create_symref` and associated dead code
refs: rename `refs_create_symref()` to `refs_update_symref()`
refs: use transaction in `refs_create_symref()`
refs: add support for transactional symref updates
refs: move `original_update_refname` to 'refs.c'
refs: support symrefs in 'reference-transaction' hook
files-backend: extract out `create_symref_lock()`
refs: accept symref values in `ref_transaction_update()`
* ps/pseudo-ref-terminology:
refs: refuse to write pseudorefs
ref-filter: properly distinuish pseudo and root refs
refs: pseudorefs are no refs
refs: classify HEAD as a root ref
refs: do not check ref existence in `is_root_ref()`
refs: rename `is_special_ref()` to `is_pseudo_ref()`
refs: rename `is_pseudoref()` to `is_root_ref()`
Documentation/glossary: define root refs as refs
Documentation/glossary: clarify limitations of pseudorefs
Documentation/glossary: redefine pseudorefs as special refs
Updates to symbolic refs can now be made as a part of ref
transaction.
* kn/ref-transaction-symref:
refs: remove `create_symref` and associated dead code
refs: rename `refs_create_symref()` to `refs_update_symref()`
refs: use transaction in `refs_create_symref()`
refs: add support for transactional symref updates
refs: move `original_update_refname` to 'refs.c'
refs: support symrefs in 'reference-transaction' hook
files-backend: extract out `create_symref_lock()`
refs: accept symref values in `ref_transaction_update()`
Remove references to `the_hash_algo` in favor of the hash algo specified
by the repository associated with the packed ref store.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Remove references to `the_hash_algo` in favor of the hash algo specified
by the repository associated with the files ref store.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are several places in the "files" backend where we use
`the_repository` instead of the repository associated with the ref store
itself. Adapt those to use the correct repository.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Both `peel_object()` and `peel_iterated_oid()` implicitly rely on
`the_repository` to look up objects. Despite the fact that we want to
get rid of `the_repository`, it also leads to some restrictions in our
ref iterators when trying to retrieve the peeled value for a repository
other than `the_repository`.
Refactor these functions such that both take a repository as argument
and remove the now-unnecessary restrictions.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Peeling an object has nothing to do with refs, but we still have the
code in "refs.c". Move it over into "object.c", which is a more natural
place to put it.
Ideally, we'd also move `peel_iterated_oid()` over into "object.c". But
this function is tied to the refs interfaces because it uses a global
ref iterator variable to optimize peeling when the iterator already has
the peeled object ID readily available.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `for_each_replace_ref()` is a bit of an oddball across the
refs interfaces as it accepts a pointer to the repository instead of a
pointer to the ref store. The only reason for us to accept a repository
is so that we can eventually pass it back to the callback function that
the caller has provided. This is somewhat arbitrary though, as callers
that need the repository can instead make it accessible via the callback
payload.
Refactor the function to instead accept the ref store and adjust callers
accordingly. This allows us to get rid of some of the boilerplate that
we had to carry to pass along the repository and brings us in line with
the other functions that iterate through refs.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Looking up submodule ref stores has two deficiencies:
- The initialized subrepo will be attributed to `the_repository`.
- The submodule ref store will be tracked in a global map.
This makes it impossible to have submodule ref stores for a repository
other than `the_repository`.
Modify the function to accept the parent repository as parameter and
move the global map into `struct repository`. Like this it becomes
possible to look up submodule ref stores for arbitrary repositories.
Note that this also adds a new reference to `the_repository` in
`resolve_gitlink_ref()`, which is part of the refs interfaces. This will
get adjusted in the next patch.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ref storages are typically only initialized once for `the_repository`
and then never released. Until now we got away with that without causing
memory leaks because `the_repository` stays reachable, and because the
ref backend is reachable via `the_repository` its memory basically never
leaks.
This is about to change though because of the upcoming migration logic,
which will create a secondary ref storage. In that case, we will either
have to release the old or new ref storage to avoid leaks.
Implement a new `release` callback and expose it via a new
`ref_storage_release()` function.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Reference backends have two callbacks `init` and `init_db`. The
similarity of these two callbacks has repeatedly confused me whenever I
was looking at them, where I always had to look up which of them does
what.
Rename the `init_db` callback to `create_on_disk`, which should
hopefully be clearer.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The names of the functions that implement the `init` and `init_db`
callbacks in the "files" and "packed" backends do not match the names of
the callbacks, which is inconsistent. Rename them so that they match,
which makes it easier to discover their respective implementations.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Root refs are those refs that live in the root of the ref hierarchy.
Our old and venerable "HEAD" reference falls into this category, but we
don't yet classify it as such in `is_root_ref()`.
Adapt the function to also treat "HEAD" as a root ref. This change is
safe to do for all current callers:
- `ref_kind_from_refname()` already handles "HEAD" explicitly before
calling `is_root_ref()`.
- The "files" and "reftable" backends explicitly call both
`is_root_ref()` and `is_headref()` together.
This also aligns behaviour or `is_root_ref()` and `is_headref()` such
that we stop checking for ref existence. This changes semantics for our
backends:
- In the reftable backend we already know that the ref must exist
because `is_headref()` is called as part of the ref iterator. The
existence check is thus redundant, and the change is safe to do.
- In the files backend we use it when populating root refs, where we
would skip adding the "HEAD" file if it was not possible to resolve
it. The new behaviour is to instead mark "HEAD" as broken, which
will cause us to emit warnings in various places.
As there are no callers of `is_headref()` left afer the refactoring, we
can absorb it completely into `is_root_ref()`.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Before this patch series, root refs except for "HEAD" and our special
refs were classified as pseudorefs. Furthermore, our terminology
clarified that pseudorefs must not be symbolic refs. This restriction
is enforced in `is_root_ref()`, which explicitly checks that a supposed
root ref resolves to an object ID without recursing.
This has been extremely confusing right from the start because (in old
terminology) a ref name may sometimes be a pseudoref and sometimes not
depending on whether it is a symbolic or regular ref. This behaviour
does not seem reasonable at all and I very much doubt that it results in
anything sane.
Last but not least, the current behaviour can actually lead to a
segfault when calling `is_root_ref()` with a reference that either does
not exist or that is a symbolic ref because we never initialized `oid`,
but then read it via `is_null_oid()`.
We have now changed terminology to clarify that pseudorefs are really
only "MERGE_HEAD" and "FETCH_HEAD", whereas all the other refs that live
in the root of the ref hierarchy are just plain refs. Thus, we do not
need to check whether the ref is symbolic or not. In fact, we can now
avoid looking up the ref completely as the name is sufficient for us to
figure out whether something would be a root ref or not.
This change of course changes semantics for our callers. As there are
only three of them we can assess each of them individually:
- "ref-filter.c:ref_kind_from_refname()" uses it to classify refs.
It's clear that the intent is to classify based on the ref name,
only.
- "refs/reftable_backend.c:reftable_ref_iterator_advance()" uses it to
filter root refs. Again, using existence checks is pointless here as
the iterator has just surfaced the ref, so we know it does exist.
- "refs/files_backend.c:add_pseudoref_and_head_entries()" uses it to
determine whether it should add a ref to the root directory of its
iterator. This had the effect that we skipped over any files that
are either a symbolic ref, or which are not a ref at all.
The new behaviour is to include symbolic refs know, which aligns us
with the adapted terminology. Furthermore, files which look like
root refs but aren't are now mark those as "broken". As broken refs
are not surfaced by our tooling, this should not lead to a change in
user-visible behaviour, but may cause us to emit warnings. This
feels like the right thing to do as we would otherwise just silently
ignore corrupted root refs completely.
So in all cases the existence check was either superfluous, not in line
with the adapted terminology or masked potential issues. This commit
thus changes the behaviour as proposed and drops the existence check
altogether.
Add a test that verifies that this does not change user-visible
behaviour. Namely, we still don't want to show broken refs to the user
by default in git-for-each-ref(1). What this does allow though is for
internal callers to surface dangling root refs when they pass in the
`DO_FOR_EACH_INCLUDE_BROKEN` flag.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Rename `is_pseudoref()` to `is_root_ref()` to adapt to the newly defined
terminology in our gitglossary(7).
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There exist a bunch of call sites in the reftable backend that want to
create iterators for a reftable stack. This is rather convoluted right
now, where you always have to go via the merged table. And it is about
to become even more convoluted when we split up iterator initialization
and seeking in the next commit.
Introduce convenience functions that allow the caller to create an
iterator from a reftable stack directly without going through the merged
table. Adapt callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Allow configuring the geometric factor used by the auto-compaction
algorithm whenever a new table is appended to the stack of tables.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Besides the expected "ref" and "log" records, the reftable library also
writes "obj" records. These are basically a reverse mapping of object
IDs to their respective ref records so that it becomes efficient to
figure out which references point to a specific object. The motivation
for this data structure is the "uploadpack.allowTipSHA1InWant" config,
which allows a client to fetch any object by its hash that has a ref
pointing to it.
This reverse index is not used by Git at all though, and the expectation
is that most hosters nowadays use "uploadpack.allowAnySHA1InWant". It
may thus be preferable for many users to disable writing these optional
object indices altogether to safe some precious disk space.
Add a new config "reftable.indexObjects" that allows the user to disable
the object index altogether.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a new option `reftable.restartInterval` that allows the user to
control the restart interval when writing reftable records used by the
reftable library.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a new option `reftable.blockSize` that allows the user to control
the block size used by the reftable library.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We sometimes pass the refatble write options as value and sometimes as a
pointer. This is quite confusing and makes the reader wonder whether the
options get modified sometimes.
In fact, `reftable_new_writer()` does cause the caller-provided options
to get updated when some values aren't set up. This is quite unexpected,
but didn't cause any harm until now.
Adapt the code so that we do not modify the caller-provided values
anymore. While at it, refactor the code to code to consistently pass the
options as a constant pointer to clarify that the caller-provided opts
will not ever get modified.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code to write out reftable has seen some optimization and
simplification.
* ps/reftable-write-optim:
reftable/block: reuse compressed array
reftable/block: reuse zstream when writing log blocks
reftable/writer: reset `last_key` instead of releasing it
reftable/writer: unify releasing memory
reftable/writer: refactorings for `writer_flush_nonempty_block()`
reftable/writer: refactorings for `writer_add_record()`
refs/reftable: don't recompute committer ident
reftable: remove name checks
refs/reftable: skip duplicate name checks
refs/reftable: perform explicit D/F check when writing symrefs
refs/reftable: fix D/F conflict error message on ref copy
In the previous commits, we converted `refs_create_symref()` to utilize
transactions to perform symref updates. Earlier `refs_create_symref()`
used `create_symref()` to do the same.
We can now remove `create_symref()` and any code associated with it
which is no longer used. We remove `create_symref()` code from all the
reference backends and also remove it entirely from the `ref_storage_be`
struct.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The reference backends currently support transactional reference
updates. While this is exposed to users via 'git-update-ref' and its
'--stdin' mode, it is also used internally within various commands.
However, we do not support transactional updates of symrefs. This commit
adds support for symrefs in both the 'files' and the 'reftable' backend.
Here, we add and use `ref_update_has_null_new_value()`, a helper
function which is used to check if there is a new_value in a reference
update. The new value could either be a symref target `new_target` or a
OID `new_oid`.
We also add another common function `ref_update_check_old_target` which
will be used to check if the update's old_target corresponds to a
reference's current target.
Now transactional updates (verify, create, delete, update) can be used
for:
- regular refs
- symbolic refs
- conversion of regular to symbolic refs and vice versa
This also allows us to expose this to users via new commands in
'git-update-ref' in the future.
Note that a dangling symref update does not record a new reflog entry,
which is unchanged before and after this commit.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The files backend and the reftable backend implement
`original_update_refname` to obtain the original refname of the update.
Move it out to 'refs.c' and only expose it internally to the refs
library. This will be used in an upcoming commit to also introduce
another common functionality for the two backends.
We also rename the function to `ref_update_original_update_refname` to
keep it consistent with the upcoming other 'ref_update_*' functions
that'll be introduced.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `create_symref_locked()` creates a symref by creating a
'<symref>.lock' file and then committing the symref lock, which creates
the final symref.
Extract the early half of `create_symref_locked()` into a new helper
function `create_symref_lock()`. Because the name of the new function is
too similar to the original, rename the original to
`create_and_commit_symref()` to avoid confusion.
The new function `create_symref_locked()` can be used to create the
symref lock in a separate step from that of committing it. This allows
to add transactional support for symrefs, where the lock would be
created in the preparation step and the lock would be committed in the
finish step.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `ref_transaction_update()` obtains ref information and
flags to create a `ref_update` and add them to the transaction at hand.
To extend symref support in transactions, we need to also accept the
old and new ref targets and process it. This commit adds the required
parameters to the function and modifies all call sites.
The two parameters added are `new_target` and `old_target`. The
`new_target` is used to denote what the reference should point to when
the transaction is applied. Some functions allow this parameter to be
NULL, meaning that the reference is not changed.
The `old_target` denotes the value the reference must have before the
update. Some functions allow this parameter to be NULL, meaning that the
old value of the reference is not checked.
We also update the internal function `ref_transaction_add_update()`
similarly to take the two new parameters.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The TRANSACTION_NAME_CONFLICT error code refers to a failure to create a
ref due to a name conflict with another ref. An example of this is a
directory/file conflict such as ref names A/B and A.
"git fetch" uses this error code to more accurately describe the error
by recommending to the user that they try running "git remote prune" to
remove any old refs that are deleted by the remote which would clear up
any directory/file conflicts.
This helpful error message is not displayed when the conflicted ref is
stored in packed refs. This change fixes this by ensuring error return
code consistency in `lock_raw_ref`.
Signed-off-by: Ivan Tse <ivan.tse1@gmail.com>
Acked-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git pack-refs" learned the "--auto" option, which is a useful
addition to be triggered from "git gc --auto".
Acked-by: Karthik Nayak <karthik.188@gmail.com>
cf. <CAOLa=ZRAEA7rSUoYL0h-2qfEELdbPHbeGpgBJRqesyhHi9Q6WQ@mail.gmail.com>
* ps/pack-refs-auto:
builtin/gc: pack refs when using `git maintenance run --auto`
builtin/gc: forward git-gc(1)'s `--auto` flag when packing refs
t6500: extract objects with "17" prefix
builtin/gc: move `struct maintenance_run_opts`
builtin/pack-refs: introduce new "--auto" flag
builtin/pack-refs: release allocated memory
refs/reftable: expose auto compaction via new flag
refs: remove `PACK_REFS_ALL` flag
refs: move `struct pack_refs_opts` to where it's used
t/helper: drop pack-refs wrapper
refs/reftable: print errors on compaction failure
reftable/stack: gracefully handle failed auto-compaction due to locks
reftable/stack: use error codes when locking fails during compaction
reftable/error: discern locked/outdated errors
reftable/stack: fix error handling in `reftable_stack_init_addition()`
In order to write reflog entries we need to compute the committer's
identity as it gets encoded in the log record itself. The reftable
backend does this via `git_committer_info()` and `split_ident_line()` in
`fill_reftable_log_record()`, which use the Git config as well as
environment variables to figure out the identity.
While most callers would only call `fill_reftable_log_record()` once or
twice, `write_transaction_table()` will call it as many times as there
are queued ref updates. This can be quite a waste of effort when writing
many refs with reflog entries in a single transaction.
Refactor the code to pre-compute the committer information. This results
in a small speedup when writing 100000 refs in a single transaction:
Benchmark 1: update-ref: create many refs (HEAD~)
Time (mean ± σ): 2.895 s ± 0.020 s [User: 1.516 s, System: 1.374 s]
Range (min … max): 2.868 s … 2.983 s 100 runs
Benchmark 2: update-ref: create many refs (HEAD)
Time (mean ± σ): 2.845 s ± 0.017 s [User: 1.461 s, System: 1.379 s]
Range (min … max): 2.803 s … 2.913 s 100 runs
Summary
update-ref: create many refs (HEAD) ran
1.02 ± 0.01 times faster than update-ref: create many refs (HEAD~)
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the preceding commit we have disabled name checks in the "reftable"
backend. These checks were responsible for verifying multiple things
when writing records to the reftable stack:
- Detecting file/directory conflicts. Starting with the preceding
commits this is now handled by the reftable backend itself via
`refs_verify_refname_available()`.
- Validating refnames. This is handled by `check_refname_format()` in
the generic ref transacton layer.
The code in the reftable library is thus not used anymore and likely to
bitrot over time. Remove it.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
All the callback functions which write refs in the reftable backend
perform D/F conflict checks via `refs_verify_refname_available()`. But
in reality we perform these D/F conflict checks a second time in the
reftable library via `stack_check_addition()`.
Interestingly, the code in the reftable library is inferior compared to
the generic function:
- It is slower than `refs_verify_refname_available()`, even though
this can probably be optimized.
- It does not provide a proper error message to the caller, and thus
all the user would see is a generic "file/directory conflict"
message.
Disable the D/F conflict checks in the reftable library by setting the
`skip_name_check` write option. This results in a non-negligible speedup
when writing many refs. The following benchmark writes 100k refs in a
single transaction:
Benchmark 1: update-ref: create many refs (HEAD~)
Time (mean ± σ): 3.241 s ± 0.040 s [User: 1.854 s, System: 1.381 s]
Range (min … max): 3.185 s … 3.454 s 100 runs
Benchmark 2: update-ref: create many refs (HEAD)
Time (mean ± σ): 2.878 s ± 0.024 s [User: 1.506 s, System: 1.367 s]
Range (min … max): 2.838 s … 2.960 s 100 runs
Summary
update-ref: create many refs (HEAD~) ran
1.13 ± 0.02 times faster than update-ref: create many refs (HEAD)
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We already perform explicit D/F checks in all reftable callbacks which
write refs, except when writing symrefs. For one this leads to an error
message which isn't perfectly actionable because we only tell the user
that there was a D/F conflict, but not which refs conflicted with each
other. But second, once all ref updating callbacks explicitly check for
D/F conflicts, we can disable the D/F checks in the reftable library
itself and thus avoid some duplicated efforts.
Refactor the code that writes symref tables to explicitly call into
`refs_verify_refname_available()` when writing symrefs.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `write_copy_table()` function is shared between the reftable
implementations for renaming and copying refs. The only difference
between those two cases is that the rename will also delete the old
reference, whereas copying won't.
This has resulted in a bug though where we don't properly verify refname
availability. When calling `refs_verify_refname_available()`, we always
add the old ref name to the list of refs to be skipped when computing
availability, which indicates that the name would be available even if
it already exists at the current point in time. This is only the right
thing to do for renames though, not for copies.
The consequence of this bug is quite harmless because the reftable
backend has its own checks for D/F conflicts further down in the call
stack, and thus we refuse the update regardless of the bug. But all the
user gets in this case is an uninformative message that copying the ref
has failed, without any further details.
Fix the bug and only add the old name to the skip-list in case we rename
the ref. Consequently, this error case will now be handled by
`refs_verify_refname_available()`, which knows to provide a proper error
message.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In future tests it will be neccesary to create repositories with a set
number of tables. To make this easier, introduce the
`GIT_TEST_REFTABLE_AUTOCOMPACTION` environment variable that, when set
to false, disables autocompaction of reftables.
Signed-off-by: Justin Tobler <jltobler@gmail.com>
Acked-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* ps/pack-refs-auto:
builtin/gc: pack refs when using `git maintenance run --auto`
builtin/gc: forward git-gc(1)'s `--auto` flag when packing refs
t6500: extract objects with "17" prefix
builtin/gc: move `struct maintenance_run_opts`
builtin/pack-refs: introduce new "--auto" flag
builtin/pack-refs: release allocated memory
refs/reftable: expose auto compaction via new flag
refs: remove `PACK_REFS_ALL` flag
refs: move `struct pack_refs_opts` to where it's used
t/helper: drop pack-refs wrapper
refs/reftable: print errors on compaction failure
reftable/stack: gracefully handle failed auto-compaction due to locks
reftable/stack: use error codes when locking fails during compaction
reftable/error: discern locked/outdated errors
reftable/stack: fix error handling in `reftable_stack_init_addition()`
Under normal circumstances, the "reftable" backend will automatically
perform compaction after appending to the stack. It is thus not
necessary and may even be considered wasteful to run git-pack-refs(1) in
"reftable"-backed repositories as it will cause the backend to compact
all tables into a single one. We do exactly that though when running
`git maintenance run --auto` or `git gc --auto`, which gets spawned by
Git after running some specific commands.
The `--auto` mode is typically only executing optimizations as needed.
To do so, we already use several heuristics for the various different
data structures in Git to determine whether to optimize them or not.
We do not use any heuristics for refs though and instead always optimize
them.
Introduce a new `PACK_REFS_AUTO` flag that can be passed to the backend.
When not handled by the backend we will continue to behave the exact
same as we do right now, that is we optimize refs unconditionally. This
is done for the "files" backend for now to retain current behaviour,
even though we may eventually also want to introduce heuristics here.
For the "reftable" backend though we already do have auto-compaction, so
we can easily reuse that logic to implement the new auto-packing flag.
Note that under normal circumstances, this should always end up being a
no-op. After all, we already invoke the code for every single addition
to the stack. But there are special cases where it can still be helpful
to execute the auto-compaction code explicitly:
- Concurrent writers may cause compaction to not run due to locks.
- Callers may decide to disable compaction altogether and then pack
refs at a later point due to various reasons.
- Other implementations of the reftable format may do compaction
differently or even not at all.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When git-pack-refs(1) fails in the reftable backend we end up printing
no error message at all, leaving the caller puzzled as to why compaction
has failed. Fix this.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code to iterate over reflogs in the reftable has been optimized
to reduce memory allocation and deallocation.
Reviewed-by: Josh Steadmon <steadmon@google.com>
cf. <Ze9eX-aaWoVaqsPP@google.com>
* ps/reftable-reflog-iteration-perf:
refs/reftable: track last log record name via strbuf
reftable/record: use scratch buffer when decoding records
reftable/record: reuse message when decoding log records
reftable/record: reuse refnames when decoding log records
reftable/record: avoid copying author info
reftable/record: convert old and new object IDs to arrays
refs/reftable: reload correct stack when creating reflog iter
The code to iterate over refs with the reftable backend has seen
some optimization.
* ps/reftable-iteration-perf-part2:
refs/reftable: precompute prefix length
reftable: allow inlining of a few functions
reftable/record: decode keys in place
reftable/record: reuse refname when copying
reftable/record: reuse refname when decoding
reftable/merged: avoid duplicate pqueue emptiness check
reftable/merged: circumvent pqueue with single subiter
reftable/merged: handle subiter cleanup on close only
reftable/merged: remove unnecessary null check for subiters
reftable/merged: make subiters own their records
reftable/merged: advance subiter on subsequent iteration
reftable/merged: make `merged_iter` structure private
reftable/pq: use `size_t` to track iterator index
Clear the fallout from a fix for 2.44 regression.
* ps/reftable-repo-init-fix:
t0610: remove unused variable assignment
refs/reftable: don't fail empty transactions in repo without HEAD
"git for-each-ref" learned "--include-root-refs" option to show
even the stuff outside the 'refs/' hierarchy.
* kn/for-all-refs:
for-each-ref: add new option to include root refs
ref-filter: rename 'FILTER_REFS_ALL' to 'FILTER_REFS_REGULAR'
refs: introduce `refs_for_each_include_root_refs()`
refs: extract out `loose_fill_ref_dir_regular_file()`
refs: introduce `is_pseudoref()` and `is_headref()`
The reflog iterator enumerates all reflogs known to a ref backend. In
the "reftable" backend there is no way to list all existing reflogs
directly. Instead, we have to iterate through all reflog entries and
discard all those redundant entries for which we have already returned a
reflog entry.
This logic is implemented by tracking the last reflog name that we have
emitted to the iterator's user. If the next log record has the same name
we simply skip it until we find another record with a different refname.
This last reflog name is stored in a simple C string, which requires us
to free and reallocate it whenever we need to update the reflog name.
Convert it to use a `struct strbuf` instead, which reduces the number of
allocations. Before:
HEAP SUMMARY:
in use at exit: 13,473 bytes in 122 blocks
total heap usage: 1,068,485 allocs, 1,068,363 frees, 281,122,886 bytes allocated
After:
HEAP SUMMARY:
in use at exit: 13,473 bytes in 122 blocks
total heap usage: 68,485 allocs, 68,363 frees, 256,234,072 bytes allocated
Note that even after this change we still allocate quite a lot of data,
even though the number of allocations does not scale with the number of
log records anymore. This remainder comes mostly from decompressing the
log blocks, where we decompress each block into newly allocated memory.
This will be addressed at a later point in time.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 7af607c58d (reftable/record: store "val1" hashes as static arrays,
2024-01-03) and b31e3cc620 (reftable/record: store "val2" hashes as
static arrays, 2024-01-03) we have converted ref records to store their
object IDs in a static array. Convert log records to do the same so that
their old and new object IDs are arrays, too.
This change results in two allocations less per log record that we're
iterating over. Before:
HEAP SUMMARY:
in use at exit: 13,473 bytes in 122 blocks
total heap usage: 8,068,495 allocs, 8,068,373 frees, 401,011,862 bytes allocated
After:
HEAP SUMMARY:
in use at exit: 13,473 bytes in 122 blocks
total heap usage: 6,068,489 allocs, 6,068,367 frees, 361,011,822 bytes allocated
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When creating a new reflog iterator, we first have to reload the stack
that the iterator is being created. This is done so that any concurrent
writes to the stack are reflected. But `reflog_iterator_for_stack()`
always reloads the main stack, which is wrong.
Fix this and reload the correct stack.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* ps/reftable-iteration-perf-part2:
refs/reftable: precompute prefix length
reftable: allow inlining of a few functions
reftable/record: decode keys in place
reftable/record: reuse refname when copying
reftable/record: reuse refname when decoding
reftable/merged: avoid duplicate pqueue emptiness check
reftable/merged: circumvent pqueue with single subiter
reftable/merged: handle subiter cleanup on close only
reftable/merged: remove unnecessary null check for subiters
reftable/merged: make subiters own their records
reftable/merged: advance subiter on subsequent iteration
reftable/merged: make `merged_iter` structure private
reftable/pq: use `size_t` to track iterator index
We're recomputing the prefix length on every iteration of the ref
iterator. Precompute it for another speedup when iterating over 1
million refs:
Benchmark 1: show-ref: single matching ref (revision = HEAD~)
Time (mean ± σ): 100.3 ms ± 3.7 ms [User: 97.3 ms, System: 2.8 ms]
Range (min … max): 97.5 ms … 139.7 ms 1000 runs
Benchmark 2: show-ref: single matching ref (revision = HEAD)
Time (mean ± σ): 95.8 ms ± 3.4 ms [User: 92.9 ms, System: 2.8 ms]
Range (min … max): 93.0 ms … 121.9 ms 1000 runs
Summary
show-ref: single matching ref (revision = HEAD) ran
1.05 ± 0.05 times faster than show-ref: single matching ref (revision = HEAD~)
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Under normal circumstances, it shouldn't ever happen that a repository
has no HEAD reference. In fact, git-update-ref(1) would fail any request
to delete the HEAD reference, and a newly initialized repository always
pre-creates it, too.
We have however changed git-clone(1) to partially initialize the
refdb just up to the point where remote helpers can find the
repository. With that change, we are going to run into a situation
where repositories have no refs at all.
Now there is a very particular edge case in this situation: when
preparing an empty ref transacton, we end up returning whatever value
`read_ref_without_reload()` returned to the caller. Under normal
conditions this would be fine: "HEAD" should usually exist, and thus the
function would return `0`. But if "HEAD" doesn't exist, the function
returns a positive value which we end up returning to the caller.
Fix this bug by resetting the return code to `0` and add a test.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The git-for-each-ref(1) command doesn't provide a way to print root refs
i.e pseudorefs and HEAD with the regular "refs/" prefixed refs.
This commit adds a new option "--include-root-refs" to
git-for-each-ref(1). When used this would also print pseudorefs and HEAD
for the current worktree.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Introduce a new ref iteration flag `DO_FOR_EACH_INCLUDE_ROOT_REFS`,
which will be used to iterate over regular refs plus pseudorefs and
HEAD.
Refs which fall outside the `refs/` and aren't either pseudorefs or HEAD
are more of a grey area. This is because we don't block the users from
creating such refs but they are not officially supported.
Introduce `refs_for_each_include_root_refs()` which calls
`do_for_each_ref()` with this newly introduced flag.
In `refs/files-backend.c`, introduce a new function
`add_pseudoref_and_head_entries()` to add pseudorefs and HEAD to the
`ref_dir`. We then finally call `add_pseudoref_and_head_entries()`
whenever the `DO_FOR_EACH_INCLUDE_ROOT_REFS` flag is set. Any new ref
backend will also have to implement similar changes on its end.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Extract out the code for adding a single file to the loose ref dir as
`loose_fill_ref_dir_regular_file()` from `loose_fill_ref_dir()` in
`refs/files-backend.c`.
This allows us to use this function independently in the following
commits where we add code to also add pseudorefs to the ref dir.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The reflog iterator tries to resolve the corresponding ref for every
reflog that it is about to yield. Historically, this was done due to
multiple reasons:
- It ensures that the refname is safe because we end up calling
`check_refname_format()`. Also, non-conformant refnames are skipped
altogether.
- The iterator used to yield the resolved object ID as well as its
flags to the callback. This info was never used though, and the
corresponding parameters were dropped in the preceding commit.
- When a ref is corrupt then the reflog is not emitted at all.
We're about to introduce a new `git reflog list` subcommand that will
print all reflogs that the refdb knows about. Skipping over reflogs
whose refs are corrupted would be quite counterproductive in this case
as the user would have no way to learn about reflogs which may still
exist in their repository to help and rescue such a corrupted ref. Thus,
the only remaining reason for why we'd want to resolve the ref is to
verify its refname.
Refactor the code to call `check_refname_format()` directly instead of
trying to resolve the ref. This is significantly more efficient given
that we don't have to hit the object database anymore to list reflogs.
And second, it ensures that we end up showing reflogs of broken refs,
which will help to make the reflog more useful.
Note that this really only impacts the case where the corresponding ref
is corrupt. Reflogs for nonexistent refs would have been returned to the
caller beforehand already as we did not pass `RESOLVE_REF_READING` to
the function, and thus `refs_resolve_ref_unsafe()` would have returned
successfully in that case.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The ref and reflog iterators share much of the same underlying code to
iterate over the corresponding entries. This results in some weird code
because the reflog iterator also exposes an object ID as well as a flag
to the callback function. Neither of these fields do refer to the reflog
though -- they refer to the corresponding ref with the same name. This
is quite misleading. In practice at least the object ID cannot really be
implemented in any other way as a reflog does not have a specific object
ID in the first place. This is further stressed by the fact that none of
the callbacks except for our test helper make use of these fields.
Split up the infrastucture so that ref and reflog iterators use separate
callback signatures. This allows us to drop the nonsensical fields from
the reflog iterator.
Note that internally, the backends still use the same shared infra to
iterate over both types. As the backends should never end up being
called directly anyway, this is not much of a problem and thus kept
as-is for simplicity's sake.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the preceding commit we have converted the reflog iterator of the
"files" backend to be ordered, which was the only remaining ref iterator
that wasn't ordered. Refactor the ref iterator infrastructure so that we
always assume iterators to be ordered, thus simplifying the code.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When iterating through reflogs in a worktree we create a merged iterator
that merges reflogs from both refdbs. The resulting refs are ordered so
that instead we first return all worktree reflogs before we return all
common refs.
This is the only remaining case where a ref iterator returns entries in
a non-lexicographic order. The result would look something like the
following (listed with a command we introduce in a subsequent commit):
```
$ git reflog list
HEAD
refs/worktree/per-worktree
refs/heads/main
refs/heads/wt
```
So we first print the per-worktree reflogs in lexicographic order, then
the common reflogs in lexicographic order. This is confusing and not
consistent with how we print per-worktree refs, which are exclusively
sorted lexicographically.
Sort reflogs lexicographically in the same way as we sort normal refs.
As this is already implemented properly by the "reftable" backend via a
separate selection function, we simply pull out that logic and reuse it
for the "files" backend. As logs are properly sorted now, mark the
merged reflog iterator as sorted.
Tests will be added in a subsequent commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We use a directory iterator to return reflogs via the reflog iterator.
This iterator returns entries in the same order as readdir(3P) would and
will thus yield reflogs with no discernible order.
Set the new `DIR_ITERATOR_SORTED` flag that was introduced in the
preceding commit so that the order is deterministic. While the effect of
this can only been observed in a test tool, a subsequent commit will
start to expose this functionality to users via a new `git reflog list`
subcommand.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When copying a ref with the reftable backend we also copy the
corresponding log records. When seeking the first log record that we're
about to copy fails though we directly return from `write_copy_table()`
without doing any cleanup, leaking several allocated data structures.
Fix this by exiting via our common cleanup logic instead.
Reported-by: Jeff King <peff@peff.net> via Coverity
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Due to scalability issues, Shawn Pearce has originally proposed a new
"reftable" format more than six years ago [1]. Initially, this new
format was implemented in JGit with promising results. Around two years
ago, we have then added the "reftable" library to the Git codebase via
a4bbd13be3 (Merge branch 'hn/reftable', 2021-12-15). With this we have
landed all the low-level code to read and write reftables. Notably
missing though was the integration of this low-level code into the Git
code base in the form of a new ref backend that ties all of this
together.
This gap is now finally closed by introducing a new "reftable" backend
into the Git codebase. This new backend promises to bring some notable
improvements to Git repositories:
- It becomes possible to do truly atomic writes where either all refs
are committed to disk or none are. This was not possible with the
"files" backend because ref updates were split across multiple loose
files.
- The disk space required to store many refs is reduced, both compared
to loose refs and packed-refs. This is enabled both by the reftable
format being a binary format, which is more compact, and by prefix
compression.
- We can ignore filesystem-specific behaviour as ref names are not
encoded via paths anymore. This means there is no need to handle
case sensitivity on Windows systems or Unicode precomposition on
macOS.
- There is no need to rewrite the complete refdb anymore every time a
ref is being deleted like it was the case for packed-refs. This
means that ref deletions are now constant time instead of scaling
linearly with the number of refs.
- We can ignore file/directory conflicts so that it becomes possible
to store both "refs/heads/foo" and "refs/heads/foo/bar".
- Due to this property we can retain reflogs for deleted refs. We have
previously been deleting reflogs together with their refs to avoid
file/directory conflicts, which is not necessary anymore.
- We can properly enumerate all refs. With the "files" backend it is
not easily possible to distinguish between refs and non-refs because
they may live side by side in the gitdir.
Not all of these improvements are realized with the current "reftable"
backend implementation. At this point, the new backend is supposed to be
a drop-in replacement for the "files" backend that is used by basically
all Git repositories nowadays. It strives for 1:1 compatibility, which
means that a user can expect the same behaviour regardless of whether
they use the "reftable" backend or the "files" backend for most of the
part.
Most notably, this means we artificially limit the capabilities of the
"reftable" backend to match the limits of the "files" backend. It is not
possible to create refs that would end up with file/directory conflicts,
we do not retain reflogs, we perform stricter-than-necessary checks.
This is done intentionally due to two main reasons:
- It makes it significantly easier to land the "reftable" backend as
tests behave the same. It would be tough to argue for each and every
single test that doesn't pass with the "reftable" backend.
- It ensures compatibility between repositories that use the "files"
backend and repositories that use the "reftable" backend. Like this,
hosters can migrate their repositories to use the "reftable" backend
without causing issues for clients that use the "files" backend in
their clones.
It is expected that these artificial limitations may eventually go away
in the long term.
Performance-wise things very much depend on the actual workload. The
following benchmarks compare the "files" and "reftable" backends in the
current version:
- Creating N refs in separate transactions shows that the "files"
backend is ~50% faster. This is not surprising given that creating a
ref only requires us to create a single loose ref. The "reftable"
backend will also perform auto compaction on updates. In real-world
workloads we would likely also want to perform pack loose refs,
which would likely change the picture.
Benchmark 1: update-ref: create refs sequentially (refformat = files, refcount = 1)
Time (mean ± σ): 2.1 ms ± 0.3 ms [User: 0.6 ms, System: 1.7 ms]
Range (min … max): 1.8 ms … 4.3 ms 133 runs
Benchmark 2: update-ref: create refs sequentially (refformat = reftable, refcount = 1)
Time (mean ± σ): 2.7 ms ± 0.1 ms [User: 0.6 ms, System: 2.2 ms]
Range (min … max): 2.4 ms … 2.9 ms 132 runs
Benchmark 3: update-ref: create refs sequentially (refformat = files, refcount = 1000)
Time (mean ± σ): 1.975 s ± 0.006 s [User: 0.437 s, System: 1.535 s]
Range (min … max): 1.969 s … 1.980 s 3 runs
Benchmark 4: update-ref: create refs sequentially (refformat = reftable, refcount = 1000)
Time (mean ± σ): 2.611 s ± 0.013 s [User: 0.782 s, System: 1.825 s]
Range (min … max): 2.597 s … 2.622 s 3 runs
Benchmark 5: update-ref: create refs sequentially (refformat = files, refcount = 100000)
Time (mean ± σ): 198.442 s ± 0.241 s [User: 43.051 s, System: 155.250 s]
Range (min … max): 198.189 s … 198.670 s 3 runs
Benchmark 6: update-ref: create refs sequentially (refformat = reftable, refcount = 100000)
Time (mean ± σ): 294.509 s ± 4.269 s [User: 104.046 s, System: 190.326 s]
Range (min … max): 290.223 s … 298.761 s 3 runs
- Creating N refs in a single transaction shows that the "files"
backend is significantly slower once we start to write many refs.
The "reftable" backend only needs to update two files, whereas the
"files" backend needs to write one file per ref.
Benchmark 1: update-ref: create many refs (refformat = files, refcount = 1)
Time (mean ± σ): 1.9 ms ± 0.1 ms [User: 0.4 ms, System: 1.4 ms]
Range (min … max): 1.8 ms … 2.6 ms 151 runs
Benchmark 2: update-ref: create many refs (refformat = reftable, refcount = 1)
Time (mean ± σ): 2.5 ms ± 0.1 ms [User: 0.7 ms, System: 1.7 ms]
Range (min … max): 2.4 ms … 3.4 ms 148 runs
Benchmark 3: update-ref: create many refs (refformat = files, refcount = 1000)
Time (mean ± σ): 152.5 ms ± 5.2 ms [User: 19.1 ms, System: 133.1 ms]
Range (min … max): 148.5 ms … 167.8 ms 15 runs
Benchmark 4: update-ref: create many refs (refformat = reftable, refcount = 1000)
Time (mean ± σ): 58.0 ms ± 2.5 ms [User: 28.4 ms, System: 29.4 ms]
Range (min … max): 56.3 ms … 72.9 ms 40 runs
Benchmark 5: update-ref: create many refs (refformat = files, refcount = 1000000)
Time (mean ± σ): 152.752 s ± 0.710 s [User: 20.315 s, System: 131.310 s]
Range (min … max): 152.165 s … 153.542 s 3 runs
Benchmark 6: update-ref: create many refs (refformat = reftable, refcount = 1000000)
Time (mean ± σ): 51.912 s ± 0.127 s [User: 26.483 s, System: 25.424 s]
Range (min … max): 51.769 s … 52.012 s 3 runs
- Deleting a ref in a fully-packed repository shows that the "files"
backend scales with the number of refs. The "reftable" backend has
constant-time deletions.
Benchmark 1: update-ref: delete ref (refformat = files, refcount = 1)
Time (mean ± σ): 1.7 ms ± 0.1 ms [User: 0.4 ms, System: 1.2 ms]
Range (min … max): 1.6 ms … 2.1 ms 316 runs
Benchmark 2: update-ref: delete ref (refformat = reftable, refcount = 1)
Time (mean ± σ): 1.8 ms ± 0.1 ms [User: 0.4 ms, System: 1.3 ms]
Range (min … max): 1.7 ms … 2.1 ms 294 runs
Benchmark 3: update-ref: delete ref (refformat = files, refcount = 1000)
Time (mean ± σ): 2.0 ms ± 0.1 ms [User: 0.5 ms, System: 1.4 ms]
Range (min … max): 1.9 ms … 2.5 ms 287 runs
Benchmark 4: update-ref: delete ref (refformat = reftable, refcount = 1000)
Time (mean ± σ): 1.9 ms ± 0.1 ms [User: 0.5 ms, System: 1.3 ms]
Range (min … max): 1.8 ms … 2.1 ms 217 runs
Benchmark 5: update-ref: delete ref (refformat = files, refcount = 1000000)
Time (mean ± σ): 229.8 ms ± 7.9 ms [User: 182.6 ms, System: 46.8 ms]
Range (min … max): 224.6 ms … 245.2 ms 6 runs
Benchmark 6: update-ref: delete ref (refformat = reftable, refcount = 1000000)
Time (mean ± σ): 2.0 ms ± 0.0 ms [User: 0.6 ms, System: 1.3 ms]
Range (min … max): 2.0 ms … 2.1 ms 3 runs
- Listing all refs shows no significant advantage for either of the
backends. The "files" backend is a bit faster, but not by a
significant margin. When repositories are not packed the "reftable"
backend outperforms the "files" backend because the "reftable"
backend performs auto-compaction.
Benchmark 1: show-ref: print all refs (refformat = files, refcount = 1, packed = true)
Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms]
Range (min … max): 1.5 ms … 2.0 ms 1729 runs
Benchmark 2: show-ref: print all refs (refformat = reftable, refcount = 1, packed = true)
Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms]
Range (min … max): 1.5 ms … 1.8 ms 1816 runs
Benchmark 3: show-ref: print all refs (refformat = files, refcount = 1000, packed = true)
Time (mean ± σ): 4.3 ms ± 0.1 ms [User: 0.9 ms, System: 3.3 ms]
Range (min … max): 4.1 ms … 4.6 ms 645 runs
Benchmark 4: show-ref: print all refs (refformat = reftable, refcount = 1000, packed = true)
Time (mean ± σ): 4.5 ms ± 0.2 ms [User: 1.0 ms, System: 3.3 ms]
Range (min … max): 4.2 ms … 5.9 ms 643 runs
Benchmark 5: show-ref: print all refs (refformat = files, refcount = 1000000, packed = true)
Time (mean ± σ): 2.537 s ± 0.034 s [User: 0.488 s, System: 2.048 s]
Range (min … max): 2.511 s … 2.627 s 10 runs
Benchmark 6: show-ref: print all refs (refformat = reftable, refcount = 1000000, packed = true)
Time (mean ± σ): 2.712 s ± 0.017 s [User: 0.653 s, System: 2.059 s]
Range (min … max): 2.692 s … 2.752 s 10 runs
Benchmark 7: show-ref: print all refs (refformat = files, refcount = 1, packed = false)
Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms]
Range (min … max): 1.5 ms … 1.9 ms 1834 runs
Benchmark 8: show-ref: print all refs (refformat = reftable, refcount = 1, packed = false)
Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms]
Range (min … max): 1.4 ms … 2.0 ms 1840 runs
Benchmark 9: show-ref: print all refs (refformat = files, refcount = 1000, packed = false)
Time (mean ± σ): 13.8 ms ± 0.2 ms [User: 2.8 ms, System: 10.8 ms]
Range (min … max): 13.3 ms … 14.5 ms 208 runs
Benchmark 10: show-ref: print all refs (refformat = reftable, refcount = 1000, packed = false)
Time (mean ± σ): 4.5 ms ± 0.2 ms [User: 1.2 ms, System: 3.3 ms]
Range (min … max): 4.3 ms … 6.2 ms 624 runs
Benchmark 11: show-ref: print all refs (refformat = files, refcount = 1000000, packed = false)
Time (mean ± σ): 12.127 s ± 0.129 s [User: 2.675 s, System: 9.451 s]
Range (min … max): 11.965 s … 12.370 s 10 runs
Benchmark 12: show-ref: print all refs (refformat = reftable, refcount = 1000000, packed = false)
Time (mean ± σ): 2.799 s ± 0.022 s [User: 0.735 s, System: 2.063 s]
Range (min … max): 2.769 s … 2.836 s 10 runs
- Printing a single ref shows no real difference between the "files"
and "reftable" backends.
Benchmark 1: show-ref: print single ref (refformat = files, refcount = 1)
Time (mean ± σ): 1.5 ms ± 0.1 ms [User: 0.4 ms, System: 1.0 ms]
Range (min … max): 1.4 ms … 1.8 ms 1779 runs
Benchmark 2: show-ref: print single ref (refformat = reftable, refcount = 1)
Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms]
Range (min … max): 1.4 ms … 2.5 ms 1753 runs
Benchmark 3: show-ref: print single ref (refformat = files, refcount = 1000)
Time (mean ± σ): 1.5 ms ± 0.1 ms [User: 0.3 ms, System: 1.1 ms]
Range (min … max): 1.4 ms … 1.9 ms 1840 runs
Benchmark 4: show-ref: print single ref (refformat = reftable, refcount = 1000)
Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms]
Range (min … max): 1.5 ms … 2.0 ms 1831 runs
Benchmark 5: show-ref: print single ref (refformat = files, refcount = 1000000)
Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms]
Range (min … max): 1.5 ms … 2.1 ms 1848 runs
Benchmark 6: show-ref: print single ref (refformat = reftable, refcount = 1000000)
Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms]
Range (min … max): 1.5 ms … 2.1 ms 1762 runs
So overall, performance depends on the usecases. Except for many
sequential writes the "reftable" backend is roughly on par or
significantly faster than the "files" backend though. Given that the
"files" backend has received 18 years of optimizations by now this can
be seen as a win. Furthermore, we can expect that the "reftable" backend
will grow faster over time when attention turns more towards
optimizations.
The complete test suite passes, except for those tests explicitly marked
to require the REFFILES prerequisite. Some tests in t0610 are marked as
failing because they depend on still-in-flight bug fixes. Tests can be
run with the new backend by setting the GIT_TEST_DEFAULT_REF_FORMAT
environment variable to "reftable".
There is a single known conceptual incompatibility with the dumb HTTP
transport. As "info/refs" SHOULD NOT contain the HEAD reference, and
because the "HEAD" file is not valid anymore, it is impossible for the
remote client to figure out the default branch without changing the
protocol. This shortcoming needs to be handled in a subsequent patch
series.
As the reftable library has already been introduced a while ago, this
commit message will not go into the details of how exactly the on-disk
format works. Please refer to our preexisting technical documentation at
Documentation/technical/reftable for this.
[1]: https://public-inbox.org/git/CAJo=hJtyof=HRy=2sLP0ng0uZ4=S-DpZ5dR1aF+VHVETKG20OQ@mail.gmail.com/
Original-idea-by: Shawn Pearce <spearce@spearce.org>
Based-on-patch-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Instead of manually creating refs/ hierarchy on disk upon a
creation of a secondary worktree, which is only usable via the
files backend, use the refs API to populate it.
* ps/worktree-refdb-initialization:
builtin/worktree: create refdb via ref backend
worktree: expose interface to look up worktree by name
builtin/worktree: move setup of commondir file earlier
refs/files: skip creation of "refs/{heads,tags}" for worktrees
setup: move creation of "refs/" into the files backend
refs: prepare `refs_init_db()` for initializing worktree refs
Introduce a new extension "refstorage" so that we can mark a
repository that uses a non-default ref backend, like reftable.
* ps/refstorage-extension:
t9500: write "extensions.refstorage" into config
builtin/clone: introduce `--ref-format=` value flag
builtin/init: introduce `--ref-format=` value flag
builtin/rev-parse: introduce `--show-ref-format` flag
t: introduce GIT_TEST_DEFAULT_REF_FORMAT envvar
setup: introduce GIT_DEFAULT_REF_FORMAT envvar
setup: introduce "extensions.refStorage" extension
setup: set repository's formats on init
setup: start tracking ref storage format
refs: refactor logic to look up storage backends
worktree: skip reading HEAD when repairing worktrees
t: introduce DEFAULT_REPO_FORMAT prereq
Remove unused header "#include".
* en/header-cleanup:
treewide: remove unnecessary includes in source files
treewide: add direct includes currently only pulled in transitively
trace2/tr2_tls.h: remove unnecessary include
submodule-config.h: remove unnecessary include
pkt-line.h: remove unnecessary include
line-log.h: remove unnecessary include
http.h: remove unnecessary include
fsmonitor--daemon.h: remove unnecessary includes
blame.h: remove unnecessary includes
archive.h: remove unnecessary include
treewide: remove unnecessary includes in source files
treewide: remove unnecessary includes from header files
The files ref backend will create both "refs/heads" and "refs/tags" in
the Git directory. While this logic makes sense for normal repositories,
it does not for worktrees because those refs are "common" refs that
would always be contained in the main repository's ref database.
Introduce a new flag telling the backend that it is expected to create a
per-worktree ref database and skip creation of these dirs in the files
backend when the flag is set. No other backends (currently) need
worktree-specific logic, so this is the only required change to start
creating per-worktree ref databases via `refs_init_db()`.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When creating the ref database we unconditionally create the "refs/"
directory in "setup.c". This is a mandatory prerequisite for all Git
repositories regardless of the ref backend in use, because Git will be
unable to detect the directory as a repository if "refs/" doesn't exist.
We are about to add another new caller that will want to create a ref
database when creating worktrees. We would require the same logic to
create the "refs/" directory even though the caller really should not
care about such low-level details. Ideally, the ref database should be
fully initialized after calling `refs_init_db()`.
Move the code to create the directory into the files backend itself to
make it so. This means that future ref backends will also need to have
equivalent logic around to ensure that the directory exists, but it
seems a lot more sensible to have it this way round than to require
callers to create the directory themselves.
An alternative to this would be to create "refs/" in `refs_init_db()`
directly. This feels conceptually unclean though as the creation of the
refdb is now cluttered across different callsites. Furthermore, both the
"files" and the upcoming "reftable" backend write backend-specific data
into the "refs/" directory anyway, so splitting up this logic would only
make it harder to reason about.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The purpose of `refs_init_db()` is to initialize the on-disk files of a
new ref database. The function is quite inflexible right now though, as
callers can neither specify the `struct ref_store` nor can they pass any
flags.
Refactor the interface to accept both of these. This will be required so
that we can start initializing per-worktree ref databases via the ref
backend instead of open-coding the initialization in "worktree.c".
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In order to look up ref storage backends, we're currently using a linked
list of backends, where each backend is expected to set up its `next`
pointer to the next ref storage backend. This is kind of a weird setup
as backends need to be aware of other backends without much of a reason.
Refactor the code so that the array of backends is centrally defined in
"refs.c", where each backend is now identified by an integer constant.
Expose functions to translate from those integer constants to the name
and vice versa, which will be required by subsequent patches.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Each of these were checked with
gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE}
to ensure that removing the direct inclusion of the header actually
resulted in that header no longer being included at all (i.e. that
no other header pulled it in transitively).
...except for a few cases where we verified that although the header
was brought in transitively, nothing from it was directly used in
that source file. These cases were:
* builtin/credential-cache.c
* builtin/pull.c
* builtin/send-pack.c
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Now that `refs_delete_refs` is implemented in a generic way via the ref
transaction interfaces there are no callers left that invoke the
`delete_refs` callback anymore. Remove it from all of our backends.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Both the files and the packed-refs reference backends now use the same
generic transactions-based code to delete references. Let's pull these
implementations up into `refs_delete_refs()` to deduplicate the code.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the `files_delete_refs()` callback function of the files backend we
implement deletion of references. This is done in two steps:
1. We lock the packed-refs file and delete all references from it in
a single transaction.
2. We delete all loose references via separate calls to
`refs_delete_ref()`.
These steps essentially duplicate the logic around locking and deletion
order that we already have in the transactional interfaces, where we do
know to lock and evict references from the packed-refs file. Despite the
fact that we duplicate the logic, it's also less efficient than if we
used a single generic transaction:
- The transactional interface knows to skip locking of the packed
refs in case they don't contain any of the refs which are about to
be deleted.
- We end up creating N+1 separate reference transactions, one for
the packed-refs file and N for the individual loose references.
Refactor the code to instead delete references via a single transaction.
As we don't assert the expected old object ID this is equivalent to the
previous behaviour, and we already do the same in the packed-refs
backend.
Despite the fact that the result is simpler to reason about, this change
also results in improved performance. The following benchmarks have been
executed in linux.git:
```
$ hyperfine -n '{rev}, packed={packed} refcount={refcount}' \
-L packed true,false -L refcount 1,1000 -L rev master,pks-ref-store-generic-delete-refs \
--setup 'git -C /home/pks/Development/git switch --detach {rev} && make -C /home/pks/Development/git -j17' \
--prepare 'printf "create refs/heads/new-branch-%d HEAD\n" $(seq {refcount}) | git -C /home/pks/Reproduction/linux.git update-ref --stdin && if test {packed} = true; then git pack-refs --all; fi' \
--warmup=10 \
'/home/pks/Development/git/bin-wrappers/git -C /home/pks/Reproduction/linux.git branch -d new-branch-{1..{refcount}}'
Benchmark 1: master packed=true refcount=1
Time (mean ± σ): 7.8 ms ± 1.6 ms [User: 3.4 ms, System: 4.4 ms]
Range (min … max): 5.5 ms … 11.0 ms 120 runs
Benchmark 2: master packed=false refcount=1
Time (mean ± σ): 7.0 ms ± 1.1 ms [User: 3.2 ms, System: 3.8 ms]
Range (min … max): 5.7 ms … 9.8 ms 180 runs
Benchmark 3: master packed=true refcount=1000
Time (mean ± σ): 283.8 ms ± 5.2 ms [User: 45.7 ms, System: 231.5 ms]
Range (min … max): 276.7 ms … 291.6 ms 10 runs
Benchmark 4: master packed=false refcount=1000
Time (mean ± σ): 284.4 ms ± 5.3 ms [User: 44.2 ms, System: 233.6 ms]
Range (min … max): 277.1 ms … 293.3 ms 10 runs
Benchmark 5: generic-delete-refs packed=true refcount=1
Time (mean ± σ): 6.2 ms ± 1.8 ms [User: 2.3 ms, System: 3.9 ms]
Range (min … max): 4.1 ms … 12.2 ms 142 runs
Benchmark 6: generic-delete-refs packed=false refcount=1
Time (mean ± σ): 7.1 ms ± 1.4 ms [User: 2.8 ms, System: 4.3 ms]
Range (min … max): 4.2 ms … 10.8 ms 157 runs
Benchmark 7: generic-delete-refs packed=true refcount=1000
Time (mean ± σ): 198.9 ms ± 1.7 ms [User: 29.5 ms, System: 165.7 ms]
Range (min … max): 196.1 ms … 201.4 ms 10 runs
Benchmark 8: generic-delete-refs packed=false refcount=1000
Time (mean ± σ): 199.7 ms ± 7.8 ms [User: 32.2 ms, System: 163.2 ms]
Range (min … max): 193.8 ms … 220.7 ms 10 runs
Summary
generic-delete-refs packed=true refcount=1 ran
1.14 ± 0.37 times faster than master packed=false refcount=1
1.15 ± 0.40 times faster than generic-delete-refs packed=false refcount=1
1.26 ± 0.44 times faster than master packed=true refcount=1
32.24 ± 9.17 times faster than generic-delete-refs packed=true refcount=1000
32.36 ± 9.29 times faster than generic-delete-refs packed=false refcount=1000
46.00 ± 13.10 times faster than master packed=true refcount=1000
46.10 ± 13.13 times faster than master packed=false refcount=1000
```
Especially in the case where we have many references we can see a clear
performance speedup of nearly 30%.
This is in contrast to the stated objecive in a27dcf89b6 (refs: make
delete_refs() virtual, 2016-09-04), where the virtual `delete_refs()`
function was introduced with the intent to speed things up rather than
making things slower. So it seems like we have outlived the need for a
virtual function.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Modify the 'readdir' loop in 'loose_fill_ref_dir' to, rather than 'stat' a
file to determine whether it is a directory or not, use 'get_dtype'.
Currently, the loop uses 'stat' to determine whether each dirent is a
directory itself or not in order to construct the appropriate ref cache
entry. If 'stat' fails (returning a negative value), the dirent is silently
skipped; otherwise, 'S_ISDIR(st.st_mode)' is used to check whether the entry
is a directory.
On platforms that include an entry's d_type in in the 'dirent' struct, this
extra 'stat' check is redundant. We can use the 'get_dtype' method to
extract this information on platforms that support it (i.e. where
NO_D_TYPE_IN_DIRENT is unset), and derive it with 'stat' on platforms that
don't. Because 'stat' is an expensive call, this confers a
modest-but-noticeable performance improvement when iterating over large
numbers of refs (approximately 20% speedup in 'git for-each-ref' in a 30k
ref repo).
Unlike other existing usage of 'get_dtype', the 'follow_symlinks' arg is set
to 1 to replicate the existing handling of symlink dirents. This
unfortunately requires calling 'stat' on the associated entry regardless of
platform, but symlinks in the loose ref store are highly unlikely since
they'd need to be created manually by a user.
Note that this patch also changes the condition for skipping creation of a
ref entry from "when 'stat' fails" to "when the d_type is anything other
than DT_REG or DT_DIR". If a dirent's d_type is DT_UNKNOWN (either because
the platform doesn't support d_type in dirents or some other reason) or
DT_LNK, 'get_dtype' will try to derive the underlying type with 'stat'. If
the 'stat' fails, the d_type will remain 'DT_UNKNOWN' and dirent will be
skipped. However, it will also be skipped if it is any other valid d_type
(e.g. DT_FIFO for named pipes, DT_LNK for a nested symlink). Git does not
handle these properly anyway, so we can safely constrain accepted types to
directories and regular files.
Signed-off-by: Victoria Dye <vdye@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Update 'cache_ref_iterator_advance' to skip over refs that are not matched
by the given prefix.
Currently, a ref entry is considered "matched" if the entry name is fully
contained within the prefix:
* prefix: "refs/heads/v1"
* entry: "refs/heads/v1.0"
OR if the prefix is fully contained in the entry name:
* prefix: "refs/heads/v1.0"
* entry: "refs/heads/v1"
The first case is always correct, but the second is only correct if the ref
cache entry is a directory, for example:
* prefix: "refs/heads/example"
* entry: "refs/heads/"
Modify the logic in 'cache_ref_iterator_advance' to reflect these
expectations:
1. If 'overlaps_prefix' returns 'PREFIX_EXCLUDES_DIR', then the prefix and
ref cache entry do not overlap at all. Skip this entry.
2. If 'overlaps_prefix' returns 'PREFIX_WITHIN_DIR', then the prefix matches
inside this entry if it is a directory. Skip if the entry is not a
directory, otherwise iterate over it.
3. Otherwise, 'overlaps_prefix' returned 'PREFIX_CONTAINS_DIR', indicating
that the cache entry (directory or not) is fully contained by or equal to
the prefix. Iterate over this entry.
Note that condition 2 relies on the names of directory entries having the
appropriate trailing slash. The existing function documentation of
'create_dir_entry' explicitly calls out the trailing slash requirement, so
this is a safe assumption to make.
This bug generally doesn't have any user-facing impact, since it requires:
1. using a non-empty prefix without a trailing slash in an iteration like
'for_each_fullref_in',
2. the callback to said iteration not reapplying the original filter (as
for-each-ref does) to ensure unmatched refs are skipped, and
3. the repository having one or more refs that match part of, but not all
of, the prefix.
However, there are some niche scenarios that meet those criteria
(specifically, 'rev-parse --bisect' and '(log|show|shortlog) --bisect'). Add
tests covering those cases to demonstrate the fix in this patch.
Signed-off-by: Victoria Dye <vdye@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Enumerating refs in the packed-refs file, while excluding refs that
match certain patterns, has been optimized.
* tb/refs-exclusion-and-packed-refs:
ls-refs.c: avoid enumerating hidden refs where possible
upload-pack.c: avoid enumerating hidden refs where possible
builtin/receive-pack.c: avoid enumerating hidden references
refs.h: implement `hidden_refs_to_excludes()`
refs.h: let `for_each_namespaced_ref()` take excluded patterns
revision.h: store hidden refs in a `strvec`
refs/packed-backend.c: add trace2 counters for jump list
refs/packed-backend.c: implement jump lists to avoid excluded pattern(s)
refs/packed-backend.c: refactor `find_reference_location()`
refs: plumb `exclude_patterns` argument throughout
builtin/for-each-ref.c: add `--exclude` option
ref-filter.c: parameterize match functions over patterns
ref-filter: add `ref_filter_clear()`
ref-filter: clear reachable list pointers after freeing
ref-filter.h: provide `REF_FILTER_INIT`
refs.c: rename `ref_filter`
The previous commit added low-level tests to ensure that the packed-refs
iterator did not enumerate excluded sections of the refspace.
However, there was no guarantee that these sections weren't being
visited, only that they were being suppressed from the output. To harden
these tests, add a trace2 counter which tracks the number of regions
skipped by the packed-refs iterator, and assert on its value.
Suggested-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When iterating through the `packed-refs` file in order to answer a query
like:
$ git for-each-ref --exclude=refs/__hidden__
it would be useful to avoid walking over all of the entries in
`refs/__hidden__/*` when possible, since we know that the ref-filter
code is going to throw them away anyways.
In certain circumstances, doing so is possible. The algorithm for doing
so is as follows:
- For each excluded pattern, find the first record that matches it,
and the first record that *doesn't* match it (i.e. the location
you'd next want to consider when excluding that pattern).
- Sort the set of excluded regions from the previous step in ascending
order of the first location within the `packed-refs` file that
matches.
- Clean up the results from the previous step: discard empty regions,
and combine adjacent regions. The set of regions which remains is
referred to as the "jump list", and never contains any references
which should be included in the result set.
Then when iterating through the `packed-refs` file, if `iter->pos` is
ever contained in one of the regions from the previous steps, advance
`iter->pos` past the end of that region, and continue enumeration.
Note that we only perform this optimization when none of the excluded
pattern(s) have special meta-characters in them. For a pattern like
"refs/foo[ac]", the excluded regions ("refs/fooa", "refs/fooc", and
everything underneath them) are not connected. A future implementation
that handles this case may split the character class (pretending as if
two patterns were excluded: "refs/fooa", and "refs/fooc").
There are a few other gotchas worth considering. First, note that the
jump list is sorted, so once we jump past a region, we can avoid
considering it (or any regions preceding it) again. The member
`jump_pos` is used to track the first next-possible region to jump
through.
Second, note that the jump list is best-effort, since we do not handle
loose references, and because of the meta-character issue above. The
jump list may not skip past all references which won't appear in the
results, but will never skip over a reference which does appear in the
result set.
In repositories with a large number of hidden references, the speed-up
can be significant. Tests here are done with a copy of linux.git with a
reference "refs/pull/N" pointing at every commit, as in:
$ git rev-list HEAD | awk '{ print "create refs/pull/" NR " " $0 }' |
git update-ref --stdin
$ git pack-refs --all
, it is significantly faster to have `for-each-ref` jump over the
excluded references, as opposed to filtering them out after the fact:
$ hyperfine \
'git for-each-ref --format="%(objectname) %(refname)" | grep -vE "^[0-9a-f]{40} refs/pull/"' \
'git.prev for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"' \
'git.compile for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"'
Benchmark 1: git for-each-ref --format="%(objectname) %(refname)" | grep -vE "^[0-9a-f]{40} refs/pull/"
Time (mean ± σ): 798.1 ms ± 3.3 ms [User: 687.6 ms, System: 146.4 ms]
Range (min … max): 794.5 ms … 805.5 ms 10 runs
Benchmark 2: git.prev for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"
Time (mean ± σ): 98.9 ms ± 1.4 ms [User: 93.1 ms, System: 5.7 ms]
Range (min … max): 97.0 ms … 104.0 ms 29 runs
Benchmark 3: git.compile for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"
Time (mean ± σ): 4.5 ms ± 0.2 ms [User: 0.7 ms, System: 3.8 ms]
Range (min … max): 4.1 ms … 5.8 ms 524 runs
Summary
'git.compile for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"' ran
21.87 ± 1.05 times faster than 'git.prev for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"'
176.52 ± 8.19 times faster than 'git for-each-ref --format="%(objectname) %(refname)" | grep -vE "^[0-9a-f]{40} refs/pull/"'
(Comparing stock git and this patch isn't quite fair, since an earlier
commit in this series adds a naive implementation of the `--exclude`
option. `git.prev` is built from the previous commit and includes this
naive implementation).
Using the jump list is fairly straightforward (see the changes to
`refs/packed-backend.c::next_record()`), but constructing the list is
not. To ensure that the construction is correct, add a new suite of
tests in t1419 covering various corner cases (overlapping regions,
partially overlapping regions, adjacent regions, etc.).
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `find_reference_location()` is used to perform a
binary search-like function over the contents of a repository's
`$GIT_DIR/packed-refs` file.
The search it implements is unlike a standard binary search in that the
records it searches over are not of a fixed width, so the comparison
must locate the end of a record before comparing it.
Extract the core routine of `find_reference_location()` in order to
implement a function in the following patch which will find the first
location in the `packed-refs` file that *doesn't* match the given
pattern.
The behavior of `find_reference_location()` is unchanged.
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The subsequent patch will want to access an optional `excluded_patterns`
array within `refs/packed-backend.c` that will cull out certain
references matching any of the given patterns on a best-effort basis.
To do so, the refs subsystem needs to be updated to pass this value
across a number of different locations.
Prepare for a future patch by introducing this plumbing now, passing
NULLs at top-level APIs in order to make that patch less noisy and more
easily readable.
Signed-off-by: Taylor Blau <me@ttaylorr.co>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This also made it clear that several .c files that depended upon path.h
were missing a #include for it; add the missing includes while at it.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since this header showed up in some places besides just #include
statements, update/clean-up/remove those other places as well.
Note that compat/fsmonitor/fsm-path-utils-darwin.c previously got
away with violating the rule that all files must start with an include
of git-compat-util.h (or a short-list of alternate headers that happen
to include it first). This change exposed the violation and caused it
to stop building correctly; fix it by having it include
git-compat-util.h first, as per policy.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
These functions do not depend upon struct cache_entry or struct
index_state in any way, and it seems more logical to break them out into
this file, especially since statinfo.h already has the struct stat_data
declaration.
Diff best viewed with `--color-moved`.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git pack-refs" learns "--include" and "--exclude" to tweak the ref
hierarchy to be packed using pattern matching.
* jc/pack-ref-exclude-include:
pack-refs: teach pack-refs --include option
pack-refs: teach --exclude option to exclude refs from being packed
docs: clarify git-pack-refs --all will pack all refs
Allow users to be more selective over which refs to pack by adding an
--include option to git-pack-refs.
The existing options allow some measure of selectivity. By default
git-pack-refs packs all tags. --all can be used to include all refs,
and the previous commit added the ability to exclude certain refs with
--exclude.
While these knobs give the user some selection over which refs to pack,
it could be useful to give more control. For instance, a repository may
have a set of branches that are rarely updated and would benefit from
being packed. --include would allow the user to easily include a set of
branches to be packed while leaving everything else unpacked.
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
At GitLab, we have a system that creates ephemeral internal refs that
don't live long before getting deleted. Having an option to exclude
certain refs from a packed-refs file allows these internal references to
be deleted much more efficiently.
Add an --exclude option to the pack-refs builtin, and use the ref
exclusions API to exclude certain refs from being packed into the final
packed-refs file
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
More header clean-up.
* en/header-split-cache-h-part-2: (22 commits)
reftable: ensure git-compat-util.h is the first (indirect) include
diff.h: reduce unnecessary includes
object-store.h: reduce unnecessary includes
commit.h: reduce unnecessary includes
fsmonitor: reduce includes of cache.h
cache.h: remove unnecessary headers
treewide: remove cache.h inclusion due to previous changes
cache,tree: move basic name compare functions from read-cache to tree
cache,tree: move cmp_cache_name_compare from tree.[ch] to read-cache.c
hash-ll.h: split out of hash.h to remove dependency on repository.h
tree-diff.c: move S_DIFFTREE_IFXMIN_NEQ define from cache.h
dir.h: move DTYPE defines from cache.h
versioncmp.h: move declarations for versioncmp.c functions from cache.h
ws.h: move declarations for ws.c functions from cache.h
match-trees.h: move declarations for match-trees.c functions from cache.h
pkt-line.h: move declarations for pkt-line.c functions from cache.h
base85.h: move declarations for base85.c functions from cache.h
copy.h: move declarations for copy.c functions from cache.h
server-info.h: move declarations for server-info.c functions from cache.h
packfile.h: move pack_window and pack_entry from cache.h
...
Mark strtok() and strtok_r() to be banned.
* tb/ban-strtok:
banned.h: mark `strtok()` and `strtok_r()` as banned
t/helper/test-json-writer.c: avoid using `strtok()`
t/helper/test-oidmap.c: avoid using `strtok()`
t/helper/test-hashmap.c: avoid using `strtok()`
string-list: introduce `string_list_setlen()`
string-list: multi-delimiter `string_list_split_in_place()`
Enhance `string_list_split_in_place()` to accept multiple characters as
delimiters instead of a single character.
Instead of using `strchr(2)` to locate the first occurrence of the given
delimiter character, `string_list_split_in_place_multi()` uses
`strcspn(2)` to move past the initial segment of characters comprised of
any characters in the delimiting set.
When only a single delimiting character is provided, `strpbrk(2)` (which
is implemented with `strcspn(2)`) has equivalent performance to
`strchr(2)`. Modern `strcspn(2)` implementations treat an empty
delimiter or the singleton delimiter as a special case and fall back to
calling strchrnul(). Both glibc[1] and musl[2] implement `strcspn(2)`
this way.
This change is one step to removing `strtok(2)` from the tree. Note that
`string_list_split_in_place()` is not a strict replacement for
`strtok()`, since it will happily turn sequential delimiter characters
into empty entries in the resulting string_list. For example:
string_list_split_in_place(&xs, "foo:;:bar:;:baz", ":;", -1)
would yield a string list of:
["foo", "", "", "bar", "", "", "baz"]
Callers that wish to emulate the behavior of strtok(2) more directly
should call `string_list_remove_empty_items()` after splitting.
To avoid regressions for the new multi-character delimter cases, update
t0063 in this patch as well.
[1]: https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcspn.c;hb=glibc-2.37#l35
[2]: https://git.musl-libc.org/cgit/musl/tree/src/string/strcspn.c?h=v1.2.3#n11
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
hash.h depends upon and includes repository.h, due to the definition and
use of the_hash_algo (defined as the_repository->hash_algo). However,
most headers trying to include hash.h are only interested in the layout
of the structs like object_id. Move the parts of hash.h that do not
depend upon repository.h into a new file hash-ll.h (the "low level"
parts of hash.h), and adjust other files to use this new header where
the convenience inline functions aren't needed.
This allows hash.h and object.h to be fairly small, minimal headers. It
also exposes a lot of hidden dependencies on both path.h (which was
brought in by repository.h) and repository.h (which was previously
implicitly brought in by object.h), so also adjust other files to be
more explicit about what they depend upon.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We had a handful of headers including cache.h that didn't need to
anymore. Drop those includes and replace them with includes of
smaller files, or forward declarations. However, note that two .c
files now need to directly include cache.h, though they should have
been including it all along given they are directly using structs
defined in it.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
cache.h's nature of a dumping ground of includes prevented it from
being included in some compat/ files, forcing us into a workaround
of having a double forward declaration of the read_in_full() function
(see commit 14086b0a13 ("compat/pread.c: Add a forward declaration to
fix a warning", 2007-11-17)). Now that we have moved functions like
read_in_full() from cache.h to wrapper.h, and wrapper.h isn't littered
with unrelated and scary #defines, get rid of the extra forward
declaration and just have compat/pread.c include wrapper.h.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>