The coming index format change doesn't allow for the number of objects
to be determined from the size of the index file directly. Instead, Let's
initialize a field in the packed_git structure with the object count when
the index is validated since the count is always known at that point.
While at it let's reorder some struct packed_git fields to avoid padding
due to needed 64-bit alignment for some of them.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Prior to 1.5.0 the git-lost-found utility was useful to locate
commits that were not referenced by any ref. These were often
amends, or resets, or tips of branches that had been deleted.
Being able to locate a 'lost' commit and recover it by creating a
new branch was a useful feature in those days.
Unfortunately 1.5.0 added the reflogs to the reachability analysis
performed by git-fsck, which means that most commits users would
consider to be lost are still reachable through a reflog. So most
(or all!) commits are reachable, and nothing gets output from
git-lost-found.
Now git-fsck can be told to ignore reflogs during its reachability
analysis, making git-lost-found useful again to locate commits
that are no longer referenced by a ref itself, but may still be
referenced by a reflog.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Let's avoid the open coded pack index reference in pack-object and use
nth_packed_object_sha1() instead. This will help encapsulating index
format differences in one place.
And while at it there is no reason to copy SHA1's over and over while a
direct pointer to it in the index will do just fine.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Acked-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This removes slightly more lines than it adds, but the real reason for
doing this is that future optimizations will require more setup of the
tree descriptor, and so we want to do it in one place.
Also renamed the "desc.buf" field to "desc.buffer" just to trigger
compiler errors for old-style manual initializations, making sure I
didn't miss anything.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
As we permit up to 2^32-1 objects in a single packfile we cannot
use a signed int to represent the object offset within a packfile,
after 2^31-1 objects we will start seeing negative indexes and
error out or compute bad addresses within the mmap'd index.
This is a minor cleanup that does not introduce any significant
logic changes. It is roach free.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
git-fsck always exited with status 0, which was a bit sloppy.
This makes it exit with a non-zero status when errors are
found. The error code is an OR'ed result of:
1 if corrupted objects are found.
2 if objects that are ought to be reachable are missing or corrupt.
For example, it would exit with 1 in a repository with an
unreachable corrupt object. If a tree object of the HEAD commit
is corrupt, you would get 3.
Signed-off-by: Junio C Hamano <junkio@cox.net>
When "git fsck" without --full found a loose object missing
because it was broken, it mistakenly thought it was not parsed
because we found it in one of the packs. Back when this code
was written, we did not have a way to explicitly check if we
have the object in pack, but we do now.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This mechanically converts strncmp() to use prefixcmp(), but only when
the parameters match specific patterns, so that they can be verified
easily. Leftover from this will be fixed in a separate step, including
idiotic conversions like
if (!strncmp("foo", arg, 3))
=>
if (!(-prefixcmp(arg, "foo")))
This was done by using this script in px.perl
#!/usr/bin/perl -i.bak -p
if (/strncmp\(([^,]+), "([^\\"]*)", (\d+)\)/ && (length($2) == $3)) {
s|strncmp\(([^,]+), "([^\\"]*)", (\d+)\)|prefixcmp($1, "$2")|;
}
if (/strncmp\("([^\\"]*)", ([^,]+), (\d+)\)/ && (length($1) == $3)) {
s|strncmp\("([^\\"]*)", ([^,]+), (\d+)\)|(-prefixcmp($2, "$1"))|;
}
and running:
$ git grep -l strncmp -- '*.c' | xargs perl px.perl
Signed-off-by: Junio C Hamano <junkio@cox.net>
The earlier change df391b192 to rename fsck-objects to fsck broke
fsck-objects. This should fix it again.
Signed-off-by: Mark Wooding <mdw@distorted.org.uk>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This separates the connectivity check into separate codepaths,
one for reachable objects and the other for unreachable ones,
while adding a lot of comments to explain what is going on.
When checking an unreachable object, unlike a reachable one, we
do not have to complain if it does not exist (we used to
complain about a missing blob even when the only thing that
references it is a tree that is dangling). Also we do not have
to check and complain about objects that are referenced by an
unreachable object.
This makes the messages from fsck-objects a lot less noisy and
more useful.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This fixes another problem that Andy's case showed: git-fsck-objects
reports nonsensical results for corrupt objects.
There were actually two independent and confusing problems:
- when we had a zero-sized file and used map_sha1_file, mmap() would
return EINVAL, and git-fsck-objects would report that as an insane and
confusing error. I don't know when this was introduced, it might have
been there forever.
- when "parse_object()" returned NULL, fsck would say "object not found",
which can be very confusing, since obviously the object might "exist",
it's just unparseable because it's totally corrupt.
So this just makes "xmmap()" return NULL for a zero-sized object (which is
a valid thing pointer, exactly the same way "malloc()" can return NULL for
a zero-sized allocation). That fixes the first problem (but we could have
fixed it in the caller too - I don't personally much care whichever way it
goes, but maybe somebody should check that the NO_MMAP case does
something sane in this case too?).
And the second problem is solved by just making the error message slightly
clearer - the failure to parse an object may be because it's missing or
corrupt, not necessarily because it's not "found".
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
It used to ignore the return value of the helper function; now, it
expects it to return 0, and stops iteration upon non-zero return
values; this value is then passed on as the return value of
for_each_reflog_ent().
Further, it makes no sense to force the parsing upon the helper
functions; for_each_reflog_ent() now calls the helper function with
old and new sha1, the email, the timestamp & timezone, and the message.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This is a mechanical clean-up of the way *.c files include
system header files.
(1) sources under compat/, platform sha-1 implementations, and
xdelta code are exempt from the following rules;
(2) the first #include must be "git-compat-util.h" or one of
our own header file that includes it first (e.g. config.h,
builtin.h, pkt-line.h);
(3) system headers that are included in "git-compat-util.h"
need not be included in individual C source files.
(4) "git-compat-util.h" does not have to include subsystem
specific header files (e.g. expat.h).
Signed-off-by: Junio C Hamano <junkio@cox.net>
This adds a "int *flag" parameter to resolve_ref() and makes
for_each_ref() family to call callback function with an extra
"int flag" parameter. They are used to give two bits of
information (REF_ISSYMREF and REF_ISPACKED) about the ref.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This is a long overdue fix to the API for for_each_ref() family
of functions. It allows the callers to specify a callback data
pointer, so that the caller does not have to use static
variables to communicate with the callback funciton.
The updated for_each_ref() family takes a function of type
int (*fn)(const char *, const unsigned char *, void *)
and a void pointer as parameters, and calls the function with
the name of the ref and its SHA-1 with the caller-supplied void
pointer as parameters.
The commit updates two callers, builtin-name-rev.c and
builtin-pack-refs.c as an example.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Like xmalloc and xrealloc xstrdup dies with a useful message if
the native strdup() implementation returns NULL rather than a
valid pointer.
I just tried to use xstrdup in new code and found it to be missing.
However I expected it to be present as xmalloc and xrealloc are
already commonly used throughout the code.
[jc: removed the part that deals with last_XXX, which I am
finding more and more dubious these days.]
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
The comment added says it all: if we have lost all references in a git
archive, git-fsck-objects should still work, so instead of dying it should
just notify the user about that condition.
This change was triggered by me just doing a "git-init-db" and then
populating that empty git archive with a pack/index file to look at it.
Having git-fsck-objects not work just because I didn't have any references
handy was rather irritating, since part of the reason for running
git-fsck-objects in the first place was to _find_ the missing references.
However, "--unreachable" really doesn't make sense in that situation, and
we want to turn it off to protect anybody who uses the old "git prune"
shell-script (rather than the modern built-in). The old pruning script
used to remove all objects that were reported as unreachable, and without
any refs, that obviously means everything - not worth it.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This abstracts away the size of the hash values when copying them
from memory location to memory location, much as the introduction
of hashcmp abstracted away hash value comparsion.
A few call sites were using char* rather than unsigned char* so
I added the cast rather than open hashcpy to be void*. This is a
reasonable tradeoff as most call sites already use unsigned char*
and the existing hashcmp is also declared to be unsigned char*.
[jc: Splitted the patch to "master" part, to be followed by a
patch for merge-recursive.c which is not in "master" yet.
Fixed the cast in the latter hunk to combine-diff.c which was
wrong in the original.
Also converted ones left-over in combine-diff.c, diff-lib.c and
upload-pack.c ]
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
[jc: I needed to hand merge the changes to the updated codebase,
so the result needs to be checked.]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Replace sha1 comparisons to null_sha1 with a global inline (which previously an
unused static inline in builtin-apply.c)
[jc: with a fix from Jonas Fonseca.]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This updates the type-enumeration constants introduced to reduce
the memory footprint of "struct object" to match the type bits
already used in the packfile format, by removing the former
(i.e. TYPE_* constant macros) and using the latter (i.e. enum
object_type) throughout the code for consistency.
Eventually we can stop passing around the "type strings"
entirely, and this will help - no confusion about two different
integer enumeration.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
There are a few special places where some programs accessed the object
hash array directly, which bothered me because I wanted to play with some
simple re-organizations.
So this patch makes the object hash array data structures all entirely
local to object.c, and the few users who wanted to look at it now get to
use a function to query how many object index entries there can be, and to
actually access the array.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This shrinks "struct object" to the absolutely minimal size possible.
It now contains /only/ the object flags and the SHA1 hash name of the
object.
The "refs" field, which is really needed only for fsck, is maintained in
a separate hashed lookup-table, allowing all normal users to totally
ignore it.
This helps memory usage, although not as much as I hoped: it looks like
the allocation overhead of malloc (and the alignment constraints in
particular) means that while the structure size shrinks, the actual
allocation overhead mostly does not.
[ That said: memory usage is actually down, but not as much as it should
be: I suspect just one of the object types actually ended up shrinking
its effective allocation size.
To get to the next level, we probably need specialized allocators that
don't pad the allocation more than necessary. ]
The separation makes for some code cleanup, though, and makes the ref
tracking that fsck wants a clearly separate thing.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This shrinks "struct object" by a small amount, by getting rid of the
"struct type *" pointer and replacing it with a 3-bit bitfield instead.
In addition, we merge the bitfields and the "flags" field, which
incidentally should also remove a useless 4-byte padding from the object
when in 64-bit mode.
Now, our "struct object" is still too damn large, but it's now less
obviously bloated, and of the remaining fields, only the "util" (which is
not used by most things) is clearly something that should be eventually
discarded.
This shrinks the "git-rev-list --all" memory use by about 2.5% on the
kernel archive (and, perhaps more importantly, on the larger mozilla
archive). That may not sound like much, but I suspect it's more on a
64-bit platform.
There are other remaining inefficiencies (the parent lists, for example,
probably have horrible malloc overhead), but this was pretty obvious.
Most of the patch is just changing the comparison of the "type" pointer
from one of the constant string pointers to the appropriate new TYPE_xxx
small integer constant.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Prime example of where the raw tree parser is easier for everybody.
[jc: "Aieee" one-liner fix from the list applied. ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Instead, just use the tree buffer directly, and use the tree-walk
infrastructure to walk the buffers instead of the tree-entry list.
The tree-entry list is inefficient, and generates tons of small
allocations for no good reason. The tree-walk infrastructure is
generally no harder to use than following a linked list, and allows
us to do most tree parsing in-place.
Some programs still use the old tree-entry lists, and are a bit
painful to convert without major surgery. For them we have a helper
function that creates a temporary tree-entry list on demand.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This is preparatory work for further cleanups, where we try to make
tree_entry look more like the more efficient tree-walk descriptor.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This allows us to avoid allocating information for names etc, because
we can just use the information from the tree buffer directly.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This finally removes the tree-entry list from "struct tree", since most of
the users can just use the tree-walk infrastructure to walk the raw tree
buffers instead of the tree-entry list.
The tree-entry list is inefficient, and generates tons of small
allocations for no good reason. The tree-walk infrastructure is generally
no harder to use than following a linked list, and allows us to do most
tree parsing in-place.
Some programs still use the old tree-entry lists, and are a bit painful to
convert without major surgery. For them we have a helper function that
creates a temporary tree-entry list on demand. We can convert those too
eventually, but with this they no longer affect any users who don't need
the explicit lists.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This is preparatory work for further cleanups, where we try to make
tree_entry look more like the more efficient tree-walk descriptor.
Instead of having a union of pointers to blob/tree/objects, this just
makes "struct tree_entry" have the raw SHA1, and makes all the users use
that instead (often that implies adding a "lookup_tree(..)" on the sha1,
but sometimes the user just wanted the SHA1 in the first place, and it
just avoids an unnecessary indirection).
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This allows us to avoid allocating information for names etc, because
we can just use the information from the tree buffer directly.
We still keep the old "tree_entry_list" in struct tree as well, so old
users aren't affected, apart from the fact that the allocations are
different (if you free a tree entry, you should no longer free the name
allocation for it, since it's allocated as part of "tree->buffer")
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
The fsck-objects command (back then it was called fsck-cache)
used to complain if objects referred to by files in .git/refs/
or objects stored in files under .git/objects/??/ were not found
as stand-alone SHA1 files (i.e. found in alternate object pools
or packed archives stored under .git/objects/pack). Back then,
packs and alternates were new curiosity and having everything as
loose objects were the norm.
When we adjusted the behaviour of fsck-cache to consider objects
found in packs are OK, we introduced the --standalone flag as a
backward compatibility measure.
It still correctly checks if your repository is complete and
consists only of loose objects, so in that sense it is doing the
"right" thing, but checking that is pointless these days. This
commit removes --standalone flag.
See also:
23676d407c8a498a05c3
Signed-off-by: Junio C Hamano <junkio@cox.net>
- Fix -Wundef -Wold-style-definition warnings
- Make pll_free() static
[jc: original patch by Timo had another unrelated bits:
- Use setenv() instead of putenv()
I'm postponing that part for now.]
Signed-off-by: Timo Hirvonen <tihirvon@gmail.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
In a simple test, this brings down the CPU time from 47 sec to 22 sec.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
The d_ino field is only used for performance reasons in
fsck-objects. On a typical filesystem, i-number tends to have a
strong correlation with where the actual bits sit on the disk
platter, and we sort the entries to allow us scan things that
ought to be close together together.
If the platform lacks support for it, it is not a big deal.
Just do not use d_ino for sorting, and scan them unsorted.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Store pointers to referenced objects in a variable sized array instead
of linked list. This cuts down memory usage of utilities which use
object references; e.g., git-fsck-objects --full on the git.git
repository consumes about 2 MB of memory tracked by Massif instead of
7 MB before the change. Object refs are still the biggest consumer of
memory (57%), but the malloc overhead for a single block instead of a
linked list is substantially smaller.
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Junio C Hamano <junkio@cox.net>
The Massif tool of Valgrind revealed that parsed tree entries occupy
more than 60% of memory allocated by git-fsck-objects. These entries
can be freed immediately after use, which significantly decreases
memory consumption.
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This makes it possible to have a "sparse" git object subdirectory
structure, something that has become much more attractive now that people
use pack-files all the time.
As a result of pack-files, a git object directory doesn't necessarily have
any individual objects lying around, and in that case it's just wasting
space to keep the empty first-level object directories around: on many
filesystems the 256 empty directories will be aboue 1MB of diskspace.
Even more importantly, after you re-pack a project that _used_ to be
unpacked, you could be left with huge directories that no longer contain
anything, but that waste space and take time to look through.
With this change, "git prune-packed" can just do an rmdir() on the
directories, and they'll get removed if empty, and re-created on demand.
This patch also tries to fix up "write_sha1_from_fd()" to use the new
common infrastructure for creating the object files, closing a hole where
we might otherwise leave half-written objects in the object database.
[jc: I unoptimized the part that really removes the fan-out directories
to ease transition. init-db still wastes 1MB of diskspace to hold 256
empty fan-outs, and prune-packed rmdir()'s the grown but empty directories,
but runs mkdir() immediately after that -- reducing the saving from 150KB
to 146KB. These parts will be re-introduced when everybody has the
on-demand capability.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This adds the counterpart of git-update-ref that lets you read
and create "symbolic refs". By default it uses a symbolic link
to represent ".git/HEAD -> refs/heads/master", but it can be compiled
to use the textfile symbolic ref.
The places that did 'readlink .git/HEAD' and 'ln -s refs/heads/blah
.git/HEAD' have been converted to use new git-symbolic-ref command, so
that they can deal with either implementation.
Signed-off-by: Junio C Hamano <junio@twinsun.com>