We are dealing with the 'istate' index, not 'the_index'.
Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This allows for optionally getting the size of the returned data and
will be used in a follow-up patch.
Signed-off-by: Lukas Fleischer <git@cryptocrack.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Extract the read_index_data() function from attr.c and move it to
read-cache.c; rename it to read_blob_data_from_index() and update
the function signature of it to align better with index/cache API
functions.
This allows for reusing the function in convert.c later.
Signed-off-by: Lukas Fleischer <git@cryptocrack.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
With core.ignorecase=true, name-hash.c builds a case insensitive index of
all tracked directories. Currently, the existing cache entry structures are
added multiple times to the same hashtable (with different name lengths and
hash codes). However, there's only one dir_next pointer, which gets
completely messed up in case of hash collisions. In the worst case, this
causes an endless loop if ce == ce->dir_next (see t7062).
Use a separate hashtable and separate structures for the directory index
so that each directory entry has its own next pointer. Use reference
counting to track which directory entry contains files.
There are only slight changes to the name-hash.c API:
- new free_name_hash() used by read_cache.c::discard_index()
- remove_name_hash() takes an additional index_state parameter
- index_name_exists() for a directory (trailing '/') may return a cache
entry that has been removed (CE_UNHASHED). This is not a problem as the
return value is only used to check if the directory exists (dir.c) or to
normalize casing of directory names (read-cache.c).
Getting rid of cache_entry.dir_next reduces memory consumption, especially
with core.ignorecase=false (which doesn't use that member at all).
With core.ignorecase=true, building the directory index is slightly faster
as we add / check the parent directory first (instead of going through all
directory levels for each file in the index). E.g. with WebKit (~200k
files, ~7k dirs), time spent in lazy_init_name_hash is reduced from 176ms
to 130ms.
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
9d22778 (read-cache.c: write prefix-compressed names in the index -
2012-04-04) defined these. Interestingly, they were not used by
read-cache.c, or anywhere in that patch. They were used in
builtin/update-index.c later for checking supported index
versions. Use them here too.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Specifically the fields uid, gid, ctime, ino and dev are set to zero
by JGit. Other implementations, eg. Git in cygwin are allegedly also
somewhat incompatible with Git For Windows and on *nix platforms
the resolution of the timestamps may differ.
Any stat checking by git will then need to check content, which may
be very slow, particularly on Windows. Since mtime and size
is typically enough we should allow the user to tell git to avoid
checking these fields if they are set to zero in the index.
This change introduces a core.checkstat config option where the
the user can select to check all fields (default), or just size
and the whole second part of mtime (minimal).
Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
These assignments comes from the very first commit e83c516 (Initial
revision of "git", the information manager from hell - 2005-04-07).
Back then we did not die() when errors happened so correct errno was
required.
Since 5d1a5c0 ([PATCH] Better error reporting for "git status" -
2005-10-01), read_index_from() learned to die rather than just return
-1 and these assignments became irrelevant. Remove them.
While at it, move die_errno() next to xmmap() call because it's the
mmap's error code that we care about. Otherwise if close(fd); fails,
it could overwrite mmap's errno.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We should never need to write the null sha1 into an index
entry (short of the 1 in 2^160 chance that somebody actually
has content that hashes to it). If we attempt to do so, it
is much more likely that it is a bug, since we use the null
sha1 as a sentinel value to mean "not valid".
The presence of null sha1s in the index (which can come
from, among other things, "update-index --cacheinfo", or by
reading a corrupted tree) can cause problems for later
readers, because they cannot distinguish the literal null
sha1 from its use a sentinel value. For example, "git
diff-files" on such an entry would make it appear as if it
is stat-dirty, and until recently, the diff code assumed
such an entry meant that we should be diffing a working tree
file rather than a blob.
Ideally, we would stop such entries from entering even our
in-core index. However, we do sometimes legitimately add
entries with null sha1s in order to represent these sentinel
situations; simply forbidding them in add_index_entry breaks
a lot of the existing code. However, we can at least make
sure that our in-core sentinel representation never makes it
to disk.
To be thorough, we will test an attempt to add both a blob
and a submodule entry. In the former case, we might run into
problems anyway because we will be missing the blob object.
But in the latter case, we do not enforce connectivity
across gitlink entries, making this our only point of
enforcement. The current implementation does not care which
type of entry we are seeing, but testing both cases helps
future-proof the test suite in case that changes.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Strip the name length from the ce_flags field and move it
into its own ce_namelen field in struct cache_entry. This
will both give us a tiny bit of a performance enhancement
when working with long pathnames and is a refactoring for
more readability of the code.
It enhances readability, by making it more clear what
is a flag, and where the length is stored and make it clear
which functions use stages in comparisions and which only
use the length.
It also makes CE_NAMEMASK private, so that users don't
mistakenly write the name length in the flags.
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We failed to use ce_namelen() equivalent and instead only compared
up to the CE_NAMEMASK bytes by mistake. Adding an overlong path
that shares the same common prefix as an existing entry in the index
did not add a new entry, but instead replaced the existing one, as
the result.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Replace strlen(ce->name) with ce_namelen() in a couple
of places which gives us some additional bits of
performance.
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Teach the code to write the index in the v4 on-disk format.
Record the format version of the on-disk index we read from in the
index_state, and use the format when writing the new index out.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Because the entries are sorted by path, adjacent entries in the index tend
to share the leading components of them, and it makes sense to only store
the differences in later entries. In the v4 on-disk format of the index,
each on-disk cache entry stores the number of bytes to be stripped from
the end of the previous name, and the bytes to append to the result, to
come up with its name.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function is the one that is reading from the data stream. It only is
natural to make it responsible for reporting this number, not the caller.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Both the on-disk format v2 and v3 pads the "name" field to the multiple of
eight to make sure that various quantities in network long/short type can
be accessed with ntohl/ntohs without having to worry about alignment, but
this forces us to waste disk I/O bandwidth.
Introduce ntoh_s()/ntoh_l() macros that the callers can use as if they were
the regular ntohs()/ntohl() on a field that may not be aligned correctly.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The on-disk format of the index file is a detail whose implementation is
neatly encapsulated in read-cache.c; there is no need to expose it to the
general public that include the cache.h header file.
Also add a prominent mark to read-cache.c to delineate the parts that deal
with the index file I/O routines from the remainder of the file.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The read-cache implementation defines this static function,
but it is a generally useful concept in git. Let's give
the empty blob the same treatment as the empty tree,
providing both hex and binary forms of the sha1.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When running "git add --refresh <pathspec>", we incorrectly showed the
path that is unmerged even if it is outside the specified pathspec, even
though we did honor pathspec and refreshed only the paths that matched.
Note that this cange does not affect "git update-index --refresh"; for
hysterical raisins, it does not take a pathspec (it takes real paths) and
more importantly itss command line options are parsed and executed one by
one as they are encountered, so "git update-index --refresh foo" means
"first refresh the index, and then update the entry 'foo' by hashing the
contents in file 'foo'", not "refresh only entry 'foo'".
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If you have a deleted file and a porcelain refreshes the
cache, we print:
Unstaged changes after reset:
M file
This is technically correct, in that the file is modified,
but it's friendlier to the user if we further differentiate
the case of a deleted file (especially because this output
looks a lot like "diff --name-status", which would also make
the distinction).
Similarly, we can distinguish typechanges ("T") and
intent-to-add files ("A"), both of which appear as just "M"
in the current output.
The plumbing output for all cases remains "needs update" for
historical compatibility.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When refreshing the index, for modified (or unmerged) files we will print
"needs update" (or "needs merge") for plumbing, or line similar to the
output from "diff --name-status" for porcelain.
The variables holding which type of message to show are named after the
plumbing messages. However, as we begin to differentiate more cases at the
porcelain level (with the plumbing message staying the same), that naming
scheme will become awkward.
Instead, name the variables after which case we found (modified or
unmerged), not what we will output.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This will enable refresh_cache to differentiate more cases
of modification (such as typechange) when telling the user
what isn't fresh.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code to estimate the in-memory size of the index based on its on-disk
representation is subtly wrong for certain architecture-dependent struct
layouts. Instead of fixing it, replace the code to keep the index entries
in a single large block of memory and allocate each entry separately
instead. This is both simpler and more flexible, as individual entries
can now be freed. Actually using that added flexibility is left for a
later patch.
Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
estimate_cache_size() tries to guess how much memory is needed for the
in-memory representation of an index file. It does that by using the
file size, the number of entries and the difference of the sizes of the
on-disk and in-memory structs -- without having to check the length of
the name of each entry, which varies for each entry, but their sums are
the same no matter the representation.
Except there can be a difference. First of all, the size is really
calculated by ce_size and ondisk_ce_size based on offsetof(..., name),
not sizeof, which can be different. And entries are padded with 1 to 8
NULs at the end (after the variable name) to make their total length a
multiple of eight.
So in order to allocate enough memory to hold the index, change the
delta calculation to be based on offsetof(..., name) and round up to
the next multiple of eight.
On a 32-bit Linux, this delta was used before:
sizeof(struct cache_entry) == 72
sizeof(struct ondisk_cache_entry) == 64
---
8
The actual difference for an entry with a filename length of one was,
however (find the definitions are in cache.h):
offsetof(struct cache_entry, name) == 72
offsetof(struct ondisk_cache_entry, name) == 62
ce_size == (72 + 1 + 8) & ~7 == 80
ondisk_ce_size == (62 + 1 + 8) & ~7 == 64
---
16
So eight bytes less had been allocated for such entries. The new
formula yields the correct delta:
(72 - 62 + 7) & ~7 == 16
Reported-by: John Hsing <tsyj2007@gmail.com>
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
verify_dotfile() currently assumes that the path seperator is '/', but on
Windows it can also be '\\', so use is_dir_sep() instead.
Signed-off-by: Theo Niessink <theo@taletn.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We simply want to say "At a directory boundary, be careful with a name
that begins with a dot, forbid a name that ends with the boundary
character or has duplicated bounadry characters".
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If someone manage to create a repo with a 'C:' entry in the
root-tree, files can be written outside of the working-dir. This
opens up a can-of-worms of exploits.
Fix it by explicitly checking for a dos drive prefix when verifying
a paht. While we're at it, make sure that paths beginning with '\' is
considered absolute as well.
Noticed-by: Theo Niessink <theo@taletn.com>
Signed-off-by: Erik Faye-Lund <kusmabite@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "format_check" parameter tucked after the existing parameters is too
ugly an afterthought to live in any reasonable API.
Combine it with the other boolean parameter "write_object" into a single
"flags" parameter.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Traditional "opportunistic index update" done by read-only "diff" and
"status" was about updating cached lstat(2) information in the index for
the next round. We missed another obvious optimization opportunity: when
there are racily clean entries that will cease to be racily clean by
updating $GIT_INDEX_FILE. Detect that case and write $GIT_INDEX_FILE out
to give it a newer timestamp.
Noticed by Lasse Makholm by stracing "git status" in a fresh checkout and
counting the number of open(2) calls.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we had to refresh the index internally before running diff or status,
we opportunistically updated the $GIT_INDEX_FILE so that later invocation
of git can use the lstat(2) we already did in this invocation.
Make them share a helper function to do so.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commits, trees and tags have structure. Don't let users feed git
with malformed ones. Sooner or later git will die() when
encountering them.
Note that this patch does not check semantics. A tree that points
to non-existent objects is perfectly OK (and should be so, users
may choose to add commit first, then its associated tree for example).
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When MyDir/ABC/filea.txt is added to Git, the disk directory MyDir/ABC/
is renamed to mydir/aBc/, and then mydir/aBc/fileb.txt is added, the
index will contain MyDir/ABC/filea.txt and mydir/aBc/fileb.txt. Although
the earlier portions of this patch series account for those differences
in case, this patch makes the pathing consistent by folding the case of
newly added files against the first file added with that path.
In read-cache.c's add_to_index(), the index_name_exists() support used
for git status's case insensitive directory lookups is used to find the
proper directory case according to what the user already checked in.
That is, MyDir/ABC/'s case is used to alter the stored path for
fileb.txt to MyDir/ABC/fileb.txt (instead of mydir/aBc/fileb.txt).
This is especially important when cloning a repository to a case
sensitive file system. MyDir/ABC/ and mydir/aBc/ exist in the same
directory on a Windows machine, but on Linux, the files exist in two
separate directories. The update to add_to_index(), in effect, treats a
Windows file system as case sensitive by making path case consistent.
Signed-off-by: Joshua Jensen <jjensen@workspacewhiz.com>
Signed-off-by: Johannes Sixt <j6t@kdbg.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The new dircache extension CACHE_EXT_RESOLVE_UNDO, whose value is
0x52455543, is actually the ASCII sequence 'REUC', not the ASCII
sequence 'REUN'.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The rule has always been that a cache entry that is ce_uptodate(ce)
means that we already have checked the work tree entity and we know
there is no change in the work tree compared to the index, and nobody
should have to double check. Note that false ce_uptodate(ce) does not
mean it is known to be dirty---it only means we don't know if it is
clean.
There are a few codepaths (refresh-index and preload-index are among
them) that mark a cache entry as up-to-date based solely on the return
value from ie_match_stat(); this function uses lstat() to see if the
work tree entity has been touched, and for a submodule entry, if its
HEAD points at the same commit as the commit recorded in the index of
the superproject (a submodule that is not even cloned is considered
clean).
A submodule is no longer considered unmodified merely because its HEAD
matches the index of the superproject these days, in order to prevent
people from forgetting to commit in the submodule and updating the
superproject index with the new submodule commit, before commiting the
state in the superproject. However, the patch to do so didn't update
the codepath that marks cache entries up-to-date based on the updated
definition and instead worked it around by saying "we don't trust the
return value of ce_uptodate() for submodules."
This makes ce_uptodate() trustworthy again by not marking submodule
entries up-to-date.
The next step _could_ be to introduce a few "in-core" flag bits to
cache_entry structure to record "this entry is _known_ to be dirty",
call is_submodule_modified() from ie_match_stat(), and use these new
bits to avoid running this rather expensive check more than once, but
that can be a separate patch.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Exal Sibeaz pointed out that some git files are way too big, and that
add_files_to_cache() brings in all the diff machinery to any git binary
that needs the basic git SHA1 object operations from read-cache.c. Which
is pretty much all of them.
It's doubly silly, since add_files_to_cache() is only used by builtin
programs (add, checkout and commit), so it's fairly easily fixed by just
moving the thing to builtin-add.c, and avoiding the dependency entirely.
I initially argued to Exal that it would probably be best to try to depend
on smart compilers and linkers, but after spending some time trying to
make -ffunction-sections work and giving up, I think Exal was right, and
the fix is to just do some trivial cleanups like this.
This trivial cleanup results in pretty stunning file size differences.
The diff machinery really is mostly used by just the builtin programs, and
you have things like these trivial before-and-after numbers:
-rwxr-xr-x 1 torvalds torvalds 1727420 2010-01-21 10:53 git-hash-object
-rwxrwxr-x 1 torvalds torvalds 940265 2010-01-21 11:16 git-hash-object
Now, I'm not saying that 940kB is good either, but that's mostly all the
debug information - you can see the real code with 'size':
text data bss dec hex filename
418675 3920 127408 550003 86473 git-hash-object (before)
230650 2288 111728 344666 5425a git-hash-object (after)
ie we have a nice 24% size reduction from this trivial cleanup.
It's not just that one file either. I get:
[torvalds@nehalem git]$ du -s /home/torvalds/libexec/git-core
45640 /home/torvalds/libexec/git-core (before)
33508 /home/torvalds/libexec/git-core (after)
so we're talking 12MB of diskspace here.
(Of course, stripping all the binaries brings the 33MB down to 9MB, so the
whole debug information thing is still the bulk of it all, but that's a
separate issue entirely)
Now, I'm sure there are other things we should do, and changing our
compiler flags from -O2 to -Os would bring the text size down by an
additional almost 20%, but this thing Exal pointed out seems to be some
good low-hanging fruit.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit 9e8ecea (Add 'merge' mode to 'git reset', 2008-12-01) disallowed
"git reset --merge" when there was unmerged entries. But it wished if
unmerged entries were reset as if --hard (instead of --merge) has been
used. This makes sense because all "mergy" operations makes sure that
any path involved in the merge does not have local modifications before
starting, so resetting such a path away won't lose any information.
The previous commit changed the behavior of --merge to accept resetting
unmerged entries if they are reset to a different state than HEAD, but it
did not reset the changes in the work tree, leaving the conflict markers
in the resulting file in the work tree.
Fix it by doing three things:
- Update the documentation to match the wish of original "reset --merge"
better, namely, "An unmerged entry is a sign that the path didn't have
any local modification and can be safely resetted to whatever the new
HEAD records";
- Update read_index_unmerged(), which reads the index file into the cache
while dropping any higher-stage entries down to stage #0, not to copy
the object name from the higher stage entry. The code used to take the
object name from the a stage entry ("base" if you happened to have
stage #1, or "ours" if both sides added, etc.), which essentially meant
that you are getting random results depending on what the merge did.
The _only_ reason we want to keep a previously unmerged entry in the
index at stage #0 is so that we don't forget the fact that we have
corresponding file in the work tree in order to be able to remove it
when the tree we are resetting to does not have the path. In order to
differentiate such an entry from ordinary cache entry, the cache entry
added by read_index_unmerged() is marked as CE_CONFLICTED.
- Update merged_entry() and deleted_entry() so that they pay attention to
cache entries marked as CE_CONFLICTED. They are previously unmerged
entries, and the files in the work tree that correspond to them are
resetted away by oneway_merge() to the version from the tree we are
resetting to.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
On big endian platforms with 8-byte unsigned long, the code reads the
size of the index extension section (which is a 4-byte network byte
order integer) incorrectly.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When resolving a conflict using "git add" to create a stage #0 entry, or
"git rm" to remove entries at higher stages, remove_index_entry_at()
function is eventually called to remove unmerged (i.e. higher stage)
entries from the index. Introduce a "resolve_undo_info" structure and
keep track of the removed cache entries, and save it in a new index
extension section in the index_state.
Operations like "read-tree -m", "merge", "checkout [-m] <branch>" and
"reset" are signs that recorded information in the index is no longer
necessary. The data is removed from the index extension when operations
start; they may leave conflicted entries in the index, and later user
actions like "git add" will record their conflicted states afresh.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Previously CE_MATCH_IGNORE_VALID flag is used by both valid and
skip-worktree bits. While the two bits have similar behaviour, sharing
this flag means "git update-index --really-refresh" will ignore
skip-worktree while it should not. Instead another flag is
introduced to ignore skip-worktree bit, CE_MATCH_IGNORE_VALID only
applies to valid bit.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
grep: turn on --cached for files that is marked skip-worktree
ls-files: do not check for deleted file that is marked skip-worktree
update-index: ignore update request if it's skip-worktree, while still allows removing
diff*: skip worktree version
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>