|
|
|
#ifndef DIR_H
|
|
|
|
#define DIR_H
|
|
|
|
|
|
|
|
#include "cache.h"
|
|
|
|
#include "hashmap.h"
|
|
|
|
#include "strbuf.h"
|
|
|
|
|
|
|
|
/**
|
|
|
|
* The directory listing API is used to enumerate paths in the work tree,
|
|
|
|
* optionally taking `.git/info/exclude` and `.gitignore` files per directory
|
|
|
|
* into account.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Calling sequence
|
|
|
|
* ----------------
|
|
|
|
*
|
|
|
|
* Note: The index may be checked for .gitignore files that are
|
|
|
|
* CE_SKIP_WORKTREE marked. If you want to exclude files, make sure you have
|
|
|
|
* loaded the index first.
|
|
|
|
*
|
dir: fix problematic API to avoid memory leaks
The dir structure seemed to have a number of leaks and problems around
it. First I noticed that parent_hashmap and recursive_hashmap were
being leaked (though Peff noticed and submitted fixes before me). Then
I noticed in the previous commit that clear_directory() was only taking
responsibility for a subset of fields within dir_struct, despite the
fact that entries[] and ignored[] we allocated internally to dir.c.
That, of course, resulted in many callers either leaking or haphazardly
trying to free these arrays and their contents.
Digging further, I found that despite the pretty clear documentation
near the top of dir.h that folks were supposed to call clear_directory()
when the user no longer needed the dir_struct, there were four callers
that didn't bother doing that at all. However, two of them clearly
thought about leaks since they had an UNLEAK(dir) directive, which to me
suggests that the method to free the data was too unclear. I suspect
the non-obviousness of the API and its holes led folks to avoid it,
which then snowballed into further problems with the entries[],
ignored[], parent_hashmap, and recursive_hashmap problems.
Rename clear_directory() to dir_clear() to be more in line with other
data structures in git, and introduce a dir_init() to handle the
suggested memsetting of dir_struct to all zeroes. I hope that a name
like "dir_clear()" is more clear, and that the presence of dir_init()
will provide a hint to those looking at the code that they need to look
for either a dir_clear() or a dir_free() and lead them to find
dir_clear().
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
4 years ago
|
|
|
* - Prepare `struct dir_struct dir` using `dir_init()` function.
|
|
|
|
*
|
|
|
|
* - To add single exclude pattern, call `add_pattern_list()` and then
|
|
|
|
* `add_pattern()`.
|
|
|
|
*
|
|
|
|
* - To add patterns from a file (e.g. `.git/info/exclude`), call
|
dir: fix problematic API to avoid memory leaks
The dir structure seemed to have a number of leaks and problems around
it. First I noticed that parent_hashmap and recursive_hashmap were
being leaked (though Peff noticed and submitted fixes before me). Then
I noticed in the previous commit that clear_directory() was only taking
responsibility for a subset of fields within dir_struct, despite the
fact that entries[] and ignored[] we allocated internally to dir.c.
That, of course, resulted in many callers either leaking or haphazardly
trying to free these arrays and their contents.
Digging further, I found that despite the pretty clear documentation
near the top of dir.h that folks were supposed to call clear_directory()
when the user no longer needed the dir_struct, there were four callers
that didn't bother doing that at all. However, two of them clearly
thought about leaks since they had an UNLEAK(dir) directive, which to me
suggests that the method to free the data was too unclear. I suspect
the non-obviousness of the API and its holes led folks to avoid it,
which then snowballed into further problems with the entries[],
ignored[], parent_hashmap, and recursive_hashmap problems.
Rename clear_directory() to dir_clear() to be more in line with other
data structures in git, and introduce a dir_init() to handle the
suggested memsetting of dir_struct to all zeroes. I hope that a name
like "dir_clear()" is more clear, and that the presence of dir_init()
will provide a hint to those looking at the code that they need to look
for either a dir_clear() or a dir_free() and lead them to find
dir_clear().
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
4 years ago
|
|
|
* `add_patterns_from_file()` , and/or set `dir.exclude_per_dir`.
|
|
|
|
*
|
dir: fix problematic API to avoid memory leaks
The dir structure seemed to have a number of leaks and problems around
it. First I noticed that parent_hashmap and recursive_hashmap were
being leaked (though Peff noticed and submitted fixes before me). Then
I noticed in the previous commit that clear_directory() was only taking
responsibility for a subset of fields within dir_struct, despite the
fact that entries[] and ignored[] we allocated internally to dir.c.
That, of course, resulted in many callers either leaking or haphazardly
trying to free these arrays and their contents.
Digging further, I found that despite the pretty clear documentation
near the top of dir.h that folks were supposed to call clear_directory()
when the user no longer needed the dir_struct, there were four callers
that didn't bother doing that at all. However, two of them clearly
thought about leaks since they had an UNLEAK(dir) directive, which to me
suggests that the method to free the data was too unclear. I suspect
the non-obviousness of the API and its holes led folks to avoid it,
which then snowballed into further problems with the entries[],
ignored[], parent_hashmap, and recursive_hashmap problems.
Rename clear_directory() to dir_clear() to be more in line with other
data structures in git, and introduce a dir_init() to handle the
suggested memsetting of dir_struct to all zeroes. I hope that a name
like "dir_clear()" is more clear, and that the presence of dir_init()
will provide a hint to those looking at the code that they need to look
for either a dir_clear() or a dir_free() and lead them to find
dir_clear().
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
4 years ago
|
|
|
* - A short-hand function `setup_standard_excludes()` can be used to set
|
|
|
|
* up the standard set of exclude settings, instead of manually calling
|
|
|
|
* the add_pattern*() family of functions.
|
|
|
|
*
|
dir: fix problematic API to avoid memory leaks
The dir structure seemed to have a number of leaks and problems around
it. First I noticed that parent_hashmap and recursive_hashmap were
being leaked (though Peff noticed and submitted fixes before me). Then
I noticed in the previous commit that clear_directory() was only taking
responsibility for a subset of fields within dir_struct, despite the
fact that entries[] and ignored[] we allocated internally to dir.c.
That, of course, resulted in many callers either leaking or haphazardly
trying to free these arrays and their contents.
Digging further, I found that despite the pretty clear documentation
near the top of dir.h that folks were supposed to call clear_directory()
when the user no longer needed the dir_struct, there were four callers
that didn't bother doing that at all. However, two of them clearly
thought about leaks since they had an UNLEAK(dir) directive, which to me
suggests that the method to free the data was too unclear. I suspect
the non-obviousness of the API and its holes led folks to avoid it,
which then snowballed into further problems with the entries[],
ignored[], parent_hashmap, and recursive_hashmap problems.
Rename clear_directory() to dir_clear() to be more in line with other
data structures in git, and introduce a dir_init() to handle the
suggested memsetting of dir_struct to all zeroes. I hope that a name
like "dir_clear()" is more clear, and that the presence of dir_init()
will provide a hint to those looking at the code that they need to look
for either a dir_clear() or a dir_free() and lead them to find
dir_clear().
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
4 years ago
|
|
|
* - Call `fill_directory()`.
|
|
|
|
*
|
dir: fix problematic API to avoid memory leaks
The dir structure seemed to have a number of leaks and problems around
it. First I noticed that parent_hashmap and recursive_hashmap were
being leaked (though Peff noticed and submitted fixes before me). Then
I noticed in the previous commit that clear_directory() was only taking
responsibility for a subset of fields within dir_struct, despite the
fact that entries[] and ignored[] we allocated internally to dir.c.
That, of course, resulted in many callers either leaking or haphazardly
trying to free these arrays and their contents.
Digging further, I found that despite the pretty clear documentation
near the top of dir.h that folks were supposed to call clear_directory()
when the user no longer needed the dir_struct, there were four callers
that didn't bother doing that at all. However, two of them clearly
thought about leaks since they had an UNLEAK(dir) directive, which to me
suggests that the method to free the data was too unclear. I suspect
the non-obviousness of the API and its holes led folks to avoid it,
which then snowballed into further problems with the entries[],
ignored[], parent_hashmap, and recursive_hashmap problems.
Rename clear_directory() to dir_clear() to be more in line with other
data structures in git, and introduce a dir_init() to handle the
suggested memsetting of dir_struct to all zeroes. I hope that a name
like "dir_clear()" is more clear, and that the presence of dir_init()
will provide a hint to those looking at the code that they need to look
for either a dir_clear() or a dir_free() and lead them to find
dir_clear().
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
4 years ago
|
|
|
* - Use `dir.entries[]` and `dir.ignored[]`.
|
|
|
|
*
|
dir: fix problematic API to avoid memory leaks
The dir structure seemed to have a number of leaks and problems around
it. First I noticed that parent_hashmap and recursive_hashmap were
being leaked (though Peff noticed and submitted fixes before me). Then
I noticed in the previous commit that clear_directory() was only taking
responsibility for a subset of fields within dir_struct, despite the
fact that entries[] and ignored[] we allocated internally to dir.c.
That, of course, resulted in many callers either leaking or haphazardly
trying to free these arrays and their contents.
Digging further, I found that despite the pretty clear documentation
near the top of dir.h that folks were supposed to call clear_directory()
when the user no longer needed the dir_struct, there were four callers
that didn't bother doing that at all. However, two of them clearly
thought about leaks since they had an UNLEAK(dir) directive, which to me
suggests that the method to free the data was too unclear. I suspect
the non-obviousness of the API and its holes led folks to avoid it,
which then snowballed into further problems with the entries[],
ignored[], parent_hashmap, and recursive_hashmap problems.
Rename clear_directory() to dir_clear() to be more in line with other
data structures in git, and introduce a dir_init() to handle the
suggested memsetting of dir_struct to all zeroes. I hope that a name
like "dir_clear()" is more clear, and that the presence of dir_init()
will provide a hint to those looking at the code that they need to look
for either a dir_clear() or a dir_free() and lead them to find
dir_clear().
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
4 years ago
|
|
|
* - Call `dir_clear()` when the contained elements are no longer in use.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
struct dir_entry {
|
builtin-add: simplify (and increase accuracy of) exclude handling
Previously, the code would always set up the excludes, and then manually
pick through the pathspec we were given, assuming that non-added but
existing paths were just ignored. This was mostly correct, but would
erroneously mark a totally empty directory as 'ignored'.
Instead, we now use the collect_ignored option of dir_struct, which
unambiguously tells us whether a path was ignored. This simplifies the
code, and means empty directories are now just not mentioned at all.
Furthermore, we now conditionally ask dir_struct to respect excludes,
depending on whether the '-f' flag has been set. This means we don't have
to pick through the result, checking for an 'ignored' flag; ignored entries
were either added or not in the first place.
We can safely get rid of the special 'ignored' flags to dir_entry, which
were not used anywhere else.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Jonas Fonseca <fonseca@diku.dk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
18 years ago
|
|
|
unsigned int len;
|
|
|
|
char name[FLEX_ARRAY]; /* more */
|
|
|
|
};
|
|
|
|
|
|
|
|
#define PATTERN_FLAG_NODIR 1
|
|
|
|
#define PATTERN_FLAG_ENDSWITH 4
|
|
|
|
#define PATTERN_FLAG_MUSTBEDIR 8
|
|
|
|
#define PATTERN_FLAG_NEGATIVE 16
|
|
|
|
|
|
|
|
struct path_pattern {
|
|
|
|
/*
|
|
|
|
* This allows callers of last_matching_pattern() etc.
|
|
|
|
* to determine the origin of the matching pattern.
|
|
|
|
*/
|
|
|
|
struct pattern_list *pl;
|
|
|
|
|
|
|
|
const char *pattern;
|
|
|
|
int patternlen;
|
|
|
|
int nowildcardlen;
|
|
|
|
const char *base;
|
|
|
|
int baselen;
|
|
|
|
unsigned flags; /* PATTERN_FLAG_* */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Counting starts from 1 for line numbers in ignore files,
|
|
|
|
* and from -1 decrementing for patterns from CLI args.
|
|
|
|
*/
|
|
|
|
int srcpos;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* used for hashmaps for cone patterns */
|
|
|
|
struct pattern_entry {
|
|
|
|
struct hashmap_entry ent;
|
|
|
|
char *pattern;
|
|
|
|
size_t patternlen;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Each excludes file will be parsed into a fresh exclude_list which
|
|
|
|
* is appended to the relevant exclude_list_group (either EXC_DIRS or
|
|
|
|
* EXC_FILE). An exclude_list within the EXC_CMDL exclude_list_group
|
|
|
|
* can also be used to represent the list of --exclude values passed
|
|
|
|
* via CLI args.
|
|
|
|
*/
|
|
|
|
struct pattern_list {
|
|
|
|
int nr;
|
|
|
|
int alloc;
|
|
|
|
|
|
|
|
/* remember pointer to exclude file contents so we can free() */
|
|
|
|
char *filebuf;
|
|
|
|
|
|
|
|
/* origin of list, e.g. path to filename, or descriptive string */
|
|
|
|
const char *src;
|
|
|
|
|
|
|
|
struct path_pattern **patterns;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* While scanning the excludes, we attempt to match the patterns
|
|
|
|
* with a more restricted set that allows us to use hashsets for
|
|
|
|
* matching logic, which is faster than the linear lookup in the
|
|
|
|
* excludes array above. If non-zero, that check succeeded.
|
|
|
|
*/
|
|
|
|
unsigned use_cone_patterns;
|
|
|
|
unsigned full_cone;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Stores paths where everything starting with those paths
|
|
|
|
* is included.
|
|
|
|
*/
|
|
|
|
struct hashmap recursive_hashmap;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Used to check single-level parents of blobs.
|
|
|
|
*/
|
|
|
|
struct hashmap parent_hashmap;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The contents of the per-directory exclude files are lazily read on
|
|
|
|
* demand and then cached in memory, one per exclude_stack struct, in
|
|
|
|
* order to avoid opening and parsing each one every time that
|
|
|
|
* directory is traversed.
|
|
|
|
*/
|
|
|
|
struct exclude_stack {
|
|
|
|
struct exclude_stack *prev; /* the struct exclude_stack for the parent directory */
|
|
|
|
int baselen;
|
|
|
|
int exclude_ix; /* index of exclude_list within EXC_DIRS exclude_list_group */
|
untracked cache: record .gitignore information and dir hierarchy
The idea is if we can capture all input and (non-rescursive) output of
read_directory_recursive(), and can verify later that all the input is
the same, then the second r_d_r() should produce the same output as in
the first run.
The requirement for this to work is stat info of a directory MUST
change if an entry is added to or removed from that directory (and
should not change often otherwise). If your OS and filesystem do not
meet this requirement, untracked cache is not for you. Most file
systems on *nix should be fine. On Windows, NTFS is fine while FAT may
not be [1] even though FAT on Linux seems to be fine.
The list of input of r_d_r() is in the big comment block in dir.h. In
short, the output of a directory (not counting subdirs) mainly depends
on stat info of the directory in question, all .gitignore leading to
it and the check_only flag when r_d_r() is called recursively. This
patch records all this info (and the output) as r_d_r() runs.
Two hash_sha1_file() are required for $GIT_DIR/info/exclude and
core.excludesfile unless their stat data matches. hash_sha1_file() is
only needed when .gitignore files in the worktree are modified,
otherwise their SHA-1 in index is used (see the previous patch).
We could store stat data for .gitignore files so we don't have to
rehash them if their content is different from index, but I think
.gitignore files are rarely modified, so not worth extra cache data
(and hashing penalty read-cache.c:verify_hdr(), as we will be storing
this as an index extension).
The implication is, if you change .gitignore, you better add it to the
index soon or you lose all the benefit of untracked cache because a
modified .gitignore invalidates all subdirs recursively. This is
especially bad for .gitignore at root.
This cached output is about untracked files only, not ignored files
because the number of tracked files is usually small, so small cache
overhead, while the number of ignored files could go really high
(e.g. *.o files mixing with source code).
[1] "Description of NTFS date and time stamps for files and folders"
http://support.microsoft.com/kb/299648
Helped-by: Torsten Bögershausen <tboegi@web.de>
Helped-by: David Turner <dturner@twopensource.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
10 years ago
|
|
|
struct untracked_cache_dir *ucd;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct exclude_list_group {
|
|
|
|
int nr, alloc;
|
|
|
|
struct pattern_list *pl;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct oid_stat {
|
|
|
|
struct stat_data stat;
|
|
|
|
struct object_id oid;
|
|
|
|
int valid;
|
|
|
|
};
|
|
|
|
|
untracked cache: record .gitignore information and dir hierarchy
The idea is if we can capture all input and (non-rescursive) output of
read_directory_recursive(), and can verify later that all the input is
the same, then the second r_d_r() should produce the same output as in
the first run.
The requirement for this to work is stat info of a directory MUST
change if an entry is added to or removed from that directory (and
should not change often otherwise). If your OS and filesystem do not
meet this requirement, untracked cache is not for you. Most file
systems on *nix should be fine. On Windows, NTFS is fine while FAT may
not be [1] even though FAT on Linux seems to be fine.
The list of input of r_d_r() is in the big comment block in dir.h. In
short, the output of a directory (not counting subdirs) mainly depends
on stat info of the directory in question, all .gitignore leading to
it and the check_only flag when r_d_r() is called recursively. This
patch records all this info (and the output) as r_d_r() runs.
Two hash_sha1_file() are required for $GIT_DIR/info/exclude and
core.excludesfile unless their stat data matches. hash_sha1_file() is
only needed when .gitignore files in the worktree are modified,
otherwise their SHA-1 in index is used (see the previous patch).
We could store stat data for .gitignore files so we don't have to
rehash them if their content is different from index, but I think
.gitignore files are rarely modified, so not worth extra cache data
(and hashing penalty read-cache.c:verify_hdr(), as we will be storing
this as an index extension).
The implication is, if you change .gitignore, you better add it to the
index soon or you lose all the benefit of untracked cache because a
modified .gitignore invalidates all subdirs recursively. This is
especially bad for .gitignore at root.
This cached output is about untracked files only, not ignored files
because the number of tracked files is usually small, so small cache
overhead, while the number of ignored files could go really high
(e.g. *.o files mixing with source code).
[1] "Description of NTFS date and time stamps for files and folders"
http://support.microsoft.com/kb/299648
Helped-by: Torsten Bögershausen <tboegi@web.de>
Helped-by: David Turner <dturner@twopensource.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
10 years ago
|
|
|
/*
|
|
|
|
* Untracked cache
|
|
|
|
*
|
|
|
|
* The following inputs are sufficient to determine what files in a
|
|
|
|
* directory are excluded:
|
|
|
|
*
|
|
|
|
* - The list of files and directories of the directory in question
|
|
|
|
* - The $GIT_DIR/index
|
|
|
|
* - dir_struct flags
|
|
|
|
* - The content of $GIT_DIR/info/exclude
|
|
|
|
* - The content of core.excludesfile
|
|
|
|
* - The content (or the lack) of .gitignore of all parent directories
|
|
|
|
* from $GIT_WORK_TREE
|
|
|
|
* - The check_only flag in read_directory_recursive (for
|
|
|
|
* DIR_HIDE_EMPTY_DIRECTORIES)
|
|
|
|
*
|
|
|
|
* The first input can be checked using directory mtime. In many
|
|
|
|
* filesystems, directory mtime (stat_data field) is updated when its
|
|
|
|
* files or direct subdirs are added or removed.
|
|
|
|
*
|
|
|
|
* The second one can be hooked from cache_tree_invalidate_path().
|
|
|
|
* Whenever a file (or a submodule) is added or removed from a
|
|
|
|
* directory, we invalidate that directory.
|
|
|
|
*
|
|
|
|
* The remaining inputs are easy, their SHA-1 could be used to verify
|
|
|
|
* their contents (exclude_sha1[], info_exclude_sha1[] and
|
|
|
|
* excludes_file_sha1[])
|
|
|
|
*/
|
|
|
|
struct untracked_cache_dir {
|
|
|
|
struct untracked_cache_dir **dirs;
|
|
|
|
char **untracked;
|
|
|
|
struct stat_data stat_data;
|
|
|
|
unsigned int untracked_alloc, dirs_nr, dirs_alloc;
|
|
|
|
unsigned int untracked_nr;
|
|
|
|
unsigned int check_only : 1;
|
|
|
|
/* all data except 'dirs' in this struct are good */
|
|
|
|
unsigned int valid : 1;
|
|
|
|
unsigned int recurse : 1;
|
|
|
|
/* null object ID means this directory does not have .gitignore */
|
|
|
|
struct object_id exclude_oid;
|
untracked cache: record .gitignore information and dir hierarchy
The idea is if we can capture all input and (non-rescursive) output of
read_directory_recursive(), and can verify later that all the input is
the same, then the second r_d_r() should produce the same output as in
the first run.
The requirement for this to work is stat info of a directory MUST
change if an entry is added to or removed from that directory (and
should not change often otherwise). If your OS and filesystem do not
meet this requirement, untracked cache is not for you. Most file
systems on *nix should be fine. On Windows, NTFS is fine while FAT may
not be [1] even though FAT on Linux seems to be fine.
The list of input of r_d_r() is in the big comment block in dir.h. In
short, the output of a directory (not counting subdirs) mainly depends
on stat info of the directory in question, all .gitignore leading to
it and the check_only flag when r_d_r() is called recursively. This
patch records all this info (and the output) as r_d_r() runs.
Two hash_sha1_file() are required for $GIT_DIR/info/exclude and
core.excludesfile unless their stat data matches. hash_sha1_file() is
only needed when .gitignore files in the worktree are modified,
otherwise their SHA-1 in index is used (see the previous patch).
We could store stat data for .gitignore files so we don't have to
rehash them if their content is different from index, but I think
.gitignore files are rarely modified, so not worth extra cache data
(and hashing penalty read-cache.c:verify_hdr(), as we will be storing
this as an index extension).
The implication is, if you change .gitignore, you better add it to the
index soon or you lose all the benefit of untracked cache because a
modified .gitignore invalidates all subdirs recursively. This is
especially bad for .gitignore at root.
This cached output is about untracked files only, not ignored files
because the number of tracked files is usually small, so small cache
overhead, while the number of ignored files could go really high
(e.g. *.o files mixing with source code).
[1] "Description of NTFS date and time stamps for files and folders"
http://support.microsoft.com/kb/299648
Helped-by: Torsten Bögershausen <tboegi@web.de>
Helped-by: David Turner <dturner@twopensource.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
10 years ago
|
|
|
char name[FLEX_ARRAY];
|
|
|
|
};
|
|
|
|
|
|
|
|
struct untracked_cache {
|
|
|
|
struct oid_stat ss_info_exclude;
|
|
|
|
struct oid_stat ss_excludes_file;
|
untracked cache: record .gitignore information and dir hierarchy
The idea is if we can capture all input and (non-rescursive) output of
read_directory_recursive(), and can verify later that all the input is
the same, then the second r_d_r() should produce the same output as in
the first run.
The requirement for this to work is stat info of a directory MUST
change if an entry is added to or removed from that directory (and
should not change often otherwise). If your OS and filesystem do not
meet this requirement, untracked cache is not for you. Most file
systems on *nix should be fine. On Windows, NTFS is fine while FAT may
not be [1] even though FAT on Linux seems to be fine.
The list of input of r_d_r() is in the big comment block in dir.h. In
short, the output of a directory (not counting subdirs) mainly depends
on stat info of the directory in question, all .gitignore leading to
it and the check_only flag when r_d_r() is called recursively. This
patch records all this info (and the output) as r_d_r() runs.
Two hash_sha1_file() are required for $GIT_DIR/info/exclude and
core.excludesfile unless their stat data matches. hash_sha1_file() is
only needed when .gitignore files in the worktree are modified,
otherwise their SHA-1 in index is used (see the previous patch).
We could store stat data for .gitignore files so we don't have to
rehash them if their content is different from index, but I think
.gitignore files are rarely modified, so not worth extra cache data
(and hashing penalty read-cache.c:verify_hdr(), as we will be storing
this as an index extension).
The implication is, if you change .gitignore, you better add it to the
index soon or you lose all the benefit of untracked cache because a
modified .gitignore invalidates all subdirs recursively. This is
especially bad for .gitignore at root.
This cached output is about untracked files only, not ignored files
because the number of tracked files is usually small, so small cache
overhead, while the number of ignored files could go really high
(e.g. *.o files mixing with source code).
[1] "Description of NTFS date and time stamps for files and folders"
http://support.microsoft.com/kb/299648
Helped-by: Torsten Bögershausen <tboegi@web.de>
Helped-by: David Turner <dturner@twopensource.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
10 years ago
|
|
|
const char *exclude_per_dir;
|
|
|
|
struct strbuf ident;
|
untracked cache: record .gitignore information and dir hierarchy
The idea is if we can capture all input and (non-rescursive) output of
read_directory_recursive(), and can verify later that all the input is
the same, then the second r_d_r() should produce the same output as in
the first run.
The requirement for this to work is stat info of a directory MUST
change if an entry is added to or removed from that directory (and
should not change often otherwise). If your OS and filesystem do not
meet this requirement, untracked cache is not for you. Most file
systems on *nix should be fine. On Windows, NTFS is fine while FAT may
not be [1] even though FAT on Linux seems to be fine.
The list of input of r_d_r() is in the big comment block in dir.h. In
short, the output of a directory (not counting subdirs) mainly depends
on stat info of the directory in question, all .gitignore leading to
it and the check_only flag when r_d_r() is called recursively. This
patch records all this info (and the output) as r_d_r() runs.
Two hash_sha1_file() are required for $GIT_DIR/info/exclude and
core.excludesfile unless their stat data matches. hash_sha1_file() is
only needed when .gitignore files in the worktree are modified,
otherwise their SHA-1 in index is used (see the previous patch).
We could store stat data for .gitignore files so we don't have to
rehash them if their content is different from index, but I think
.gitignore files are rarely modified, so not worth extra cache data
(and hashing penalty read-cache.c:verify_hdr(), as we will be storing
this as an index extension).
The implication is, if you change .gitignore, you better add it to the
index soon or you lose all the benefit of untracked cache because a
modified .gitignore invalidates all subdirs recursively. This is
especially bad for .gitignore at root.
This cached output is about untracked files only, not ignored files
because the number of tracked files is usually small, so small cache
overhead, while the number of ignored files could go really high
(e.g. *.o files mixing with source code).
[1] "Description of NTFS date and time stamps for files and folders"
http://support.microsoft.com/kb/299648
Helped-by: Torsten Bögershausen <tboegi@web.de>
Helped-by: David Turner <dturner@twopensource.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
10 years ago
|
|
|
/*
|
|
|
|
* dir_struct#flags must match dir_flags or the untracked
|
|
|
|
* cache is ignored.
|
|
|
|
*/
|
|
|
|
unsigned dir_flags;
|
|
|
|
struct untracked_cache_dir *root;
|
|
|
|
/* Statistics */
|
|
|
|
int dir_created;
|
|
|
|
int gitignore_invalidated;
|
untracked cache: record/validate dir mtime and reuse cached output
The main readdir loop in read_directory_recursive() is replaced with a
new one that checks if cached results of a directory is still valid.
If a file is added or removed from the index, the containing directory
is invalidated (but not its subdirs). If directory's mtime is changed,
the same happens. If a .gitignore is updated, the containing directory
and all subdirs are invalidated recursively. If dir_struct#flags or
other conditions change, the cache is ignored.
If a directory is invalidated, we opendir/readdir/closedir and run the
exclude machinery on that directory listing as usual. If untracked
cache is also enabled, we'll update the cache along the way. If a
directory is validated, we simply pull the untracked listing out from
the cache. The cache also records the list of direct subdirs that we
have to recurse in. Fully excluded directories are seen as "untracked
files".
In the best case when no dirs are invalidated, read_directory()
becomes a series of
stat(dir), open(.gitignore), fstat(), read(), close() and optionally
hash_sha1_file()
For comparison, standard read_directory() is a sequence of
opendir(), readdir(), open(.gitignore), fstat(), read(), close(), the
expensive last_exclude_matching() and closedir().
We already try not to open(.gitignore) if we know it does not exist,
so open/fstat/read/close sequence does not apply to every
directory. The sequence could be reduced further, as noted in
prep_exclude() in another patch. So in theory, the entire best-case
read_directory sequence could be reduced to a series of stat() and
nothing else.
This is not a silver bullet approach. When you compile a C file, for
example, the old .o file is removed and a new one with the same name
created, effectively invalidating the containing directory's cache
(but not its subdirectories). If your build process touches every
directory, this cache adds extra overhead for nothing, so it's a good
idea to separate generated files from tracked files.. Editors may use
the same strategy for saving files. And of course you're out of luck
running your repo on an unsupported filesystem and/or operating system.
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
10 years ago
|
|
|
int dir_invalidated;
|
|
|
|
int dir_opened;
|
|
|
|
/* fsmonitor invalidation data */
|
|
|
|
unsigned int use_fsmonitor : 1;
|
untracked cache: record .gitignore information and dir hierarchy
The idea is if we can capture all input and (non-rescursive) output of
read_directory_recursive(), and can verify later that all the input is
the same, then the second r_d_r() should produce the same output as in
the first run.
The requirement for this to work is stat info of a directory MUST
change if an entry is added to or removed from that directory (and
should not change often otherwise). If your OS and filesystem do not
meet this requirement, untracked cache is not for you. Most file
systems on *nix should be fine. On Windows, NTFS is fine while FAT may
not be [1] even though FAT on Linux seems to be fine.
The list of input of r_d_r() is in the big comment block in dir.h. In
short, the output of a directory (not counting subdirs) mainly depends
on stat info of the directory in question, all .gitignore leading to
it and the check_only flag when r_d_r() is called recursively. This
patch records all this info (and the output) as r_d_r() runs.
Two hash_sha1_file() are required for $GIT_DIR/info/exclude and
core.excludesfile unless their stat data matches. hash_sha1_file() is
only needed when .gitignore files in the worktree are modified,
otherwise their SHA-1 in index is used (see the previous patch).
We could store stat data for .gitignore files so we don't have to
rehash them if their content is different from index, but I think
.gitignore files are rarely modified, so not worth extra cache data
(and hashing penalty read-cache.c:verify_hdr(), as we will be storing
this as an index extension).
The implication is, if you change .gitignore, you better add it to the
index soon or you lose all the benefit of untracked cache because a
modified .gitignore invalidates all subdirs recursively. This is
especially bad for .gitignore at root.
This cached output is about untracked files only, not ignored files
because the number of tracked files is usually small, so small cache
overhead, while the number of ignored files could go really high
(e.g. *.o files mixing with source code).
[1] "Description of NTFS date and time stamps for files and folders"
http://support.microsoft.com/kb/299648
Helped-by: Torsten Bögershausen <tboegi@web.de>
Helped-by: David Turner <dturner@twopensource.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
10 years ago
|
|
|
};
|
|
|
|
|
|
|
|
/**
|
|
|
|
* structure is used to pass directory traversal options to the library and to
|
|
|
|
* record the paths discovered. A single `struct dir_struct` is used regardless
|
|
|
|
* of whether or not the traversal recursively descends into subdirectories.
|
|
|
|
*/
|
|
|
|
struct dir_struct {
|
|
|
|
|
|
|
|
/* The number of members in `entries[]` array. */
|
|
|
|
int nr;
|
|
|
|
|
|
|
|
/* Internal use; keeps track of allocation of `entries[]` array.*/
|
|
|
|
int alloc;
|
|
|
|
|
|
|
|
/* The number of members in `ignored[]` array. */
|
|
|
|
int ignored_nr;
|
|
|
|
|
|
|
|
int ignored_alloc;
|
|
|
|
|
|
|
|
/* bit-field of options */
|
|
|
|
enum {
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Return just ignored files in `entries[]`, not untracked files.
|
|
|
|
* This flag is mutually exclusive with `DIR_SHOW_IGNORED_TOO`.
|
|
|
|
*/
|
|
|
|
DIR_SHOW_IGNORED = 1<<0,
|
|
|
|
|
|
|
|
/* Include a directory that is not tracked. */
|
|
|
|
DIR_SHOW_OTHER_DIRECTORIES = 1<<1,
|
|
|
|
|
|
|
|
/* Do not include a directory that is not tracked and is empty. */
|
|
|
|
DIR_HIDE_EMPTY_DIRECTORIES = 1<<2,
|
|
|
|
|
|
|
|
/**
|
|
|
|
* If set, recurse into a directory that looks like a Git directory.
|
|
|
|
* Otherwise it is shown as a directory.
|
|
|
|
*/
|
|
|
|
DIR_NO_GITLINKS = 1<<3,
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Special mode for git-add. Return ignored files in `ignored[]` and
|
|
|
|
* untracked files in `entries[]`. Only returns ignored files that match
|
|
|
|
* pathspec exactly (no wildcards). Does not recurse into ignored
|
|
|
|
* directories.
|
|
|
|
*/
|
|
|
|
DIR_COLLECT_IGNORED = 1<<4,
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Similar to `DIR_SHOW_IGNORED`, but return ignored files in
|
|
|
|
* `ignored[]` in addition to untracked files in `entries[]`.
|
|
|
|
* This flag is mutually exclusive with `DIR_SHOW_IGNORED`.
|
|
|
|
*/
|
ls-files -k: a directory only can be killed if the index has a non-directory
"ls-files -o" and "ls-files -k" both traverse the working tree down
to find either all untracked paths or those that will be "killed"
(removed from the working tree to make room) when the paths recorded
in the index are checked out. It is necessary to traverse the
working tree fully when enumerating all the "other" paths, but when
we are only interested in "killed" paths, we can take advantage of
the fact that paths that do not overlap with entries in the index
can never be killed.
The treat_one_path() helper function, which is called during the
recursive traversal, is the ideal place to implement an
optimization.
When we are looking at a directory P in the working tree, there are
three cases:
(1) P exists in the index. Everything inside the directory P in
the working tree needs to go when P is checked out from the
index.
(2) P does not exist in the index, but there is P/Q in the index.
We know P will stay a directory when we check out the contents
of the index, but we do not know yet if there is a directory
P/Q in the working tree to be killed, so we need to recurse.
(3) P does not exist in the index, and there is no P/Q in the index
to require P to be a directory, either. Only in this case, we
know that everything inside P will not be killed without
recursing.
Note that this helper is called by treat_leading_path() that decides
if we need to traverse only subdirectories of a single common
leading directory, which is essential for this optimization to be
correct. This caller checks each level of the leading path
component from shallower directory to deeper ones, and that is what
allows us to only check if the path appears in the index. If the
call to treat_one_path() weren't there, given a path P/Q/R, the real
traversal may start from directory P/Q/R, even when the index
records P as a regular file, and we would end up having to check if
any leading subpath in P/Q/R, e.g. P, appears in the index.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
12 years ago
|
|
|
DIR_SHOW_IGNORED_TOO = 1<<5,
|
|
|
|
|
|
|
|
DIR_COLLECT_KILLED_ONLY = 1<<6,
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Only has meaning if `DIR_SHOW_IGNORED_TOO` is also set; if this is
|
|
|
|
* set, the untracked contents of untracked directories are also
|
|
|
|
* returned in `entries[]`.
|
|
|
|
*/
|
status: add option to show ignored files differently
Teach the status command more flexibility in how ignored files are
reported. Currently, the reporting of ignored files and untracked
files are linked. You cannot control how ignored files are reported
independently of how untracked files are reported (i.e. `all` vs
`normal`). This makes it impossible to show untracked files with the
`all` option, but show ignored files with the `normal` option.
This work 1) adds the ability to control the reporting of ignored
files independently of untracked files and 2) introduces the concept
of status reporting ignored paths that explicitly match an ignored
pattern. There are 2 benefits to these changes: 1) if a consumer needs
all untracked files but not all ignored files, there is a performance
benefit to not scanning all contents of an ignored directory and 2)
returning ignored files that explicitly match a path allow a consumer
to make more informed decisions about when a status result might be
stale.
This commit implements --ignored=matching with --untracked-files=all.
The following commit will implement --ignored=matching with
--untracked=files=normal.
As an example of where this flexibility could be useful is that our
application (Visual Studio) runs the status command and presents the
output. It shows all untracked files individually (e.g. using the
'--untracked-files==all' option), and would like to know about which
paths are ignored. It uses information about ignored paths to make
decisions about when the status result might have changed.
Additionally, many projects place build output into directories inside
a repository's working directory (e.g. in "bin/" and "obj/"
directories). Normal usage is to explicitly ignore these 2 directory
names in the .gitignore file (rather than or in addition to the *.obj
pattern).If an application could know that these directories are
explicitly ignored, it could infer that all contents are ignored as
well and make better informed decisions about files in these
directories. It could infer that any changes under these paths would
not affect the output of status. Additionally, there can be a
significant performance benefit by avoiding scanning through ignored
directories.
When status is set to report matching ignored files, it has the
following behavior. Ignored files and directories that explicitly
match an exclude pattern are reported. If an ignored directory matches
an exclude pattern, then the path of the directory is returned. If a
directory does not match an exclude pattern, but all of its contents
are ignored, then the contained files are reported instead of the
directory.
Signed-off-by: Jameson Miller <jamill@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
7 years ago
|
|
|
DIR_KEEP_UNTRACKED_CONTENTS = 1<<7,
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Only has meaning if `DIR_SHOW_IGNORED_TOO` is also set; if this is
|
|
|
|
* set, returns ignored files and directories that match an exclude
|
|
|
|
* pattern. If a directory matches an exclude pattern, then the
|
|
|
|
* directory is returned and the contained paths are not. A directory
|
|
|
|
* that does not match an exclude pattern will not be returned even if
|
|
|
|
* all of its contents are ignored. In this case, the contents are
|
|
|
|
* returned as individual entries.
|
|
|
|
*
|
|
|
|
* If this is set, files and directories that explicitly match an ignore
|
|
|
|
* pattern are reported. Implicitly ignored directories (directories that
|
|
|
|
* do not match an ignore pattern, but whose contents are all ignored)
|
|
|
|
* are not reported, instead all of the contents are reported.
|
|
|
|
*/
|
clean: avoid removing untracked files in a nested git repository
Users expect files in a nested git repository to be left alone unless
sufficiently forced (with two -f's). Unfortunately, in certain
circumstances, git would delete both tracked (and possibly dirty) files
and untracked files within a nested repository. To explain how this
happens, let's contrast a couple cases. First, take the following
example setup (which assumes we are already within a git repo):
git init nested
cd nested
>tracked
git add tracked
git commit -m init
>untracked
cd ..
In this setup, everything works as expected; running 'git clean -fd'
will result in fill_directory() returning the following paths:
nested/
nested/tracked
nested/untracked
and then correct_untracked_entries() would notice this can be compressed
to
nested/
and then since "nested/" is a directory, we would call
remove_dirs("nested/", ...), which would
check is_nonbare_repository_dir() and then decide to skip it.
However, if someone also creates an ignored file:
>nested/ignored
then running 'git clean -fd' would result in fill_directory() returning
the same paths:
nested/
nested/tracked
nested/untracked
but correct_untracked_entries() will notice that we had ignored entries
under nested/ and thus simplify this list to
nested/tracked
nested/untracked
Since these are not directories, we do not call remove_dirs() which was
the only place that had the is_nonbare_repository_dir() safety check --
resulting in us deleting both the untracked file and the tracked (and
possibly dirty) file.
One possible fix for this issue would be walking the parent directories
of each path and checking if they represent nonbare repositories, but
that would be wasteful. Even if we added caching of some sort, it's
still a waste because we should have been able to check that "nested/"
represented a nonbare repository before even descending into it in the
first place. Add a DIR_SKIP_NESTED_GIT flag to dir_struct.flags and use
it to prevent fill_directory() and friends from descending into nested
git repos.
With this change, we also modify two regression tests added in commit
91479b9c72f1 ("t7300: add tests to document behavior of clean and nested
git", 2015-06-15). That commit, nor its series, nor the six previous
iterations of that series on the mailing list discussed why those tests
coded the expectation they did. In fact, it appears their purpose was
simply to test _existing_ behavior to make sure that the performance
changes didn't change the behavior. However, these two tests directly
contradicted the manpage's claims that two -f's were required to delete
files/directories under a nested git repository. While one could argue
that the user gave an explicit path which matched files/directories that
were within a nested repository, there's a slippery slope that becomes
very difficult for users to understand once you go down that route (e.g.
what if they specified "git clean -f -d '*.c'"?) It would also be hard
to explain what the exact behavior was; avoid such problems by making it
really simple.
Also, clean up some grammar errors describing this functionality in the
git-clean manpage.
Finally, there are still a couple bugs with -ffd not cleaning out enough
(e.g. missing the nested .git) and with -ffdX possibly cleaning out the
wrong files (paying attention to outer .gitignore instead of inner).
This patch does not address these cases at all (and does not change the
behavior relative to those flags), it only fixes the handling when given
a single -f. See
https://public-inbox.org/git/20190905212043.GC32087@szeder.dev/ for more
discussion of the -ffd[X?] bugs.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
5 years ago
|
|
|
DIR_SHOW_IGNORED_TOO_MODE_MATCHING = 1<<8,
|
|
|
|
|
clean: avoid removing untracked files in a nested git repository
Users expect files in a nested git repository to be left alone unless
sufficiently forced (with two -f's). Unfortunately, in certain
circumstances, git would delete both tracked (and possibly dirty) files
and untracked files within a nested repository. To explain how this
happens, let's contrast a couple cases. First, take the following
example setup (which assumes we are already within a git repo):
git init nested
cd nested
>tracked
git add tracked
git commit -m init
>untracked
cd ..
In this setup, everything works as expected; running 'git clean -fd'
will result in fill_directory() returning the following paths:
nested/
nested/tracked
nested/untracked
and then correct_untracked_entries() would notice this can be compressed
to
nested/
and then since "nested/" is a directory, we would call
remove_dirs("nested/", ...), which would
check is_nonbare_repository_dir() and then decide to skip it.
However, if someone also creates an ignored file:
>nested/ignored
then running 'git clean -fd' would result in fill_directory() returning
the same paths:
nested/
nested/tracked
nested/untracked
but correct_untracked_entries() will notice that we had ignored entries
under nested/ and thus simplify this list to
nested/tracked
nested/untracked
Since these are not directories, we do not call remove_dirs() which was
the only place that had the is_nonbare_repository_dir() safety check --
resulting in us deleting both the untracked file and the tracked (and
possibly dirty) file.
One possible fix for this issue would be walking the parent directories
of each path and checking if they represent nonbare repositories, but
that would be wasteful. Even if we added caching of some sort, it's
still a waste because we should have been able to check that "nested/"
represented a nonbare repository before even descending into it in the
first place. Add a DIR_SKIP_NESTED_GIT flag to dir_struct.flags and use
it to prevent fill_directory() and friends from descending into nested
git repos.
With this change, we also modify two regression tests added in commit
91479b9c72f1 ("t7300: add tests to document behavior of clean and nested
git", 2015-06-15). That commit, nor its series, nor the six previous
iterations of that series on the mailing list discussed why those tests
coded the expectation they did. In fact, it appears their purpose was
simply to test _existing_ behavior to make sure that the performance
changes didn't change the behavior. However, these two tests directly
contradicted the manpage's claims that two -f's were required to delete
files/directories under a nested git repository. While one could argue
that the user gave an explicit path which matched files/directories that
were within a nested repository, there's a slippery slope that becomes
very difficult for users to understand once you go down that route (e.g.
what if they specified "git clean -f -d '*.c'"?) It would also be hard
to explain what the exact behavior was; avoid such problems by making it
really simple.
Also, clean up some grammar errors describing this functionality in the
git-clean manpage.
Finally, there are still a couple bugs with -ffd not cleaning out enough
(e.g. missing the nested .git) and with -ffdX possibly cleaning out the
wrong files (paying attention to outer .gitignore instead of inner).
This patch does not address these cases at all (and does not change the
behavior relative to those flags), it only fixes the handling when given
a single -f. See
https://public-inbox.org/git/20190905212043.GC32087@szeder.dev/ for more
discussion of the -ffd[X?] bugs.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
5 years ago
|
|
|
DIR_SKIP_NESTED_GIT = 1<<9
|
|
|
|
} flags;
|
|
|
|
|
|
|
|
/* An array of `struct dir_entry`, each element of which describes a path. */
|
|
|
|
struct dir_entry **entries;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* used for ignored paths with the `DIR_SHOW_IGNORED_TOO` and
|
|
|
|
* `DIR_COLLECT_IGNORED` flags.
|
|
|
|
*/
|
|
|
|
struct dir_entry **ignored;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* The name of the file to be read in each directory for excluded files
|
|
|
|
* (typically `.gitignore`).
|
|
|
|
*/
|
|
|
|
const char *exclude_per_dir;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We maintain three groups of exclude pattern lists:
|
|
|
|
*
|
|
|
|
* EXC_CMDL lists patterns explicitly given on the command line.
|
|
|
|
* EXC_DIRS lists patterns obtained from per-directory ignore files.
|
|
|
|
* EXC_FILE lists patterns from fallback ignore files, e.g.
|
|
|
|
* - .git/info/exclude
|
|
|
|
* - core.excludesfile
|
|
|
|
*
|
|
|
|
* Each group contains multiple exclude lists, a single list
|
|
|
|
* per source.
|
|
|
|
*/
|
|
|
|
#define EXC_CMDL 0
|
|
|
|
#define EXC_DIRS 1
|
|
|
|
#define EXC_FILE 2
|
|
|
|
struct exclude_list_group exclude_list_group[3];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Temporary variables which are used during loading of the
|
|
|
|
* per-directory exclude lists.
|
|
|
|
*
|
|
|
|
* exclude_stack points to the top of the exclude_stack, and
|
|
|
|
* basebuf contains the full path to the current
|
dir.c: unify is_excluded and is_path_excluded APIs
The is_excluded and is_path_excluded APIs are very similar, except for a
few noteworthy differences:
is_excluded doesn't handle ignored directories, results for paths within
ignored directories are incorrect. This is probably based on the premise
that recursive directory scans should stop at ignored directories, which
is no longer true (in certain cases, read_directory_recursive currently
calls is_excluded *and* is_path_excluded to get correct ignored state).
is_excluded caches parsed .gitignore files of the last directory in struct
dir_struct. If the directory changes, it finds a common parent directory
and is very careful to drop only as much state as necessary. On the other
hand, is_excluded will also read and parse .gitignore files in already
ignored directories, which are completely irrelevant.
is_path_excluded correctly handles ignored directories by checking if any
component in the path is excluded. As it uses is_excluded internally, this
unfortunately forces is_excluded to drop and re-read all .gitignore files,
as there is no common parent directory for the root dir.
is_path_excluded tracks state in a separate struct path_exclude_check,
which is essentially a wrapper of dir_struct with two more fields. However,
as is_path_excluded also modifies dir_struct, it is not possible to e.g.
use multiple path_exclude_check structures with the same dir_struct in
parallel. The additional structure just unnecessarily complicates the API.
Teach is_excluded / prep_exclude about ignored directories: whenever
entering a new directory, first check if the entire directory is excluded.
Remember the excluded state in dir_struct. Don't traverse into already
ignored directories (i.e. don't read irrelevant .gitignore files).
Directories could also be excluded by exclude patterns specified on the
command line or .git/info/exclude, so we cannot simply skip prep_exclude
entirely if there's no .gitignore file name (dir_struct.exclude_per_dir).
Move this check to just before actually reading the file.
is_path_excluded is now equivalent to is_excluded, so we can simply
redirect to it (the public API is cleaned up in the next patch).
The performance impact of the additional ignored check per directory is
hardly noticeable when reading directories recursively (e.g. 'git status').
However, performance of git commands using the is_path_excluded API (e.g.
'git ls-files --cached --ignored --exclude-standard') is greatly improved
as this no longer re-reads .gitignore files on each call.
Here's some performance data from the linux and WebKit repos (best of 10
runs on a Debian Linux on SSD, core.preloadIndex=true):
| ls-files -ci | status | status --ignored
| linux | WebKit | linux | WebKit | linux | WebKit
-------+-------+--------+-------+--------+-------+---------
before | 0.506 | 6.539 | 0.212 | 1.555 | 0.323 | 2.541
after | 0.080 | 1.191 | 0.218 | 1.583 | 0.321 | 2.579
gain | 6.325 | 5.490 | 0.972 | 0.982 | 1.006 | 0.985
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
12 years ago
|
|
|
* (sub)directory in the traversal. Exclude points to the
|
|
|
|
* matching exclude struct if the directory is excluded.
|
|
|
|
*/
|
|
|
|
struct exclude_stack *exclude_stack;
|
|
|
|
struct path_pattern *pattern;
|
|
|
|
struct strbuf basebuf;
|
untracked cache: record .gitignore information and dir hierarchy
The idea is if we can capture all input and (non-rescursive) output of
read_directory_recursive(), and can verify later that all the input is
the same, then the second r_d_r() should produce the same output as in
the first run.
The requirement for this to work is stat info of a directory MUST
change if an entry is added to or removed from that directory (and
should not change often otherwise). If your OS and filesystem do not
meet this requirement, untracked cache is not for you. Most file
systems on *nix should be fine. On Windows, NTFS is fine while FAT may
not be [1] even though FAT on Linux seems to be fine.
The list of input of r_d_r() is in the big comment block in dir.h. In
short, the output of a directory (not counting subdirs) mainly depends
on stat info of the directory in question, all .gitignore leading to
it and the check_only flag when r_d_r() is called recursively. This
patch records all this info (and the output) as r_d_r() runs.
Two hash_sha1_file() are required for $GIT_DIR/info/exclude and
core.excludesfile unless their stat data matches. hash_sha1_file() is
only needed when .gitignore files in the worktree are modified,
otherwise their SHA-1 in index is used (see the previous patch).
We could store stat data for .gitignore files so we don't have to
rehash them if their content is different from index, but I think
.gitignore files are rarely modified, so not worth extra cache data
(and hashing penalty read-cache.c:verify_hdr(), as we will be storing
this as an index extension).
The implication is, if you change .gitignore, you better add it to the
index soon or you lose all the benefit of untracked cache because a
modified .gitignore invalidates all subdirs recursively. This is
especially bad for .gitignore at root.
This cached output is about untracked files only, not ignored files
because the number of tracked files is usually small, so small cache
overhead, while the number of ignored files could go really high
(e.g. *.o files mixing with source code).
[1] "Description of NTFS date and time stamps for files and folders"
http://support.microsoft.com/kb/299648
Helped-by: Torsten Bögershausen <tboegi@web.de>
Helped-by: David Turner <dturner@twopensource.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
10 years ago
|
|
|
|
|
|
|
/* Enable untracked file cache if set */
|
|
|
|
struct untracked_cache *untracked;
|
|
|
|
struct oid_stat ss_info_exclude;
|
|
|
|
struct oid_stat ss_excludes_file;
|
|
|
|
unsigned unmanaged_exclude_files;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*Count the number of slashes for string s*/
|
|
|
|
int count_slashes(const char *s);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The ordering of these constants is significant, with
|
|
|
|
* higher-numbered match types signifying "closer" (i.e. more
|
|
|
|
* specific) matches which will override lower-numbered match types
|
|
|
|
* when populating the seen[] array.
|
|
|
|
*/
|
|
|
|
#define MATCHED_RECURSIVELY 1
|
dir: if our pathspec might match files under a dir, recurse into it
For git clean, if a directory is entirely untracked and the user did not
specify -d (corresponding to DIR_SHOW_IGNORED_TOO), then we usually do
not want to remove that directory and thus do not recurse into it.
However, if the user manually specified specific (or even globbed) paths
somewhere under that directory to remove, then we need to recurse into
the directory to make sure we remove the relevant paths under that
directory as the user requested.
Note that this does not mean that the recursed-into directory will be
added to dir->entries for later removal; as of a few commits earlier in
this series, there is another more strict match check that is run after
returning from a recursed-into directory before deciding to add it to the
list of entries. Therefore, this will only result in files underneath
the given directory which match one of the pathspecs being added to the
entries list.
Two notes of potential interest to future readers:
* If we wanted to only recurse into a directory when it is specifically
matched rather than matched-via-glob (e.g. '*.c'), then we could do
so via making the final non-zero return in match_pathspec_item be
MATCHED_RECURSIVELY instead of MATCHED_RECURSIVELY_LEADING_PATHSPEC.
(Note that the relative order of MATCHED_RECURSIVELY_LEADING_PATHSPEC
and MATCHED_RECURSIVELY are important for such a change.) I was
leaving open that possibility while writing an RFC asking for the
behavior we want, but even though we don't want it, that knowledge
might help you understand the code flow better.
* There is a growing amount of logic in read_directory_recursive() for
deciding whether to recurse into a subdirectory. However, there is a
comment immediately preceding this logic that says to recurse if
instructed by treat_path(). It may be better for the logic in
read_directory_recursive() to ultimately be moved to treat_path() (or
another function it calls, such as treat_directory()), but I have
left that for someone else to tackle in the future.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
5 years ago
|
|
|
#define MATCHED_RECURSIVELY_LEADING_PATHSPEC 2
|
|
|
|
#define MATCHED_FNMATCH 3
|
|
|
|
#define MATCHED_EXACTLY 4
|
|
|
|
int simple_length(const char *match);
|
|
|
|
int no_wildcard(const char *string);
|
|
|
|
char *common_prefix(const struct pathspec *pathspec);
|
|
|
|
int match_pathspec(const struct index_state *istate,
|
|
|
|
const struct pathspec *pathspec,
|
|
|
|
const char *name, int namelen,
|
|
|
|
int prefix, char *seen, int is_dir);
|
|
|
|
int report_path_error(const char *ps_matched, const struct pathspec *pathspec);
|
|
|
|
int within_depth(const char *name, int namelen, int depth, int max_depth);
|
|
|
|
|
dir: fix problematic API to avoid memory leaks
The dir structure seemed to have a number of leaks and problems around
it. First I noticed that parent_hashmap and recursive_hashmap were
being leaked (though Peff noticed and submitted fixes before me). Then
I noticed in the previous commit that clear_directory() was only taking
responsibility for a subset of fields within dir_struct, despite the
fact that entries[] and ignored[] we allocated internally to dir.c.
That, of course, resulted in many callers either leaking or haphazardly
trying to free these arrays and their contents.
Digging further, I found that despite the pretty clear documentation
near the top of dir.h that folks were supposed to call clear_directory()
when the user no longer needed the dir_struct, there were four callers
that didn't bother doing that at all. However, two of them clearly
thought about leaks since they had an UNLEAK(dir) directive, which to me
suggests that the method to free the data was too unclear. I suspect
the non-obviousness of the API and its holes led folks to avoid it,
which then snowballed into further problems with the entries[],
ignored[], parent_hashmap, and recursive_hashmap problems.
Rename clear_directory() to dir_clear() to be more in line with other
data structures in git, and introduce a dir_init() to handle the
suggested memsetting of dir_struct to all zeroes. I hope that a name
like "dir_clear()" is more clear, and that the presence of dir_init()
will provide a hint to those looking at the code that they need to look
for either a dir_clear() or a dir_free() and lead them to find
dir_clear().
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
4 years ago
|
|
|
void dir_init(struct dir_struct *dir);
|
|
|
|
|
|
|
|
int fill_directory(struct dir_struct *dir,
|
|
|
|
struct index_state *istate,
|
|
|
|
const struct pathspec *pathspec);
|
|
|
|
int read_directory(struct dir_struct *, struct index_state *istate,
|
|
|
|
const char *path, int len,
|
|
|
|
const struct pathspec *pathspec);
|
|
|
|
|
|
|
|
enum pattern_match_result {
|
|
|
|
UNDECIDED = -1,
|
|
|
|
NOT_MATCHED = 0,
|
|
|
|
MATCHED = 1,
|
|
|
|
MATCHED_RECURSIVE = 2,
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Scan the list of patterns to determine if the ordered list
|
|
|
|
* of patterns matches on 'pathname'.
|
|
|
|
*
|
|
|
|
* Return 1 for a match, 0 for not matched and -1 for undecided.
|
|
|
|
*/
|
|
|
|
enum pattern_match_result path_matches_pattern_list(const char *pathname,
|
|
|
|
int pathlen,
|
|
|
|
const char *basename, int *dtype,
|
|
|
|
struct pattern_list *pl,
|
|
|
|
struct index_state *istate);
|
|
|
|
struct dir_entry *dir_add_ignored(struct dir_struct *dir,
|
|
|
|
struct index_state *istate,
|
|
|
|
const char *pathname, int len);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* these implement the matching logic for dir.c:excluded_from_list and
|
|
|
|
* attr.c:path_matches()
|
|
|
|
*/
|
|
|
|
int match_basename(const char *, int,
|
|
|
|
const char *, int, int, unsigned);
|
|
|
|
int match_pathname(const char *, int,
|
|
|
|
const char *, int,
|
|
|
|
const char *, int, int, unsigned);
|
|
|
|
|
|
|
|
struct path_pattern *last_matching_pattern(struct dir_struct *dir,
|
|
|
|
struct index_state *istate,
|
|
|
|
const char *name, int *dtype);
|
|
|
|
|
|
|
|
int is_excluded(struct dir_struct *dir,
|
|
|
|
struct index_state *istate,
|
|
|
|
const char *name, int *dtype);
|
|
|
|
|
|
|
|
int pl_hashmap_cmp(const void *unused_cmp_data,
|
|
|
|
const struct hashmap_entry *a,
|
|
|
|
const struct hashmap_entry *b,
|
|
|
|
const void *key);
|
|
|
|
int hashmap_contains_parent(struct hashmap *map,
|
|
|
|
const char *path,
|
|
|
|
struct strbuf *buffer);
|
|
|
|
struct pattern_list *add_pattern_list(struct dir_struct *dir,
|
|
|
|
int group_type, const char *src);
|
|
|
|
int add_patterns_from_file_to_list(const char *fname, const char *base, int baselen,
|
|
|
|
struct pattern_list *pl, struct index_state *istate);
|
|
|
|
void add_patterns_from_file(struct dir_struct *, const char *fname);
|
|
|
|
int add_patterns_from_blob_to_list(struct object_id *oid,
|
|
|
|
const char *base, int baselen,
|
|
|
|
struct pattern_list *pl);
|
|
|
|
void parse_path_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen);
|
|
|
|
void add_pattern(const char *string, const char *base,
|
|
|
|
int baselen, struct pattern_list *pl, int srcpos);
|
|
|
|
void clear_pattern_list(struct pattern_list *pl);
|
dir: fix problematic API to avoid memory leaks
The dir structure seemed to have a number of leaks and problems around
it. First I noticed that parent_hashmap and recursive_hashmap were
being leaked (though Peff noticed and submitted fixes before me). Then
I noticed in the previous commit that clear_directory() was only taking
responsibility for a subset of fields within dir_struct, despite the
fact that entries[] and ignored[] we allocated internally to dir.c.
That, of course, resulted in many callers either leaking or haphazardly
trying to free these arrays and their contents.
Digging further, I found that despite the pretty clear documentation
near the top of dir.h that folks were supposed to call clear_directory()
when the user no longer needed the dir_struct, there were four callers
that didn't bother doing that at all. However, two of them clearly
thought about leaks since they had an UNLEAK(dir) directive, which to me
suggests that the method to free the data was too unclear. I suspect
the non-obviousness of the API and its holes led folks to avoid it,
which then snowballed into further problems with the entries[],
ignored[], parent_hashmap, and recursive_hashmap problems.
Rename clear_directory() to dir_clear() to be more in line with other
data structures in git, and introduce a dir_init() to handle the
suggested memsetting of dir_struct to all zeroes. I hope that a name
like "dir_clear()" is more clear, and that the presence of dir_init()
will provide a hint to those looking at the code that they need to look
for either a dir_clear() or a dir_free() and lead them to find
dir_clear().
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
4 years ago
|
|
|
void dir_clear(struct dir_struct *dir);
|
|
|
|
|
|
|
|
int repo_file_exists(struct repository *repo, const char *path);
|
|
|
|
int file_exists(const char *);
|
|
|
|
|
|
|
|
int is_inside_dir(const char *dir);
|
|
|
|
int dir_inside_of(const char *subdir, const char *dir);
|
|
|
|
|
|
|
|
static inline int is_dot_or_dotdot(const char *name)
|
|
|
|
{
|
|
|
|
return (name[0] == '.' &&
|
|
|
|
(name[1] == '\0' ||
|
|
|
|
(name[1] == '.' && name[2] == '\0')));
|
|
|
|
}
|
|
|
|
|
|
|
|
int is_empty_dir(const char *dir);
|
|
|
|
|
|
|
|
void setup_standard_excludes(struct dir_struct *dir);
|
|
|
|
|
|
|
|
|
|
|
|
/* Constants for remove_dir_recursively: */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a non-directory is found within path, stop and return an error.
|
|
|
|
* (In this case some empty directories might already have been
|
|
|
|
* removed.)
|
|
|
|
*/
|
|
|
|
#define REMOVE_DIR_EMPTY_ONLY 01
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If any Git work trees are found within path, skip them without
|
|
|
|
* considering it an error.
|
|
|
|
*/
|
|
|
|
#define REMOVE_DIR_KEEP_NESTED_GIT 02
|
|
|
|
|
|
|
|
/* Remove the contents of path, but leave path itself. */
|
|
|
|
#define REMOVE_DIR_KEEP_TOPLEVEL 04
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove path and its contents, recursively. flags is a combination
|
|
|
|
* of the above REMOVE_DIR_* constants. Return 0 on success.
|
|
|
|
*
|
|
|
|
* This function uses path as temporary scratch space, but restores it
|
|
|
|
* before returning.
|
|
|
|
*/
|
|
|
|
int remove_dir_recursively(struct strbuf *path, int flag);
|
|
|
|
|
|
|
|
/* tries to remove the path with empty directories along it, ignores ENOENT */
|
|
|
|
int remove_path(const char *path);
|
|
|
|
|
|
|
|
int fspathcmp(const char *a, const char *b);
|
|
|
|
int fspathncmp(const char *a, const char *b, size_t count);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The prefix part of pattern must not contains wildcards.
|
|
|
|
*/
|
|
|
|
struct pathspec_item;
|
|
|
|
int git_fnmatch(const struct pathspec_item *item,
|
|
|
|
const char *pattern, const char *string,
|
|
|
|
int prefix);
|
|
|
|
|
|
|
|
int submodule_path_match(const struct index_state *istate,
|
|
|
|
const struct pathspec *ps,
|
|
|
|
const char *submodule_name,
|
|
|
|
char *seen);
|
|
|
|
|
|
|
|
static inline int ce_path_match(const struct index_state *istate,
|
|
|
|
const struct cache_entry *ce,
|
|
|
|
const struct pathspec *pathspec,
|
|
|
|
char *seen)
|
|
|
|
{
|
|
|
|
return match_pathspec(istate, pathspec, ce->name, ce_namelen(ce), 0, seen,
|
|
|
|
S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int dir_path_match(const struct index_state *istate,
|
|
|
|
const struct dir_entry *ent,
|
|
|
|
const struct pathspec *pathspec,
|
|
|
|
int prefix, char *seen)
|
|
|
|
{
|
|
|
|
int has_trailing_dir = ent->len && ent->name[ent->len - 1] == '/';
|
|
|
|
int len = has_trailing_dir ? ent->len - 1 : ent->len;
|
|
|
|
return match_pathspec(istate, pathspec, ent->name, len, prefix, seen,
|
|
|
|
has_trailing_dir);
|
|
|
|
}
|
|
|
|
|
|
|
|
int cmp_dir_entry(const void *p1, const void *p2);
|
|
|
|
int check_dir_entry_contains(const struct dir_entry *out, const struct dir_entry *in);
|
|
|
|
|
dir.c: ignore paths containing .git when invalidating untracked cache
read_directory() code ignores all paths named ".git" even if it's not
a valid git repository. See treat_path() for details. Since ".git" is
basically invisible to read_directory(), when we are asked to
invalidate a path that contains ".git", we can safely ignore it
because the slow path would not consider it anyway.
This helps when fsmonitor is used and we have a real ".git" repo at
worktree top. Occasionally .git/index will be updated and if the
fsmonitor hook does not filter it, untracked cache is asked to
invalidate the path ".git/index".
Without this patch, we invalidate the root directory unncessarily,
which:
- makes read_directory() fall back to slow path for root directory
(slower)
- makes the index dirty (because UNTR extension is updated). Depending
on the index size, writing it down could also be slow.
A note about the new "safe_path" knob. Since this new check could be
relatively expensive, avoid it when we know it's not needed. If the
path comes from the index, it can't contain ".git". If it does
contain, we may be screwed up at many more levels, not just this one.
Noticed-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
7 years ago
|
|
|
void untracked_cache_invalidate_path(struct index_state *, const char *, int safe_path);
|
|
|
|
void untracked_cache_remove_from_index(struct index_state *, const char *);
|
|
|
|
void untracked_cache_add_to_index(struct index_state *, const char *);
|
|
|
|
|
|
|
|
void free_untracked_cache(struct untracked_cache *);
|
|
|
|
struct untracked_cache *read_untracked_extension(const void *data, unsigned long sz);
|
|
|
|
void write_untracked_extension(struct strbuf *out, struct untracked_cache *untracked);
|
|
|
|
void add_untracked_cache(struct index_state *istate);
|
|
|
|
void remove_untracked_cache(struct index_state *istate);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Connect a worktree to a git directory by creating (or overwriting) a
|
|
|
|
* '.git' file containing the location of the git directory. In the git
|
|
|
|
* directory set the core.worktree setting to indicate where the worktree is.
|
|
|
|
* When `recurse_into_nested` is set, recurse into any nested submodules,
|
|
|
|
* connecting them as well.
|
|
|
|
*/
|
|
|
|
void connect_work_tree_and_git_dir(const char *work_tree,
|
|
|
|
const char *git_dir,
|
|
|
|
int recurse_into_nested);
|
|
|
|
void relocate_gitdir(const char *path,
|
|
|
|
const char *old_git_dir,
|
|
|
|
const char *new_git_dir);
|
|
|
|
#endif
|