From fbce4fa9ae4f010e509be288d8b113610e78781d Mon Sep 17 00:00:00 2001 From: Jeff King Date: Thu, 22 Sep 2022 06:11:43 -0400 Subject: [PATCH 1/3] fsck: free tree buffers after walking unreachable objects After calling fsck_walk(), a tree object struct may be left in the parsed state, with the full tree contents available via tree->buffer. It's the responsibility of the caller to free these when it's done with the object to avoid having many trees allocated at once. In a regular "git fsck", we hit fsck_walk() only from fsck_obj(), which does call free_tree_buffer(). Likewise for "--connectivity-only", we see most objects via traverse_one_object(), which makes a similar call. The exception is in mark_unreachable_referents(). When using both "--connectivity-only" and "--dangling" (the latter of which is the default), we walk all of the unreachable objects, and there we forget to free. Most cases would not notice this, because they don't have a lot of unreachable objects, but you can make a pathological case like this: git clone --bare /path/to/linux.git repo.git cd repo.git rm packed-refs ;# now everything is unreachable! git fsck --connectivity-only That ends up with peak heap usage ~18GB, which is (not coincidentally) close to the size of all uncompressed trees in the repository. After this patch, the peak heap is only ~2GB. A few things to note: - it might seem like fsck_walk(), if it is parsing the trees, should be responsible for freeing them. But the situation is quite tricky. In the non-connectivity mode, after we call fsck_walk() we then proceed with fsck_object() which actually does the type-specific sanity checks on the object contents. We do pass our own separate buffer to fsck_object(), but there's a catch: our earlier call to parse_object_buffer() may have attached that buffer to the object struct! So by freeing it, we leave the rest of the code with a dangling pointer. Likewise, the call to fsck_walk() in index-pack is subtle. It attaches a buffer to the tree object that must not be freed! And so rather than calling free_tree_buffer(), it actually detaches it by setting tree->buffer to NULL. These cases would _probably_ be fixable by having fsck_walk() free the tree buffer only when it was the one who allocated it via parse_tree(). But that would still leave the callers responsible for freeing other cases, so they wouldn't be simplified. While the current semantics for fsck_walk() make it easy to accidentally leak in new callers, at least they are simple to explain, and it's not a function that's likely to get a lot of new call-sites. And in any case, it's probably sensible to fix the leak first with this simple patch, and try any more complicated refactoring separately. - a careful reader may notice that fsck_obj() also frees commit buffers, but neither the call in traverse_one_object() nor the one touched in this patch does so. And indeed, this is another problem for --connectivity-only (and accounts for most of the 2GB heap after this patch), but it's one we'll fix in a separate commit. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- builtin/fsck.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/builtin/fsck.c b/builtin/fsck.c index 6c73092f10..2119758ef4 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -228,6 +228,8 @@ static void mark_unreachable_referents(const struct object_id *oid) options.walk = mark_used; fsck_walk(obj, NULL, &options); + if (obj->type == OBJ_TREE) + free_tree_buffer((struct tree *)obj); } static int mark_loose_unreachable_referents(const struct object_id *oid, From 069e4452567143000b839d563d2529cec93ec9e8 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Thu, 22 Sep 2022 06:13:36 -0400 Subject: [PATCH 2/3] fsck: turn off save_commit_buffer When parsing a commit, the default behavior is to stuff the original buffer into a commit_slab (which takes ownership of it). But for a tool like fsck, this isn't useful. While we may look at the buffer further as part of fsck_commit(), we'll always do so through a separate pointer; attaching the buffer to the slab doesn't help. Worse, it means we have to remember to free the commit buffer in all call paths. We do so in fsck_obj(), which covers a regular "git fsck". But with "--connectivity-only", we forget to do so in both traverse_one_object(), which covers reachable objects, and mark_unreachable_referents(), which covers unreachable ones. As a result, that mode ends up storing an uncompressed copy of every commit on the heap at once. We could teach the code paths for --connectivity-only to also free commit buffers. But there's an even easier fix: we can just turn off the save_commit_buffer flag, and then we won't attach them to the commits in the first place. This reduces the peak heap of running "git fsck --connectivity-only" in a clone of linux.git from ~2GB to ~1GB. According to massif, the remaining memory goes where you'd expect: the object structs themselves, the obj_hash containing them, and the delta base cache. Note that we'll leave the call to free commit buffers in fsck_obj() for now; it's not quite redundant because of a related bug that we'll fix in a subsequent commit. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- builtin/fsck.c | 1 + 1 file changed, 1 insertion(+) diff --git a/builtin/fsck.c b/builtin/fsck.c index 2119758ef4..38922a369d 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -853,6 +853,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) errors_found = 0; read_replace_refs = 0; + save_commit_buffer = 0; argc = parse_options(argc, argv, prefix, fsck_opts, fsck_usage, 0); From 51b27747e5ba939942fe0e1f9a61e86e0ead19ed Mon Sep 17 00:00:00 2001 From: Jeff King Date: Thu, 22 Sep 2022 06:15:38 -0400 Subject: [PATCH 3/3] parse_object_buffer(): respect save_commit_buffer If the global variable "save_commit_buffer" is set to 0, then parse_commit() will throw away the commit object data after parsing it, rather than sticking it into a commit slab. This goes all the way back to 60ab26de99 ([PATCH] Avoid wasting memory in git-rev-list, 2005-09-15). But there's another code path which may similarly stash the buffer: parse_object_buffer(). This is where we end up if we parse a commit via parse_object(), and it's used directly in a few other code paths like git-fsck. The original goal of 60ab26de99 was avoiding extra memory usage for rev-list. And there it's not all that important to catch parse_object(). We use that function only for looking at the tips of the traversal, and the majority of the commits are parsed by following parent links, where we use parse_commit() directly. So we were wasting some memory, but only a small portion. It's much easier to see the effect with fsck. Since we now turn off save_commit_buffer by default there, we _should_ be able to drop the freeing of the commit buffer in fsck_obj(). But if we do so (taking the first hunk of this patch without the rest), then the peak heap of "git fsck" in a clone of git.git goes from 136MB to 194MB. Teaching parse_object_buffer() to respect save_commit_buffer brings that down to 134.5MB (it's hard to tell from massif's output, but I suspect the savings comes from avoiding the overhead of the mostly-empty commit slab). Other programs should see a small improvement. Both "rev-list --all" and "fsck --connectivity-only" improve by a few hundred kilobytes, as they'd avoid loading the tip objects of their traversals. Most importantly, no code should be hurt by doing this. Any program that turns off save_commit_buffer is already making the assumption that any commit it sees may need to have its object data loaded on demand, as it doesn't know which ones were parsed by parse_commit() versus parse_object(). Not to mention that anything parsed by the commit graph may be in the same boat, even if save_commit_buffer was not disabled. This should be the only spot that needs to be fixed. Grepping for set_commit_buffer() shows that this and parse_commit() are the only relevant calls. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- builtin/fsck.c | 3 --- object.c | 3 ++- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 38922a369d..a7592ef5ff 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -439,9 +439,6 @@ static int fsck_obj(struct object *obj, void *buffer, unsigned long size) out: if (obj->type == OBJ_TREE) free_tree_buffer((struct tree *)obj); - if (obj->type == OBJ_COMMIT) - free_commit_buffer(the_repository->parsed_objects, - (struct commit *)obj); return err; } diff --git a/object.c b/object.c index 588b8156f1..868e623719 100644 --- a/object.c +++ b/object.c @@ -233,7 +233,8 @@ struct object *parse_object_buffer(struct repository *r, const struct object_id if (commit) { if (parse_commit_buffer(r, commit, buffer, size, 1)) return NULL; - if (!get_cached_commit_buffer(r, commit, NULL)) { + if (save_commit_buffer && + !get_cached_commit_buffer(r, commit, NULL)) { set_commit_buffer(r, commit, buffer, size); *eaten_p = 1; }