From 65168c42df25c61cfbe3ab481549b250c150f1f6 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Mon, 22 Aug 2022 15:50:32 -0400 Subject: [PATCH 1/7] t5326: demonstrate potential bitmap corruption It is possible to generate a corrupt MIDX bitmap when certain conditions are met. This happens when the preferred pack "P" changes to one (say, "Q") that: - "Q" has objects included in an existing MIDX, - but "Q" is different than "P", - and "Q" and "P" have some objects in common When this is the case, not all objects from "Q" will be selected from "Q" (ie., the generated MIDX will represent them as coming from a different pack), despite "Q" being preferred. This is an invariant violation, since all objects contained in the MIDX's preferred pack are supposed to originate from the preferred pack. In other words, all duplicate objects are resolved in favor of the copy that comes from the MIDX's preferred pack, if any. This violation results in a corrupt object order, which cannot be interpreted by the pack-bitmap code, leading to broken clones and other defects. This test demonstrates the above problem by constructing a minimal reproduction, and showing that the final `git clone` invocation fails. The reproduction is mostly straightforward, except that the new pack generated between MIDX writes (which is necessary in order to prevent that operation from being a noop) must sort ahead of all existing packs in order to prevent a different pack (neither "P" nor "Q") from appearing as preferred (meaning all its objects appear in order at the beginning of the pseudo-pack order). Subsequent commits will first refactor the midx.c::get_sorted_entries() function, and then fix this bug. Reported-by: Abhradeep Chakraborty Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- t/t5326-multi-pack-bitmaps.sh | 47 +++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) diff --git a/t/t5326-multi-pack-bitmaps.sh b/t/t5326-multi-pack-bitmaps.sh index 4fe57414c1..c364677ae8 100755 --- a/t/t5326-multi-pack-bitmaps.sh +++ b/t/t5326-multi-pack-bitmaps.sh @@ -307,4 +307,51 @@ test_expect_success 'graceful fallback when missing reverse index' ' ) ' +test_expect_success 'preferred pack change with existing MIDX bitmap' ' + git init preferred-pack-with-existing && + ( + cd preferred-pack-with-existing && + + test_commit base && + test_commit other && + + git rev-list --objects --no-object-names base >p1.objects && + git rev-list --objects --no-object-names other >p2.objects && + + p1="$(git pack-objects "$objdir/pack/pack" \ + --delta-base-offset Date: Mon, 22 Aug 2022 15:50:35 -0400 Subject: [PATCH 2/7] t/lib-bitmap.sh: avoid silencing stderr The midx_bitmap_partial_tests() function is responsible for setting up a state where some (but not all) packs in the repository are covered by a MIDX (and bitmap). This function has redirected the `git multi-pack-index write --bitmap`'s stderr to a file "err" since its introduction back in c51f5a6437 (t5326: test multi-pack bitmap behavior, 2021-08-31). This was likely a stray change left over from a slightly different version of this test, since the file "err" is never read after being written. This leads to confusingly-missing output, especially when the contents of stderr are important. Resolve this confusion by avoiding silencing stderr in this case. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- t/lib-bitmap.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/t/lib-bitmap.sh b/t/lib-bitmap.sh index a95537e759..f595937094 100644 --- a/t/lib-bitmap.sh +++ b/t/lib-bitmap.sh @@ -440,7 +440,7 @@ midx_bitmap_partial_tests () { test_commit packed && git repack && test_commit loose && - git multi-pack-index write --bitmap 2>err && + git multi-pack-index write --bitmap && test_path_is_file $midx && test_path_is_file $midx-$(midx_checksum $objdir).bitmap ' From 989d9cbd5c32fad54391a4ed06b1271f0a891077 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Mon, 22 Aug 2022 15:50:38 -0400 Subject: [PATCH 3/7] midx.c: extract `struct midx_fanout` To build up a list of objects (along with their packs, and the offsets within those packs that each object appears at), the MIDX code implements `get_sorted_entries()` which builds up a list of candidates, sorts them, and then removes duplicate entries. To do this, it keeps an array of `pack_midx_entry` structures that it builds up once for each fanout level (ie., for all possible values of the first byte of each object's ID). This array is a function-local variable of `get_sorted_entries()`. Since it uses the ALLOC_GROW() macro, having the `alloc_fanout` variable also be local to that function, and only modified within that function is convenient. However, subsequent changes will extract the two ways this array is filled (from a pack at some fanout value, and from an existing MIDX at some fanout value) into separate functions. Instead of passing around pointers to the entries array, along with `nr_fanout` and `alloc_fanout`, encapsulate these three into a structure instead. Then pass around a pointer to this structure instead. This patch does not yet extract the above two functions, but sets us up to begin doing so in the following commit. For now, the implementation of get_sorted_entries() is only modified to replace `entries_by_fanout` with `fanout.entries`, `nr_fanout` with `fanout.nr`, and so on. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- midx.c | 54 +++++++++++++++++++++++++++++++++++------------------- 1 file changed, 35 insertions(+), 19 deletions(-) diff --git a/midx.c b/midx.c index 5f0dd386b0..b4f6e37cd0 100644 --- a/midx.c +++ b/midx.c @@ -577,6 +577,22 @@ static void fill_pack_entry(uint32_t pack_int_id, entry->preferred = !!preferred; } +struct midx_fanout { + struct pack_midx_entry *entries; + uint32_t nr; + uint32_t alloc; +}; + +static void midx_fanout_grow(struct midx_fanout *fanout, uint32_t nr) +{ + ALLOC_GROW(fanout->entries, nr, fanout->alloc); +} + +static void midx_fanout_sort(struct midx_fanout *fanout) +{ + QSORT(fanout->entries, fanout->nr, midx_oid_compare); +} + /* * It is possible to artificially get into a state where there are many * duplicate copies of objects. That can create high memory pressure if @@ -595,8 +611,8 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, int preferred_pack) { uint32_t cur_fanout, cur_pack, cur_object; - uint32_t alloc_fanout, alloc_objects, total_objects = 0; - struct pack_midx_entry *entries_by_fanout = NULL; + uint32_t alloc_objects, total_objects = 0; + struct midx_fanout fanout = { 0 }; struct pack_midx_entry *deduplicated_entries = NULL; uint32_t start_pack = m ? m->num_packs : 0; @@ -608,14 +624,14 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, * slices to be evenly distributed, with some noise. Hence, * allocate slightly more than one 256th. */ - alloc_objects = alloc_fanout = total_objects > 3200 ? total_objects / 200 : 16; + alloc_objects = fanout.alloc = total_objects > 3200 ? total_objects / 200 : 16; - ALLOC_ARRAY(entries_by_fanout, alloc_fanout); + ALLOC_ARRAY(fanout.entries, fanout.alloc); ALLOC_ARRAY(deduplicated_entries, alloc_objects); *nr_objects = 0; for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) { - uint32_t nr_fanout = 0; + fanout.nr = 0; if (m) { uint32_t start = 0, end; @@ -625,15 +641,15 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, end = ntohl(m->chunk_oid_fanout[cur_fanout]); for (cur_object = start; cur_object < end; cur_object++) { - ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout); + midx_fanout_grow(&fanout, fanout.nr + 1); nth_midxed_pack_midx_entry(m, - &entries_by_fanout[nr_fanout], + &fanout.entries[fanout.nr], cur_object); if (nth_midxed_pack_int_id(m, cur_object) == preferred_pack) - entries_by_fanout[nr_fanout].preferred = 1; + fanout.entries[fanout.nr].preferred = 1; else - entries_by_fanout[nr_fanout].preferred = 0; - nr_fanout++; + fanout.entries[fanout.nr].preferred = 0; + fanout.nr++; } } @@ -646,36 +662,36 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, end = get_pack_fanout(info[cur_pack].p, cur_fanout); for (cur_object = start; cur_object < end; cur_object++) { - ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout); + midx_fanout_grow(&fanout, fanout.nr + 1); fill_pack_entry(cur_pack, info[cur_pack].p, cur_object, - &entries_by_fanout[nr_fanout], + &fanout.entries[fanout.nr], preferred); - nr_fanout++; + fanout.nr++; } } - QSORT(entries_by_fanout, nr_fanout, midx_oid_compare); + midx_fanout_sort(&fanout); /* * The batch is now sorted by OID and then mtime (descending). * Take only the first duplicate. */ - for (cur_object = 0; cur_object < nr_fanout; cur_object++) { - if (cur_object && oideq(&entries_by_fanout[cur_object - 1].oid, - &entries_by_fanout[cur_object].oid)) + for (cur_object = 0; cur_object < fanout.nr; cur_object++) { + if (cur_object && oideq(&fanout.entries[cur_object - 1].oid, + &fanout.entries[cur_object].oid)) continue; ALLOC_GROW(deduplicated_entries, *nr_objects + 1, alloc_objects); memcpy(&deduplicated_entries[*nr_objects], - &entries_by_fanout[cur_object], + &fanout.entries[cur_object], sizeof(struct pack_midx_entry)); (*nr_objects)++; } } - free(entries_by_fanout); + free(fanout.entries); return deduplicated_entries; } From 852c530102c5ca5b69ae863acaa1acb50cbba861 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Mon, 22 Aug 2022 15:50:41 -0400 Subject: [PATCH 4/7] midx.c: extract `midx_fanout_add_midx_fanout()` Extract a routine to add all objects whose object ID's first byte is `cur_fanout` from an existing MIDX. This function will only be called once, so extracting it is purely cosmetic to improve the readability of `get_sorted_entries()` (its sole caller) below. The functionality is unchanged in this commit. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- midx.c | 47 ++++++++++++++++++++++++++++------------------- 1 file changed, 28 insertions(+), 19 deletions(-) diff --git a/midx.c b/midx.c index b4f6e37cd0..17bdceed25 100644 --- a/midx.c +++ b/midx.c @@ -593,6 +593,31 @@ static void midx_fanout_sort(struct midx_fanout *fanout) QSORT(fanout->entries, fanout->nr, midx_oid_compare); } +static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout, + struct multi_pack_index *m, + int preferred_pack, + uint32_t cur_fanout) +{ + uint32_t start = 0, end; + uint32_t cur_object; + + if (cur_fanout) + start = ntohl(m->chunk_oid_fanout[cur_fanout - 1]); + end = ntohl(m->chunk_oid_fanout[cur_fanout]); + + for (cur_object = start; cur_object < end; cur_object++) { + midx_fanout_grow(fanout, fanout->nr + 1); + nth_midxed_pack_midx_entry(m, + &fanout->entries[fanout->nr], + cur_object); + if (nth_midxed_pack_int_id(m, cur_object) == preferred_pack) + fanout->entries[fanout->nr].preferred = 1; + else + fanout->entries[fanout->nr].preferred = 0; + fanout->nr++; + } +} + /* * It is possible to artificially get into a state where there are many * duplicate copies of objects. That can create high memory pressure if @@ -633,25 +658,9 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) { fanout.nr = 0; - if (m) { - uint32_t start = 0, end; - - if (cur_fanout) - start = ntohl(m->chunk_oid_fanout[cur_fanout - 1]); - end = ntohl(m->chunk_oid_fanout[cur_fanout]); - - for (cur_object = start; cur_object < end; cur_object++) { - midx_fanout_grow(&fanout, fanout.nr + 1); - nth_midxed_pack_midx_entry(m, - &fanout.entries[fanout.nr], - cur_object); - if (nth_midxed_pack_int_id(m, cur_object) == preferred_pack) - fanout.entries[fanout.nr].preferred = 1; - else - fanout.entries[fanout.nr].preferred = 0; - fanout.nr++; - } - } + if (m) + midx_fanout_add_midx_fanout(&fanout, m, preferred_pack, + cur_fanout); for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) { uint32_t start = 0, end; From 1d6f4c64084f31bd1be558ae72f25e7600c39f4d Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Mon, 22 Aug 2022 15:50:43 -0400 Subject: [PATCH 5/7] midx.c: extract `midx_fanout_add_pack_fanout()` Extract a routine to add all objects whose object ID's first byte is `cur_fanout` from a given pack (identified by its index into the `struct pack_info` array maintained by the MIDX writing routine). Unlike the previous extraction (for `midx_fanout_add_midx_fanout()`), this function will be called twice, once for all new packs, and again for the preferred pack (if it appears in an existing MIDX). The latter change is to resolve the bug described a few patches ago, and will be made in the subsequent commit. Similar to the previous refactoring, this function also enhances the readability of its caller in `get_sorted_entries()`. Its functionality is unchanged in this commit. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- midx.c | 43 ++++++++++++++++++++++++++++--------------- 1 file changed, 28 insertions(+), 15 deletions(-) diff --git a/midx.c b/midx.c index 17bdceed25..b2ef4ece64 100644 --- a/midx.c +++ b/midx.c @@ -618,6 +618,31 @@ static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout, } } +static void midx_fanout_add_pack_fanout(struct midx_fanout *fanout, + struct pack_info *info, + uint32_t cur_pack, + int preferred, + uint32_t cur_fanout) +{ + struct packed_git *pack = info[cur_pack].p; + uint32_t start = 0, end; + uint32_t cur_object; + + if (cur_fanout) + start = get_pack_fanout(pack, cur_fanout - 1); + end = get_pack_fanout(pack, cur_fanout); + + for (cur_object = start; cur_object < end; cur_object++) { + midx_fanout_grow(fanout, fanout->nr + 1); + fill_pack_entry(cur_pack, + info[cur_pack].p, + cur_object, + &fanout->entries[fanout->nr], + preferred); + fanout->nr++; + } +} + /* * It is possible to artificially get into a state where there are many * duplicate copies of objects. That can create high memory pressure if @@ -663,22 +688,10 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, cur_fanout); for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) { - uint32_t start = 0, end; int preferred = cur_pack == preferred_pack; - - if (cur_fanout) - start = get_pack_fanout(info[cur_pack].p, cur_fanout - 1); - end = get_pack_fanout(info[cur_pack].p, cur_fanout); - - for (cur_object = start; cur_object < end; cur_object++) { - midx_fanout_grow(&fanout, fanout.nr + 1); - fill_pack_entry(cur_pack, - info[cur_pack].p, - cur_object, - &fanout.entries[fanout.nr], - preferred); - fanout.nr++; - } + midx_fanout_add_pack_fanout(&fanout, + info, cur_pack, + preferred, cur_fanout); } midx_fanout_sort(&fanout); From cdf517be06b83c01d0411709c1fac0c2fc5c1a9b Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Mon, 22 Aug 2022 15:50:46 -0400 Subject: [PATCH 6/7] midx.c: include preferred pack correctly with existing MIDX This patch resolves an issue where the object order used to generate a MIDX bitmap would violate an invariant that all of the preferred pack's objects are represented by that pack in the MIDX. The problem arises when reusing an existing MIDX while generating a new one, and occurs specifically when the identity of the preferred pack changes from one MIDX to another, along with a few other conditions: - the new preferred pack must also be present in the existing MIDX - the new preferred pack must *not* have been the preferred pack in the existing MIDX - most importantly, there must be at least one object present in the physical preferred pack (ie., it shows up in that pack's index) but was selected from a *different* pack when the previous MIDX was generated When the above conditions are all met, we end up (incorrectly) discarding copies of some objects in the pack selected as the preferred pack. This is because `get_sorted_entries()` adds objects to its list by doing the following at each fanout level: - first, adding all objects from that fanout level from an existing MIDX - then, adding all objects from that fanout level in each pack *not* included in the existing MIDX So if some object was not selected from the to-be-preferred pack when writing the previous MIDX, then we will never consider it as a candidate when generating the new MIDX. This means that it's possible for the preferred pack to not include all of its objects in the MIDX's pseudo-pack object order, which is an invariant violation of that order. Resolve this by adding all objects from the preferred pack separately when it appears in the existing MIDX (if one was present). This will duplicate objects from that pack that *did* appear in the MIDX, but this is fine, since get_sorted_entries() already handles duplicates. (A future optimization in this area could avoid adding copies of objects that we know already existing in the MIDX.) Note that we no longer need to compute the preferred-ness of objects added from the MIDX, since we only want to select the preferred objects from a single source. (We could still mark these preferred bits, but doing so is redundant and unnecessary). This resolves the bug demonstrated by t5326.174 ("preferred pack change with existing MIDX bitmap"). Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- midx.c | 14 +++++++------- t/t5326-multi-pack-bitmaps.sh | 13 +++++-------- 2 files changed, 12 insertions(+), 15 deletions(-) diff --git a/midx.c b/midx.c index b2ef4ece64..aa9574f73e 100644 --- a/midx.c +++ b/midx.c @@ -595,7 +595,6 @@ static void midx_fanout_sort(struct midx_fanout *fanout) static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout, struct multi_pack_index *m, - int preferred_pack, uint32_t cur_fanout) { uint32_t start = 0, end; @@ -610,10 +609,7 @@ static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout, nth_midxed_pack_midx_entry(m, &fanout->entries[fanout->nr], cur_object); - if (nth_midxed_pack_int_id(m, cur_object) == preferred_pack) - fanout->entries[fanout->nr].preferred = 1; - else - fanout->entries[fanout->nr].preferred = 0; + fanout->entries[fanout->nr].preferred = 0; fanout->nr++; } } @@ -684,8 +680,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, fanout.nr = 0; if (m) - midx_fanout_add_midx_fanout(&fanout, m, preferred_pack, - cur_fanout); + midx_fanout_add_midx_fanout(&fanout, m, cur_fanout); for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) { int preferred = cur_pack == preferred_pack; @@ -694,6 +689,11 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, preferred, cur_fanout); } + if (-1 < preferred_pack && preferred_pack < start_pack) + midx_fanout_add_pack_fanout(&fanout, info, + preferred_pack, 1, + cur_fanout); + midx_fanout_sort(&fanout); /* diff --git a/t/t5326-multi-pack-bitmaps.sh b/t/t5326-multi-pack-bitmaps.sh index c364677ae8..89ecd1062c 100755 --- a/t/t5326-multi-pack-bitmaps.sh +++ b/t/t5326-multi-pack-bitmaps.sh @@ -338,19 +338,16 @@ test_expect_success 'preferred pack change with existing MIDX bitmap' ' git pack-objects --all --unpacked $objdir/pack/pack0 && # Generate a new MIDX which changes the preferred pack - # to a pack contained in the existing MIDX, such that - # not all objects from p2 that appear in the MIDX had - # their copy selected from p2. + # to a pack contained in the existing MIDX. git multi-pack-index write --bitmap \ --preferred-pack="pack-$p2.pack" && test_path_is_file $midx && test_path_is_file $midx-$(midx_checksum $objdir).bitmap && - # When the above circumstances are met, an existing bug - # in the MIDX machinery will cause the reverse index to - # be read incorrectly, resulting in failed clones (among - # other things). - test_must_fail git clone --no-local . clone2 + # When the above circumstances are met, the preferred + # pack should change appropriately and clones should + # (still) succeed. + git clone --no-local . clone2 ) ' From 99e4d084ffc4c6f8cb28ec61fdbb44facdd47ac7 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Mon, 22 Aug 2022 15:50:49 -0400 Subject: [PATCH 7/7] midx.c: avoid adding preferred objects twice The last commit changes the behavior of midx.c's `get_sorted_objects()` function to handle the case of writing a MIDX bitmap while reusing an existing MIDX and changing the identity of the preferred pack separately. As part of this change, all objects from the (new) preferred pack are added to the fanout table in a separate pass. Since these copies of the objects all have their preferred bits set, any duplicates will be resolved in their favor. Importantly, this includes any copies of those same objects that come from the existing MIDX. We know at the time of adding them that they'll be redundant if their source pack is the (new) preferred one, so we can avoid adding them to the list in this case. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- midx.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/midx.c b/midx.c index aa9574f73e..1ce4c5f05c 100644 --- a/midx.c +++ b/midx.c @@ -595,7 +595,8 @@ static void midx_fanout_sort(struct midx_fanout *fanout) static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout, struct multi_pack_index *m, - uint32_t cur_fanout) + uint32_t cur_fanout, + int preferred_pack) { uint32_t start = 0, end; uint32_t cur_object; @@ -605,6 +606,15 @@ static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout, end = ntohl(m->chunk_oid_fanout[cur_fanout]); for (cur_object = start; cur_object < end; cur_object++) { + if ((preferred_pack > -1) && + (preferred_pack == nth_midxed_pack_int_id(m, cur_object))) { + /* + * Objects from preferred packs are added + * separately. + */ + continue; + } + midx_fanout_grow(fanout, fanout->nr + 1); nth_midxed_pack_midx_entry(m, &fanout->entries[fanout->nr], @@ -680,7 +690,8 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, fanout.nr = 0; if (m) - midx_fanout_add_midx_fanout(&fanout, m, cur_fanout); + midx_fanout_add_midx_fanout(&fanout, m, cur_fanout, + preferred_pack); for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) { int preferred = cur_pack == preferred_pack;