From 75979d94607d92d53b1cad5d65f20594c66d275c Mon Sep 17 00:00:00 2001 From: Derrick Stolee Date: Tue, 1 Mar 2022 19:48:30 +0000 Subject: commit-graph: fix ordering bug in generation numbers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When computing the generation numbers for a commit-graph, we compute the corrected commit dates and then check if their offsets from the actual dates is too large to fit in the 32-bit Generation Data chunk. However, there is a problem with this approach: if we have parsed the generation data from the previous commit-graph, then we continue the loop because the corrected commit date is already computed. This causes an under-count in the number of overflow values. It is incorrect to add an increment to num_generation_data_overflows next to this 'continue' statement, because we might start double-counting commits that are computed because of the depth-first search walk from a commit with an earlier OID. Instead, iterate over the full commit list at the end, checking the offsets to see how many grow beyond the maximum value. Create a new t5328-commit-graph-64-bit-time.sh test script to handle special cases of testing 64-bit timestamps. This helps demonstrate this bug in more cases. It still won't hit all potential cases until the next change, which reenables reading generation numbers. Use the skip_all trick from 0a2bfccb9c8 (t0051: use "skip_all" under !MINGW in single-test file, 2022-02-04) to make the output clean when run on a 32-bit system. Helped-by: Ævar Arnfjörð Bjarmason Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano --- commit-graph.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'commit-graph.c') diff --git a/commit-graph.c b/commit-graph.c index 265c010122..a19bd96c2e 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1556,12 +1556,16 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) if (current->date && current->date > max_corrected_commit_date) max_corrected_commit_date = current->date - 1; commit_graph_data_at(current)->generation = max_corrected_commit_date + 1; - - if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX) - ctx->num_generation_data_overflows++; } } } + + for (i = 0; i < ctx->commits.nr; i++) { + struct commit *c = ctx->commits.list[i]; + timestamp_t offset = commit_graph_data_at(c)->generation - c->date; + if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) + ctx->num_generation_data_overflows++; + } stop_progress(&ctx->progress); } -- cgit v1.2.3 From 3b0199d4c3c9cc2ec39413c52c15cfd16e4d0980 Mon Sep 17 00:00:00 2001 From: Derrick Stolee Date: Tue, 1 Mar 2022 19:48:31 +0000 Subject: commit-graph: start parsing generation v2 (again) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 'read_generation_data' member of 'struct commit_graph' was introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire chain does, 2021-01-16). The intention was to avoid using corrected commit dates if not all layers of a commit-graph had that data stored. The logic in validate_mixed_generation_chain() at that point incorrectly initialized read_generation_data to 1 if and only if the tip commit-graph contained the Corrected Commit Date chunk. This was "fixed" in 448a39e65 (commit-graph: validate layers for generation data, 2021-02-02) to validate that read_generation_data was either non-zero for all layers, or it would set read_generation_data to zero for all layers. The problem here is that read_generation_data is not initialized to be non-zero anywhere! This change initializes read_generation_data immediately after the chunk is parsed, so each layer will have its value present as soon as possible. The read_generation_data member is used in fill_commit_graph_info() to determine if we should use the corrected commit date or the topological levels stored in the Commit Data chunk. Due to this bug, all previous versions of Git were defaulting to topological levels in all cases! This can be measured with some performance tests. Using the Linux kernel as a testbed, I generated a complete commit-graph containing corrected commit dates and tested the 'new' version against the previous, 'old' version. First, rev-list with --topo-order demonstrates a 26% improvement using corrected commit dates: hyperfine \ -n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \ -n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 57.1 ms ± 3.1 ms Range (min … max): 52.9 ms … 62.0 ms 55 runs Benchmark 2: new Time (mean ± σ): 45.5 ms ± 3.3 ms Range (min … max): 39.9 ms … 51.7 ms 59 runs Summary 'new' ran 1.26 ± 0.11 times faster than 'old' These performance improvements are due to the algorithmic improvements given by walking fewer commits due to the higher cutoffs from corrected commit dates. However, this comes at a cost. The additional I/O cost of parsing the corrected commit dates is visible in case of merge-base commands that do not reduce the overall number of walked commits. hyperfine \ -n "old" "$OLD_GIT merge-base v4.8 v4.9" \ -n "new" "$NEW_GIT merge-base v4.8 v4.9" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 110.4 ms ± 6.4 ms Range (min … max): 96.0 ms … 118.3 ms 25 runs Benchmark 2: new Time (mean ± σ): 150.7 ms ± 1.1 ms Range (min … max): 149.3 ms … 153.4 ms 19 runs Summary 'old' ran 1.36 ± 0.08 times faster than 'new' Performance issues like this are what motivated 702110aac (commit-graph: use config to specify generation type, 2021-02-25). In the future, we could fix this performance problem by inserting the corrected commit date offsets into the Commit Date chunk instead of having that data in an extra chunk. Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano --- commit-graph.c | 3 +++ t/lib-commit-graph.sh | 12 +++++++++++- t/t4216-log-bloom.sh | 2 +- t/t5318-commit-graph.sh | 2 +- t/t5324-split-commit-graph.sh | 9 +++++++-- 5 files changed, 23 insertions(+), 5 deletions(-) (limited to 'commit-graph.c') diff --git a/commit-graph.c b/commit-graph.c index a19bd96c2e..8e52bb0955 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -407,6 +407,9 @@ struct commit_graph *parse_commit_graph(struct repository *r, &graph->chunk_generation_data); pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW, &graph->chunk_generation_data_overflow); + + if (graph->chunk_generation_data) + graph->read_generation_data = 1; } if (r->settings.commit_graph_read_changed_paths) { diff --git a/t/lib-commit-graph.sh b/t/lib-commit-graph.sh index 07e12b9d2f..5d79e1a4e9 100755 --- a/t/lib-commit-graph.sh +++ b/t/lib-commit-graph.sh @@ -37,11 +37,21 @@ graph_read_expect() { OPTIONAL=" $2" NUM_CHUNKS=$((3 + $(echo "$2" | wc -w))) fi + GENERATION_VERSION=2 + if test -n "$3" + then + GENERATION_VERSION=$3 + fi + OPTIONS= + if test $GENERATION_VERSION -gt 1 + then + OPTIONS=" read_generation_data" + fi cat >expect <<- EOF header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0 num_commits: $1 chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL - options: + options:$OPTIONS EOF test-tool read-graph >output && test_cmp expect output diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index 5ed6d2a21c..fa9d32facf 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -48,7 +48,7 @@ graph_read_expect () { header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0 num_commits: $1 chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data - options: bloom(1,10,7) + options: bloom(1,10,7) read_generation_data EOF test-tool read-graph >actual && test_cmp expect actual diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 0ed7e9de8e..fbf0d64578 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -456,7 +456,7 @@ test_expect_success 'git commit-graph verify' ' cd "$TRASH_DIRECTORY/full" && git rev-parse commits/8 | git -c commitGraph.generationVersion=1 commit-graph write --stdin-commits && git commit-graph verify >output && - graph_read_expect 9 extra_edges + graph_read_expect 9 extra_edges 1 ' NUM_COMMITS=9 diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh index 778fa418de..669ddc645f 100755 --- a/t/t5324-split-commit-graph.sh +++ b/t/t5324-split-commit-graph.sh @@ -30,11 +30,16 @@ graph_read_expect() { then NUM_BASE=$2 fi + OPTIONS= + if test -z "$3" + then + OPTIONS=" read_generation_data" + fi cat >expect <<- EOF header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE num_commits: $1 chunks: oid_fanout oid_lookup commit_metadata generation_data - options: + options:$OPTIONS EOF test-tool read-graph >output && test_cmp expect output @@ -624,7 +629,7 @@ test_expect_success 'write generation data chunk if topmost remaining layer has header: 43475048 1 $(test_oid oid_version) 5 1 num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS)) chunks: oid_fanout oid_lookup commit_metadata generation_data - options: + options: read_generation_data EOF test_cmp expect output ) -- cgit v1.2.3 From c8d67b9a682e2a52beafab60f057ac317e35358d Mon Sep 17 00:00:00 2001 From: Derrick Stolee Date: Tue, 1 Mar 2022 19:48:32 +0000 Subject: commit-graph: fix generation number v2 overflow values The Generation Data Chunk was implemented and tested in e8b63005c (commit-graph: implement generation data chunk, 2021-01-16), but the test was carefully constructed to work on systems with 32-bit dates. Since the corrected commit date offsets still required more than 31 bits, this triggered writing the generation_data_overflow chunk. However, upon closer look, the write_graph_chunk_generation_data_overflow() method writes the offsets to the chunk (as dictated by the format) but fill_commit_graph_info() treats the value in the chunk as if it is the full corrected commit date (not an offset). For some reason, this does not cause an issue when using the FUTURE_DATE specified in t5318-commit-graph.sh, but it does show up as a failure in 'git commit-graph verify' if we increase that FUTURE_DATE to be above four billion. Fix this error and create a 64-bit timestamp version of the test so we can test these larger values. Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano --- commit-graph.c | 2 +- t/t5328-commit-graph-64bit-time.sh | 27 +++++++++++++++++++++++++++ 2 files changed, 28 insertions(+), 1 deletion(-) (limited to 'commit-graph.c') diff --git a/commit-graph.c b/commit-graph.c index 8e52bb0955..b86a6a634f 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -806,7 +806,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, die(_("commit-graph requires overflow generation data but has none")); offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW; - graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos); + graph_data->generation = item->date + get_be64(g->chunk_generation_data_overflow + 8 * offset_pos); } else graph_data->generation = item->date + offset; } else diff --git a/t/t5328-commit-graph-64bit-time.sh b/t/t5328-commit-graph-64bit-time.sh index 28114bcaf4..093f0c067a 100755 --- a/t/t5328-commit-graph-64bit-time.sh +++ b/t/t5328-commit-graph-64bit-time.sh @@ -36,4 +36,31 @@ test_expect_success 'lower layers have overflow chunk' ' graph_git_behavior 'overflow' '' HEAD~2 HEAD +test_expect_success 'set up and verify repo with generation data overflow chunk' ' + mkdir repo && + cd repo && + git init && + test_commit --date "$UNIX_EPOCH_ZERO" 1 && + test_commit 2 && + test_commit --date "$UNIX_EPOCH_ZERO" 3 && + git commit-graph write --reachable && + graph_read_expect 3 generation_data && + test_commit --date "$FUTURE_DATE" 4 && + test_commit 5 && + test_commit --date "$UNIX_EPOCH_ZERO" 6 && + git branch left && + git reset --hard 3 && + test_commit 7 && + test_commit --date "$FUTURE_DATE" 8 && + test_commit 9 && + git branch right && + git reset --hard 3 && + test_merge M left right && + git commit-graph write --reachable && + graph_read_expect 10 "generation_data generation_data_overflow" && + git commit-graph verify +' + +graph_git_behavior 'overflow 2' repo left right + test_done -- cgit v1.2.3 From 6dbf4b8172ef9edd50bdf9ca2da4681ba9153f75 Mon Sep 17 00:00:00 2001 From: Derrick Stolee Date: Wed, 2 Mar 2022 09:45:13 -0500 Subject: commit-graph: declare bankruptcy on GDAT chunks The Generation Data (GDAT) and Generation Data Overflow (GDOV) chunks store corrected commit date offsets, used for generation number v2. Recent changes have demonstrated that previous versions of Git were incorrectly parsing data from these chunks, but might have also been writing them incorrectly. I asserted [1] that the previous fixes were sufficient because the known reasons for incorrectly writing generation number v2 data relied on parsing the information incorrectly out of a commit-graph file, but the previous versions of Git were not reading the generation number v2 data. However, Patrick demonstrated [2] a case where in split commit-graphs across an alternate boundary (and possibly some other special conditions) it was possible to have a commit-graph that was generated by a previous version of Git have incorrect generation number v2 data which results in errors like the following: commit-graph generation for commit is 1623273624 < 1623273710 [1] https://lore.kernel.org/git/f50e74f0-9ffa-f4f2-4663-269801495ed3@github.com/ [2] https://lore.kernel.org/git/Yh93vOkt2DkrGPh2@ncase/ Clearly, there is something else going on. The situation is not completely understood, but the errors do not reproduce if the commit-graphs are all generated by a Git version including these recent fixes. If we cannot trust the existing data in the GDAT and GDOV chunks, then we can alter the format to change the chunk IDs for these chunks. This causes the new version of Git to silently ignore the older chunks (and disabling generation number v2 in the process) while writing new commit-graph files with correct data in the GDA2 and GDO2 chunks. Update commit-graph-format.txt including a historical note about these deprecated chunks. Reported-by: Patrick Steinhardt Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano --- Documentation/technical/commit-graph-format.txt | 12 ++++++++++-- commit-graph.c | 4 ++-- 2 files changed, 12 insertions(+), 4 deletions(-) (limited to 'commit-graph.c') diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt index 87971c27dd..484b185ba9 100644 --- a/Documentation/technical/commit-graph-format.txt +++ b/Documentation/technical/commit-graph-format.txt @@ -93,7 +93,7 @@ CHUNK DATA: 2 bits of the lowest byte, storing the 33rd and 34th bit of the commit time. - Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional] + Generation Data (ID: {'G', 'D', 'A', '2' }) (N * 4 bytes) [Optional] * This list of 4-byte values store corrected commit date offsets for the commits, arranged in the same order as commit data chunk. * If the corrected commit date offset cannot be stored within 31 bits, @@ -104,7 +104,7 @@ CHUNK DATA: by compatible versions of Git and in case of split commit-graph chains, the topmost layer also has Generation Data chunk. - Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional] + Generation Data Overflow (ID: {'G', 'D', 'O', '2' }) [Optional] * This list of 8-byte values stores the corrected commit date offsets for commits with corrected commit date offsets that cannot be stored within 31 bits. @@ -156,3 +156,11 @@ CHUNK DATA: TRAILER: H-byte HASH-checksum of all of the above. + +== Historical Notes: + +The Generation Data (GDA2) and Generation Data Overflow (GDO2) chunks have +the number '2' in their chunk IDs because a previous version of Git wrote +possibly erroneous data in these chunks with the IDs "GDAT" and "GDOV". By +changing the IDs, newer versions of Git will silently ignore those older +chunks and write the new information without trusting the incorrect data. diff --git a/commit-graph.c b/commit-graph.c index b86a6a634f..fb2ced0bd6 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -39,8 +39,8 @@ void git_test_write_commit_graph_or_die(void) #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ -#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */ -#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */ +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444132 /* "GDA2" */ +#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f32 /* "GDO2" */ #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */ #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */ #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */ -- cgit v1.2.3