diff options
| author | Derrick Stolee <dstolee@microsoft.com> | 2020-05-11 11:56:12 +0000 |
|---|---|---|
| committer | Junio C Hamano <gitster@pobox.com> | 2020-05-11 09:33:56 -0700 |
| commit | 65c1a28bb60d1c17a672306c651bb402378f81b0 (patch) | |
| tree | 4c8077d3df7884329fa1b1ab925d24e30bffdbe2 | |
| parent | Documentation: changed-path Bloom filters use byte words (diff) | |
| download | git-65c1a28bb60d1c17a672306c651bb402378f81b0.tar.gz git-65c1a28bb60d1c17a672306c651bb402378f81b0.zip | |
bloom: de-duplicate directory entries
When computing a changed-path Bloom filter, we need to take the
files that changed from the diff computation and extract the parent
directories. That way, a directory pathspec such as "Documentation"
could match commits that change "Documentation/git.txt".
However, the current code does a poor job of this process. The paths
are added to a hashmap, but we do not check if an entry already
exists with that path. This can create many duplicate entries and
cause the filter to have a much larger length than it should. This
means that the filter is more sparse than intended, which helps the
false positive rate, but wastes a lot of space.
Properly use hashmap_get() before hashmap_add(). Also be sure to
include a comparison function so these can be matched correctly.
This has an effect on a test in t0095-bloom.sh. This makes sense,
there are ten changes inside "smallDir" so the total number of
paths in the filter should be 11. This would result in 11 * 10 bits
required, and with 8 bits per byte, this results in 14 bytes.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
| -rw-r--r-- | bloom.c | 35 | ||||
| -rwxr-xr-x | t/t0095-bloom.sh | 4 |
2 files changed, 28 insertions, 11 deletions
@@ -158,6 +158,19 @@ void init_bloom_filters(void) init_bloom_filter_slab(&bloom_filters); } +static int pathmap_cmp(const void *hashmap_cmp_fn_data, + const struct hashmap_entry *eptr, + const struct hashmap_entry *entry_or_key, + const void *keydata) +{ + const struct pathmap_hash_entry *e1, *e2; + + e1 = container_of(eptr, const struct pathmap_hash_entry, entry); + e2 = container_of(entry_or_key, const struct pathmap_hash_entry, entry); + + return strcmp(e1->path, e2->path); +} + struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c, int compute_if_not_present) @@ -206,25 +219,29 @@ struct bloom_filter *get_bloom_filter(struct repository *r, struct hashmap pathmap; struct pathmap_hash_entry *e; struct hashmap_iter iter; - hashmap_init(&pathmap, NULL, NULL, 0); + hashmap_init(&pathmap, pathmap_cmp, NULL, 0); for (i = 0; i < diff_queued_diff.nr; i++) { const char *path = diff_queued_diff.queue[i]->two->path; /* - * Add each leading directory of the changed file, i.e. for - * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so - * the Bloom filter could be used to speed up commands like - * 'git log dir/subdir', too. - * - * Note that directories are added without the trailing '/'. - */ + * Add each leading directory of the changed file, i.e. for + * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so + * the Bloom filter could be used to speed up commands like + * 'git log dir/subdir', too. + * + * Note that directories are added without the trailing '/'. + */ do { char *last_slash = strrchr(path, '/'); FLEX_ALLOC_STR(e, path, path); hashmap_entry_init(&e->entry, strhash(path)); - hashmap_add(&pathmap, &e->entry); + + if (!hashmap_get(&pathmap, &e->entry, NULL)) + hashmap_add(&pathmap, &e->entry); + else + free(e); if (!last_slash) last_slash = (char*)path; diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh index 8f9eef116d..6defeb544f 100755 --- a/t/t0095-bloom.sh +++ b/t/t0095-bloom.sh @@ -89,8 +89,8 @@ test_expect_success 'get bloom filter for commit with 10 changes' ' git add smallDir && git commit -m "commit with 10 changes" && cat >expect <<-\EOF && - Filter_Length:25 - Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23| + Filter_Length:14 + Filter_Data:02|b3|c4|a0|34|e7|fe|eb|cb|47|fe|a0|e8|72| EOF test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && test_cmp expect actual |
