git/object.c, branch v2.38.4

Merge branch 'jk/fsck-on-diet' into maint-2.38

2022-10-26T00:11:33Z

"git fsck" failed to release contents of tree objects already used from the memory, which has been fixed. * jk/fsck-on-diet: parse_object_buffer(): respect save_commit_buffer fsck: turn off save_commit_buffer fsck: free tree buffers after walking unreachable objects

parse_object_buffer(): respect save_commit_buffer

2022-09-22T18:40:47Z

If the global variable "save_commit_buffer" is set to 0, then parse_commit() will throw away the commit object data after parsing it, rather than sticking it into a commit slab. This goes all the way back to 60ab26de99 ([PATCH] Avoid wasting memory in git-rev-list, 2005-09-15). But there's another code path which may similarly stash the buffer: parse_object_buffer(). This is where we end up if we parse a commit via parse_object(), and it's used directly in a few other code paths like git-fsck. The original goal of 60ab26de99 was avoiding extra memory usage for rev-list. And there it's not all that important to catch parse_object(). We use that function only for looking at the tips of the traversal, and the majority of the commits are parsed by following parent links, where we use parse_commit() directly. So we were wasting some memory, but only a small portion. It's much easier to see the effect with fsck. Since we now turn off save_commit_buffer by default there, we _should_ be able to drop the freeing of the commit buffer in fsck_obj(). But if we do so (taking the first hunk of this patch without the rest), then the peak heap of "git fsck" in a clone of git.git goes from 136MB to 194MB. Teaching parse_object_buffer() to respect save_commit_buffer brings that down to 134.5MB (it's hard to tell from massif's output, but I suspect the savings comes from avoiding the overhead of the mostly-empty commit slab). Other programs should see a small improvement. Both "rev-list --all" and "fsck --connectivity-only" improve by a few hundred kilobytes, as they'd avoid loading the tip objects of their traversals. Most importantly, no code should be hurt by doing this. Any program that turns off save_commit_buffer is already making the assumption that any commit it sees may need to have its object data loaded on demand, as it doesn't know which ones were parsed by parse_commit() versus parse_object(). Not to mention that anything parsed by the commit graph may be in the same boat, even if save_commit_buffer was not disabled. This should be the only spot that needs to be fixed. Grepping for set_commit_buffer() shows that this and parse_commit() are the only relevant calls. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

parse_object(): check commit-graph when skip_hash set

2022-09-07T19:27:02Z

If the caller told us that they don't care about us checking the object hash, then we're free to implement any optimizations that get us the parsed value more quickly. An obvious one is to check the commit graph before loading an object from disk. And in fact, both of the callers who pass in this flag are already doing so before they call parse_object()! So we can simplify those callers, as well as any possible future ones, by moving the logic into parse_object(). There are two subtle things to note in the diff, but neither has any impact in practice: - it seems least-surprising here to do the graph lookup on the git-replace'd oid, rather than the original. This is in theory a change of behavior from the earlier code, as neither caller did a replace lookup itself. But in practice it doesn't matter, as we disable the commit graph entirely if there are any replace refs. - the caller in get_reference() passes the skip_hash flag only if revs->verify_objects isn't set, whereas it would look in the commit graph unconditionally. In practice this should not matter as we should disable the commit graph entirely when using verify_objects (and that was done recently in another patch). So this should be a pure cleanup with no behavior change. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

parse_object(): allow skipping hash check

2022-09-07T19:18:57Z

The parse_object() function checks the object hash of any object it parses. This is a nice feature, as it means we may catch bit corruption during normal use, rather than waiting for specific fsck operations. But it also can be slow. It's particularly noticeable for blobs, where except for the hash check, we could return without loading the object contents at all. Now one may wonder what is the point of calling parse_object() on a blob in the first place then, but usually it's not intentional: we were fed an oid from somewhere, don't know the type, and want an object struct. For commits and trees, the parsing is usually helpful; we're about to look at the contents anyway. But this is less true for blobs, where we may be collecting them as part of a reachability traversal, etc, and don't actually care what's in them. And blobs, of course, tend to be larger. We don't want to just throw out the hash-checks for blobs, though. We do depend on them in some circumstances (e.g., rev-list --verify-objects uses parse_object() to check them). It's only the callers that know how they're going to use the result. And so we can help them by providing a special flag to skip the hash check. We could just apply this to blobs, as they're going to be the main source of performance improvement. But if a caller doesn't care about checking the hash, we might as well skip it for other object types, too. Even though we can't avoid reading the object contents, we can still skip the actual hash computation. If this seems like it is making Git a little bit less safe against corruption, it may be. But it's part of a series of tradeoffs we're already making. For instance, "rev-list --objects" does not open the contents of blobs it prints. And when a commit graph is present, we skip opening most commits entirely. The important thing will be to use this flag in cases where it's safe to skip the check. For instance, when serving a pack for a fetch, we know the client will fully index the objects and do a connectivity check itself. There's little to be gained from the server side re-hashing a blob itself. And indeed, most of the time we don't! The revision machinery won't open up a blob reached by traversal, but only one requested directly with a "want" line. So applied properly, this new feature shouldn't make anything less safe in practice. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

object-file API: have hash_object_file() take "enum object_type"

2022-02-26T01:16:32Z

Change the hash_object_file() function to take an "enum object_type". Since a preceding commit all of its callers are passing either "{commit,tree,blob,tag}_type", or the result of a call to type_name(), the parse_object() caller that would pass NULL is now using stream_object_signature(). Signed-off-by: Ævar Arnfjörð Bjarmason Signed-off-by: Junio C Hamano

object-file API: split up and simplify check_object_signature()

2022-02-26T01:16:31Z

Split up the check_object_signature() function into that non-streaming version (it accepts an already filled "buf"), and a new stream_object_signature() which will retrieve the object from storage, and hash it on-the-fly. All of the callers of check_object_signature() were effectively calling two different functions, if we go by cyclomatic complexity. I.e. they'd either take the early "if (map)" branch and return early, or not. This has been the case since the "if (map)" condition was added in 090ea12671b (parse_object: avoid putting whole blob in core, 2012-03-07). We can then further simplify the resulting check_object_signature() function since only one caller wanted to pass a non-NULL "buf" and a non-NULL "real_oidp". That "read_loose_object()" codepath used by "git fsck" can instead use hash_object_file() followed by oideq(). Signed-off-by: Ævar Arnfjörð Bjarmason Signed-off-by: Junio C Hamano

Merge branch 'ns/tmp-objdir'

2022-01-04T00:24:15Z

New interface into the tmp-objdir API to help in-core use of the quarantine feature. * ns/tmp-objdir: tmp-objdir: disable ref updates when replacing the primary odb tmp-objdir: new API for creating temporary writable databases

tmp-objdir: new API for creating temporary writable databases

2021-12-08T22:06:36Z

The tmp_objdir API provides the ability to create temporary object directories, but was designed with the goal of having subprocesses access these object stores, followed by the main process migrating objects from it to the main object store or just deleting it. The subprocesses would view it as their primary datastore and write to it. Here we add the tmp_objdir_replace_primary_odb function that replaces the current process's writable "main" object directory with the specified one. The previous main object directory is restored in either tmp_objdir_migrate or tmp_objdir_destroy. For the --remerge-diff usecase, add a new `will_destroy` flag in `struct object_database` to mark ephemeral object databases that do not require fsync durability. Add 'git prune' support for removing temporary object databases, and make sure that they have a name starting with tmp_ and containing an operation-specific name. Based-on-patch-by: Elijah Newren Signed-off-by: Neeraj Singh Reviewed-by: Elijah Newren Signed-off-by: Junio C Hamano

object.c: use BUG(...) no die("BUG: ...") in lookup_object_by_type()

2021-12-07T20:33:58Z

Adjust code added in 7463064b280 (object.h: add lookup_object_by_type() function, 2021-06-22) to use the BUG() function. Signed-off-by: Junio C Hamano Signed-off-by: Ævar Arnfjörð Bjarmason Signed-off-by: Junio C Hamano

Merge branch 'ab/fsck-unexpected-type'

2021-10-25T23:06:56Z

"git fsck" has been taught to report mismatch between expected and actual types of an object better. * ab/fsck-unexpected-type: fsck: report invalid object type-path combinations fsck: don't hard die on invalid object types object-file.c: stop dying in parse_loose_header() object-file.c: return ULHR_TOO_LONG on "header too long" object-file.c: use "enum" return type for unpack_loose_header() object-file.c: simplify unpack_loose_short_header() object-file.c: make parse_loose_header_extended() public object-file.c: return -1, not "status" from unpack_loose_header() object-file.c: don't set "typep" when returning non-zero cat-file tests: test for current --allow-unknown-type behavior cat-file tests: add corrupt loose object test cat-file tests: test for missing/bogus object with -t, -s and -p cat-file tests: move bogus_* variable declarations earlier fsck tests: test for garbage appended to a loose object fsck tests: test current hash/type mismatch behavior fsck tests: refactor one test to use a sub-repo fsck tests: add test for fsck-ing an unknown type