<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/object.c, branch v2.38.4</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://www.git.shady.money/git/atom?h=v2.38.4</id>
<link rel='self' href='https://www.git.shady.money/git/atom?h=v2.38.4'/>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/'/>
<updated>2022-10-26T00:11:33Z</updated>
<entry>
<title>Merge branch 'jk/fsck-on-diet' into maint-2.38</title>
<updated>2022-10-26T00:11:33Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2022-10-26T00:11:33Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=cf96b393d62df78fe1b19e599a026d2eef303828'/>
<id>urn:sha1:cf96b393d62df78fe1b19e599a026d2eef303828</id>
<content type='text'>
"git fsck" failed to release contents of tree objects already used
from the memory, which has been fixed.

* jk/fsck-on-diet:
  parse_object_buffer(): respect save_commit_buffer
  fsck: turn off save_commit_buffer
  fsck: free tree buffers after walking unreachable objects
</content>
</entry>
<entry>
<title>parse_object_buffer(): respect save_commit_buffer</title>
<updated>2022-09-22T18:40:47Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2022-09-22T10:15:38Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=51b27747e5ba939942fe0e1f9a61e86e0ead19ed'/>
<id>urn:sha1:51b27747e5ba939942fe0e1f9a61e86e0ead19ed</id>
<content type='text'>
If the global variable "save_commit_buffer" is set to 0, then
parse_commit() will throw away the commit object data after parsing it,
rather than sticking it into a commit slab. This goes all the way back
to 60ab26de99 ([PATCH] Avoid wasting memory in git-rev-list,
2005-09-15).

But there's another code path which may similarly stash the buffer:
parse_object_buffer(). This is where we end up if we parse a commit via
parse_object(), and it's used directly in a few other code paths like
git-fsck.

The original goal of 60ab26de99 was avoiding extra memory usage for
rev-list. And there it's not all that important to catch parse_object().
We use that function only for looking at the tips of the traversal, and
the majority of the commits are parsed by following parent links, where
we use parse_commit() directly. So we were wasting some memory, but only
a small portion.

It's much easier to see the effect with fsck. Since we now turn off
save_commit_buffer by default there, we _should_ be able to drop the
freeing of the commit buffer in fsck_obj(). But if we do so (taking the
first hunk of this patch without the rest), then the peak heap of "git
fsck" in a clone of git.git goes from 136MB to 194MB. Teaching
parse_object_buffer() to respect save_commit_buffer brings that down to
134.5MB (it's hard to tell from massif's output, but I suspect the
savings comes from avoiding the overhead of the mostly-empty commit
slab).

Other programs should see a small improvement. Both "rev-list --all" and
"fsck --connectivity-only" improve by a few hundred kilobytes, as they'd
avoid loading the tip objects of their traversals.

Most importantly, no code should be hurt by doing this. Any program that
turns off save_commit_buffer is already making the assumption that any
commit it sees may need to have its object data loaded on demand, as it
doesn't know which ones were parsed by parse_commit() versus
parse_object(). Not to mention that anything parsed by the commit graph
may be in the same boat, even if save_commit_buffer was not disabled.

This should be the only spot that needs to be fixed. Grepping for
set_commit_buffer() shows that this and parse_commit() are the only
relevant calls.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>parse_object(): check commit-graph when skip_hash set</title>
<updated>2022-09-07T19:27:02Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2022-09-06T23:06:25Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=9a8c3c4a5f94aaed54ae26e4796c08ca9a1cbfbc'/>
<id>urn:sha1:9a8c3c4a5f94aaed54ae26e4796c08ca9a1cbfbc</id>
<content type='text'>
If the caller told us that they don't care about us checking the object
hash, then we're free to implement any optimizations that get us the
parsed value more quickly. An obvious one is to check the commit graph
before loading an object from disk. And in fact, both of the callers who
pass in this flag are already doing so before they call parse_object()!

So we can simplify those callers, as well as any possible future ones,
by moving the logic into parse_object().

There are two subtle things to note in the diff, but neither has any
impact in practice:

  - it seems least-surprising here to do the graph lookup on the
    git-replace'd oid, rather than the original. This is in theory a
    change of behavior from the earlier code, as neither caller did a
    replace lookup itself. But in practice it doesn't matter, as we
    disable the commit graph entirely if there are any replace refs.

  - the caller in get_reference() passes the skip_hash flag only if
    revs-&gt;verify_objects isn't set, whereas it would look in the commit
    graph unconditionally. In practice this should not matter as we
    should disable the commit graph entirely when using verify_objects
    (and that was done recently in another patch).

So this should be a pure cleanup with no behavior change.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>parse_object(): allow skipping hash check</title>
<updated>2022-09-07T19:18:57Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2022-09-06T23:01:34Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=c868d8e91fee01999eb34c5559250ea059293b33'/>
<id>urn:sha1:c868d8e91fee01999eb34c5559250ea059293b33</id>
<content type='text'>
The parse_object() function checks the object hash of any object it
parses. This is a nice feature, as it means we may catch bit corruption
during normal use, rather than waiting for specific fsck operations.

But it also can be slow. It's particularly noticeable for blobs, where
except for the hash check, we could return without loading the object
contents at all. Now one may wonder what is the point of calling
parse_object() on a blob in the first place then, but usually it's not
intentional: we were fed an oid from somewhere, don't know the type, and
want an object struct. For commits and trees, the parsing is usually
helpful; we're about to look at the contents anyway. But this is less
true for blobs, where we may be collecting them as part of a
reachability traversal, etc, and don't actually care what's in them. And
blobs, of course, tend to be larger.

We don't want to just throw out the hash-checks for blobs, though. We do
depend on them in some circumstances (e.g., rev-list --verify-objects
uses parse_object() to check them). It's only the callers that know
how they're going to use the result. And so we can help them by
providing a special flag to skip the hash check.

We could just apply this to blobs, as they're going to be the main
source of performance improvement. But if a caller doesn't care about
checking the hash, we might as well skip it for other object types, too.
Even though we can't avoid reading the object contents, we can still
skip the actual hash computation.

If this seems like it is making Git a little bit less safe against
corruption, it may be. But it's part of a series of tradeoffs we're
already making. For instance, "rev-list --objects" does not open the
contents of blobs it prints. And when a commit graph is present, we skip
opening most commits entirely. The important thing will be to use this
flag in cases where it's safe to skip the check. For instance, when
serving a pack for a fetch, we know the client will fully index the
objects and do a connectivity check itself. There's little to be gained
from the server side re-hashing a blob itself. And indeed, most of the
time we don't! The revision machinery won't open up a blob reached by
traversal, but only one requested directly with a "want" line. So
applied properly, this new feature shouldn't make anything less safe in
practice.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>object-file API: have hash_object_file() take "enum object_type"</title>
<updated>2022-02-26T01:16:32Z</updated>
<author>
<name>Ævar Arnfjörð Bjarmason</name>
<email>avarab@gmail.com</email>
</author>
<published>2022-02-04T23:48:32Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=44439c1c5827480f68b37c3cc38f257eaeb3ed2c'/>
<id>urn:sha1:44439c1c5827480f68b37c3cc38f257eaeb3ed2c</id>
<content type='text'>
Change the hash_object_file() function to take an "enum
object_type".

Since a preceding commit all of its callers are passing either
"{commit,tree,blob,tag}_type", or the result of a call to type_name(),
the parse_object() caller that would pass NULL is now using
stream_object_signature().

Signed-off-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>object-file API: split up and simplify check_object_signature()</title>
<updated>2022-02-26T01:16:31Z</updated>
<author>
<name>Ævar Arnfjörð Bjarmason</name>
<email>avarab@gmail.com</email>
</author>
<published>2022-02-04T23:48:30Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=0f156dbb04b434d95ce5465e6b07d8869d55e8e0'/>
<id>urn:sha1:0f156dbb04b434d95ce5465e6b07d8869d55e8e0</id>
<content type='text'>
Split up the check_object_signature() function into that non-streaming
version (it accepts an already filled "buf"), and a new
stream_object_signature() which will retrieve the object from storage,
and hash it on-the-fly.

All of the callers of check_object_signature() were effectively
calling two different functions, if we go by cyclomatic
complexity. I.e. they'd either take the early "if (map)" branch and
return early, or not. This has been the case since the "if (map)"
condition was added in 090ea12671b (parse_object: avoid putting whole
blob in core, 2012-03-07).

We can then further simplify the resulting check_object_signature()
function since only one caller wanted to pass a non-NULL "buf" and a
non-NULL "real_oidp". That "read_loose_object()" codepath used by "git
fsck" can instead use hash_object_file() followed by oideq().

Signed-off-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'ns/tmp-objdir'</title>
<updated>2022-01-04T00:24:15Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2022-01-04T00:24:14Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=0dc90d954db852c6604796b8d817365f94e92a16'/>
<id>urn:sha1:0dc90d954db852c6604796b8d817365f94e92a16</id>
<content type='text'>
New interface into the tmp-objdir API to help in-core use of the
quarantine feature.

* ns/tmp-objdir:
  tmp-objdir: disable ref updates when replacing the primary odb
  tmp-objdir: new API for creating temporary writable databases
</content>
</entry>
<entry>
<title>tmp-objdir: new API for creating temporary writable databases</title>
<updated>2021-12-08T22:06:36Z</updated>
<author>
<name>Neeraj Singh</name>
<email>neerajsi@microsoft.com</email>
</author>
<published>2021-12-06T22:05:04Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=b3cecf49eac00d62e361bf6e6e81392f5a2fb571'/>
<id>urn:sha1:b3cecf49eac00d62e361bf6e6e81392f5a2fb571</id>
<content type='text'>
The tmp_objdir API provides the ability to create temporary object
directories, but was designed with the goal of having subprocesses
access these object stores, followed by the main process migrating
objects from it to the main object store or just deleting it.  The
subprocesses would view it as their primary datastore and write to it.

Here we add the tmp_objdir_replace_primary_odb function that replaces
the current process's writable "main" object directory with the
specified one. The previous main object directory is restored in either
tmp_objdir_migrate or tmp_objdir_destroy.

For the --remerge-diff usecase, add a new `will_destroy` flag in `struct
object_database` to mark ephemeral object databases that do not require
fsync durability.

Add 'git prune' support for removing temporary object databases, and
make sure that they have a name starting with tmp_ and containing an
operation-specific name.

Based-on-patch-by: Elijah Newren &lt;newren@gmail.com&gt;

Signed-off-by: Neeraj Singh &lt;neerajsi@microsoft.com&gt;
Reviewed-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>object.c: use BUG(...) no die("BUG: ...") in lookup_object_by_type()</title>
<updated>2021-12-07T20:33:58Z</updated>
<author>
<name>Ævar Arnfjörð Bjarmason</name>
<email>avarab@gmail.com</email>
</author>
<published>2021-12-07T11:05:54Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=eafd6e7e5585ecf7e0d4bd542d7565fea008795a'/>
<id>urn:sha1:eafd6e7e5585ecf7e0d4bd542d7565fea008795a</id>
<content type='text'>
Adjust code added in 7463064b280 (object.h: add
lookup_object_by_type() function, 2021-06-22) to use the BUG()
function.

Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
Signed-off-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'ab/fsck-unexpected-type'</title>
<updated>2021-10-25T23:06:56Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2021-10-25T23:06:56Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=061a21d36d807bdcf996f388d5e487d5e1993bbc'/>
<id>urn:sha1:061a21d36d807bdcf996f388d5e487d5e1993bbc</id>
<content type='text'>
"git fsck" has been taught to report mismatch between expected and
actual types of an object better.

* ab/fsck-unexpected-type:
  fsck: report invalid object type-path combinations
  fsck: don't hard die on invalid object types
  object-file.c: stop dying in parse_loose_header()
  object-file.c: return ULHR_TOO_LONG on "header too long"
  object-file.c: use "enum" return type for unpack_loose_header()
  object-file.c: simplify unpack_loose_short_header()
  object-file.c: make parse_loose_header_extended() public
  object-file.c: return -1, not "status" from unpack_loose_header()
  object-file.c: don't set "typep" when returning non-zero
  cat-file tests: test for current --allow-unknown-type behavior
  cat-file tests: add corrupt loose object test
  cat-file tests: test for missing/bogus object with -t, -s and -p
  cat-file tests: move bogus_* variable declarations earlier
  fsck tests: test for garbage appended to a loose object
  fsck tests: test current hash/type mismatch behavior
  fsck tests: refactor one test to use a sub-repo
  fsck tests: add test for fsck-ing an unknown type
</content>
</entry>
</feed>
