<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/diff.c, branch v2.28.1</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://www.git.shady.money/git/atom?h=v2.28.1</id>
<link rel='self' href='https://www.git.shady.money/git/atom?h=v2.28.1'/>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/'/>
<updated>2020-06-18T04:54:00Z</updated>
<entry>
<title>Merge branch 'jk/diff-memuse-optim-with-stat-unmatch'</title>
<updated>2020-06-18T04:54:00Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2020-06-18T04:54:00Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=0cd0afc9c6193dd4a762f5a2d1bc23e58629b23f'/>
<id>urn:sha1:0cd0afc9c6193dd4a762f5a2d1bc23e58629b23f</id>
<content type='text'>
Reduce memory usage during "diff --quiet" in a worktree with too
many stat-unmatched paths.

* jk/diff-memuse-optim-with-stat-unmatch:
  diff: discard blob data from stat-unmatched pairs
</content>
</entry>
<entry>
<title>diff: discard blob data from stat-unmatched pairs</title>
<updated>2020-06-02T16:28:56Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2020-06-01T20:22:18Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=d2d7fbe12945dd9032c9c7c14a4563dbecc7e629'/>
<id>urn:sha1:d2d7fbe12945dd9032c9c7c14a4563dbecc7e629</id>
<content type='text'>
When performing a tree-level diff against the working tree, we may find
that our index stat information is dirty, so we queue a filepair to be
examined later. If the actual content hasn't changed, we call this a
stat-unmatch; the stat information was out of date, but there's no
actual diff.  Normally diffcore_std() would detect and remove these
identical filepairs via diffcore_skip_stat_unmatch().  However, when
"--quiet" is used, we want to stop the diff as soon as we see any
changes, so we check for stat-unmatches immediately in diff_change().

That check may require us to actually load the file contents into the
pair of diff_filespecs. If we find that the pair isn't a stat-unmatch,
then no big deal; we'd likely load the contents later anyway to generate
a patch, do rename detection, etc, so we want to hold on to it. But if
it is a stat-unmatch, then we have no more use for that data; the whole
point is that we're going discard the pair. However, we never free the
allocated diff_filespec data.

In most cases, keeping that data isn't a problem. We don't expect a lot
of stat-unmatch entries, and since we're using --quiet, we'd quit as
soon as we saw such a real change anyway. However, there are extreme
cases where it makes a big difference:

  1. We'd generally mmap() the working tree half of the pair. And since
     the OS may limit the total number of maps, we can run afoul of this
     in large repositories. E.g.:

       $ cd linux
       $ git ls-files | wc -l
       67959
       $ sysctl vm.max_map_count
       vm.max_map_count = 65530
       $ git ls-files | xargs touch ;# everything is stat-dirty!
       $ git diff --quiet
       fatal: mmap failed: Cannot allocate memory

     It should be unusual to have so many files stat-dirty, but it's
     possible if you've just run a script like "sed -i" or similar.

     After this patch, the above correctly exits with code 0.

  2. Even if you don't hit mmap limits, the index half of the pair will
     have been pulled from the object database into heap memory. Again
     in a clone of linux.git, running:

       $ git ls-files | head -n 10000 | xargs touch
       $ git diff --quiet

     peaks at 145MB heap before this patch, and 94MB after.

This patch solves the problem by freeing any diff_filespec data we
picked up during the "--quiet" stat-unmatch check in diff_changes.
Nobody is going to need that data later, so there's no point holding on
to it. There are a few things to note:

  - we could skip queueing the pair entirely, which could in theory save
    a little work. But there's not much to save, as we need a
    diff_filepair to feed to diff_filespec_check_stat_unmatch() anyway.
    And since we cache the result of the stat-unmatch checks, a later
    call to diffcore_skip_stat_unmatch() call will quickly skip over
    them. The diffcore code also counts up the number of stat-unmatched
    pairs as it removes them. It's doubtful any callers would care about
    that in combination with --quiet, but we'd have to reimplement the
    logic here to be on the safe side. So it's not really worth the
    trouble.

  - I didn't write a test, because we always produce the correct output
    unless we run up against system mmap limits, which are both
    unportable and expensive to test against. Measuring peak heap
    would be interesting, but our perf suite isn't yet capable of that.

  - note that diff without "--quiet" does not suffer from the same
    problem. In diffcore_skip_stat_unmatch(), we detect the stat-unmatch
    entries and drop them immediately, so we're not carrying their data
    around.

  - you _can_ still trigger the mmap limit problem if you truly have
    that many files with actual changes. But it's rather unlikely. The
    stat-unmatch check avoids loading the file contents if the sizes
    don't match, so you'd need a pretty trivial change in every single
    file. Likewise, inexact rename detection might load the data for
    many files all at once. But you'd need not just 64k changes, but
    that many deletions and additions. The most likely candidate is
    perhaps break-detection, which would load the data for all pairs and
    keep it around for the content-level diff. But again, you'd need 64k
    actually changed files in the first place.

    So it's still possible to trigger this case, but it seems like "I
    accidentally made all my files stat-dirty" is the most likely case
    in the real world.

Reported-by: Jan Christoph Uhde &lt;Jan@UhdeJc.com&gt;
Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>diff: add config option relative</title>
<updated>2020-05-24T23:23:59Z</updated>
<author>
<name>Laurent Arnoud</name>
<email>laurent@spkdev.net</email>
</author>
<published>2020-05-22T10:46:18Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=c28ded83fc95be8157c851c8be179733a7d4b137'/>
<id>urn:sha1:c28ded83fc95be8157c851c8be179733a7d4b137</id>
<content type='text'>
The `diff.relative` boolean option set to `true` shows only changes in
the current directory/value specified by the `path` argument of the
`relative` option and shows pathnames relative to the aforementioned
directory.

Teach `--no-relative` to override earlier `--relative`

Add for git-format-patch(1) options documentation `--relative` and
`--no-relative`

Signed-off-by: Laurent Arnoud &lt;laurent@spkdev.net&gt;
Acked-by: Đoàn Trần Công Danh &lt;congdanhqx@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'jt/avoid-prefetch-when-able-in-diff'</title>
<updated>2020-04-28T22:50:04Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2020-04-28T22:50:04Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=8f5dc5a4af3fa32bf9261b76b0f1146fd15b8643'/>
<id>urn:sha1:8f5dc5a4af3fa32bf9261b76b0f1146fd15b8643</id>
<content type='text'>
"git diff" in a partial clone learned to avoid lazy loading blob
objects in more casese when they are not needed.

* jt/avoid-prefetch-when-able-in-diff:
  diff: restrict when prefetching occurs
  diff: refactor object read
  diff: make diff_populate_filespec_options struct
  promisor-remote: accept 0 as oid_nr in function
</content>
</entry>
<entry>
<title>diff: restrict when prefetching occurs</title>
<updated>2020-04-07T23:09:29Z</updated>
<author>
<name>Jonathan Tan</name>
<email>jonathantanmy@google.com</email>
</author>
<published>2020-04-07T22:11:43Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=95acf11a3dc3d18ec999f4913ec6c6a54545c6b7'/>
<id>urn:sha1:95acf11a3dc3d18ec999f4913ec6c6a54545c6b7</id>
<content type='text'>
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.

diffcore_std() may read blobs when it calls the following functions:
 (1) diffcore_skip_stat_unmatch() (controlled by the config variable
     diff.autorefreshindex)
 (2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
     detection)
 (3) diffcore_rename() (for rename detection)
 (4) diffcore_pickaxe() (for detecting addition/deletion of specified
     string)

Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.

Helped-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Jonathan Tan &lt;jonathantanmy@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>diff: refactor object read</title>
<updated>2020-04-07T23:09:29Z</updated>
<author>
<name>Jonathan Tan</name>
<email>jonathantanmy@google.com</email>
</author>
<published>2020-04-07T22:11:42Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=c14b6f83ec7453d2a93bba04f45caf26905f2bff'/>
<id>urn:sha1:c14b6f83ec7453d2a93bba04f45caf26905f2bff</id>
<content type='text'>
Refactor the object reads in diff_populate_filespec() to have the first
object read not be in an if/else branch, because in a future patch, a
retry will be added to that first object read.

Signed-off-by: Jonathan Tan &lt;jonathantanmy@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>diff: make diff_populate_filespec_options struct</title>
<updated>2020-04-07T23:09:29Z</updated>
<author>
<name>Jonathan Tan</name>
<email>jonathantanmy@google.com</email>
</author>
<published>2020-04-07T22:11:41Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=1c37e86ab2834dfca311799e799568794bc474ce'/>
<id>urn:sha1:1c37e86ab2834dfca311799e799568794bc474ce</id>
<content type='text'>
The behavior of diff_populate_filespec() currently can be customized
through a bitflag, but a subsequent patch requires it to support a
non-boolean option. Replace the bitflag with an options struct.

Signed-off-by: Jonathan Tan &lt;jonathantanmy@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>promisor-remote: accept 0 as oid_nr in function</title>
<updated>2020-04-02T19:42:32Z</updated>
<author>
<name>Jonathan Tan</name>
<email>jonathantanmy@google.com</email>
</author>
<published>2020-04-02T19:19:16Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=db7ed7418b3702ab2b2df755764c3f452917a890'/>
<id>urn:sha1:db7ed7418b3702ab2b2df755764c3f452917a890</id>
<content type='text'>
There are 3 callers to promisor_remote_get_direct() that first check if
the number of objects to be fetched is equal to 0. Fold that check into
promisor_remote_get_direct(), and in doing so, be explicit as to what
promisor_remote_get_direct() does if oid_nr is 0 (it returns 0, success,
immediately).

Signed-off-by: Jonathan Tan &lt;jonathantanmy@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>convert: provide additional metadata to filters</title>
<updated>2020-03-16T18:37:02Z</updated>
<author>
<name>brian m. carlson</name>
<email>bk2204@github.com</email>
</author>
<published>2020-03-16T18:05:03Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=c397aac02f9f97976f675115aa5df6ca01e26d59'/>
<id>urn:sha1:c397aac02f9f97976f675115aa5df6ca01e26d59</id>
<content type='text'>
Now that we have the codebase wired up to pass any additional metadata
to filters, let's collect the additional metadata that we'd like to
pass.

The two main places we pass this metadata are checkouts and archives.
In these two situations, reading HEAD isn't a valid option, since HEAD
isn't updated for checkouts until after the working tree is written and
archives can accept an arbitrary tree.  In other situations, HEAD will
usually reflect the refname of the branch in current use.

We pass a smaller amount of data in other cases, such as git cat-file,
where we can really only logically know about the blob.

This commit updates only the parts of the checkout code where we don't
use unpack_trees.  That function and callers of it will be handled in a
future commit.

In the archive code, we leak a small amount of memory, since nothing we
pass in the archiver argument structure is freed.

Signed-off-by: brian m. carlson &lt;bk2204@github.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>convert: permit passing additional metadata to filter processes</title>
<updated>2020-03-16T18:37:02Z</updated>
<author>
<name>brian m. carlson</name>
<email>bk2204@github.com</email>
</author>
<published>2020-03-16T18:05:02Z</published>
<link rel='alternate' type='text/html' href='https://www.git.shady.money/git/commit/?id=ab90ecae992e44e3e8303f143ad858608acabcf5'/>
<id>urn:sha1:ab90ecae992e44e3e8303f143ad858608acabcf5</id>
<content type='text'>
There are a variety of situations where a filter process can make use of
some additional metadata.  For example, some people find the ident
filter too limiting and would like to include the commit or the branch
in their smudged files.  This information isn't available during
checkout as HEAD hasn't been updated at that point, and it wouldn't be
available in archives either.

Let's add a way to pass this metadata down to the filter.  We pass the
blob we're operating on, the treeish (preferring the commit over the
tree if one exists), and the ref we're operating on.  Note that we won't
pass this information in all cases, such as when renormalizing or when
we're performing diffs, since it doesn't make sense in those cases.

The data we currently get from the filter process looks like the
following:

  command=smudge
  pathname=git.c
  0000

With this change, we'll get data more like this:

  command=smudge
  pathname=git.c
  refname=refs/tags/v2.25.1
  treeish=c522f061d551c9bb8684a7c3859b2ece4499b56b
  blob=7be7ad34bd053884ec48923706e70c81719a8660
  0000

There are a couple things to note about this approach.  For operations
like checkout, treeish will always be a commit, since we cannot check
out individual trees, but for other operations, like archive, we can end
up operating on only a particular tree, so we'll provide only a tree as
the treeish.  Similar comments apply for refname, since there are a
variety of cases in which we won't have a ref.

This commit wires up the code to print this information, but doesn't
pass any of it at this point.  In a future commit, we'll have various
code paths pass the actual useful data down.

Signed-off-by: brian m. carlson &lt;bk2204@github.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
