git/diff.c, branch v2.28.1

Merge branch 'jk/diff-memuse-optim-with-stat-unmatch'

2020-06-18T04:54:00Z

Reduce memory usage during "diff --quiet" in a worktree with too many stat-unmatched paths. * jk/diff-memuse-optim-with-stat-unmatch: diff: discard blob data from stat-unmatched pairs

diff: discard blob data from stat-unmatched pairs

2020-06-02T16:28:56Z

When performing a tree-level diff against the working tree, we may find that our index stat information is dirty, so we queue a filepair to be examined later. If the actual content hasn't changed, we call this a stat-unmatch; the stat information was out of date, but there's no actual diff. Normally diffcore_std() would detect and remove these identical filepairs via diffcore_skip_stat_unmatch(). However, when "--quiet" is used, we want to stop the diff as soon as we see any changes, so we check for stat-unmatches immediately in diff_change(). That check may require us to actually load the file contents into the pair of diff_filespecs. If we find that the pair isn't a stat-unmatch, then no big deal; we'd likely load the contents later anyway to generate a patch, do rename detection, etc, so we want to hold on to it. But if it is a stat-unmatch, then we have no more use for that data; the whole point is that we're going discard the pair. However, we never free the allocated diff_filespec data. In most cases, keeping that data isn't a problem. We don't expect a lot of stat-unmatch entries, and since we're using --quiet, we'd quit as soon as we saw such a real change anyway. However, there are extreme cases where it makes a big difference: 1. We'd generally mmap() the working tree half of the pair. And since the OS may limit the total number of maps, we can run afoul of this in large repositories. E.g.: $ cd linux $ git ls-files | wc -l 67959 $ sysctl vm.max_map_count vm.max_map_count = 65530 $ git ls-files | xargs touch ;# everything is stat-dirty! $ git diff --quiet fatal: mmap failed: Cannot allocate memory It should be unusual to have so many files stat-dirty, but it's possible if you've just run a script like "sed -i" or similar. After this patch, the above correctly exits with code 0. 2. Even if you don't hit mmap limits, the index half of the pair will have been pulled from the object database into heap memory. Again in a clone of linux.git, running: $ git ls-files | head -n 10000 | xargs touch $ git diff --quiet peaks at 145MB heap before this patch, and 94MB after. This patch solves the problem by freeing any diff_filespec data we picked up during the "--quiet" stat-unmatch check in diff_changes. Nobody is going to need that data later, so there's no point holding on to it. There are a few things to note: - we could skip queueing the pair entirely, which could in theory save a little work. But there's not much to save, as we need a diff_filepair to feed to diff_filespec_check_stat_unmatch() anyway. And since we cache the result of the stat-unmatch checks, a later call to diffcore_skip_stat_unmatch() call will quickly skip over them. The diffcore code also counts up the number of stat-unmatched pairs as it removes them. It's doubtful any callers would care about that in combination with --quiet, but we'd have to reimplement the logic here to be on the safe side. So it's not really worth the trouble. - I didn't write a test, because we always produce the correct output unless we run up against system mmap limits, which are both unportable and expensive to test against. Measuring peak heap would be interesting, but our perf suite isn't yet capable of that. - note that diff without "--quiet" does not suffer from the same problem. In diffcore_skip_stat_unmatch(), we detect the stat-unmatch entries and drop them immediately, so we're not carrying their data around. - you _can_ still trigger the mmap limit problem if you truly have that many files with actual changes. But it's rather unlikely. The stat-unmatch check avoids loading the file contents if the sizes don't match, so you'd need a pretty trivial change in every single file. Likewise, inexact rename detection might load the data for many files all at once. But you'd need not just 64k changes, but that many deletions and additions. The most likely candidate is perhaps break-detection, which would load the data for all pairs and keep it around for the content-level diff. But again, you'd need 64k actually changed files in the first place. So it's still possible to trigger this case, but it seems like "I accidentally made all my files stat-dirty" is the most likely case in the real world. Reported-by: Jan Christoph Uhde Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

diff: add config option relative

2020-05-24T23:23:59Z

The `diff.relative` boolean option set to `true` shows only changes in the current directory/value specified by the `path` argument of the `relative` option and shows pathnames relative to the aforementioned directory. Teach `--no-relative` to override earlier `--relative` Add for git-format-patch(1) options documentation `--relative` and `--no-relative` Signed-off-by: Laurent Arnoud Acked-by: Đoàn Trần Công Danh Signed-off-by: Junio C Hamano

Merge branch 'jt/avoid-prefetch-when-able-in-diff'

2020-04-28T22:50:04Z

"git diff" in a partial clone learned to avoid lazy loading blob objects in more casese when they are not needed. * jt/avoid-prefetch-when-able-in-diff: diff: restrict when prefetching occurs diff: refactor object read diff: make diff_populate_filespec_options struct promisor-remote: accept 0 as oid_nr in function

diff: restrict when prefetching occurs

2020-04-07T23:09:29Z

Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08) optimized "diff" by prefetching blobs in a partial clone, but there are some cases wherein blobs do not need to be prefetched. In these cases, any command that uses the diff machinery will unnecessarily fetch blobs. diffcore_std() may read blobs when it calls the following functions: (1) diffcore_skip_stat_unmatch() (controlled by the config variable diff.autorefreshindex) (2) diffcore_break() and diffcore_merge_broken() (for break-rewrite detection) (3) diffcore_rename() (for rename detection) (4) diffcore_pickaxe() (for detecting addition/deletion of specified string) Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(), diffcore_break(), and diffcore_rename() to prefetch blobs upon the first read of a missing object. This covers (1), (2), and (3): to cover the rest, teach diffcore_std() to prefetch if the output type is one that includes blob data (and hence blob data will be required later anyway), or if it knows that (4) will be run. Helped-by: Jeff King Signed-off-by: Jonathan Tan Signed-off-by: Junio C Hamano

diff: refactor object read

2020-04-07T23:09:29Z

Refactor the object reads in diff_populate_filespec() to have the first object read not be in an if/else branch, because in a future patch, a retry will be added to that first object read. Signed-off-by: Jonathan Tan Signed-off-by: Junio C Hamano

diff: make diff_populate_filespec_options struct

2020-04-07T23:09:29Z

The behavior of diff_populate_filespec() currently can be customized through a bitflag, but a subsequent patch requires it to support a non-boolean option. Replace the bitflag with an options struct. Signed-off-by: Jonathan Tan Signed-off-by: Junio C Hamano

promisor-remote: accept 0 as oid_nr in function

2020-04-02T19:42:32Z

There are 3 callers to promisor_remote_get_direct() that first check if the number of objects to be fetched is equal to 0. Fold that check into promisor_remote_get_direct(), and in doing so, be explicit as to what promisor_remote_get_direct() does if oid_nr is 0 (it returns 0, success, immediately). Signed-off-by: Jonathan Tan Signed-off-by: Junio C Hamano

convert: provide additional metadata to filters

2020-03-16T18:37:02Z

Now that we have the codebase wired up to pass any additional metadata to filters, let's collect the additional metadata that we'd like to pass. The two main places we pass this metadata are checkouts and archives. In these two situations, reading HEAD isn't a valid option, since HEAD isn't updated for checkouts until after the working tree is written and archives can accept an arbitrary tree. In other situations, HEAD will usually reflect the refname of the branch in current use. We pass a smaller amount of data in other cases, such as git cat-file, where we can really only logically know about the blob. This commit updates only the parts of the checkout code where we don't use unpack_trees. That function and callers of it will be handled in a future commit. In the archive code, we leak a small amount of memory, since nothing we pass in the archiver argument structure is freed. Signed-off-by: brian m. carlson Signed-off-by: Junio C Hamano

convert: permit passing additional metadata to filter processes

2020-03-16T18:37:02Z

There are a variety of situations where a filter process can make use of some additional metadata. For example, some people find the ident filter too limiting and would like to include the commit or the branch in their smudged files. This information isn't available during checkout as HEAD hasn't been updated at that point, and it wouldn't be available in archives either. Let's add a way to pass this metadata down to the filter. We pass the blob we're operating on, the treeish (preferring the commit over the tree if one exists), and the ref we're operating on. Note that we won't pass this information in all cases, such as when renormalizing or when we're performing diffs, since it doesn't make sense in those cases. The data we currently get from the filter process looks like the following: command=smudge pathname=git.c 0000 With this change, we'll get data more like this: command=smudge pathname=git.c refname=refs/tags/v2.25.1 treeish=c522f061d551c9bb8684a7c3859b2ece4499b56b blob=7be7ad34bd053884ec48923706e70c81719a8660 0000 There are a couple things to note about this approach. For operations like checkout, treeish will always be a commit, since we cannot check out individual trees, but for other operations, like archive, we can end up operating on only a particular tree, so we'll provide only a tree as the treeish. Similar comments apply for refname, since there are a variety of cases in which we won't have a ref. This commit wires up the code to print this information, but doesn't pass any of it at this point. In a future commit, we'll have various code paths pass the actual useful data down. Signed-off-by: brian m. carlson Signed-off-by: Junio C Hamano