diff options
Diffstat (limited to 'Documentation/technical')
| -rw-r--r-- | Documentation/technical/api-index-skel.txt | 2 | ||||
| -rw-r--r-- | Documentation/technical/api-merge.txt | 4 | ||||
| -rw-r--r-- | Documentation/technical/api-simple-ipc.txt | 10 | ||||
| -rw-r--r-- | Documentation/technical/api-trace2.txt | 190 | ||||
| -rw-r--r-- | Documentation/technical/bitmap-format.txt | 6 | ||||
| -rw-r--r-- | Documentation/technical/bundle-uri.txt | 8 | ||||
| -rw-r--r-- | Documentation/technical/commit-graph.txt | 10 | ||||
| -rw-r--r-- | Documentation/technical/hash-function-transition.txt | 2 | ||||
| -rw-r--r-- | Documentation/technical/parallel-checkout.txt | 12 | ||||
| -rw-r--r-- | Documentation/technical/partial-clone.txt | 8 | ||||
| -rw-r--r-- | Documentation/technical/racy-git.txt | 10 | ||||
| -rw-r--r-- | Documentation/technical/reftable.txt | 10 | ||||
| -rw-r--r-- | Documentation/technical/remembering-renames.txt | 2 | ||||
| -rw-r--r-- | Documentation/technical/repository-version.txt | 6 | ||||
| -rw-r--r-- | Documentation/technical/rerere.txt | 8 | ||||
| -rw-r--r-- | Documentation/technical/sparse-checkout.txt | 1103 |
16 files changed, 1284 insertions, 107 deletions
diff --git a/Documentation/technical/api-index-skel.txt b/Documentation/technical/api-index-skel.txt index eda8c195c1..7780a76b08 100644 --- a/Documentation/technical/api-index-skel.txt +++ b/Documentation/technical/api-index-skel.txt @@ -1,7 +1,7 @@ Git API Documents ================= -Git has grown a set of internal API over time. This collection +Git has grown a set of internal APIs over time. This collection documents them. //////////////////////////////////////////////////////////////// diff --git a/Documentation/technical/api-merge.txt b/Documentation/technical/api-merge.txt index 487d4d83ff..c2ba01828c 100644 --- a/Documentation/technical/api-merge.txt +++ b/Documentation/technical/api-merge.txt @@ -28,9 +28,9 @@ and `diff.c` for examples. * `struct ll_merge_options` -Check ll-merge.h for details. +Check merge-ll.h for details. Low-level (single file) merge ----------------------------- -Check ll-merge.h for details. +Check merge-ll.h for details. diff --git a/Documentation/technical/api-simple-ipc.txt b/Documentation/technical/api-simple-ipc.txt index d44ada98e7..c4fb152b23 100644 --- a/Documentation/technical/api-simple-ipc.txt +++ b/Documentation/technical/api-simple-ipc.txt @@ -2,7 +2,7 @@ Simple-IPC API ============== The Simple-IPC API is a collection of `ipc_` prefixed library routines -and a basic communication protocol that allow an IPC-client process to +and a basic communication protocol that allows an IPC-client process to send an application-specific IPC-request message to an IPC-server process and receive an application-specific IPC-response message. @@ -20,12 +20,12 @@ IPC-client. The IPC-client routines within a client application process connect to the IPC-server and send a request message and wait for a response. -When received, the response is returned back the caller. +When received, the response is returned back to the caller. For example, the `fsmonitor--daemon` feature will be built as a server application on top of the IPC-server library routines. It will have threads watching for file system events and a thread pool waiting for -client connections. Clients, such as `git status` will request a list +client connections. Clients, such as `git status`, will request a list of file system events since a point in time and the server will respond with a list of changed files and directories. The formats of the request and response are application-specific; the IPC-client and @@ -37,7 +37,7 @@ Comparison with sub-process model The Simple-IPC mechanism differs from the existing `sub-process.c` model (Documentation/technical/long-running-process-protocol.txt) and -used by applications like Git-LFS. In the LFS-style sub-process model +used by applications like Git-LFS. In the LFS-style sub-process model, the helper is started by the foreground process, communication happens via a pair of file descriptors bound to the stdin/stdout of the sub-process, the sub-process only serves the current foreground @@ -102,4 +102,4 @@ stateless request, receive an application-specific response, and disconnect. It is a one round trip facility for querying the server. The Simple-IPC routines hide the socket, named pipe, and thread pool details and allow the application -layer to focus on the application at hand. +layer to focus on the task at hand. diff --git a/Documentation/technical/api-trace2.txt b/Documentation/technical/api-trace2.txt index 2afa28bb5a..de5fc25059 100644 --- a/Documentation/technical/api-trace2.txt +++ b/Documentation/technical/api-trace2.txt @@ -148,20 +148,18 @@ filename collisions). == Trace2 API -All public Trace2 functions and macros are defined in `trace2.h` and -`trace2.c`. All public symbols are prefixed with `trace2_`. +The Trace2 public API is defined and documented in `trace2.h`; refer to it for +more information. All public functions and macros are prefixed +with `trace2_` and are implemented in `trace2.c`. There are no public Trace2 data structures. The Trace2 code also defines a set of private functions and data types in the `trace2/` directory. These symbols are prefixed with `tr2_` -and should only be used by functions in `trace2.c`. +and should only be used by functions in `trace2.c` (or other private +source files in `trace2/`). -== Conventions for Public Functions and Macros - -The functions defined by the Trace2 API are declared and documented -in `trace2.h`. It defines the API functions and wrapper macros for -Trace2. +=== Conventions for Public Functions and Macros Some functions have a `_fl()` suffix to indicate that they take `file` and `line-number` arguments. @@ -172,52 +170,7 @@ take a `va_list` argument. Some functions have a `_printf_fl()` suffix to indicate that they also take a `printf()` style format with a variable number of arguments. -There are CPP wrapper macros and `#ifdef`s to hide most of these details. -See `trace2.h` for more details. The following discussion will only -describe the simplified forms. - -== Public API - -All Trace2 API functions send a message to all of the active -Trace2 Targets. This section describes the set of available -messages. - -It helps to divide these functions into groups for discussion -purposes. - -=== Basic Command Messages - -These are concerned with the lifetime of the overall git process. -e.g: `void trace2_initialize_clock()`, `void trace2_initialize()`, -`int trace2_is_enabled()`, `void trace2_cmd_start(int argc, const char **argv)`. - -=== Command Detail Messages - -These are concerned with describing the specific Git command -after the command line, config, and environment are inspected. -e.g: `void trace2_cmd_name(const char *name)`, -`void trace2_cmd_mode(const char *mode)`. - -=== Child Process Messages - -These are concerned with the various spawned child processes, -including shell scripts, git commands, editors, pagers, and hooks. - -e.g: `void trace2_child_start(struct child_process *cmd)`. - -=== Git Thread Messages - -These messages are concerned with Git thread usage. - -e.g: `void trace2_thread_start(const char *thread_name)`. - -=== Region and Data Messages - -These are concerned with recording performance data -over regions or spans of code. e.g: -`void trace2_region_enter(const char *category, const char *label, const struct repository *repo)`. - -Refer to trace2.h for details about all trace2 functions. +CPP wrapper macros are defined to hide most of these details. == Trace2 Target Formats @@ -685,8 +638,8 @@ The "exec_id" field is a command-unique id and is only useful if the `"thread_start"`:: This event is generated when a thread is started. It is - generated from *within* the new thread's thread-proc (for TLS - reasons). + generated from *within* the new thread's thread-proc (because + it needs to access data in the thread's thread-local storage). + ------------ { @@ -698,7 +651,7 @@ The "exec_id" field is a command-unique id and is only useful if the `"thread_exit"`:: This event is generated when a thread exits. It is generated - from *within* the thread's thread-proc (for TLS reasons). + from *within* the thread's thread-proc. + ------------ { @@ -816,6 +769,73 @@ The "value" field may be an integer or a string. } ------------ +`"th_timer"`:: + This event logs the amount of time that a stopwatch timer was + running in the thread. This event is generated when a thread + exits for timers that requested per-thread events. ++ +------------ +{ + "event":"th_timer", + ... + "category":"my_category", + "name":"my_timer", + "intervals":5, # number of time it was started/stopped + "t_total":0.052741, # total time in seconds it was running + "t_min":0.010061, # shortest interval + "t_max":0.011648 # longest interval +} +------------ + +`"timer"`:: + This event logs the amount of time that a stopwatch timer was + running aggregated across all threads. This event is generated + when the process exits. ++ +------------ +{ + "event":"timer", + ... + "category":"my_category", + "name":"my_timer", + "intervals":5, # number of time it was started/stopped + "t_total":0.052741, # total time in seconds it was running + "t_min":0.010061, # shortest interval + "t_max":0.011648 # longest interval +} +------------ + +`"th_counter"`:: + This event logs the value of a counter variable in a thread. + This event is generated when a thread exits for counters that + requested per-thread events. ++ +------------ +{ + "event":"th_counter", + ... + "category":"my_category", + "name":"my_counter", + "count":23 +} +------------ + +`"counter"`:: + This event logs the value of a counter variable across all threads. + This event is generated when the process exits. The total value + reported here is the sum across all threads. ++ +------------ +{ + "event":"counter", + ... + "category":"my_category", + "name":"my_counter", + "count":23 +} +------------ + + == Example Trace2 API Usage Here is a hypothetical usage of the Trace2 API showing the intended @@ -1206,7 +1226,7 @@ worked on 508 items at offset 2032. Thread "th04" worked on 508 items at offset 508. + This example also shows that thread names are assigned in a racy manner -as each thread starts and allocates TLS storage. +as each thread starts. Config (def param) Events:: @@ -1247,6 +1267,60 @@ d0 | main | data | r0 | 0.002126 | 0.002126 | fsy d0 | main | exit | | 0.000470 | | | code:0 d0 | main | atexit | | 0.000477 | | | code:0 ---------------- + +Stopwatch Timer Events:: + + Measure the time spent in a function call or span of code + that might be called from many places within the code + throughout the life of the process. ++ +---------------- +static void expensive_function(void) +{ + trace2_timer_start(TRACE2_TIMER_ID_TEST1); + ... + sleep_millisec(1000); // Do something expensive + ... + trace2_timer_stop(TRACE2_TIMER_ID_TEST1); +} + +static int ut_100timer(int argc, const char **argv) +{ + ... + + expensive_function(); + + // Do something else 1... + + expensive_function(); + + // Do something else 2... + + expensive_function(); + + return 0; +} +---------------- ++ +In this example, we measure the total time spent in +`expensive_function()` regardless of when it is called +in the overall flow of the program. ++ +---------------- +$ export GIT_TRACE2_PERF_BRIEF=1 +$ export GIT_TRACE2_PERF=~/log.perf +$ t/helper/test-tool trace2 100timer 3 1000 +... +$ cat ~/log.perf +d0 | main | version | | | | | ... +d0 | main | start | | 0.001453 | | | t/helper/test-tool trace2 100timer 3 1000 +d0 | main | cmd_name | | | | | trace2 (trace2) +d0 | main | exit | | 3.003667 | | | code:0 +d0 | main | timer | | | | test | name:test1 intervals:3 total:3.001686 min:1.000254 max:1.000929 +d0 | main | atexit | | 3.003796 | | | code:0 +---------------- + + == Future Work === Relationship to the Existing Trace Api (api-trace.txt) diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt index c2e652b71a..f5d200939b 100644 --- a/Documentation/technical/bitmap-format.txt +++ b/Documentation/technical/bitmap-format.txt @@ -114,7 +114,7 @@ result in an empty bitmap (no bits set). * N entries with compressed bitmaps, one for each indexed commit + -Where `N` is the total amount of entries in this bitmap index. +Where `N` is the total number of entries in this bitmap index. Each entry contains the following: ** {empty} @@ -126,7 +126,7 @@ Each entry contains the following: ** {empty} 1-byte XOR-offset: :: The xor offset used to compress this bitmap. For an entry - in position `x`, a XOR offset of `y` means that the actual + in position `x`, an XOR offset of `y` means that the actual bitmap representing this commit is composed by XORing the bitmap for this entry with the bitmap in entry `x-y` (i.e. the bitmap `y` entries before this one). @@ -239,7 +239,7 @@ bitmaps. For a `.bitmap` containing `nr_entries` reachability bitmaps, the table contains a list of `nr_entries` <commit_pos, offset, xor_row> triplets -(sorted in the ascending order of `commit_pos`). The content of i'th +(sorted in the ascending order of `commit_pos`). The content of the i'th triplet is - * {empty} diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt index b78d01d9ad..91d3a13e32 100644 --- a/Documentation/technical/bundle-uri.txt +++ b/Documentation/technical/bundle-uri.txt @@ -479,14 +479,14 @@ outline for submitting these features: (This choice is an opt-in via a config option and a command-line option.) -4. Allow the client to understand the `bundle.flag=forFetch` configuration +4. Allow the client to understand the `bundle.heuristic` configuration key and the `bundle.<id>.creationToken` heuristic. When `git clone` - discovers a bundle URI with `bundle.flag=forFetch`, it configures the - client repository to check that bundle URI during later `git fetch <remote>` + discovers a bundle URI with `bundle.heuristic`, it configures the client + repository to check that bundle URI during later `git fetch <remote>` commands. 5. Allow clients to discover bundle URIs during `git fetch` and configure - a bundle URI for later fetches if `bundle.flag=forFetch`. + a bundle URI for later fetches if `bundle.heuristic` is set. 6. Implement the "inspect headers" heuristic to reduce data downloads when the `bundle.<id>.creationToken` heuristic is not available. diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 90c9760c23..2c26e95e51 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -1,4 +1,4 @@ -Git Commit Graph Design Notes +Git Commit-Graph Design Notes ============================= Git walks the commit graph for many reasons, including: @@ -17,7 +17,7 @@ There are two main costs here: The commit-graph file is a supplemental data structure that accelerates commit graph walks. If a user downgrades or disables the 'core.commitGraph' -config setting, then the existing ODB is sufficient. The file is stored +config setting, then the existing object database is sufficient. The file is stored as "commit-graph" either in the .git/objects/info directory or in the info directory of an alternate. @@ -95,7 +95,7 @@ with default order), but is not used when the topological order is required (such as merge base calculations, "git log --graph"). In practice, we expect some commits to be created recently and not stored -in the commit graph. We can treat these commits as having "infinite" +in the commit-graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. @@ -136,7 +136,7 @@ Design Details - Commit grafts and replace objects can change the shape of the commit history. The latter can also be enabled/disabled on the fly using - `--no-replace-objects`. This leads to difficultly storing both possible + `--no-replace-objects`. This leads to difficulty storing both possible interpretations of a commit id, especially when computing generation numbers. The commit-graph will not be read or written when replace-objects or grafts are present. @@ -149,7 +149,7 @@ Design Details helpful for these clones, anyway. The commit-graph will not be read or written when shallow commits are present. -Commit Graphs Chains +Commit-Graphs Chains -------------------- Typically, repos grow with near-constant velocity (commits per day). Over time, diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt index e2ac36dd21..ed57481089 100644 --- a/Documentation/technical/hash-function-transition.txt +++ b/Documentation/technical/hash-function-transition.txt @@ -562,7 +562,7 @@ hash re-encode during clone and to encourage peers to modernize. The design described here allows fetches by SHA-1 clients of a personal SHA-256 repository because it's not much more difficult than allowing pushes from that repository. This support needs to be guarded -by a configuration option --- servers like git.kernel.org that serve a +by a configuration option -- servers like git.kernel.org that serve a large number of clients would not be expected to bear that cost. Meaning of signatures diff --git a/Documentation/technical/parallel-checkout.txt b/Documentation/technical/parallel-checkout.txt index e790258a1a..b4a144e5f4 100644 --- a/Documentation/technical/parallel-checkout.txt +++ b/Documentation/technical/parallel-checkout.txt @@ -56,14 +56,14 @@ Rejected Multi-Threaded Solution The most "straightforward" implementation would be to spread the set of to-be-updated cache entries across multiple threads. But due to the -thread-unsafe functions in the ODB code, we would have to use locks to +thread-unsafe functions in the object database code, we would have to use locks to coordinate the parallel operation. An early prototype of this solution showed that the multi-threaded checkout would bring performance improvements over the sequential code, but there was still too much lock contention. A `perf` profiling indicated that around 20% of the runtime during a local Linux clone (on an SSD) was spent in locking functions. For this reason this approach was rejected in favor of using multiple -child processes, which led to a better performance. +child processes, which led to better performance. Multi-Process Solution ---------------------- @@ -126,7 +126,7 @@ Then, for each assigned item, each worker: * W5: Writes the result to the file descriptor opened at W2. -* W6: Calls `fstat()` or lstat()` on the just-written path, and sends +* W6: Calls `fstat()` or `lstat()` on the just-written path, and sends the result back to the main process, together with the end status of the operation and the item's identification number. @@ -148,7 +148,7 @@ information, the main process handles the results in two steps: - First, it updates the in-memory index with the `lstat()` information sent by the workers. (This must be done first as this information - might me required in the following step.) + might be required in the following step.) - Then it writes the items which collided on disk (i.e. items marked with `PC_ITEM_COLLIDED`). More on this below. @@ -185,7 +185,7 @@ quite straightforward: for each parallel-eligible entry, the main process must remove all files that prevent this entry from being written (before enqueueing it). This includes any non-directory file in the leading path of the entry. Later, when a worker gets assigned the entry, -it looks again for the non-directories files and for an already existing +it looks again for the non-directory files and for an already existing file at the entry's path. If any of these checks finds something, the worker knows that there was a path collision. @@ -232,7 +232,7 @@ conversion and re-encoding, are eligible for parallel checkout. Ineligible entries are checked out by the classic sequential codepath *before* spawning workers. -Note: submodules's files are also eligible for parallel checkout (as +Note: submodules' files are also eligible for parallel checkout (as long as they don't fall into any of the excluding categories mentioned above). But since each submodule is checked out in its own child process, we don't mix the superproject's and the submodules' files in diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt index 92fcee2bff..cd948b0072 100644 --- a/Documentation/technical/partial-clone.txt +++ b/Documentation/technical/partial-clone.txt @@ -3,7 +3,7 @@ Partial Clone Design Notes The "Partial Clone" feature is a performance optimization for Git that allows Git to function without having a complete copy of the repository. -The goal of this work is to allow Git better handle extremely large +The goal of this work is to allow Git to better handle extremely large repositories. During clone and fetch operations, Git downloads the complete contents @@ -256,7 +256,7 @@ remote in a specific order. - Dynamic object fetching currently uses the existing pack protocol V0 which means that each object is requested via fetch-pack. The server will send a full set of info/refs when the connection is established. - If there are large number of refs, this may incur significant overhead. + If there are a large number of refs, this may incur significant overhead. Future Work @@ -265,7 +265,7 @@ Future Work - Improve the way to specify the order in which promisor remotes are tried. + -For example this could allow to specify explicitly something like: +For example this could allow specifying explicitly something like: "When fetching from this remote, I want to use these promisor remotes in this order, though, when pushing or fetching to that remote, I want to use those promisor remotes in that order." @@ -322,7 +322,7 @@ Footnotes [a] expensive-to-modify list of missing objects: Earlier in the design of partial clone we discussed the need for a single list of missing objects. - This would essentially be a sorted linear list of OIDs that the were + This would essentially be a sorted linear list of OIDs that were omitted by the server during a clone or subsequent fetches. This file would need to be loaded into memory on every object lookup. diff --git a/Documentation/technical/racy-git.txt b/Documentation/technical/racy-git.txt index ceda4bbfda..59bea66c0f 100644 --- a/Documentation/technical/racy-git.txt +++ b/Documentation/technical/racy-git.txt @@ -11,7 +11,7 @@ write out the next tree object to be committed. The state is "virtual" in the sense that it does not necessarily have to, and often does not, match the files in the working tree. -There are cases Git needs to examine the differences between the +There are cases where Git needs to examine the differences between the virtual working tree state in the index and the files in the working tree. The most obvious case is when the user asks `git diff` (or its low level implementation, `git diff-files`) or @@ -165,9 +165,9 @@ Avoiding runtime penalty In order to avoid the above runtime penalty, post 1.4.2 Git used to have a code that made sure the index file -got timestamp newer than the youngest files in the index when -there are many young files with the same timestamp as the -resulting index file would otherwise would have by waiting +got a timestamp newer than the youngest files in the index when +there were many young files with the same timestamp as the +resulting index file otherwise would have by waiting before finishing writing the index file out. I suspected that in practice the situation where many paths in the @@ -190,7 +190,7 @@ In a large project where raciness avoidance cost really matters, however, the initial computation of all object names in the index takes more than one second, and the index file is written out after all that happens. Therefore the timestamp of the -index file will be more than one seconds later than the +index file will be more than one second later than the youngest file in the working tree. This means that in these cases there actually will not be any racily clean entry in the resulting index. diff --git a/Documentation/technical/reftable.txt b/Documentation/technical/reftable.txt index 6a67cc4174..dd0b37c4e3 100644 --- a/Documentation/technical/reftable.txt +++ b/Documentation/technical/reftable.txt @@ -46,7 +46,7 @@ search lookup, and range scans. Storage in the file is organized into variable sized blocks. Prefix compression is used within a single block to reduce disk space. Block -size and alignment is tunable by the writer. +size and alignment are tunable by the writer. Performance ^^^^^^^^^^^ @@ -115,7 +115,7 @@ Varint encoding Varint encoding is identical to the ofs-delta encoding method used within pack files. -Decoder works such as: +Decoder works as follows: .... val = buf[ptr] & 0x7f @@ -175,7 +175,7 @@ log_index* footer .... -in a log-only file the first log block immediately follows the file +In a log-only file, the first log block immediately follows the file header, without padding to block alignment. Block size @@ -247,7 +247,7 @@ uint32( hash_id ) .... The header is identical to `version_number=1`, with the 4-byte hash ID -("sha1" for SHA1 and "s256" for SHA-256) append to the header. +("sha1" for SHA1 and "s256" for SHA-256) appended to the header. For maximum backward compatibility, it is recommended to use version 1 when writing SHA1 reftables. @@ -288,7 +288,7 @@ The 2-byte `restart_count` stores the number of entries in the `restart_count` to binary search between restarts before starting a linear scan. -Exactly `restart_count` 3-byte `restart_offset` values precedes the +Exactly `restart_count` 3-byte `restart_offset` values precede the `restart_count`. Offsets are relative to the start of the block and refer to the first byte of any `ref_record` whose name has not been prefix compressed. Entries in the `restart_offset` list must be sorted, diff --git a/Documentation/technical/remembering-renames.txt b/Documentation/technical/remembering-renames.txt index 1e34d91390..73f41761e2 100644 --- a/Documentation/technical/remembering-renames.txt +++ b/Documentation/technical/remembering-renames.txt @@ -664,7 +664,7 @@ skip-irrelevant-renames optimization means we sometimes don't detect renames for any files within a directory that was renamed, in which case we will not have been able to detect any rename for the directory itself. In such a case, we do not know whether the directory was -renamed; we want to be careful to avoid cacheing some kind of "this +renamed; we want to be careful to avoid caching some kind of "this directory was not renamed" statement. If we did, then a subsequent commit being rebased could add a file to the old directory, and the user would expect it to end up in the correct directory -- something diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt index 7844ef30ff..045a76756f 100644 --- a/Documentation/technical/repository-version.txt +++ b/Documentation/technical/repository-version.txt @@ -82,9 +82,9 @@ When the config key `extensions.preciousObjects` is set to `true`, objects in the repository MUST NOT be deleted (e.g., by `git-prune` or `git repack -d`). -==== `partialclone` +==== `partialClone` -When the config key `extensions.partialclone` is set, it indicates +When the config key `extensions.partialClone` is set, it indicates that the repo was created with a partial clone (or later performed a partial fetch) and that the remote may have omitted sending certain unwanted objects. Such a remote is called a "promisor remote" @@ -96,7 +96,7 @@ The value of this key is the name of the promisor remote. ==== `worktreeConfig` If set, by default "git config" reads from both "config" and -"config.worktree" file from GIT_DIR in that order. In +"config.worktree" files from GIT_DIR in that order. In multiple working directory mode, "config" file is shared while "config.worktree" is per-working directory (i.e., it's in GIT_COMMON_DIR/worktrees/<id>/config.worktree) diff --git a/Documentation/technical/rerere.txt b/Documentation/technical/rerere.txt index 35d4541433..580f23360a 100644 --- a/Documentation/technical/rerere.txt +++ b/Documentation/technical/rerere.txt @@ -60,7 +60,7 @@ By resolving this conflict, to leave line D, the user declares: what AB and AC wanted to do. As branch AC2 refers to the same commit as AC, the above implies that -this is also compatible what AB and AC2 wanted to do. +this is also compatible with what AB and AC2 wanted to do. By extension, this means that rerere should recognize that the above conflicts are the same. To do this, the labels on the conflict @@ -76,7 +76,7 @@ examples would both result in the following normalized conflict: Sorting hunks ~~~~~~~~~~~~~ -As before, lets imagine that a common ancestor had a file with line A +As before, let's imagine that a common ancestor had a file with line A its early part, and line X in its late part. And then four branches are forked that do these things: @@ -99,7 +99,7 @@ conflict to leave line D means that the user declares: compatible with what AB and AC wanted to do. So the conflict we would see when merging AB into ACAB should be -resolved the same way---it is the resolution that is in line with that +resolved the same way--it is the resolution that is in line with that declaration. Imagine that similarly previously a branch XYXZ was forked from XY, @@ -145,7 +145,7 @@ Nested conflicts Nested conflicts are handled very similarly to "simple" conflicts. Similar to simple conflicts, the conflict is first normalized by stripping the labels from conflict markers, stripping the common ancestor -version, and the sorting the conflict hunks, both for the outer and the +version, and sorting the conflict hunks, both for the outer and the inner conflict. This is done recursively, so any number of nested conflicts can be handled. diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt new file mode 100644 index 0000000000..fa0d01cbda --- /dev/null +++ b/Documentation/technical/sparse-checkout.txt @@ -0,0 +1,1103 @@ +Table of contents: + + * Terminology + * Purpose of sparse-checkouts + * Usecases of primary concern + * Oversimplified mental models ("Cliff Notes" for this document!) + * Desired behavior + * Behavior classes + * Subcommand-dependent defaults + * Sparse specification vs. sparsity patterns + * Implementation Questions + * Implementation Goals/Plans + * Known bugs + * Reference Emails + + +=== Terminology === + +cone mode: one of two modes for specifying the desired subset of files + in a sparse-checkout. In cone-mode, the user specifies + directories (getting both everything under that directory as + well as everything in leading directories), while in non-cone + mode, the user specifies gitignore-style patterns. Controlled + by the --[no-]cone option to sparse-checkout init|set. + +SKIP_WORKTREE: When tracked files do not match the sparse specification and + are removed from the working tree, the file in the index is marked + with a SKIP_WORKTREE bit. Note that if a tracked file has the + SKIP_WORKTREE bit set but the file is later written by the user to + the working tree anyway, the SKIP_WORKTREE bit will be cleared at + the beginning of any subsequent Git operation. + + Most sparse checkout users are unaware of this implementation + detail, and the term should generally be avoided in user-facing + descriptions and command flags. Unfortunately, prior to the + `sparse-checkout` subcommand this low-level detail was exposed, + and as of time of writing, is still exposed in various places. + +sparse-checkout: a subcommand in git used to reduce the files present in + the working tree to a subset of all tracked files. Also, the + name of the file in the $GIT_DIR/info directory used to track + the sparsity patterns corresponding to the user's desired + subset. + +sparse cone: see cone mode + +sparse directory: An entry in the index corresponding to a directory, which + appears in the index instead of all the files under that directory + that would normally appear. See also sparse-index. Something that + can cause confusion is that the "sparse directory" does NOT match + the sparse specification, i.e. the directory is NOT present in the + working tree. May be renamed in the future (e.g. to "skipped + directory"). + +sparse index: A special mode for sparse-checkout that also makes the + index sparse by recording a directory entry in lieu of all the + files underneath that directory (thus making that a "skipped + directory" which unfortunately has also been called a "sparse + directory"), and does this for potentially multiple + directories. Controlled by the --[no-]sparse-index option to + init|set|reapply. + +sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to + define the set of files of interest. A warning: It is easy to + over-use this term (or the shortened "patterns" term), for two + reasons: (1) users in cone mode specify directories rather than + patterns (their directories are transformed into patterns, but + users may think you are talking about non-cone mode if you use the + word "patterns"), and (b) the sparse specification might + transiently differ in the working tree or index from the sparsity + patterns (see "Sparse specification vs. sparsity patterns"). + +sparse specification: The set of paths in the user's area of focus. This + is typically just the tracked files that match the sparsity + patterns, but the sparse specification can temporarily differ and + include additional files. (See also "Sparse specification + vs. sparsity patterns") + + * When working with history, the sparse specification is exactly + the set of files matching the sparsity patterns. + * When interacting with the working tree, the sparse specification + is the set of tracked files with a clear SKIP_WORKTREE bit or + tracked files present in the working copy. + * When modifying or showing results from the index, the sparse + specification is the set of files with a clear SKIP_WORKTREE bit + or that differ in the index from HEAD. + * If working with the index and the working copy, the sparse + specification is the union of the paths from above. + +vivifying: When a command restores a tracked file to the working tree (and + hopefully also clears the SKIP_WORKTREE bit in the index for that + file), this is referred to as "vivifying" the file. + + +=== Purpose of sparse-checkouts === + +sparse-checkouts exist to allow users to work with a subset of their +files. + +You can think of sparse-checkouts as subdividing "tracked" files into two +categories -- a sparse subset, and all the rest. Implementationally, we +mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them +out of the working tree. The SKIP_WORKTREE files are still tracked, just +not present in the working tree. + +In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file +is missing from the working tree but pretend the file contents match HEAD". +That was not only bogus (it actually meant the file missing from the +working tree matched the index rather than HEAD), but it was also a +low-level detail which only provided decent behavior for a few commands. +There were a surprising number of ways in which that guiding principle gave +command results that violated user expectations, and as such was a bad +mental model. However, it persisted for many years and may still be found +in some corners of the code base. + +Anyway, the idea of "working with a subset of files" is simple enough, but +there are multiple different high-level usecases which affect how some Git +subcommands should behave. Further, even if we only considered one of +those usecases, sparse-checkouts can modify different subcommands in over a +half dozen different ways. Let's start by considering the high level +usecases: + + A) Users are _only_ interested in the sparse portion of the repo + + A*) Users are _only_ interested in the sparse portion of the repo + that they have downloaded so far + + B) Users want a sparse working tree, but are working in a larger whole + + C) sparse-checkout is a behind-the-scenes implementation detail allowing + Git to work with a specially crafted in-house virtual file system; + users are actually working with a "full" working tree that is + lazily populated, and sparse-checkout helps with the lazy population + piece. + +It may be worth explaining each of these in a bit more detail: + + + (Behavior A) Users are _only_ interested in the sparse portion of the repo + +These folks might know there are other things in the repository, but +don't care. They are uninterested in other parts of the repository, and +only want to know about changes within their area of interest. Showing +them other files from history (e.g. from diff/log/grep/etc.) is a +usability annoyance, potentially a huge one since other changes in +history may dwarf the changes they are interested in. + +Some of these users also arrive at this usecase from wanting to use partial +clones together with sparse checkouts (in a way where they have downloaded +blobs within the sparse specification) and do disconnected development. +Not only do these users generally not care about other parts of the +repository, but consider it a blocker for Git commands to try to operate on +those. If commands attempt to access paths in history outside the sparsity +specification, then the partial clone will attempt to download additional +blobs on demand, fail, and then fail the user's command. (This may be +unavoidable in some cases, e.g. when `git merge` has non-trivial changes to +reconcile outside the sparse specification, but we should limit how often +users are forced to connect to the network.) + +Also, even for users using partial clones that do not mind being +always connected to the network, the need to download blobs as +side-effects of various other commands (such as the printed diffstat +after a merge or pull) can lead to worries about local repository size +growing unnecessarily[10]. + + (Behavior A*) Users are _only_ interested in the sparse portion of the repo + that they have downloaded so far (a variant on the first usecase) + +This variant is driven by folks who using partial clones together with +sparse checkouts and do disconnected development (so far sounding like a +subset of behavior A users) and doing so on very large repositories. The +reason for yet another variant is that downloading even just the blobs +through history within their sparse specification may be too much, so they +only download some. They would still like operations to succeed without +network connectivity, though, so things like `git log -S${SEARCH_TERM} -p` +or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide +partial results that depend on what happens to have been downloaded. + +This variant could be viewed as Behavior A with the sparse specification +for history querying operations modified from "sparsity patterns" to +"sparsity patterns limited to the blobs we have already downloaded". + + (Behavior B) Users want a sparse working tree, but are working in a + larger whole + +Stolee described this usecase this way[11]: + +"I'm also focused on users that know that they are a part of a larger +whole. They know they are operating on a large repository but focus on +what they need to contribute their part. I expect multiple "roles" to +use very different, almost disjoint parts of the codebase. Some other +"architect" users operate across the entire tree or hop between different +sections of the codebase as necessary. In this situation, I'm wary of +scoping too many features to the sparse-checkout definition, especially +"git log," as it can be too confusing to have their view of the codebase +depend on your "point of view." + +People might also end up wanting behavior B due to complex inter-project +dependencies. The initial attempts to use sparse-checkouts usually involve +the directories you are directly interested in plus what those directories +depend upon within your repository. But there's a monkey wrench here: if +you have integration tests, they invert the hierarchy: to run integration +tests, you need not only what you are interested in and its in-tree +dependencies, you also need everything that depends upon what you are +interested in or that depends upon one of your dependencies...AND you need +all the in-tree dependencies of that expanded group. That can easily +change your sparse-checkout into a nearly dense one. + +Naturally, that tends to kill the benefits of sparse-checkouts. There are +a couple solutions to this conundrum: either avoid grabbing in-repo +dependencies (maybe have built versions of your in-repo dependencies pulled +from a CI cache somewhere), or say that users shouldn't run integration +tests directly and instead do it on the CI server when they submit a code +review. Or do both. Regardless of whether you stub out your in-repo +dependencies or stub out the things that depend upon you, there is +certainly a reason to want to query and be aware of those other stubbed-out +parts of the repository, particularly when the dependencies are complex or +change relatively frequently. Thus, for such uses, sparse-checkouts can be +used to limit what you directly build and modify, but these users do not +necessarily want their sparse checkout paths to limit their queries of +versions in history. + +Some people may also be interested in behavior B over behavior A simply as +a performance workaround: if they are using non-cone mode, then they have +to deal with its inherent quadratic performance problems. In that mode, +every operation that checks whether paths match the sparsity specification +can be expensive. As such, these users may only be willing to pay for +those expensive checks when interacting with the working copy, and may +prefer getting "unrelated" results from their history queries over having +slow commands. + + (Behavior C) sparse-checkout is an implementational detail supporting a + special VFS. + +This usecase goes slightly against the traditional definition of +sparse-checkout in that it actually tries to present a full or dense +checkout to the user. However, this usecase utilizes the same underlying +technical underpinnings in a new way which does provide some performance +advantages to users. The basic idea is that a company can have an in-house +Git-aware Virtual File System which pretends all files are present in the +working tree, by intercepting all file system accesses and using those to +fetch and write accessed files on demand via partial clones. The VFS uses +sparse-checkout to prevent Git from writing or paying attention to many +files, and manually updates the sparse checkout patterns itself based on +user access and modification of files in the working tree. See commit +ecc7c8841d ("repo_read_index: add config to expect files outside sparse +patterns", 2022-02-25) and the link at [17] for a more detailed description +of such a VFS. + +The biggest difference here is that users are completely unaware that the +sparse-checkout machinery is even in use. The sparse patterns are not +specified by the user but rather are under the complete control of the VFS +(and the patterns are updated frequently and dynamically by it). The user +will perceive the checkout as dense, and commands should thus behave as if +all files are present. + + +=== Usecases of primary concern === + +Most of the rest of this document will focus on Behavior A and Behavior +B. Some notes about the other two cases and why we are not focusing on +them: + + (Behavior A*) + +Supporting this usecase is estimated to be difficult and a lot of work. +There are no plans to implement it currently, but it may be a potential +future alternative. Knowing about the existence of additional alternatives +may affect our choice of command line flags (e.g. if we need tri-state or +quad-state flags rather than just binary flags), so it was still important +to at least note. + +Further, I believe the descriptions below for Behavior A are probably still +valid for this usecase, with the only exception being that it redefines the +sparse specification to restrict it to already-downloaded blobs. The hard +part is in making commands capable of respecting that modified definition. + + (Behavior C) + +This usecase violates some of the early sparse-checkout documented +assumptions (since files marked as SKIP_WORKTREE will be displayed to users +as present in the working tree). That violation may mean various +sparse-checkout related behaviors are not well suited to this usecase and +we may need tweaks -- to both documentation and code -- to handle it. +However, this usecase is also perhaps the simplest model to support in that +everything behaves like a dense checkout with a few exceptions (e.g. branch +checkouts and switches write fewer things, knowing the VFS will lazily +write the rest on an as-needed basis). + +Since there is no publically available VFS-related code for folks to try, +the number of folks who can test such a usecase is limited. + +The primary reason to note the Behavior C usecase is that as we fix things +to better support Behaviors A and B, there may be additional places where +we need to make tweaks allowing folks in this usecase to get the original +non-sparse treatment. For an example, see ecc7c8841d ("repo_read_index: +add config to expect files outside sparse patterns", 2022-02-25). The +secondary reason to note Behavior C, is so that folks taking advantage of +Behavior C do not assume they are part of the Behavior B camp and propose +patches that break things for the real Behavior B folks. + + +=== Oversimplified mental models === + +An oversimplification of the differences in the above behaviors is: + + Behavior A: Restrict worktree and history operations to sparse specification + Behavior B: Restrict worktree operations to sparse specification; have any + history operations work across all files + Behavior C: Do not restrict either worktree or history operations to the + sparse specification...with the exception of branch checkouts or + switches which avoid writing files that will match the index so + they can later lazily be populated instead. + + +=== Desired behavior === + +As noted previously, despite the simple idea of just working with a subset +of files, there are a range of different behavioral changes that need to be +made to different subcommands to work well with such a feature. See +[1,2,3,4,5,6,7,8,9,10] for various examples. In particular, at [2], we saw +that mere composition of other commands that individually worked correctly +in a sparse-checkout context did not imply that the higher level command +would work correctly; it sometimes requires further tweaks. So, +understanding these differences can be beneficial. + +* Commands behaving the same regardless of high-level use-case + + * commands that only look at files within the sparsity specification + + * diff (without --cached or REVISION arguments) + * grep (without --cached or REVISION arguments) + * diff-files + + * commands that restore files to the working tree that match sparsity + patterns, and remove unmodified files that don't match those + patterns: + + * switch + * checkout (the switch-like half) + * read-tree + * reset --hard + + * commands that write conflicted files to the working tree, but otherwise + will omit writing files to the working tree that do not match the + sparsity patterns: + + * merge + * rebase + * cherry-pick + * revert + + * `am` and `apply --cached` should probably be in this section but + are buggy (see the "Known bugs" section below) + + The behavior for these commands somewhat depends upon the merge + strategy being used: + * `ort` behaves as described above + * `recursive` tries to not vivify files unnecessarily, but does sometimes + vivify files without conflicts. + * `octopus` and `resolve` will always vivify any file changed in the merge + relative to the first parent, which is rather suboptimal. + + It is also important to note that these commands WILL update the index + outside the sparse specification relative to when the operation began, + BUT these commands often make a commit just before or after such that + by the end of the operation there is no change to the index outside the + sparse specification. Of course, if the operation hits conflicts or + does not make a commit, then these operations clearly can modify the + index outside the sparse specification. + + Finally, it is important to note that at least the first four of these + commands also try to remove differences between the sparse + specification and the sparsity patterns (much like the commands in the + previous section). + + * commands that always ignore sparsity since commits must be full-tree + + * archive + * bundle + * commit + * format-patch + * fast-export + * fast-import + * commit-tree + + * commands that write any modified file to the working tree (conflicted + or not, and whether those paths match sparsity patterns or not): + + * stash + * apply (without `--index` or `--cached`) + +* Commands that may slightly differ for behavior A vs. behavior B: + + Commands in this category behave mostly the same between the two + behaviors, but may differ in verbosity and types of warning and error + messages. + + * commands that make modifications to which files are tracked: + * add + * rm + * mv + * update-index + + The fact that files can move between the 'tracked' and 'untracked' + categories means some commands will have to treat untracked files + differently. But if we have to treat untracked files differently, + then additional commands may also need changes: + + * status + * clean + + In particular, `status` may need to report any untracked files outside + the sparsity specification as an erroneous condition (especially to + avoid the user trying to `git add` them, forcing `git add` to display + an error). + + It's not clear to me exactly how (or even if) `clean` would change, + but it's the other command that also affects untracked files. + + `update-index` may be slightly special. Its --[no-]skip-worktree flag + may need to ignore the sparse specification by its nature. Also, its + current --[no-]ignore-skip-worktree-entries default is totally bogus. + + * commands for manually tweaking paths in both the index and the working tree + * `restore` + * the restore-like half of `checkout` + + These commands should be similar to add/rm/mv in that they should + only operate on the sparse specification by default, and require a + special flag to operate on all files. + + Also, note that these commands currently have a number of issues (see + the "Known bugs" section below) + +* Commands that significantly differ for behavior A vs. behavior B: + + * commands that query history + * diff (with --cached or REVISION arguments) + * grep (with --cached or REVISION arguments) + * show (when given commit arguments) + * blame (only matters when one or more -C flags are passed) + * and annotate + * log + * whatchanged + * ls-files + * diff-index + * diff-tree + * ls-tree + + Note: for log and whatchanged, revision walking logic is unaffected + but displaying of patches is affected by scoping the command to the + sparse-checkout. (The fact that revision walking is unaffected is + why rev-list, shortlog, show-branch, and bisect are not in this + list.) + + ls-files may be slightly special in that e.g. `git ls-files -t` is + often used to see what is sparse and what is not. Perhaps -t should + always work on the full tree? + +* Commands I don't know how to classify + + * range-diff + + Is this like `log` or `format-patch`? + + * cherry + + See range-diff + +* Commands unaffected by sparse-checkouts + + * shortlog + * show-branch + * rev-list + * bisect + + * branch + * describe + * fetch + * gc + * init + * maintenance + * notes + * pull (merge & rebase have the necessary changes) + * push + * submodule + * tag + + * config + * filter-branch (works in separate checkout without sparse-checkout setup) + * pack-refs + * prune + * remote + * repack + * replace + + * bugreport + * count-objects + * fsck + * gitweb + * help + * instaweb + * merge-tree (doesn't touch worktree or index, and merges always compute full-tree) + * rerere + * verify-commit + * verify-tag + + * commit-graph + * hash-object + * index-pack + * mktag + * mktree + * multi-pack-index + * pack-objects + * prune-packed + * symbolic-ref + * unpack-objects + * update-ref + * write-tree (operates on index, possibly optimized to use sparse dir entries) + + * for-each-ref + * get-tar-commit-id + * ls-remote + * merge-base (merges are computed full tree, so merge base should be too) + * name-rev + * pack-redundant + * rev-parse + * show-index + * show-ref + * unpack-file + * var + * verify-pack + + * <Everything under 'Interacting with Others' in 'git help --all'> + * <Everything under 'Low-level...Syncing' in 'git help --all'> + * <Everything under 'Low-level...Internal Helpers' in 'git help --all'> + * <Everything under 'External commands' in 'git help --all'> + +* Commands that might be affected, but who cares? + + * merge-file + * merge-index + * gitk? + + +=== Behavior classes === + +From the above there are a few classes of behavior: + + * "restrict" + + Commands in this class only read or write files in the working tree + within the sparse specification. + + When moving to a new commit (e.g. switch, reset --hard), these commands + may update index files outside the sparse specification as of the start + of the operation, but by the end of the operation those index files + will match HEAD again and thus those files will again be outside the + sparse specification. + + When paths are explicitly specified, these paths are intersected with + the sparse specification and will only operate on such paths. + (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`) + + Some of these commands may also attempt, at the end of their operation, + to cull transient differences between the sparse specification and the + sparsity patterns (see "Sparse specification vs. sparsity patterns" for + details, but this basically means either removing unmodified files not + matching the sparsity patterns and marking those files as + SKIP_WORKTREE, or vivifying files that match the sparsity patterns and + marking those files as !SKIP_WORKTREE). + + * "restrict modulo conflicts" + + Commands in this class generally behave like the "restrict" class, + except that: + (1) they will ignore the sparse specification and write files with + conflicts to the working tree (thus temporarily expanding the + sparse specification to include such files.) + (2) they are grouped with commands which move to a new commit, since + they often create a commit and then move to it, even though we + know there are many exceptions to moving to the new commit. (For + example, the user may rebase a commit that becomes empty, or have + a cherry-pick which conflicts, or a user could run `merge + --no-commit`, and we also view `apply --index` kind of like `am + --no-commit`.) As such, these commands can make changes to index + files outside the sparse specification, though they'll mark such + files with SKIP_WORKTREE. + + * "restrict also specially applied to untracked files" + + Commands in this class generally behave like the "restrict" class, + except that they have to handle untracked files differently too, often + because these commands are dealing with files changing state between + 'tracked' and 'untracked'. Often, this may mean printing an error + message if the command had nothing to do, but the arguments may have + referred to files whose tracked-ness state could have changed were it + not for the sparsity patterns excluding them. + + * "no restrict" + + Commands in this class ignore the sparse specification entirely. + + * "restrict or no restrict dependent upon behavior A vs. behavior B" + + Commands in this class behave like "no restrict" for folks in the + behavior B camp, and like "restrict" for folks in the behavior A camp. + However, when behaving like "restrict" a warning of some sort might be + provided that history queries have been limited by the sparse-checkout + specification. + + +=== Subcommand-dependent defaults === + +Note that we have different defaults depending on the command for the +desired behavior : + + * Commands defaulting to "restrict": + * diff-files + * diff (without --cached or REVISION arguments) + * grep (without --cached or REVISION arguments) + * switch + * checkout (the switch-like half) + * reset (<commit>) + + * restore + * checkout (the restore-like half) + * checkout-index + * reset (with pathspec) + + This behavior makes sense; these interact with the working tree. + + * Commands defaulting to "restrict modulo conflicts": + * merge + * rebase + * cherry-pick + * revert + + * am + * apply --index (which is kind of like an `am --no-commit`) + + * read-tree (especially with -m or -u; is kind of like a --no-commit merge) + * reset (<tree-ish>, due to similarity to read-tree) + + These also interact with the working tree, but require slightly + different behavior either so that (a) conflicts can be resolved or (b) + because they are kind of like a merge-without-commit operation. + + (See also the "Known bugs" section below regarding `am` and `apply`) + + * Commands defaulting to "no restrict": + * archive + * bundle + * commit + * format-patch + * fast-export + * fast-import + * commit-tree + + * stash + * apply (without `--index`) + + These have completely different defaults and perhaps deserve the most + detailed explanation: + + In the case of commands in the first group (format-patch, + fast-export, bundle, archive, etc.), these are commands for + communicating history, which will be broken if they restrict to a + subset of the repository. As such, they operate on full paths and + have no `--restrict` option for overriding. Some of these commands may + take paths for manually restricting what is exported, but it needs to + be very explicit. + + In the case of stash, it needs to vivify files to avoid losing the + user's changes. + + In the case of apply without `--index`, that command needs to update + the working tree without the index (or the index without the working + tree if `--cached` is passed), and if we restrict those updates to the + sparse specification then we'll lose changes from the user. + + * Commands defaulting to "restrict also specially applied to untracked files": + * add + * rm + * mv + * update-index + * status + * clean (?) + + Our original implementation for the first three of these commands was + "no restrict", but it had some severe usability issues: + * `git add <somefile>` if honored and outside the sparse + specification, can result in the file randomly disappearing later + when some subsequent command is run (since various commands + automatically clean up unmodified files outside the sparse + specification). + * `git rm '*.jpg'` could very negatively surprise users if it deletes + files outside the range of the user's interest. + * `git mv` has similar surprises when moving into or out of the cone, + so best to restrict by default + + So, we switched `add` and `rm` to default to "restrict", which made + usability problems much less severe and less frequent, but we still got + complaints because commands like: + git add <file-outside-sparse-specification> + git rm <file-outside-sparse-specification> + would silently do nothing. We should instead print an error in those + cases to get usability right. + + update-index needs to be updated to match, and status and maybe clean + also need to be updated to specially handle untracked paths. + + There may be a difference in here between behavior A and behavior B in + terms of verboseness of errors or additional warnings. + + * Commands falling under "restrict or no restrict dependent upon behavior + A vs. behavior B" + + * diff (with --cached or REVISION arguments) + * grep (with --cached or REVISION arguments) + * show (when given commit arguments) + * blame (only matters when one or more -C flags passed) + * and annotate + * log + * and variants: shortlog, gitk, show-branch, whatchanged, rev-list + * ls-files + * diff-index + * diff-tree + * ls-tree + + For now, we default to behavior B for these, which want a default of + "no restrict". + + Note that two of these commands -- diff and grep -- also appeared in a + different list with a default of "restrict", but only when limited to + searching the working tree. The working tree vs. history distinction + is fundamental in how behavior B operates, so this is expected. Note, + though, that for diff and grep with --cached, when doing "restrict" + behavior, the difference between sparse specification and sparsity + patterns is important to handle. + + "restrict" may make more sense as the long term default for these[12]. + Also, supporting "restrict" for these commands might be a fair amount + of work to implement, meaning it might be implemented over multiple + releases. If that behavior were the default in the commands that + supported it, that would force behavior B users to need to learn to + slowly add additional flags to their commands, depending on git + version, to get the behavior they want. That gradual switchover would + be painful, so we should avoid it at least until it's fully + implemented. + + +=== Sparse specification vs. sparsity patterns === + +In a well-behaved situation, the sparse specification is given directly +by the $GIT_DIR/info/sparse-checkout file. However, it can transiently +diverge for a few reasons: + + * needing to resolve conflicts (merging will vivify conflicted files) + * running Git commands that implicitly vivify files (e.g. "git stash apply") + * running Git commands that explicitly vivify files (e.g. "git checkout + --ignore-skip-worktree-bits FILENAME") + * other commands that write to these files (perhaps a user copies it + from elsewhere) + +For the last item, note that we do automatically clear the SKIP_WORKTREE +bit for files that are present in the working tree. This has been true +since 82386b4496 ("Merge branch 'en/present-despite-skipped'", +2022-03-09) + +However, such a situation is transient because: + + * Such transient differences can and will be automatically removed as + a side-effect of commands which call unpack_trees() (checkout, + merge, reset, etc.). + * Users can also request such transient differences be corrected via + running `git sparse-checkout reapply`. Various places recommend + running that command. + * Additional commands are also welcome to implicitly fix these + differences; we may add more in the future. + +While we avoid dropping unstaged changes or files which have conflicts, +we otherwise aggressively try to fix these transient differences. If +users want these differences to persist, they should run the `set` or +`add` subcommands of `git sparse-checkout` to reflect their intended +sparse specification. + +However, when we need to do a query on history restricted to the +"relevant subset of files" such a transiently expanded sparse +specification is ignored. There are a couple reasons for this: + + * The behavior wanted when doing something like + git grep expression REVISION + is roughly what the users would expect from + git checkout REVISION && git grep expression + (modulo a "REVISION:" prefix), which has a couple ramifications: + + * REVISION may have paths not in the current index, so there is no + path we can consult for a SKIP_WORKTREE setting for those paths. + + * Since `checkout` is one of those commands that tries to remove + transient differences in the sparse specification, it makes sense + to use the corrected sparse specification + (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to + consult SKIP_WORKTREE anyway. + +So, a transiently expanded (or restricted) sparse specification applies to +the working tree, but not to history queries where we always use the +sparsity patterns. (See [16] for an early discussion of this.) + +Similar to a transiently expanded sparse specification of the working tree +based on additional files being present in the working tree, we also need +to consider additional files being modified in the index. In particular, +if the user has staged changes to files (relative to HEAD) that do not +match the sparsity patterns, and the file is not present in the working +tree, we still want to consider the file part of the sparse specification +if we are specifically performing a query related to the index (e.g. git +diff --cached [REVISION], git diff-index [REVISION], git restore --staged +--source=REVISION -- PATHS, etc.) Note that a transiently expanded sparse +specification for the index usually only matters under behavior A, since +under behavior B index operations are lumped with history and tend to +operate full-tree. + + +=== Implementation Questions === + + * Do the options --scope={sparse,all} sound good to others? Are there better + options? + * Names in use, or appearing in patches, or previously suggested: + * --sparse/--dense + * --ignore-skip-worktree-bits + * --ignore-skip-worktree-entries + * --ignore-sparsity + * --[no-]restrict-to-sparse-paths + * --full-tree/--sparse-tree + * --[no-]restrict + * --scope={sparse,all} + * --focus/--unfocus + * --limit/--unlimited + * Rationale making me lean slightly towards --scope={sparse,all}: + * We want a name that works for many commands, so we need a name that + does not conflict + * We know that we have more than two possible usecases, so it is best + to avoid a flag that appears to be binary. + * --scope={sparse,all} isn't overly long and seems relatively + explanatory + * `--sparse`, as used in add/rm/mv, is totally backwards for + grep/log/etc. Changing the meaning of `--sparse` for these + commands would fix the backwardness, but possibly break existing + scripts. Using a new name pairing would allow us to treat + `--sparse` in these commands as a deprecated alias. + * There is a different `--sparse`/`--dense` pair for commands using + revision machinery, so using that naming might cause confusion + * There is also a `--sparse` in both pack-objects and show-branch, which + don't conflict but do suggest that `--sparse` is overloaded + * The name --ignore-skip-worktree-bits is a double negative, is + quite a mouthful, refers to an implementation detail that many + users may not be familiar with, and we'd need a negation for it + which would probably be even more ridiculously long. (But we + can make --ignore-skip-worktree-bits a deprecated alias for + --no-restrict.) + + * If a config option is added (sparse.scope?) what should the values and + description be? "sparse" (behavior A), "worktree-sparse-history-dense" + (behavior B), "dense" (behavior C)? There's a risk of confusion, + because even for Behaviors A and B we want some commands to be + full-tree and others to operate sparsely, so the wording may need to be + more tied to the usecases and somehow explain that. Also, right now, + the primary difference we are focusing is just the history-querying + commands (log/diff/grep). Previous config suggestion here: [13] + + * Is `--no-expand` a good alias for ls-files's `--sparse` option? + (`--sparse` does not map to either `--scope=sparse` or `--scope=all`, + because in non-cone mode it does nothing and in cone-mode it shows the + sparse directory entries which are technically outside the sparse + specification) + + * Under Behavior A: + * Does ls-files' `--no-expand` override the default `--scope=all`, or + does it need an extra flag? + * Does ls-files' `-t` option imply `--scope=all`? + * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`? + + * sparse-checkout: once behavior A is fully implemented, should we take + an interim measure to ease people into switching the default? Namely, + if folks are not already in a sparse checkout, then require + `sparse-checkout init/set` to take a + `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which + would set sparse.scope according to the setting given), and throw an + error if the flag is not provided? That error would be a great place + to warn folks that the default may change in the future, and get them + used to specifying what they want so that the eventual default switch + is seamless for them. + + +=== Implementation Goals/Plans === + + * Get buy-in on this document in general. + + * Figure out answers to the 'Implementation Questions' sections (above) + + * Fix bugs in the 'Known bugs' section (below) + + * Provide some kind of method for backfilling the blobs within the sparse + specification in a partial clone + + [Below here is kind of spitballing since the first two haven't been resolved] + + * update-index: flip the default to --no-ignore-skip-worktree-entries, + nuke this stupid "Oh, there's a bug? Let me add a flag to let users + request that they not trigger this bug." flag + + * Flags & Config + * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all` + * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore + a deprecated aliases for `--scope=all` + * Create config option (sparse.scope?), tie it to the "Cliff notes" + overview + + * Add --scope=sparse (and --scope=all) flag to each of the history querying + commands. IMPORTANT: make sure diff machinery changes don't mess with + format-patch, fast-export, etc. + +=== Known bugs === + +This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've +been working on it. + +0. Behavior A is not well supported in Git. (Behavior B didn't used to + be either, but was the easier of the two to implement.) + +1. am and apply: + + apply, without `--index` or `--cached`, relies on files being present + in the working copy, and also writes to them unconditionally. As + such, it should first check for the files' presence, and if found to + be SKIP_WORKTREE, then clear the bit and vivify the paths, then do + its work. Currently, it just throws an error. + + apply, with either `--cached` or `--index`, will not preserve the + SKIP_WORKTREE bit. This is fine if the file has conflicts, but + otherwise SKIP_WORKTREE bits should be preserved for --cached and + probably also for --index. + + am, if there are no conflicts, will vivify files and fail to preserve + the SKIP_WORKTREE bit. If there are conflicts and `-3` is not + specified, it will vivify files and then complain the patch doesn't + apply. If there are conflicts and `-3` is specified, it will vivify + files and then complain that those vivified files would be + overwritten by merge. + +2. reset --hard: + + reset --hard provides confusing error message (works correctly, but + misleads the user into believing it didn't): + + $ touch addme + $ git add addme + $ git ls-files -t + H addme + H tracked + S tracked-but-maybe-skipped + $ git reset --hard # usually works great + error: Path 'addme' not uptodate; will not remove from working tree. + HEAD is now at bdbbb6f third + $ git ls-files -t + H tracked + S tracked-but-maybe-skipped + $ ls -1 + tracked + + `git reset --hard` DID remove addme from the index and the working tree, contrary + to the error message, but in line with how reset --hard should behave. + +3. read-tree + + `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the + entries it reads into the index, resulting in all your files suddenly + appearing to be "deleted". + +4. Checkout, restore: + + These command do not handle path & revision arguments appropriately: + + $ ls + tracked + $ git ls-files -t + H tracked + S tracked-but-maybe-skipped + $ git status --porcelain + $ git checkout -- '*skipped' + error: pathspec '*skipped' did not match any file(s) known to git + $ git ls-files -- '*skipped' + tracked-but-maybe-skipped + $ git checkout HEAD -- '*skipped' + error: pathspec '*skipped' did not match any file(s) known to git + $ git ls-tree HEAD | grep skipped + 100644 blob 276f5a64354b791b13840f02047738c77ad0584f tracked-but-maybe-skipped + $ git status --porcelain + $ git checkout HEAD~1 -- '*skipped' + $ git ls-files -t + H tracked + H tracked-but-maybe-skipped + $ git status --porcelain + M tracked-but-maybe-skipped + $ git checkout HEAD -- '*skipped' + $ git status --porcelain + $ + + Note that checkout without a revision (or restore --staged) fails to + find a file to restore from the index, even though ls-files shows + such a file certainly exists. + + Similar issues occur with HEAD (--source=HEAD in restore's case), + but suddenly works when HEAD~1 is specified. And then after that it + will work with HEAD specified, even though it didn't before. + + Directories are also an issue: + + $ git sparse-checkout set nomatches + $ git status + On branch main + You are in a sparse checkout with 0% of tracked files present. + + nothing to commit, working tree clean + $ git checkout . + error: pathspec '.' did not match any file(s) known to git + $ git checkout HEAD~1 . + Updated 1 path from 58916d9 + $ git ls-files -t + S tracked + H tracked-but-maybe-skipped + +5. checkout and restore --staged, continued: + + These commands do not correctly scope operations to the sparse + specification, and make it worse by not setting important SKIP_WORKTREE + bits: + + $ git restore --source OLDREV --staged outside-sparse-cone/ + $ git status --porcelain + MD outside-sparse-cone/file1 + MD outside-sparse-cone/file2 + MD outside-sparse-cone/file3 + + We can add a --scope=all mode to `git restore` to let it operate outside + the sparse specification, but then it will be important to set the + SKIP_WORKTREE bits appropriately. + +6. Performance issues; see: + https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/ + + +=== Reference Emails === + +Emails that detail various bugs we've had in sparse-checkout: + +[1] (Original descriptions of behavior A & behavior B) + https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/ +[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences) + https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/ +[3] (Present-despite-skipped entries) + https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/ +[4] (Clone --no-checkout interaction) + https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout) +[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`) + https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/ +[6] (SKIP_WORKTREE is advisory, not mandatory) + https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/ +[7] (`worktree add` should copy sparsity settings from current worktree) + https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/ +[8] (Avoid negative surprises in add, rm, and mv) + https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/ + https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/ +[9] (Move from out-of-cone to in-cone) + https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/ + https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/ +[10] (Unnecessarily downloading objects outside sparse specification) + https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/ + +[11] (Stolee's comments on high-level usecases) + https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/ + +[12] Others commenting on eventually switching default to behavior A: + * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/ + * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/ + * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/ + +[13] Previous config name suggestion and description + * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/ + +[14] Tangential issue: switch to cone mode as default sparse specification mechanism: + https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/ + +[15] Lengthy email on grep behavior, covering what should be searched: + * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/ + +[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations, + search for the parenthetical comment starting "We do not check". + https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/ + +[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/ |
