linux - Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/

Age	Commit message (Collapse)	Author	Files	Lines
2019-05-23	tools/io_uring: fix Makefile for pthread library link	Jens Axboe	1	-1/+1
	Currently fails with: io_uring-bench.o: In function `main': /home/axboe/git/linux-block/tools/io_uring/io_uring-bench.c:560: undefined reference to `pthread_create' /home/axboe/git/linux-block/tools/io_uring/io_uring-bench.c:588: undefined reference to `pthread_join' collect2: error: ld returned 1 exit status Makefile:11: recipe for target 'io_uring-bench' failed make: *** [io_uring-bench] Error 1 Move -lpthread to the end. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23	blk-mq: fix hang caused by freeze/unfreeze sequence	Bob Liu	3	-11/+18
	The following is a description of a hang in blk_mq_freeze_queue_wait(). The hang happens on attempt to freeze a queue while another task does queue unfreeze. The root cause is an incorrect sequence of percpu_ref_resurrect() and percpu_ref_kill() and as a result those two can be swapped: CPU#0 CPU#1 ---------------- ----------------- q1 = blk_mq_init_queue(shared_tags) q2 = blk_mq_init_queue(shared_tags): blk_mq_add_queue_tag_set(shared_tags): blk_mq_update_tag_set_depth(shared_tags): list_for_each_entry() blk_mq_freeze_queue(q1) > percpu_ref_kill() > blk_mq_freeze_queue_wait() blk_cleanup_queue(q1) blk_mq_freeze_queue(q1) > percpu_ref_kill() ^^^^^^ freeze_depth can't guarantee the order blk_mq_unfreeze_queue() > percpu_ref_resurrect() > blk_mq_freeze_queue_wait() ^^^^^^ Hang here!!!! This wrong sequence raises kernel warning: percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release! WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0 But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(), which waits for a zero of a q_usage_counter, which never happens because percpu-ref was reinited (instead of being killed) and stays in PERCPU state forever. How to reproduce: - "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2" - cpu0: python Script.py 0; taskset the corresponding process running on cpu0 - cpu1: python Script.py 1; taskset the corresponding process running on cpu1 Script.py: ------ #!/usr/bin/python3 import os import sys while True: on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1] off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1] os.system(on) os.system(off) ------ This bug was first reported and fixed by Roman, previous discussion: [1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com [2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org [3] https://patchwork.kernel.org/patch/9268199/ Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23	block: remove the bi_seg_{front,back}_size fields in struct bio	Christoph Hellwig	2	-89/+12
	At this point these fields aren't used for anything, so we can remove them. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23	block: remove the segment size check in bio_will_gap	Christoph Hellwig	1	-18/+1
	We fundamentally do not have a maximum segement size for devices with a virt boundary. So don't bother checking it, especially given that the existing checks didn't properly work to start with as we never fully update the front/back segment size and miss the bi_seg_front_size that wuld have been required for some cases. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23	block: force an unlimited segment size on queues with a virt boundary	Christoph Hellwig	1	-0/+11
	We currently fail to update the front/back segment size in the bio when deciding to allow an otherwise gappy segement to a device with a virt boundary. The reason why this did not cause problems is that devices with a virt boundary fundamentally don't use segments as we know it and thus don't care. Make that assumption formal by forcing an unlimited segement size in this case. Fixes: f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23	block: don't decrement nr_phys_segments for physically contigous segments	Christoph Hellwig	1	-22/+1
	Currently ll_merge_requests_fn, unlike all other merge functions, reduces nr_phys_segments by one if the last segment of the previous, and the first segment of the next segement are contigous. While this seems like a nice solution to avoid building smaller than possible requests it causes a mismatch between the segments actually present in the request and those iterated over by the bvec iterators, including __rq_for_each_bio. This can for example mistrigger the single segment optimization in the nvme-pci driver, and might lead to mismatching nr_phys_segments number when recalculating the number of request when inserting a cloned request. We could possibly work around this by making the bvec iterators take the front and back segment size into account, but that would require moving them from the bio to the bio_iter and spreading this mess over all users of bvecs. Or we could simply remove this optimization under the assumption that most users already build good enough bvecs, and that the bio merge patch never cared about this optimization either. The latter is what this patch does. dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests"). Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23	sbitmap: fix improper use of smp_mb__before_atomic()	Andrea Parri	1	-1/+1
	This barrier only applies to the read-modify-write operations; in particular, it does not apply to the atomic_set() primitive. Replace the barrier with an smp_mb(). Fixes: 6c0ca7ae292ad ("sbitmap: fix wakeup hang after sbq resize") Cc: stable@vger.kernel.org Reported-by: "Paul E. McKenney" <paulmck@linux.ibm.com> Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: linux-block@vger.kernel.org Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23	bio: fix improper use of smp_mb__before_atomic()	Andrea Parri	1	-1/+1
	This barrier only applies to the read-modify-write operations; in particular, it does not apply to the atomic_set() primitive. Replace the barrier with an smp_mb(). Fixes: dac56212e8127 ("bio: skip atomic inc/dec of ->bi_cnt for most use cases") Cc: stable@vger.kernel.org Reported-by: "Paul E. McKenney" <paulmck@linux.ibm.com> Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: linux-block@vger.kernel.org Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23	aoe: list new maintainer for aoe driver	Ed Cashin	1	-1/+1
	Justin Sanders, who has extensive experience with ATA over Ethernet in general and AoE SCSI and block-device drivers in particular, is ready to take on the role of aoe maintainer. The driver needs a more active maintainer. Signed-off-by: Ed Cashin <ed.cashin@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-22	nvme-pci: use blk-mq mapping for unmanaged irqs	Keith Busch	1	-1/+1
	If a device is providing a single IRQ vector, the IO queue will share that vector with the admin queue. This is an unmanaged vector, so does not have a valid PCI IRQ affinity. Avoid trying to extract a managed affinity in this case and let blk-mq set up the cpu:queue mapping instead. Otherwise we'd hit the following warning when the device is using MSI: WARNING: CPU: 4 PID: 7 at drivers/pci/msi.c:1272 pci_irq_get_affinity+0x66/0x80 Modules linked in: nvme nvme_core serio_raw CPU: 4 PID: 7 Comm: kworker/u16:0 Tainted: G W 5.2.0-rc1+ #494 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Workqueue: nvme-reset-wq nvme_reset_work [nvme] RIP: 0010:pci_irq_get_affinity+0x66/0x80 Code: 0b 31 c0 c3 83 e2 10 48 c7 c0 b0 83 35 91 74 2a 48 8b 87 d8 03 00 00 48 85 c0 74 0e 48 8b 50 30 48 85 d2 74 05 39 70 14 77 05 <0f> 0b 31 c0 c3 48 63 f6 48 8d 04 76 48 8d 04 c2 f3 c3 48 8b 40 30 RSP: 0000:ffffb5abc01d3cc8 EFLAGS: 00010246 RAX: ffff9536786a39c0 RBX: 0000000000000000 RCX: 0000000000000080 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9536781ed000 RBP: ffff95367346a008 R08: ffff95367d43f080 R09: ffff953678c07800 R10: ffff953678164800 R11: 0000000000000000 R12: 0000000000000000 R13: ffff9536781ed000 R14: 00000000ffffffff R15: ffff95367346a008 FS: 0000000000000000(0000) GS:ffff95367d400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fdf814a3ff0 CR3: 000000001a20f000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: blk_mq_pci_map_queues+0x37/0xd0 nvme_pci_map_queues+0x80/0xb0 [nvme] blk_mq_alloc_tag_set+0x133/0x2f0 nvme_reset_work+0x105d/0x1590 [nvme] process_one_work+0x291/0x530 worker_thread+0x218/0x3d0 ? process_one_work+0x530/0x530 kthread+0x111/0x130 ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x30 ---[ end trace 74587339d93c83c0 ]--- Fixes: 22b5560195bd6 ("nvme-pci: Separate IO and admin queue IRQ vectors") Reported-by: Iván Chavero <ichavero@chavero.com.mx> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-21	nvme: update MAINTAINERS	Keith Busch	1	-1/+1
	Use my kernel.org email for nvme. This forwards to all my accounts. Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-21	nvme: copy MTFA field from identify controller	Laine Walker-Avina	1	-0/+1
	We use the controller's reported maximum firmware activation time as our timeout before resetting a controller for a failed activation notice, but this value was never being read so we could only use the default timeout. Copy the Identify Controller MTFA field to the corresponding nvme_ctrl's mtfa field. Fixes: b6dccf7fae433 (“nvme: add support for FW activation without reset”). Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Minwoo Im <minwoo.im@samsung.com> Signed-off-by: Laine Walker-Avina <laine.walker-avina@intel.com> [changelog, fix endian] Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-17	nvme: fix memory leak for power latency tolerance	Yufen Yu	1	-0/+1
	Unconditionally hide device pm latency tolerance when uninitializing the controller to ensure all qos resources are released so that we're not leaking this memory. This is safe to call if none were allocated in the first place, or were previously freed. Fixes: c5552fde102fc("nvme: Enable autonomous power state transitions") Suggested-by: Keith Busch <keith.busch@intel.com> Tested-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> [changelog] Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-17	nvme: release namespace SRCU protection before performing controller ioctls	Christoph Hellwig	1	-5/+20
	Holding the SRCU critical section protecting the namespace list can cause deadlocks when using the per-namespace admin passthrough ioctl to delete as namespace. Release it earlier when performing per-controller ioctls to avoid that. Reported-by: Kenneth Heitke <kenneth.heitke@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-17	nvme: merge nvme_ns_ioctl into nvme_ioctl	Christoph Hellwig	1	-23/+24
	Merge the two functions to make future changes a little easier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-05-17	nvme: remove the ifdef around nvme_nvm_ioctl	Christoph Hellwig	1	-2/+0
	We already have a proper stub if lightnvm is not enabled, so don't bother with the ifdef. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-05-17	nvme: fix srcu locking on error return in nvme_get_ns_from_disk	Christoph Hellwig	1	-4/+9
	If we can't get a namespace don't leak the SRCU lock. nvme_ioctl was working around this, but nvme_pr_command wasn't handling this properly. Just do what callers would usually expect. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-05-17	nvme: Fix known effects	Keith Busch	1	-1/+1
	We're trying to append known effects to the ones reported in the controller's log. The original patch accomplished this, but something went wrong when patch was merged causing the effects log to override the known effects. Link: http://lists.infradead.org/pipermail/linux-nvme/2019-May/023710.html Fixes: f4524cc45626 ("nvme-pci: add known admin effects to augument admin effects log page") Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-17	nvme-pci: Sync queues on reset	Keith Busch	3	-0/+14
	A controller with multiple namespaces may have multiple request_queues with their own timeout work. If a controller fails with IO outstanding to diffent namespaces, each request queue may attempt to handle it, so ensure there is no previously scheduled timeout work executing prior to starting controller initialization by synchronizing with each queue. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-17	nvme-pci: Unblock reset_work on IO failure	Keith Busch	1	-5/+4
	The reset_work waits for queued IO to complete before setting the controller to live. If any of these times out and requeues, we won't be able to restart the controller because the reset_work is already running. Flush all entered requests to a failed completion if a timeout occurs in the connecting state, and ensure the controller can't transition to the live state after we've unblocked it from waiting for completions. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-17	nvme-pci: Don't disable on timeout in reset state	Keith Busch	1	-1/+2
	The reset state doesn't dispatch commands that it needs to wait for anymore. If a timeout occurs in this state, the reset work is already disabling the controller, so just reset the request's timer. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-17	nvme-pci: Fix controller freeze wait disabling	Keith Busch	1	-6/+6
	If a controller disabling didn't start a freeze, don't wait for the operation to complete. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-17	kvm: fix compilation on aarch64	Paolo Bonzini	1	-1/+1
	Commit e45adf665a53 ("KVM: Introduce a new guest mapping API", 2019-01-31) introduced a build failure on aarch64 defconfig: $ make -j$(nproc) ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- O=out defconfig \ Image.gz ... ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_map_gfn': ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1763:9: error: implicit declaration of function 'memremap'; did you mean 'memset_p'? ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1763:46: error: 'MEMREMAP_WB' undeclared (first use in this function) ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c: In function 'kvm_vcpu_unmap': ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1795:3: error: implicit declaration of function 'memunmap'; did you mean 'vm_munmap'? because these functions are declared in <linux/io.h> rather than <asm/io.h>, and the former was being pulled in already on x86 but not on aarch64. Reported-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-16	slab: remove /proc/slab_allocators	Qian Cai	3	-232/+1
	It turned out that DEBUG_SLAB_LEAK is still broken even after recent recue efforts that when there is a large number of objects like kmemleak_object which is normal on a debug kernel, # grep kmemleak /proc/slabinfo kmemleak_object 2243606 3436210 ... reading /proc/slab_allocators could easily loop forever while processing the kmemleak_object cache and any additional freeing or allocating objects will trigger a reprocessing. To make a situation worse, soft-lockups could easily happen in this sitatuion which will call printk() to allocate more kmemleak objects to guarantee an infinite loop. Also, since it seems no one had noticed when it was totally broken more than 2-year ago - see the commit fcf88917dd43 ("slab: fix a crash by reading /proc/slab_allocators"), probably nobody cares about it anymore due to the decline of the SLAB. Just remove it entirely. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-16	afs: Fix application of the results of a inline bulk status fetch	David Howells	2	-7/+45
	Fix afs_do_lookup() such that when it does an inline bulk status fetch op, it will update inodes that are already extant (something that afs_iget() doesn't do) and to cache permits for each inode created (thereby avoiding a follow up FS.FetchStatus call to determine this). Extant inodes need looking up in advance so that their cb_break counters before and after the operation can be compared. To this end, the inode pointers are cached so that they don't need looking up again after the op. Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Pass pre-fetch server and volume break counts into afs_iget5_set()	David Howells	4	-49/+78
	Pass the server and volume break counts from before the status fetch operation that queried the attributes of a file into afs_iget5_set() so that the new vnode's break counters can be initialised appropriately. This allows detection of a volume or server break that happened whilst we were fetching the status or setting up the vnode. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Fix unlink to handle YFS.RemoveFile2 better	David Howells	6	-28/+33
	Make use of the status update for the target file that the YFS.RemoveFile2 RPC op returns to correctly update the vnode as to whether the file was actually deleted or just had nlink reduced. Fixes: 30062bd13e36 ("afs: Implement YFS support in the fs client") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Clear AFS_VNODE_CB_PROMISED if we detect callback expiry	David Howells	1	-1/+13
	Fix afs_validate() to clear AFS_VNODE_CB_PROMISED on a vnode if we detect any condition that causes the callback promise to be broken implicitly, including server break (cb_s_break), volume break (cb_v_break) or callback expiry. Fixes: ae3b7361dc0e ("afs: Fix validation/callback interaction") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Make vnode->cb_interest RCU safe	David Howells	7	-59/+100
	Use RCU-based freeing for afs_cb_interest struct objects and use RCU on vnode->cb_interest. Use that change to allow afs_check_validity() to use read_seqbegin_or_lock() instead of read_seqlock_excl(). This also requires the caller of afs_check_validity() to hold the RCU read lock across the call. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Split afs_validate() so first part can be used under LOOKUP_RCU	David Howells	2	-13/+25
	Split afs_validate() so that the part that decides if the vnode is still valid can be used under LOOKUP_RCU conditions from afs_d_revalidate(). Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Don't save callback version and type fields	David Howells	6	-16/+5
	Don't save callback version and type fields as the version is about the format of the callback information and the type is relative to the particular RPC call. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	dt-bindings: Convert vendor prefixes to json-schema	Rob Herring	2	-476/+977
	Convert the vendor prefix registry to a schema. This will enable checking that new vendor prefixes are added (in addition to the less than perfect checkpatch.pl check) and will also check against adding other prefixes which are not vendors. Converted vendor-prefixes.txt using the following sed script: sed -e 's/$[a-zA-Z0-9\-]$[[:space:]]$[a-zA-Z0-9].$/ "^\1,\.\\":\n description: \2/' Signed-off-by: Rob Herring <robh@kernel.org>
2019-05-16	signal: unconditionally leave the frozen state in ptrace_stop()	Roman Gushchin	1	-0/+1
	Alex Xu reported a regression in strace, caused by the introduction of the cgroup v2 freezer. The regression can be reproduced by stracing the following simple program: #include <unistd.h> int main() { write(1, "a", 1); return 0; } An attempt to run strace ./a.out leads to the infinite loop: [ pre-main omitted ] write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) [ repeats forever ] The problem occurs because the traced task leaves ptrace_stop() (and the signal handling loop) with the frozen bit set. So let's call cgroup_leave_frozen(true) unconditionally after sleeping in ptrace_stop(). With this patch applied, strace works as expected: [ pre-main omitted ] write(1, "a", 1) = 1 exit_group(0) = ? +++ exited with 0 +++ Reported-by: Alex Xu <alex_y_xu@yahoo.ca> Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer") Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-05-16	uapi: Wire up the mount API syscalls on non-x86 arches [ver #2]	David Howells	17	-2/+110
	Wire up the mount API syscalls on non-x86 arches. Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-16	uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2]	David Howells	2	-12/+12
	Fix the syscall numbering of the mount API syscalls so that the numbers match between i386 and x86_64 and that they're in the common numbering scheme space. Fixes: a07b20004793 ("vfs: syscall: Add open_tree(2) to reference or clone a mount") Fixes: 2db154b3ea8e ("vfs: syscall: Add move_mount(2) to move mounts around") Fixes: 24dcb3d90a1f ("vfs: syscall: Add fsopen() to prepare for superblock creation") Fixes: ecdab150fddb ("vfs: syscall: Add fsconfig() for configuring and managing a context") Fixes: 93766fbd2696 ("vfs: syscall: Add fsmount() to create a mount for a superblock") Fixes: cf3cba4a429b ("vfs: syscall: Add fspick() to select a superblock for reconfiguration") Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-16	uapi, fsopen: use square brackets around "fscontext" [ver #2]	Christian Brauner	1	-1/+1
	Make the name of the anon inode fd "[fscontext]" instead of "fscontext". This is minor but most core-kernel anon inode fds already carry square brackets around their name: [eventfd] [eventpoll] [fanotify] [io_uring] [pidfd] [signalfd] [timerfd] [userfaultfd] For the sake of consistency lets do the same for the fscontext anon inode fd that comes with the new mount api. Signed-off-by: Christian Brauner <christian@brauner.io> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-16	afs: Fix double inc of vnode->cb_break	David Howells	1	-3/+1
	When __afs_break_callback() clears the CB_PROMISED flag, it increments vnode->cb_break to trigger a future refetch of the status and callback - however it also calls afs_clear_permits(), which also increments vnode->cb_break. Fix this by removing the increment from afs_clear_permits(). Whilst we're at it, fix the conditional call to afs_put_permits() as the function checks to see if the argument is NULL, so the check is redundant. Fixes: be080a6f43c4 ("afs: Overhaul permit caching"); Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Fix application of status and callback to be under same lock	David Howells	12	-1004/+921
	When applying the status and callback in the response of an operation, apply them in the same critical section so that there's no race between checking the callback state and checking status-dependent state (such as the data version). Fix this by: (1) Allocating a joint {status,callback} record (afs_status_cb) before calling the RPC function for each vnode for which the RPC reply contains a status or a status plus a callback. A flag is set in the record to indicate if a callback was actually received. (2) These records are passed into the RPC functions to be filled in. The afs_decode_status() and yfs_decode_status() functions are removed and the cb_lock is no longer taken. (3) xdr_decode_AFSFetchStatus() and xdr_decode_YFSFetchStatus() no longer update the vnode. (4) xdr_decode_AFSCallBack() and xdr_decode_YFSCallBack() no longer update the vnode. (5) vnodes, expected data-version numbers and callback break counters (cb_break) no longer need to be passed to the reply delivery functions. Note that, for the moment, the file locking functions still need access to both the call and the vnode at the same time. (6) afs_vnode_commit_status() is now given the cb_break value and the expected data_version and the task of applying the status and the callback to the vnode are now done here. This is done under a single taking of vnode->cb_lock. (7) afs_pages_written_back() is now called by afs_store_data() rather than by the reply delivery function. afs_pages_written_back() has been moved to before the call point and is now given the first and last page numbers rather than a pointer to the call. (8) The indicator from YFS.RemoveFile2 as to whether the target file actually got removed (status.abort_code == VNOVNODE) rather than merely dropping a link is now checked in afs_unlink rather than in xdr_decode_YFSFetchStatus(). Supplementary fixes: () afs_cache_permit() now gets the caller_access mask from the afs_status_cb object rather than picking it out of the vnode's status record. afs_fetch_status() returns caller_access through its argument list for this purpose also. () afs_inode_init_from_status() now uses a write lock on cb_lock rather than a read lock and now sets the callback inside the same critical section. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Fix lock-wait/callback-break double locking	David Howells	2	-10/+1
	__afs_break_callback() holds vnode->lock around its call of afs_lock_may_be_available() - which also takes that lock. Fix this by not taking the lock in __afs_break_callback(). Also, there's no point checking the granted_locks and pending_locks queues; it's sufficient to check lock_state, so move that check out of afs_lock_may_be_available() into __afs_break_callback() to replace the queue checks. Fixes: e8d6c554126b ("AFS: implement file locking") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Always get the reply time	David Howells	5	-16/+3
	Always ask for the reply time from AF_RXRPC as it's used to calculate the callback expiry time and lock expiry times, so it's needed by most FS operations. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not set	David Howells	1	-5/+2
	Don't invalidate the callback promise on a directory if the AFS_VNODE_DIR_VALID flag is not set (which indicates that the directory contents are invalid, due to edit failure, callback break, page reclaim). The directory will be reloaded next time the directory is accessed, so clearing the callback flag at this point may race with a reload of the directory and cancel it's recorded callback promise. Fixes: f3ddee8dc4e2 ("afs: Fix directory handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Fix order-1 allocation in afs_do_lookup()	David Howells	5	-50/+45
	afs_do_lookup() will do an order-1 allocation to allocate status records if there are more than 39 vnodes to stat. Fix this by allocating an array of {status,callback} records for each vnode we want to examine using vmalloc() if larger than a page. This not only gets rid of the order-1 allocation, but makes it easier to grow beyond 50 records for YFS servers. It also allows us to move to {status,callback} tuples for other calls too and makes it easier to lock across the application of the status and the callback to the vnode. Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Fix calculation of callback expiry time	David Howells	2	-57/+51
	Fix the calculation of the expiry time of a callback promise, as obtained from operations like FS.FetchStatus and FS.FetchData. The time should be based on the timestamp of the first DATA packet in the reply and the calculation needs to turn the ktime_t timestamp into a time64_t. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Get rid of afs_call::reply[]	David Howells	10	-303/+293
	Replace the afs_call::reply[] array with a bunch of typed members so that the compiler can use type-checking on them. It's also easier for the eye to see what's going on. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Make dynamic root population wait uninterruptibly for proc_cells_lock	David Howells	1	-2/+1
	Make dynamic root population wait uninterruptibly for proc_cells_lock. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Don't pass the vnode pointer through into the inline bulk status op	David Howells	2	-15/+2
	Don't pass the vnode pointer through into the inline bulk status op. We want to process the status records outside of it anyway. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Make some RPC operations non-interruptible	David Howells	14	-35/+99
	Make certain RPC operations non-interruptible, including: () Set attributes () Store data We don't want to get interrupted during a flush on close, flush on unlock, writeback or an inode update, leaving us in a state where we still need to do the writeback or update. () Extend lock () Release lock We don't want to get lock extension interrupted as the file locks on the server are time-limited. Interruption during lock release is less of an issue since the lock is time-limited, but it's better to complete the release to avoid a several-minute wait to recover it. Setting the lock isn't a problem if it's interrupted since we can just return to the user and tell them they were interrupted - at which point they can elect to retry. (*) Silly unlink We want to remove silly unlink files if we can, rather than leaving them for the salvager to clear up. Note that whilst these calls are no longer interruptible, they do have timeouts on them, so if the server stops responding the call will fail with something like ETIME or ECONNRESET. Without this, the following: kAFS: Unexpected error from FS.StoreData -512 appears in dmesg when a pending store data gets interrupted and some processes may just hang. Additionally, make the code that checks/updates the server record ignore failure due to interruption if the main call is uninterruptible and if the server has an address list. The next op will check it again since the expiration time on the old list has past. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Reported-by: Jonathan Billings <jsbillings@jsbillings.org> Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	rxrpc: Allow the kernel to mark a call as being non-interruptible	David Howells	8	-4/+28
	Allow kernel services using AF_RXRPC to indicate that a call should be non-interruptible. This allows kafs to make things like lock-extension and writeback data storage calls non-interruptible. If this is set, signals will be ignored for operations on that call where possible - such as waiting to get a call channel on an rxrpc connection. It doesn't prevent UDP sendmsg from being interrupted, but that will be handled by packet retransmission. rxrpc_kernel_recv_data() isn't affected by this since that never waits, preferring instead to return -EAGAIN and leave the waiting to the caller. Userspace initiated calls can't be set to be uninterruptible at this time. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Fix error propagation from server record check/update	David Howells	1	-3/+3
	afs_check/update_server_record() should be setting fc->error rather than fc->ac.error as they're called from within the cursor iteration function. afs_fs_cursor::error is where the error code of the attempt to call the operation on multiple servers is integrated and is the final result, whereas afs_addr_cursor::error is used to hold the error from individual iterations of the call loop. (Note there's also an afs_vl_cursor which also wraps afs_addr_cursor for accessing VL servers rather than file servers). Fix this by setting fc->error in the afs_check/update_server_record() so that any error incurred whilst talking to the VL server correctly propagates to the final result. This results in: kAFS: Unexpected error from FS.StoreData -512 being seen, even though the store-data op is non-interruptible. The error is actually coming from the server record update getting interrupted. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16	afs: Fix the maximum lifespan of VL and probe calls	David Howells	5	-0/+13
	If an older AFS server doesn't support an operation, it may accept the call and then sit on it forever, happily responding to pings that make kafs think that the call is still alive. Fix this by setting the maximum lifespan of Volume Location service calls in particular and probe calls in general so that they don't run on endlessly if they're not supported. Signed-off-by: David Howells <dhowells@redhat.com>