aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-11-06sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remainingAaron Lu2-10/+7
When a cfs_rq is to be throttled, its limbo list should be empty and that's why there is a warn in tg_throttle_down() for non empty cfs_rq->throttled_limbo_list. When running a test with the following hierarchy: root / \ A* ... / | \ ... B / \ C* where both A and C have quota settings, that warn on non empty limbo list is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu part of the cfs_rq for the sake of simpler representation). Debug showed it happened like this: Task group C is created and quota is set, so in tg_set_cfs_bandwidth(), cfs_rq_c is initialized with runtime_enabled set, runtime_remaining equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c, *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled), these throttled tasks are directly placed into cfs_rq_c's limbo list by enqueue_throttled_task(). Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these tasks. The first enqueue triggers check_enqueue_throttle(), and with zero runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it can't get more runtime and enters tg_throttle_down(), where the warning is hit due to remaining tasks in the limbo list. I think it's a chaos to trigger throttle on unthrottle path, the status of a being unthrottled cfs_rq can be in a mixed state in the end, so fix this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures cfs_rq_c has a positive runtime_remaining when initialized as unthrottled and cannot enter tg_unthrottle_up() with zero runtime_remaining. Also, update outdated comments in tg_throttle_down() since unthrottle_cfs_rq() is no longer called with zero runtime_remaining. While at it, remove a redundant assignment to se in tg_throttle_down(). Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model") Reviewed-By: Benjamin Segall <bsegall@google.com> Suggested-by: Benjamin Segall <bsegall@google.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Hao Jia <jiahao1@lixiang.com> Link: https://patch.msgid.link/20251030032755.560-1-ziqianlu@bytedance.com
2025-10-27Merge tag 'sched_ext-for-6.18-rc3-fixes' of ↵Linus Torvalds2-15/+112
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix scx_kick_pseqs corruption when multiple schedulers are loaded concurrently - Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc() to handle systems with large CPU counts - Defer queue_balance_callback() until after ops.dispatch to fix callback ordering issues - Sync error_irq_work before freeing scx_sched to prevent use-after-free - Mark scx_bpf_dsq_move_set_[slice|vtime]() with KF_RCU for proper RCU protection - Fix flag check for deferred callbacks * tag 'sched_ext-for-6.18-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: fix flag check for deferred callbacks sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc() sched_ext: defer queue_balance_callback() until after ops.dispatch sched_ext: Sync error_irq_work before freeing scx_sched sched_ext: Mark scx_bpf_dsq_move_set_[slice|vtime]() with KF_RCU
2025-10-26Merge tag 'sched_urgent_for_v6.18_rc3' of ↵Linus Torvalds1-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Make sure a CFS runqueue on a throttled hierarchy has its PELT clock throttled otherwise task movement and manipulation would lead to dangling cfs_rq references and an eventual crash * tag 'sched_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
2025-10-24sched: Remove never used code in mm_cid_get()Andy Shevchenko1-2/+0
Clang is not happy with set but unused variable (this is visible with `make W=1` build: kernel/sched/sched.h:3744:18: error: variable 'cpumask' set but not used [-Werror,-Wunused-but-set-variable] It seems like the variable was never used along with the assignment that does not have side effects as far as I can see. Remove those altogether. Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-10-22sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttledK Prateek Nayak1-0/+12
Matteo reported hitting the assert_list_leaf_cfs_rq() warning from enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate load for throttled cfs_rq") which transitioned to using cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in propagate_entity_cfs_rq(). The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all its descendants should have their PELT frozen too or weird things can happen as a result of children accumulating PELT signals when the parents have their PELT clock stopped. Another side effect of this is the loss of integrity of the leaf cfs_rq list. As debugged by Aaron, consider the following hierarchy: root(#) / \ A(#) B(*) | C <--- new cgroup | D <--- new cgroup # - Already on leaf cfs_rq list * - Throttled with PELT frozen The newly created cgroups don't have their "pelt_clock_throttled" signal synced with cgroup B. Next, the following series of events occur: 1. online_fair_sched_group() for cgroup D will call propagate_entity_cfs_rq(). (Same can happen if a throttled task is moved to cgroup C and enqueue_task_fair() returns early.) propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to "rq->tmp_alone_branch" since its PELT clock is not marked throttled and cfs_rq of cgroup B is not on the list. cfs_rq of cgroup B is skipped since its PELT is throttled. root cfs_rq already exists on cfs_rq leading to list_add_leaf_cfs_rq() returning early. The cfs_rq of cgroup C is left dangling on the "rq->tmp_alone_branch". 2. A new task wakes up on cgroup A. Since the whole hierarchy is already on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early without any modifications to "rq->tmp_alone_branch". The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch". !!! Splat !!! Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not enough since the new cfs_rq is not yet enqueued on the hierarchy. A dequeue on other subtree on the throttled hierarchy can freeze the PELT clock for the parent hierarchy without setting the indicators for this newly added cfs_rq which was never enqueued. Since there are no tasks on the new hierarchy, start a cfs_rq on a throttled hierarchy with its PELT clock throttled. The first enqueue, or the distribution (whichever happens first) will unfreeze the PELT clock and queue the cfs_rq on the leaf cfs_rq list. While at it, add an assert_list_leaf_cfs_rq() in propagate_entity_cfs_rq() to catch such cases in the future. Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/ Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model") Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Suggested-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Link: https://patch.msgid.link/20251021053522.37583-1-kprateek.nayak@amd.com
2025-10-16sched_ext: fix flag check for deferred callbacksEmil Tsalapatis1-1/+1
When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests whether there is already a pending request for queue_balance_callback() to be invoked at the end of .balance(). Fixes: a8ad873113d3 ("sched_ext: defer queue_balance_callback() until after ops.dispatch") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-14sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loadsAndrea Righi1-5/+5
If we load a BPF scheduler while another scheduler is already running, alloc_kick_pseqs() would be called again, overwriting the previously allocated arrays. Fix by moving the alloc_kick_pseqs() call after the scx_enable_state() check, ensuring that the arrays are only allocated when a scheduler can actually be loaded. Fixes: 14c1da3895a11 ("sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc()") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-14sched/fair: Fix pelt lost idle time detectionVincent Guittot1-13/+13
The check for some lost idle pelt time should be always done when pick_next_task_fair() fails to pick a task and not only when we call it from the fair fast-path. The case happens when the last running task on rq is a RT or DL task. When the latter goes to sleep and the /Sum of util_sum of the rq is at the max value, we don't account the lost of idle time whereas we should. Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-14sched/deadline: Stop dl_server before CPU goes offlinePeter Zijlstra (Intel)2-0/+5
IBM CI tool reported kernel warning[1] when running a CPU removal operation through drmgr[2]. i.e "drmgr -c cpu -r -q 1" WARNING: CPU: 0 PID: 0 at kernel/sched/cpudeadline.c:219 cpudl_set+0x58/0x170 NIP [c0000000002b6ed8] cpudl_set+0x58/0x170 LR [c0000000002b7cb8] dl_server_timer+0x168/0x2a0 Call Trace: [c000000002c2f8c0] init_stack+0x78c0/0x8000 (unreliable) [c0000000002b7cb8] dl_server_timer+0x168/0x2a0 [c00000000034df84] __hrtimer_run_queues+0x1a4/0x390 [c00000000034f624] hrtimer_interrupt+0x124/0x300 [c00000000002a230] timer_interrupt+0x140/0x320 Git bisects to: commit 4ae8d9aa9f9d ("sched/deadline: Fix dl_server getting stuck") This happens since: - dl_server hrtimer gets enqueued close to cpu offline, when kthread_park enqueues a fair task. - CPU goes offline and drmgr removes it from cpu_present_mask. - hrtimer fires and warning is hit. Fix it by stopping the dl_server before CPU is marked dead. [1]: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com/ [2]: https://github.com/ibm-power-utilities/powerpc-utils/tree/next/src/drmgr [sshegde: wrote the changelog and tested it] Fixes: 4ae8d9aa9f9d ("sched/deadline: Fix dl_server getting stuck") Closes: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
2025-10-13sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc()Tejun Heo1-10/+79
On systems with >4096 CPUs, scx_kick_cpus_pnt_seqs allocation fails during boot because it exceeds the 32,768 byte percpu allocator limit. Restructure to use DEFINE_PER_CPU() for the per-CPU pointers, with each CPU pointing to its own kvzalloc'd array. Move allocation from boot time to scx_enable() and free in scx_disable(), so the O(nr_cpu_ids^2) memory is only consumed when sched_ext is active. Use RCU to guard against racing with free. Arrays are freed via call_rcu() and kick_cpus_irq_workfn() uses rcu_dereference_bh() with a NULL check. While at it, rename to scx_kick_pseqs for brevity and update comments to clarify these are pick_task sequence numbers. v2: RCU protect scx_kick_seqs to manage kick_cpus_irq_workfn() racing against disable as per Andrea. v3: Fix bugs notcied by Andrea. Reported-by: Phil Auld <pauld@redhat.com> Link: http://lkml.kernel.org/r/20251007133523.GA93086@pauld.westford.csb Cc: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: defer queue_balance_callback() until after ops.dispatchEmil Tsalapatis2-2/+28
The sched_ext code calls queue_balance_callback() during enqueue_task() to defer operations that drop multiple locks until we can unpin them. The call assumes that the rq lock is held until the callbacks are invoked, and the pending callbacks will not be visible to any other threads. This is enforced by a WARN_ON_ONCE() in rq_pin_lock(). However, balance_one() may actually drop the lock during a BPF dispatch call. Another thread may win the race to get the rq lock and see the pending callback. To avoid this, sched_ext must only queue the callback after the dispatch calls have completed. CPU 0 CPU 1 CPU 2 scx_balance() rq_unpin_lock() scx_balance_one() |= IN_BALANCE scx_enqueue() ops.dispatch() rq_unlock() rq_lock() queue_balance_callback() rq_unlock() [WARN] rq_pin_lock() rq_lock() &= ~IN_BALANCE rq_repin_lock() Changelog v2-> v1 (https://lore.kernel.org/sched-ext/aOgOxtHCeyRT_7jn@gpd4) - Fixed explanation in patch description (Andrea) - Fixed scx_rq mask state updates (Andrea) - Added Reviewed-by tag from Andrea Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Sync error_irq_work before freeing scx_schedTejun Heo1-0/+2
By the time scx_sched_free_rcu_work() runs, the scx_sched is no longer reachable. However, a previously queued error_irq_work may still be pending or running. Ensure it completes before proceeding with teardown. Fixes: bff3b5aec1b7 ("sched_ext: Move disable machinery into scx_sched") Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Mark scx_bpf_dsq_move_set_[slice|vtime]() with KF_RCUTejun Heo1-4/+4
scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() take a DSQ iterator argument which has to be valid. Mark them with KF_RCU. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-02Merge tag 'mm-stable-2025-10-01-19-00' of ↵Linus Torvalds1-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
2025-09-30Merge tag 'timers-core-2025-09-29' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: - Address the inconsistent shutdown sequence of per CPU clockevents on CPU hotplug, which only removed it from the core but failed to invoke the actual device driver shutdown callback. This kept the timer active, which prevented power savings and caused pointless noise in virtualization. - Encapsulate the open coded access to the hrtimer clock base, which is a private implementation detail, so that the implementation can be changed without breaking a lot of usage sites. - Enhance the debug output of the clocksource watchdog to provide better information for analysis. - The usual set of cleanups and enhancements all over the place * tag 'timers-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Fix spelling mistakes in comments clocksource: Print durations for sync check unconditionally LoongArch: Remove clockevents shutdown call on offlining tick: Do not set device to detached state in tick_shutdown() hrtimer: Reorder branches in hrtimer_clockid_to_base() hrtimer: Remove hrtimer_clock_base:: Get_time hrtimer: Use hrtimer_cb_get_time() helper media: pwm-ir-tx: Avoid direct access to hrtimer clockbase ALSA: hrtimer: Avoid direct access to hrtimer clockbase lib: test_objpool: Avoid direct access to hrtimer clockbase sched/core: Avoid direct access to hrtimer clockbase timers/itimer: Avoid direct access to hrtimer clockbase posix-timers: Avoid direct access to hrtimer clockbase jiffies: Remove obsolete SHIFTED_HZ comment
2025-09-30Merge tag 'sched-core-2025-09-26' of ↵Linus Torvalds7-293/+431
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core scheduler changes: - Make migrate_{en,dis}able() inline, to improve performance (Menglong Dong) - Move STDL_INIT() functions out-of-line (Peter Zijlstra) - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra) Fair scheduling: - Defer throttling to when tasks exit to user-space, to reduce the chance & impact of throttle-preemption with held locks and other resources (Aaron Lu, Valentin Schneider) - Get rid of sched_domains_curr_level hack for tl->cpumask(), as the warning was getting triggered on certain topologies (Peter Zijlstra) Misc cleanups & fixes: - Header cleanups (Menglong Dong) - Fix race in push_dl_task() (Harshit Agarwal)" * tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix some typos in include/linux/preempt.h sched: Make migrate_{en,dis}able() inline rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c sched/fair: Do not balance task to a throttled cfs_rq sched/fair: Do not special case tasks in throttled hierarchy sched/fair: update_cfs_group() for throttled cfs_rqs sched/fair: Propagate load for throttled cfs_rq sched/fair: Get rid of throttled_lb_pair() sched/fair: Task based throttle time accounting sched/fair: Switch to task based throttle model sched/fair: Implement throttle task work and related helpers sched/fair: Add related data structure for task based throttle sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig sched: Move STDL_INIT() functions out-of-line sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask() sched/deadline: Fix race in push_dl_task()
2025-09-30Merge tag 'sched_ext-for-6.18' of ↵Linus Torvalds6-1298/+1510
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Code organization cleanup. Separate internal types and accessors to ext_internal.h to reduce the size of ext.c and improve maintainability. - Prepare for cgroup sub-scheduler support by adding @sch parameter to various functions and helpers, reorganizing scheduler instance handling, and dropping obsolete helpers like scx_kf_exit() and kf_cpu_valid(). - Add new scx_bpf_cpu_curr() and scx_bpf_locked_rq() BPF helpers to provide safer access patterns with proper RCU protection. scx_bpf_cpu_rq() is deprecated with warnings due to potential race conditions. - Improve debugging with migration-disabled counter in error state dumps, SCX_EFLAG_INITIALIZED flag, bitfields for warning flags, and other enhancements to help diagnose issues. - Use cgroup_lock/unlock() for cgroup synchronization instead of scx_cgroup_rwsem based synchronization. This is simpler and allows enable/disable paths to synchronize against cgroup changes independent of the CPU controller. - rhashtable_lookup() replacement to avoid redundant RCU locking was reverted due to RCU usage warnings. Will be redone once rhashtable is updated to use rcu_dereference_all(). - Other misc updates and fixes including bypass handling improvements, scx_task_iter_relock() improvements, tools/sched_ext updates, and compatibility helpers. * tag 'sched_ext-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (28 commits) Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()" sched_ext: Misc updates around scx_sched instance pointer sched_ext: Drop scx_kf_exit() and scx_kf_error() sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit() sched_ext: Drop kf_cpu_valid() sched_ext: Add the @sch parameter to ext_idle helpers sched_ext: Add the @sch parameter to __bstr_format() sched_ext: Separate out scx_kick_cpu() and add @sch to it tools/sched_ext: scx_qmap: Make debug output quieter by default sched_ext: Make qmap dump operation non-destructive sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init() sched_ext: Use bitfields for boolean warning flags sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq() sched_ext: Improve SCX_KF_DISPATCH comment sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() sched_ext: Verify RCU protection in scx_bpf_cpu_curr() sched_ext: Add migration-disabled counter to error state dump sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warning tools/sched_ext: Add compat helper for scx_bpf_cpu_curr() sched_ext: deprecation warn for scx_bpf_cpu_rq() ...
2025-09-29Merge tag 'kernel-6.18-rc1.clone3' of ↵Linus Torvalds3-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull copy_process updates from Christian Brauner: "This contains the changes to enable support for clone3() on nios2 which apparently is still a thing. The more exciting part of this is that it cleans up the inconsistency in how the 64-bit flag argument is passed from copy_process() into the various other copy_*() helpers" [ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ] * tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nios2: implement architecture-specific portion of sys_clone3 arch: copy_thread: pass clone_flags as u64 copy_process: pass clone_flags as u64 across calltree copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
2025-09-26Merge tag 'sched-urgent-2025-09-26' of ↵Linus Torvalds3-44/+35
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Fix two dl_server regressions: a race that can end up leaving the dl_server stuck, and a dl_server throttling bug causing lag to fair tasks" * tag 'sched-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix dl_server behaviour sched/deadline: Fix dl_server getting stuck
2025-09-25sched: Make migrate_{en,dis}able() inlineMenglong Dong2-49/+26
For now, migrate_enable and migrate_disable are global, which makes them become hotspots in some case. Take BPF for example, the function calling to migrate_enable and migrate_disable in BPF trampoline can introduce significant overhead, and following is the 'perf top' of FENTRY's benchmark (./tools/testing/selftests/bpf/bench trig-fentry): 54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k] bpf_prog_2dcccf652aac1793_bench_trigger_fentry 10.43% [kernel] [k] migrate_enable 10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037 8.06% [kernel] [k] __bpf_prog_exit_recur 4.11% libc.so.6 [.] syscall 2.15% [kernel] [k] entry_SYSCALL_64 1.48% [kernel] [k] memchr_inv 1.32% [kernel] [k] fput 1.16% [kernel] [k] _copy_to_user 0.73% [kernel] [k] bpf_prog_test_run_raw_tp So in this commit, we make migrate_enable/migrate_disable inline to obtain better performance. The struct rq is defined internally in kernel/sched/sched.h, and the field "nr_pinned" is accessed in migrate_enable/migrate_disable, which makes it hard to make them inline. Alexei Starovoitov suggests to generate the offset of "nr_pinned" in [1], so we can define the migrate_enable/migrate_disable in include/linux/sched.h and access "this_rq()->nr_pinned" with "(void *)this_rq() + RQ_nr_pinned". The offset of "nr_pinned" is generated in include/generated/rq-offsets.h by kernel/sched/rq-offsets.c. Generally speaking, we move the definition of migrate_enable and migrate_disable to include/linux/sched.h from kernel/sched/core.c. The calling to __set_cpus_allowed_ptr() is leaved in ___migrate_enable(). The "struct rq" is not available in include/linux/sched.h, so we can't access the "runqueues" with this_cpu_ptr(), as the compilation will fail in this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr(): typeof((ptr) + 0) So we introduce the this_rq_raw() and access the runqueues with arch_raw_cpu_ptr/PERCPU_PTR directly. The variable "runqueues" is not visible in the kernel modules, and export it is not a good idea. As Peter Zijlstra advised in [2], we define and export migrate_enable/migrate_disable in kernel/sched/core.c too, and use them for the modules. Before this patch, the performance of BPF FENTRY is: fentry : 113.030 ± 0.149M/s fentry : 112.501 ± 0.187M/s fentry : 112.828 ± 0.267M/s fentry : 115.287 ± 0.241M/s After this patch, the performance of BPF FENTRY increases to: fentry : 143.644 ± 0.670M/s fentry : 149.764 ± 0.362M/s fentry : 149.642 ± 0.156M/s fentry : 145.263 ± 0.221M/s Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/bpf/CAADnVQ+5sEDKHdsJY5ZsfGDO_1SEhhQWHrt2SMBG5SYyQ+jt7w@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/20250819123214.GH4067720@noisy.programming.kicks-ass.net/ [2]
2025-09-25sched/deadline: Fix dl_server behaviourPeter Zijlstra2-23/+33
John reported undesirable behaviour with the dl_server since commit: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling"). When starving fair tasks on purpose (starting spinning FIFO tasks), his fair workload, which often goes (briefly) idle, would delay fair invocations for a second, running one invocation per second was both unexpected and terribly slow. The reason this happens is that when dl_se->server_pick_task() returns NULL, indicating no runnable tasks, it would yield, pushing any later jobs out a whole period (1 second). Instead simply stop the server. This should restore behaviour in that a later wakeup (which restarts the server) will be able to continue running (subject to the CBS wakeup rules). Notably, this does not re-introduce the behaviour cccb45d7c4295 set out to solve, any start/stop cycle is naturally throttled by the timer period (no active cancel). Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: John Stultz <jstultz@google.com>
2025-09-25sched/deadline: Fix dl_server getting stuckPeter Zijlstra3-21/+2
John found it was easy to hit lockup warnings when running locktorture on a 2 CPU VM, which he bisected down to: commit cccb45d7c429 ("sched/deadline: Less agressive dl_server handling"). While debugging it seems there is a chance where we end up with the dl_server dequeued, with dl_se->dl_server_active. This causes dl_server_start() to return without enqueueing the dl_server, thus it fails to run when RT tasks starve the cpu. When this happens, dl_server_timer() catches the '!dl_se->server_has_tasks(dl_se)' case, which then calls replenish_dl_entity() and dl_server_stopped() and finally return HRTIMER_NO_RESTART. This ends in no new timer and also no enqueue, leaving the dl_server 'dead', allowing starvation. What should have happened is for the bandwidth timer to start the zero-laxity timer, which in turn would enqueue the dl_server and cause dl_se->server_pick_task() to be called -- which will stop the dl_server if no fair tasks are observed for a whole period. IOW, it is totally irrelevant if there are fair tasks at the moment of bandwidth refresh. This removes all dl_se->server_has_tasks() users, so remove the whole thing. Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: John Stultz <jstultz@google.com>
2025-09-23Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()"Tejun Heo1-1/+1
This reverts commit c8191ee8e64a8c5c021a34e32868f2380965e82b which triggers the following suspicious RCU usage warning: [ 6.647598] ============================= [ 6.647603] WARNING: suspicious RCU usage [ 6.647605] 6.17.0-rc7-virtme #1 Not tainted [ 6.647608] ----------------------------- [ 6.647608] ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage! [ 6.647610] [ 6.647610] other info that might help us debug this: [ 6.647610] [ 6.647612] [ 6.647612] rcu_scheduler_active = 2, debug_locks = 1 [ 6.647613] 1 lock held by swapper/10/0: [ 6.647614] #0: ffff8b14bbb3cc98 (&rq->__lock){-.-.}-{2:2}, at: +raw_spin_rq_lock_nested+0x20/0x90 [ 6.647630] [ 6.647630] stack backtrace: [ 6.647633] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.17.0-rc7-virtme #1 +PREEMPT(full) [ 6.647643] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 6.647646] Sched_ext: beerland_1.0.2_g27d63fc3_x86_64_unknown_linux_gnu (enabled+all) [ 6.647648] Call Trace: [ 6.647652] <IRQ> [ 6.647655] dump_stack_lvl+0x78/0xe0 [ 6.647665] lockdep_rcu_suspicious+0x14a/0x1b0 [ 6.647672] __rhashtable_lookup.constprop.0+0x1d5/0x250 [ 6.647680] find_dsq_for_dispatch+0xbc/0x190 [ 6.647684] do_enqueue_task+0x25b/0x550 [ 6.647689] enqueue_task_scx+0x21d/0x360 [ 6.647692] ? trace_lock_acquire+0x22/0xb0 [ 6.647695] enqueue_task+0x2e/0xd0 [ 6.647698] ttwu_do_activate+0xa2/0x290 [ 6.647703] sched_ttwu_pending+0xfd/0x250 [ 6.647706] __flush_smp_call_function_queue+0x1cd/0x610 [ 6.647714] __sysvec_call_function_single+0x34/0x150 [ 6.647720] sysvec_call_function_single+0x6e/0x80 [ 6.647726] </IRQ> [ 6.647726] <TASK> [ 6.647727] asm_sysvec_call_function_single+0x1a/0x20 Reported-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Merge branch 'for-6.17-fixes' into for-6.18Tejun Heo3-7/+33
Pull sched_ext/for-6.17-fixes to receive: 55ed11b181c4 ("sched_ext: idle: Handle migration-disabled tasks in BPF code") which conflicts with the following commit in for-6.18: 2407bae23d1e ("sched_ext: Add the @sch parameter to ext_idle helpers") The conflict is a simple context conflict which can be resolved by taking the updated parts from both commits. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Misc updates around scx_sched instance pointerTejun Heo1-22/+40
In preparation for multiple scheduler support: - Add the @sch parameter to find_global_dsq() and refill_task_slice_dfl(). - Restructure scx_allow_ttwu_queue() and make it read scx_root into $sch. - Make RCU protection in scx_dsq_move() and scx_bpf_dsq_move_to_local() explicit. v2: Add scx_root -> sch conversion in scx_allow_ttwu_queue(). Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Drop scx_kf_exit() and scx_kf_error()Tejun Heo2-63/+88
The intention behind scx_kf_exit/error() was that when called from kfuncs, scx_kf_exit/error() would be able to implicitly determine the scx_sched instance being operated on and thus wouldn't need the @sch parameter passed in explicitly. This turned out to be unnecessarily complicated to implement and not have enough practical benefits. Replace scx_kf_exit/error() usages with scx_exit/error() which take an explicit @sch parameter. - Add the @sch parameter to scx_kf_allowed(), scx_kf_allowed_on_arg_tasks, mark_direct_dispatch() and other intermediate functions transitively. - In callers that don't already have @sch available, grab RCU, read $scx_root, verify it's not NULL and use it. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit()Tejun Heo1-7/+22
In preparation for multiple scheduler support, add the @sch parameter to scx_dsq_insert_preamble/commit() and update the callers to read $scx_root and pass it in. The passed in @sch parameter is not used yet. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Drop kf_cpu_valid()Tejun Heo2-31/+48
The intention behind kf_cpu_valid() was that when called from kfuncs, kf_cpu_valid() would be able to implicitly determine the scx_sched instance being operated on and thus wouldn't need @sch passed in explicitly. This turned out to be unnecessarily complicated to implement and not have justifiable practical benefits. Replace kf_cpu_valid() usages with ops_cpu_valid() which takes explicit @sch. Callers which don't have $sch available in the context are updated to read $scx_root under RCU read lock, verify that it's not NULL and pass it in. scx_bpf_cpu_rq() is restructured to use guard(rcu)() instead of explicit rcu_read_[un]lock(). Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Add the @sch parameter to ext_idle helpersTejun Heo1-15/+94
In preparation for multiple scheduler support, add the @sch parameter to validate_node(), check_builtin_idle_enabled() and select_cpu_from_kfunc(), and update their callers to read $scx_root, verify that it's not NULL and pass it in. The passed in @sch parameter is not used yet. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Add the @sch parameter to __bstr_format()Tejun Heo1-7/+21
In preparation for multiple scheduler support, add the @sch parameter to __bstr_format() and update the callers to read $scx_root, verify that it's not NULL and pass it in. The passed in @sch parameter is not used yet. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Separate out scx_kick_cpu() and add @sch to itTejun Heo1-16/+27
In preparation for multiple scheduler support, separate out scx_kick_cpu() from scx_bpf_kick_cpu() and add the @sch parameter to it. scx_bpf_kick_cpu() now acquires an RCU read lock, reads $scx_root, and calls scx_kick_cpu() with it if non-NULL. The passed in @sch parameter is not used yet. Internal uses of scx_bpf_kick_cpu() are converted to scx_kick_cpu(). Where $sch is available, it's used. In the pick_task_scx() path where no associated scheduler can be identified, $scx_root is used directly. Note that $scx_root cannot be NULL in this case. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()Tejun Heo2-0/+14
ops.exit() may be called even if the loading failed before ops.init() finishes successfully. This is because ops.exit() allows rich exit info communication. Add SCX_EFLAG_INITIALIZED flag to scx_exit_info.flags to indicate whether ops.init() finished successfully. This enables BPF schedulers to distinguish between exit scenarios and handle cleanup appropriately based on initialization state. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Use bitfields for boolean warning flagsTejun Heo1-2/+2
Convert warned_zero_slice and warned_deprecated_rq in scx_sched struct to single-bit bitfields. While this doesn't reduce struct size immediately, it prepares for future bitfield additions. v2: Update patch description. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq()Tejun Heo1-2/+1
task_can_run_on_remote_rq() takes @sch but it is using scx_root when incrementing SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which is inconsistent and gets in the way of implementing multiple scheduler support. Use @sch instead. As currently scx_root is the only possible scheduler instance, this doesn't cause any behavior changes. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()Tejun Heo1-1/+1
The find_user_dsq() function is called from contexts that are already under RCU read lock protection. Switch from rhashtable_lookup_fast() to rhashtable_lookup() to avoid redundant RCU locking. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Verify RCU protection in scx_bpf_cpu_curr()Andrea Righi1-1/+1
scx_bpf_cpu_curr() has been introduced to retrieve the current task of a given runqueue, allowing schedulers to interact with that task. The kfunc assumes that it is always called in an RCU context, but this is not always guaranteed and some BPF schedulers can trigger the following warning: WARNING: suspicious RCU usage sched_ext: BPF scheduler "cosmos_1.0.2_gd0e71ca_x86_64_unknown_linux_gnu_debug" enabled 6.17.0-rc1 #1-NixOS Not tainted ----------------------------- kernel/sched/ext.c:6415 suspicious rcu_dereference_check() usage! ... Call Trace: <IRQ> dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x4e/0x96 scx_bpf_cpu_curr+0x7e/0x80 bpf_prog_c68b2b6b6b1b0ff8_sched_timerfn+0xce/0x1dc bpf_timer_cb+0x7b/0x130 __hrtimer_run_queues+0x1ea/0x380 hrtimer_run_softirq+0x8c/0xd0 handle_softirqs+0xc9/0x3b0 __irq_exit_rcu+0x96/0xc0 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0x73/0x80 </IRQ> <TASK> To address this, mark the kfunc with KF_RCU_PROTECTED, so the verifier can enforce its usage only inside RCU-protected sections. Note: this also requires commit 1512231b6cc86 ("bpf: Enforce RCU protection for KF_RCU_PROTECTED"), currently in bpf-next, to enforce the proper KF_RCU_PROTECTED. Fixes: 20b158094a1ad ("sched_ext: Introduce scx_bpf_cpu_curr()") Cc: Christian Loehle <christian.loehle@arm.com> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-22Merge tag 'sched_ext-for-6.17-rc7-fixes' of ↵Linus Torvalds1-1/+27
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from jun Heo: "This contains a fix for sched_ext idle CPU selection that likely fixes a substantial performance regression. The scx_bpf_select_cpu_dfl/and() kfuncs were incorrectly detecting all tasks as migration-disabled when called outside ops.select_cpu(), causing them to always return -EBUSY instead of finding idle CPUs. The fix properly distinguishes between genuinely migration-disabled tasks vs. the current task whose migration is temporarily disabled by BPF execution" * tag 'sched_ext-for-6.17-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: idle: Handle migration-disabled tasks in BPF code
2025-09-22sched_ext: idle: Handle migration-disabled tasks in BPF codeAndrea Righi1-1/+27
When scx_bpf_select_cpu_dfl()/and() kfuncs are invoked outside of ops.select_cpu() we can't rely on @p->migration_disabled to determine if migration is disabled for the task @p. In fact, migration is always disabled for the current task while running BPF code: __bpf_prog_enter() disables migration and __bpf_prog_exit() re-enables it. To handle this, when @p->migration_disabled == 1, check whether @p is the current task. If so, migration was not disabled before entering the callback, otherwise migration was disabled. This ensures correct idle CPU selection in all cases. The behavior of ops.select_cpu() remains unchanged, because this callback is never invoked for the current task and migration-disabled tasks are always excluded. Example: without this change scx_bpf_select_cpu_and() called from ops.enqueue() always returns -EBUSY; with this change applied, it correctly returns idle CPUs. Fixes: 06efc9fe0b8de ("sched_ext: idle: Handle migration-disabled tasks in idle selection") Cc: stable@vger.kernel.org # v6.16+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-18sched_ext: Add migration-disabled counter to error state dumpAndrea Righi1-1/+2
Include the task's migration-disabled counter when dumping task state during an error exit. This can help diagnose cases where tasks can get stuck, because they're unable to migrate elsewhere. tj: s/nomig/no_mig/ for readability and consistency with other keys. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17Merge tag 'sched_ext-for-6.17-rc6-fixes' of ↵Linus Torvalds2-6/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED - Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()" which was causing issues with per-CPU task scheduling and reenqueuing behavior * tag 'sched_ext-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext, sched/core: Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"
2025-09-16sched_ext, sched/core: Fix build failure when !FAIR_GROUP_SCHED && ↵Tejun Heo1-1/+1
EXT_GROUP_SCHED While collecting SCX related fields in struct task_group into struct scx_task_group, 6e6558a6bc41 ("sched_ext, sched/core: Factor out struct scx_task_group") forgot update tg->scx_weight usage in tg_weight(), which leads to build failure when CONFIG_FAIR_GROUP_SCHED is disabled but CONFIG_EXT_GROUP_SCHED is enabled. Fix it. Fixes: 6e6558a6bc41 ("sched_ext, sched/core: Factor out struct scx_task_group") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509170230.MwZsJSWa-lkp@intel.com/ Tested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"Andrea Righi1-5/+1
scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a CPU is taken by a higher scheduling class to give tasks queued to the CPU's local DSQ a chance to be migrated somewhere else, instead of waiting indefinitely for that CPU to become available again. In doing so, we decided to skip migration-disabled tasks, under the assumption that they cannot be migrated anyway. However, when a higher scheduling class preempts a CPU, the running task is always inserted at the head of the local DSQ as a migration-disabled task. This means it is always skipped by scx_bpf_reenqueue_local(), and ends up being confined to the same CPU even if that CPU is heavily contended by other higher scheduling class tasks. As an example, let's consider the following scenario: $ schedtool -a 0,1, -e yes > /dev/null $ sudo schedtool -F -p 99 -a 0, -e \ stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000 The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task (SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT task initially runs on CPU0, it will remain there because it always sees CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1% utilization while CPU1 stays idle: 0[||||||||||||||||||||||100.0%] 8[ 0.0%] 1[ 0.0%] 9[ 0.0%] 2[ 0.0%] 10[ 0.0%] 3[ 0.0%] 11[ 0.0%] 4[ 0.0%] 12[ 0.0%] 5[ 0.0%] 13[ 0.0%] 6[ 0.0%] 14[ 0.0%] 7[ 0.0%] 15[ 0.0%] PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command 1067 root RT 0 R 0 99.0 0.2 0:31.16 stress-ng-cpu [run] 975 arighi 20 0 R 0 1.0 0.0 0:26.32 yes By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in this case) via ops.enqueue(), leading to better CPU utilization: 0[||||||||||||||||||||||100.0%] 8[ 0.0%] 1[||||||||||||||||||||||100.0%] 9[ 0.0%] 2[ 0.0%] 10[ 0.0%] 3[ 0.0%] 11[ 0.0%] 4[ 0.0%] 12[ 0.0%] 5[ 0.0%] 13[ 0.0%] 6[ 0.0%] 14[ 0.0%] 7[ 0.0%] 15[ 0.0%] PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command 577 root RT 0 R 0 100.0 0.2 0:23.17 stress-ng-cpu [run] 555 arighi 20 0 R 1 100.0 0.0 0:28.67 yes It's debatable whether per-CPU tasks should be re-enqueued as well, but doing so is probably safer: the scheduler can recognize re-enqueued tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and either put them back at the head of the local DSQ or let another task attempt to take the CPU. This also prevents giving per-CPU tasks an implicit priority boost, which would otherwise make them more likely to reclaim CPUs preempted by higher scheduling classes. Fixes: 97e13ecb02668 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()") Cc: stable@vger.kernel.org # v6.15+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-15sched/fair: Do not balance task to a throttled cfs_rqAaron Lu1-4/+18
When doing load balance and the target cfs_rq is in throttled hierarchy, whether to allow balancing there is a question. The good side to allow balancing is: if the target CPU is idle or less loaded and the being balanced task is holding some kernel resources, then it seems a good idea to balance the task there and let the task get the CPU earlier and release kernel resources sooner. The bad part is, if the task is not holding any kernel resources, then the balance seems not that useful. While theoretically it's debatable, a performance test[0] which involves 200 cgroups and each cgroup runs hackbench(20 sender, 20 receiver) in pipe mode showed a performance degradation on AMD Genoa when allowing load balance to throttled cfs_rq. Analysis[1] showed hackbench doesn't like task migration across LLC boundary. For this reason, add a check in can_migrate_task() to forbid balancing to a cfs_rq that is in throttled hierarchy. This reduced task migration a lot and performance restored. [0]: https://lore.kernel.org/lkml/20250822110701.GB289@bytedance/ [1]: https://lore.kernel.org/lkml/20250903101102.GB42@bytedance/ Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
2025-09-15sched/fair: Do not special case tasks in throttled hierarchyAaron Lu1-4/+4
With the introduction of task based throttle model, task in a throttled hierarchy is allowed to continue to run till it gets throttled on its ret2user path. For this reason, remove those throttled_hierarchy() checks in the following functions so that those tasks can get their turn as normal tasks: dequeue_entities(), check_preempt_wakeup_fair() and yield_to_task_fair(). The benefit of doing it this way is: if those tasks gets the chance to run earlier and if they hold any kernel resources, they can release those resources earlier. The downside is, if they don't hold any kernel resouces, all they can do is to throttle themselves on their way back to user space so the favor to let them run seems not that useful and for check_preempt_wakeup_fair(), that favor may be bad for curr. K Prateek Nayak pointed out prio_changed_fair() can send a throttled task to check_preempt_wakeup_fair(), further tests showed the affinity change path from move_queued_task() can also send a throttled task to check_preempt_wakeup_fair(), that's why the check of task_is_throttled() in that function. Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-09-15sched/fair: update_cfs_group() for throttled cfs_rqsAaron Lu1-3/+0
With task based throttle model, tasks in a throttled hierarchy are allowed to continue to run if they are running in kernel mode. For this reason, PELT clock is not stopped for these cfs_rqs in throttled hierarchy when they still have tasks running or queued. Since PELT clock is not stopped, whether to allow update_cfs_group() doing its job for cfs_rqs which are in throttled hierarchy but still have tasks running/queued is a question. The good side is, continue to run update_cfs_group() can get these cfs_rq entities with an up2date weight and that up2date weight can be useful to derive an accurate load for the CPU as well as ensure fairness if multiple tasks of different cgroups are running on the same CPU. OTOH, as Benjamin Segall pointed: when unthrottle comes around the most likely correct distribution is the distribution we had at the time of throttle. In reality, either way may not matter that much if tasks in throttled hierarchy don't run in kernel mode for too long. But in case that happens, let these cfs_rq entities have an up2date weight seems a good thing to do. Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-09-15sched/fair: Propagate load for throttled cfs_rqAaron Lu1-8/+18
Before task based throttle model, propagating load will stop at a throttled cfs_rq and that propagate will happen on unthrottle time by update_load_avg(). Now that there is no update_load_avg() on unthrottle for throttled cfs_rq and all load tracking is done by task related operations, let the propagate happen immediately. While at it, add a comment to explain why cfs_rqs that are not affected by throttle have to be added to leaf cfs_rq list in propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b ("sched/fair: Fix unfairness caused by missing load decay"). Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
2025-09-13mm: replace (20 - PAGE_SHIFT) with common macros for pages<->MB conversionYe Liu1-3/+2
Replace repeated (20 - PAGE_SHIFT) calculations with standard macros: - MB_TO_PAGES(mb) converts MB to page count - PAGES_TO_MB(pages) converts pages to MB No functional change. [akpm@linux-foundation.org: remove arc's private PAGES_TO_MB, remove its unused PAGES_TO_KB] [akpm@linux-foundation.org: don't include mm.h due to include file ordering mess] Link: https://lkml.kernel.org/r/20250718024134.1304745-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: memory-tiering: fix PGPROMOTE_CANDIDATE countingRuan Shiyang1-2/+3
Goto-san reported confusing pgpromote statistics where the pgpromote_success count significantly exceeded pgpromote_candidate. On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB): # Enable demotion only echo 1 > /sys/kernel/mm/numa/demotion_enabled numactl -m 0-1 memhog -r200 3500M >/dev/null & pid=$! sleep 2 numactl memhog -r100 2500M >/dev/null & sleep 10 kill -9 $pid # terminate the 1st memhog # Enable promotion echo 2 > /proc/sys/kernel/numa_balancing After a few seconds, we observeed `pgpromote_candidate < pgpromote_success` $ grep -e pgpromote /proc/vmstat pgpromote_success 2579 pgpromote_candidate 0 In this scenario, after terminating the first memhog, the conditions for pgdat_free_space_enough() are quickly met, and triggers promotion. However, these migrated pages are only counted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE. To solve these confusing statistics, introduce PGPROMOTE_CANDIDATE_NRL to count the missed promotion pages. And also, not counting these pages into PGPROMOTE_CANDIDATE is to avoid changing the existing algorithm or performance of the promotion rate limit. Link: https://lkml.kernel.org/r/20250901090122.124262-1-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/20250729035101.1601407-1-ruansy.fnst@fujitsu.com Fixes: c6833e10008f ("memory tiering: rate limit NUMA migration throughput") Co-developed-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Ruan Shiyang <ruansy.fnst@fujitsu.com> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-09sched/core: Avoid direct access to hrtimer clockbaseThomas Weißschuh1-1/+1
The field timer->base->get_time is a private implementation detail and should not be accessed outside of the hrtimer core. Switch to the equivalent helper. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-3-3ae822e5bfbd@linutronix.de
2025-09-04sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warningAndrea Righi1-2/+5
When printing the deprecation warning for scx_bpf_cpu_rq(), we may hit a NULL pointer dereference if the kfunc is called before a BPF scheduler is fully attached, for example, when invoked from a BPF timer or during ops.init(): [ 50.752775] BUG: kernel NULL pointer dereference, address: 0000000000000331 ... [ 50.764205] RIP: 0010:scx_bpf_cpu_rq+0x30/0xa0 ... [ 50.787661] Call Trace: [ 50.788398] <TASK> [ 50.789061] bpf_prog_08f7fd2dcb187aaf_wakeup_timerfn+0x75/0x1a8 [ 50.792477] bpf_timer_cb+0x7e/0x140 [ 50.796003] hrtimer_run_softirq+0x91/0xe0 [ 50.796952] handle_softirqs+0xce/0x3c0 [ 50.799087] run_ksoftirqd+0x3e/0x70 [ 50.800197] smpboot_thread_fn+0x133/0x290 [ 50.802320] kthread+0x115/0x220 [ 50.804984] ret_from_fork+0x17a/0x1d0 [ 50.806920] ret_from_fork_asm+0x1a/0x30 [ 50.807799] </TASK> Fix this by only printing the warning once the scheduler is fully registered. Fixes: 5c48d88fe0049 ("sched_ext: deprecation warn for scx_bpf_cpu_rq()") Cc: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>