aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched/ext.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-10-16sched_ext: fix flag check for deferred callbacksEmil Tsalapatis1-1/+1
When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests whether there is already a pending request for queue_balance_callback() to be invoked at the end of .balance(). Fixes: a8ad873113d3 ("sched_ext: defer queue_balance_callback() until after ops.dispatch") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-14sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loadsAndrea Righi1-5/+5
If we load a BPF scheduler while another scheduler is already running, alloc_kick_pseqs() would be called again, overwriting the previously allocated arrays. Fix by moving the alloc_kick_pseqs() call after the scx_enable_state() check, ensuring that the arrays are only allocated when a scheduler can actually be loaded. Fixes: 14c1da3895a11 ("sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc()") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc()Tejun Heo1-10/+79
On systems with >4096 CPUs, scx_kick_cpus_pnt_seqs allocation fails during boot because it exceeds the 32,768 byte percpu allocator limit. Restructure to use DEFINE_PER_CPU() for the per-CPU pointers, with each CPU pointing to its own kvzalloc'd array. Move allocation from boot time to scx_enable() and free in scx_disable(), so the O(nr_cpu_ids^2) memory is only consumed when sched_ext is active. Use RCU to guard against racing with free. Arrays are freed via call_rcu() and kick_cpus_irq_workfn() uses rcu_dereference_bh() with a NULL check. While at it, rename to scx_kick_pseqs for brevity and update comments to clarify these are pick_task sequence numbers. v2: RCU protect scx_kick_seqs to manage kick_cpus_irq_workfn() racing against disable as per Andrea. v3: Fix bugs notcied by Andrea. Reported-by: Phil Auld <pauld@redhat.com> Link: http://lkml.kernel.org/r/20251007133523.GA93086@pauld.westford.csb Cc: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: defer queue_balance_callback() until after ops.dispatchEmil Tsalapatis1-2/+27
The sched_ext code calls queue_balance_callback() during enqueue_task() to defer operations that drop multiple locks until we can unpin them. The call assumes that the rq lock is held until the callbacks are invoked, and the pending callbacks will not be visible to any other threads. This is enforced by a WARN_ON_ONCE() in rq_pin_lock(). However, balance_one() may actually drop the lock during a BPF dispatch call. Another thread may win the race to get the rq lock and see the pending callback. To avoid this, sched_ext must only queue the callback after the dispatch calls have completed. CPU 0 CPU 1 CPU 2 scx_balance() rq_unpin_lock() scx_balance_one() |= IN_BALANCE scx_enqueue() ops.dispatch() rq_unlock() rq_lock() queue_balance_callback() rq_unlock() [WARN] rq_pin_lock() rq_lock() &= ~IN_BALANCE rq_repin_lock() Changelog v2-> v1 (https://lore.kernel.org/sched-ext/aOgOxtHCeyRT_7jn@gpd4) - Fixed explanation in patch description (Andrea) - Fixed scx_rq mask state updates (Andrea) - Added Reviewed-by tag from Andrea Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Sync error_irq_work before freeing scx_schedTejun Heo1-0/+2
By the time scx_sched_free_rcu_work() runs, the scx_sched is no longer reachable. However, a previously queued error_irq_work may still be pending or running. Ensure it completes before proceeding with teardown. Fixes: bff3b5aec1b7 ("sched_ext: Move disable machinery into scx_sched") Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Mark scx_bpf_dsq_move_set_[slice|vtime]() with KF_RCUTejun Heo1-4/+4
scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() take a DSQ iterator argument which has to be valid. Mark them with KF_RCU. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()"Tejun Heo1-1/+1
This reverts commit c8191ee8e64a8c5c021a34e32868f2380965e82b which triggers the following suspicious RCU usage warning: [ 6.647598] ============================= [ 6.647603] WARNING: suspicious RCU usage [ 6.647605] 6.17.0-rc7-virtme #1 Not tainted [ 6.647608] ----------------------------- [ 6.647608] ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage! [ 6.647610] [ 6.647610] other info that might help us debug this: [ 6.647610] [ 6.647612] [ 6.647612] rcu_scheduler_active = 2, debug_locks = 1 [ 6.647613] 1 lock held by swapper/10/0: [ 6.647614] #0: ffff8b14bbb3cc98 (&rq->__lock){-.-.}-{2:2}, at: +raw_spin_rq_lock_nested+0x20/0x90 [ 6.647630] [ 6.647630] stack backtrace: [ 6.647633] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.17.0-rc7-virtme #1 +PREEMPT(full) [ 6.647643] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 6.647646] Sched_ext: beerland_1.0.2_g27d63fc3_x86_64_unknown_linux_gnu (enabled+all) [ 6.647648] Call Trace: [ 6.647652] <IRQ> [ 6.647655] dump_stack_lvl+0x78/0xe0 [ 6.647665] lockdep_rcu_suspicious+0x14a/0x1b0 [ 6.647672] __rhashtable_lookup.constprop.0+0x1d5/0x250 [ 6.647680] find_dsq_for_dispatch+0xbc/0x190 [ 6.647684] do_enqueue_task+0x25b/0x550 [ 6.647689] enqueue_task_scx+0x21d/0x360 [ 6.647692] ? trace_lock_acquire+0x22/0xb0 [ 6.647695] enqueue_task+0x2e/0xd0 [ 6.647698] ttwu_do_activate+0xa2/0x290 [ 6.647703] sched_ttwu_pending+0xfd/0x250 [ 6.647706] __flush_smp_call_function_queue+0x1cd/0x610 [ 6.647714] __sysvec_call_function_single+0x34/0x150 [ 6.647720] sysvec_call_function_single+0x6e/0x80 [ 6.647726] </IRQ> [ 6.647726] <TASK> [ 6.647727] asm_sysvec_call_function_single+0x1a/0x20 Reported-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Merge branch 'for-6.17-fixes' into for-6.18Tejun Heo1-5/+5
Pull sched_ext/for-6.17-fixes to receive: 55ed11b181c4 ("sched_ext: idle: Handle migration-disabled tasks in BPF code") which conflicts with the following commit in for-6.18: 2407bae23d1e ("sched_ext: Add the @sch parameter to ext_idle helpers") The conflict is a simple context conflict which can be resolved by taking the updated parts from both commits. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Misc updates around scx_sched instance pointerTejun Heo1-22/+40
In preparation for multiple scheduler support: - Add the @sch parameter to find_global_dsq() and refill_task_slice_dfl(). - Restructure scx_allow_ttwu_queue() and make it read scx_root into $sch. - Make RCU protection in scx_dsq_move() and scx_bpf_dsq_move_to_local() explicit. v2: Add scx_root -> sch conversion in scx_allow_ttwu_queue(). Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Drop scx_kf_exit() and scx_kf_error()Tejun Heo1-54/+72
The intention behind scx_kf_exit/error() was that when called from kfuncs, scx_kf_exit/error() would be able to implicitly determine the scx_sched instance being operated on and thus wouldn't need the @sch parameter passed in explicitly. This turned out to be unnecessarily complicated to implement and not have enough practical benefits. Replace scx_kf_exit/error() usages with scx_exit/error() which take an explicit @sch parameter. - Add the @sch parameter to scx_kf_allowed(), scx_kf_allowed_on_arg_tasks, mark_direct_dispatch() and other intermediate functions transitively. - In callers that don't already have @sch available, grab RCU, read $scx_root, verify it's not NULL and use it. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit()Tejun Heo1-7/+22
In preparation for multiple scheduler support, add the @sch parameter to scx_dsq_insert_preamble/commit() and update the callers to read $scx_root and pass it in. The passed in @sch parameter is not used yet. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Drop kf_cpu_valid()Tejun Heo1-27/+40
The intention behind kf_cpu_valid() was that when called from kfuncs, kf_cpu_valid() would be able to implicitly determine the scx_sched instance being operated on and thus wouldn't need @sch passed in explicitly. This turned out to be unnecessarily complicated to implement and not have justifiable practical benefits. Replace kf_cpu_valid() usages with ops_cpu_valid() which takes explicit @sch. Callers which don't have $sch available in the context are updated to read $scx_root under RCU read lock, verify that it's not NULL and pass it in. scx_bpf_cpu_rq() is restructured to use guard(rcu)() instead of explicit rcu_read_[un]lock(). Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Add the @sch parameter to __bstr_format()Tejun Heo1-7/+21
In preparation for multiple scheduler support, add the @sch parameter to __bstr_format() and update the callers to read $scx_root, verify that it's not NULL and pass it in. The passed in @sch parameter is not used yet. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Separate out scx_kick_cpu() and add @sch to itTejun Heo1-16/+27
In preparation for multiple scheduler support, separate out scx_kick_cpu() from scx_bpf_kick_cpu() and add the @sch parameter to it. scx_bpf_kick_cpu() now acquires an RCU read lock, reads $scx_root, and calls scx_kick_cpu() with it if non-NULL. The passed in @sch parameter is not used yet. Internal uses of scx_bpf_kick_cpu() are converted to scx_kick_cpu(). Where $sch is available, it's used. In the pick_task_scx() path where no associated scheduler can be identified, $scx_root is used directly. Note that $scx_root cannot be NULL in this case. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()Tejun Heo1-0/+1
ops.exit() may be called even if the loading failed before ops.init() finishes successfully. This is because ops.exit() allows rich exit info communication. Add SCX_EFLAG_INITIALIZED flag to scx_exit_info.flags to indicate whether ops.init() finished successfully. This enables BPF schedulers to distinguish between exit scenarios and handle cleanup appropriately based on initialization state. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq()Tejun Heo1-2/+1
task_can_run_on_remote_rq() takes @sch but it is using scx_root when incrementing SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which is inconsistent and gets in the way of implementing multiple scheduler support. Use @sch instead. As currently scx_root is the only possible scheduler instance, this doesn't cause any behavior changes. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()Tejun Heo1-1/+1
The find_user_dsq() function is called from contexts that are already under RCU read lock protection. Switch from rhashtable_lookup_fast() to rhashtable_lookup() to avoid redundant RCU locking. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Verify RCU protection in scx_bpf_cpu_curr()Andrea Righi1-1/+1
scx_bpf_cpu_curr() has been introduced to retrieve the current task of a given runqueue, allowing schedulers to interact with that task. The kfunc assumes that it is always called in an RCU context, but this is not always guaranteed and some BPF schedulers can trigger the following warning: WARNING: suspicious RCU usage sched_ext: BPF scheduler "cosmos_1.0.2_gd0e71ca_x86_64_unknown_linux_gnu_debug" enabled 6.17.0-rc1 #1-NixOS Not tainted ----------------------------- kernel/sched/ext.c:6415 suspicious rcu_dereference_check() usage! ... Call Trace: <IRQ> dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x4e/0x96 scx_bpf_cpu_curr+0x7e/0x80 bpf_prog_c68b2b6b6b1b0ff8_sched_timerfn+0xce/0x1dc bpf_timer_cb+0x7b/0x130 __hrtimer_run_queues+0x1ea/0x380 hrtimer_run_softirq+0x8c/0xd0 handle_softirqs+0xc9/0x3b0 __irq_exit_rcu+0x96/0xc0 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0x73/0x80 </IRQ> <TASK> To address this, mark the kfunc with KF_RCU_PROTECTED, so the verifier can enforce its usage only inside RCU-protected sections. Note: this also requires commit 1512231b6cc86 ("bpf: Enforce RCU protection for KF_RCU_PROTECTED"), currently in bpf-next, to enforce the proper KF_RCU_PROTECTED. Fixes: 20b158094a1ad ("sched_ext: Introduce scx_bpf_cpu_curr()") Cc: Christian Loehle <christian.loehle@arm.com> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-18sched_ext: Add migration-disabled counter to error state dumpAndrea Righi1-1/+2
Include the task's migration-disabled counter when dumping task state during an error exit. This can help diagnose cases where tasks can get stuck, because they're unable to migrate elsewhere. tj: s/nomig/no_mig/ for readability and consistency with other keys. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"Andrea Righi1-5/+1
scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a CPU is taken by a higher scheduling class to give tasks queued to the CPU's local DSQ a chance to be migrated somewhere else, instead of waiting indefinitely for that CPU to become available again. In doing so, we decided to skip migration-disabled tasks, under the assumption that they cannot be migrated anyway. However, when a higher scheduling class preempts a CPU, the running task is always inserted at the head of the local DSQ as a migration-disabled task. This means it is always skipped by scx_bpf_reenqueue_local(), and ends up being confined to the same CPU even if that CPU is heavily contended by other higher scheduling class tasks. As an example, let's consider the following scenario: $ schedtool -a 0,1, -e yes > /dev/null $ sudo schedtool -F -p 99 -a 0, -e \ stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000 The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task (SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT task initially runs on CPU0, it will remain there because it always sees CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1% utilization while CPU1 stays idle: 0[||||||||||||||||||||||100.0%] 8[ 0.0%] 1[ 0.0%] 9[ 0.0%] 2[ 0.0%] 10[ 0.0%] 3[ 0.0%] 11[ 0.0%] 4[ 0.0%] 12[ 0.0%] 5[ 0.0%] 13[ 0.0%] 6[ 0.0%] 14[ 0.0%] 7[ 0.0%] 15[ 0.0%] PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command 1067 root RT 0 R 0 99.0 0.2 0:31.16 stress-ng-cpu [run] 975 arighi 20 0 R 0 1.0 0.0 0:26.32 yes By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in this case) via ops.enqueue(), leading to better CPU utilization: 0[||||||||||||||||||||||100.0%] 8[ 0.0%] 1[||||||||||||||||||||||100.0%] 9[ 0.0%] 2[ 0.0%] 10[ 0.0%] 3[ 0.0%] 11[ 0.0%] 4[ 0.0%] 12[ 0.0%] 5[ 0.0%] 13[ 0.0%] 6[ 0.0%] 14[ 0.0%] 7[ 0.0%] 15[ 0.0%] PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command 577 root RT 0 R 0 100.0 0.2 0:23.17 stress-ng-cpu [run] 555 arighi 20 0 R 1 100.0 0.0 0:28.67 yes It's debatable whether per-CPU tasks should be re-enqueued as well, but doing so is probably safer: the scheduler can recognize re-enqueued tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and either put them back at the head of the local DSQ or let another task attempt to take the CPU. This also prevents giving per-CPU tasks an implicit priority boost, which would otherwise make them more likely to reclaim CPUs preempted by higher scheduling classes. Fixes: 97e13ecb02668 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()") Cc: stable@vger.kernel.org # v6.15+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warningAndrea Righi1-2/+5
When printing the deprecation warning for scx_bpf_cpu_rq(), we may hit a NULL pointer dereference if the kfunc is called before a BPF scheduler is fully attached, for example, when invoked from a BPF timer or during ops.init(): [ 50.752775] BUG: kernel NULL pointer dereference, address: 0000000000000331 ... [ 50.764205] RIP: 0010:scx_bpf_cpu_rq+0x30/0xa0 ... [ 50.787661] Call Trace: [ 50.788398] <TASK> [ 50.789061] bpf_prog_08f7fd2dcb187aaf_wakeup_timerfn+0x75/0x1a8 [ 50.792477] bpf_timer_cb+0x7e/0x140 [ 50.796003] hrtimer_run_softirq+0x91/0xe0 [ 50.796952] handle_softirqs+0xce/0x3c0 [ 50.799087] run_ksoftirqd+0x3e/0x70 [ 50.800197] smpboot_thread_fn+0x133/0x290 [ 50.802320] kthread+0x115/0x220 [ 50.804984] ret_from_fork+0x17a/0x1d0 [ 50.806920] ret_from_fork_asm+0x1a/0x30 [ 50.807799] </TASK> Fix this by only printing the warning once the scheduler is fully registered. Fixes: 5c48d88fe0049 ("sched_ext: deprecation warn for scx_bpf_cpu_rq()") Cc: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: deprecation warn for scx_bpf_cpu_rq()Christian Loehle1-0/+9
scx_bpf_cpu_rq() works on an unlocked rq which generally isn't safe. For the common use-cases scx_bpf_locked_rq() and scx_bpf_cpu_curr() work, so add a deprecation warning to scx_bpf_cpu_rq() so it can eventually be removed. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: Introduce scx_bpf_cpu_curr()Christian Loehle1-0/+14
Provide scx_bpf_cpu_curr() as a way for scx schedulers to check the curr task of a remote rq without assuming its lock is held. Many scx schedulers make use of scx_bpf_cpu_rq() to check a remote curr (e.g. to see if it should be preempted). This is problematic because scx_bpf_cpu_rq() provides access to all fields of struct rq, most of which aren't safe to use without holding the associated rq lock. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: Introduce scx_bpf_locked_rq()Christian Loehle1-0/+23
Most fields in scx_bpf_cpu_rq() assume that its rq_lock is held. Furthermore they become meaningless without rq lock, too. Make a safer version of scx_bpf_cpu_rq() that only returns a rq if we hold rq lock of that rq. Also mark the new scx_bpf_locked_rq() as returning NULL as scx_bpf_cpu_rq() should've been too. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operationsTejun Heo1-52/+14
SCX hooks into CPU cgroup controller operations and read-locks scx_cgroup_rwsem to exclude them while enabling and disable schedulers. While this works, it's unnecessarily complicated given that cgroup_[un]lock() are available and thus the cgroup operations can be locked out that way. Drop scx_cgroup_rwsem locking from the tg on/offline and cgroup [can_]attach operations. Instead, grab cgroup_lock() from scx_cgroup_lock(). Drop scx_cgroup_finish_attach() which is no longer necessary. Drop the now unnecessary rcu locking and css ref bumping in scx_cgroup_init() and scx_cgroup_exit(). As scx_cgroup_set_weight/bandwidth() paths aren't protected by cgroup_lock(), rename scx_cgroup_rwsem to scx_cgroup_ops_rwsem and retain the locking there. This is overall simpler and will also allow enable/disable paths to synchronize against cgroup changes independent of the CPU controller. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03sched_ext: Put event_stats_cpu in struct scx_sched_pcpuTejun Heo1-9/+9
scx_sched.event_stats_cpu is the percpu counters that are used to track stats. Introduce struct scx_sched_pcpu and move the counters inside. This will ease adding more per-cpu fields. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03sched_ext: Move internal type and accessor definitions to ext_internal.hTejun Heo1-1034/+0
There currently isn't a place to place SCX-internal types and accessors to be shared between ext.c and ext_idle.c. Create kernel/sched/ext_internal.h and move internal type and accessor definitions there. This trims ext.c a bit and makes future additions easier. Pure code reorganization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03sched_ext: Keep bypass on between enable failure and scx_disable_workfn()Tejun Heo1-1/+1
scx_enable() turns on the bypass mode while enable is in progress. If enabling fails, it turns off the bypass mode and then triggers scx_error(). scx_error() will trigger scx_disable_workfn() which will turn on the bypass mode again and unload the failed scheduler. This moves the system out of bypass mode between the enable error path and the disable path, which is unnecessary and can be brittle - e.g. the thread running scx_enable() may already be on the failed scheduler and can be switched out before it triggers scx_error() leading to a stall. The watchdog would eventually kick in, so the situation isn't critical but is still suboptimal. There is nothing to be gained by turning off the bypass mode between scx_enable() failure and scx_disable_workfn(). Keep bypass on. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03sched_ext: Make explicit scx_task_iter_relock() calls unnecessaryTejun Heo1-20/+23
During tasks iteration, the locks can be dropped using scx_task_iter_unlock() to perform e.g. sleepable allocations. Afterwards, scx_task_iter_relock() has to be called prior to other iteration operations, which is error-prone. This can be easily automated by tracking whether scx_tasks_lock is held in scx_task_iter and re-acquiring when necessary. It already tracks whether the task's rq is locked after all. - Add scx_task_iter->list_locked which remembers whether scx_tasks_lock is held. - Rename scx_task_iter->locked to scx_task_iter->locked_task to better distinguish it from ->list_locked. - Replace scx_task_iter_relock() with __scx_task_iter_maybe_relock() which is automatically called by scx_task_iter_next() and scx_task_iter_stop(). - Drop explicit scx_task_iter_relock() calls. The resulting behavior should be equivalent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-08-11sched/ext: Fix invalid task state transitions on class switchAndrea Righi1-0/+4
When enabling a sched_ext scheduler, we may trigger invalid task state transitions, resulting in warnings like the following (which can be easily reproduced by running the hotplug selftest in a loop): sched_ext: Invalid task state transition 0 -> 3 for fish[770] WARNING: CPU: 18 PID: 787 at kernel/sched/ext.c:3862 scx_set_task_state+0x7c/0xc0 ... RIP: 0010:scx_set_task_state+0x7c/0xc0 ... Call Trace: <TASK> scx_enable_task+0x11f/0x2e0 switching_to_scx+0x24/0x110 scx_enable.isra.0+0xd14/0x13d0 bpf_struct_ops_link_create+0x136/0x1a0 __sys_bpf+0x1edd/0x2c30 __x64_sys_bpf+0x21/0x30 do_syscall_64+0xbb/0x370 entry_SYSCALL_64_after_hwframe+0x77/0x7f This happens because we skip initialization for tasks that are already dead (with their usage counter set to zero), but we don't exclude them during the scheduling class transition phase. Fix this by also skipping dead tasks during class swiching, preventing invalid task state transitions. Fixes: a8532fac7b5d2 ("sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-07-31Merge tag 'sched_ext-for-6.17' of ↵Linus Torvalds1-130/+120
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Add support for cgroup "cpu.max" interface - Code organization cleanup so that ext_idle.c doesn't depend on the source-file-inclusion build method of sched/ - Drop UP paths in accordance with sched core changes - Documentation and other misc changes * tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix scx_bpf_reenqueue_local() reference sched_ext: Drop kfuncs marked for removal in 6.15 sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic kernel/sched/ext.c: fix typo "occured" -> "occurred" in comments sched_ext: Add support for cgroup bandwidth control interface sched_ext, sched/core: Factor out struct scx_task_group sched_ext: Return NULL in llc_span sched_ext: Always use SMP versions in kernel/sched/ext_idle.h sched_ext: Always use SMP versions in kernel/sched/ext_idle.c sched_ext: Always use SMP versions in kernel/sched/ext.h sched_ext: Always use SMP versions in kernel/sched/ext.c sched_ext: Documentation: Clarify time slice handling in task lifecycle sched_ext: Make scx_locked_rq() inline sched_ext: Make scx_rq_bypassing() inline sched_ext: idle: Make local functions static in ext_idle.c sched_ext: idle: Remove unnecessary ifdef in scx_bpf_cpu_node()
2025-07-17sched_ext: Fix scx_bpf_reenqueue_local() referenceChristian Loehle1-1/+1
The comment mentions bpf_scx_reenqueue_local(), but the function is provided for the BPF program implementing scx, as such the naming convention is scx_bpf_reenqueue_local(), fix the comment. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-07-16sched/ext: Prevent update_locked_rq() calls with NULL rqBreno Leitao1-4/+8
Avoid invoking update_locked_rq() when the runqueue (rq) pointer is NULL in the SCX_CALL_OP and SCX_CALL_OP_RET macros. Previously, calling update_locked_rq(NULL) with preemption enabled could trigger the following warning: BUG: using __this_cpu_write() in preemptible [00000000] This happens because __this_cpu_write() is unsafe to use in preemptible context. rq is NULL when an ops invoked from an unlocked context. In such cases, we don't need to store any rq, since the value should already be NULL (unlocked). Ensure that update_locked_rq() is only called when rq is non-NULL, preventing calling __this_cpu_write() on preemptible context. Suggested-by: Peter Zijlstra <peterz@infradead.org> Fixes: 18853ba782bef ("sched_ext: Track currently locked rq") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v6.15
2025-06-25sched_ext: Drop kfuncs marked for removal in 6.15Jake Hillion1-69/+2
sched_ext performed a kfunc renaming pass in 6.13 and kept the old names around for compatibility with old binaries. These were scheduled for cleanup in 6.15 but were missed. Submitting for cleanup in for-next. Removed the kfuncs, their flags, and any references I could find to them in doc comments. Left the entries in include/scx/compat.bpf.h as they're still useful to make new binaries compatible with old kernels. Tested by applying to my kernel. It builds and a modern version of scx_lavd loads fine. Signed-off-by: Jake Hillion <jake@hillion.co.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-24sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panicDavid Dai1-0/+35
For systems using a sched_ext scheduler and has panic_on_rcu_stall enabled, try kicking out the current scheduler before issuing a panic. While there are numerous reasons for RCU CPU stalls that are not directly attributed to the scheduler, deferring the panic gives sched_ext an opportunity to provide additional debug info when ejecting the current scheduler. Also, handling the event more gracefully allows us to potentially recover the system instead of incurring additional down time. Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: David Dai <david.dai@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-23kernel/sched/ext.c: fix typo "occured" -> "occurred" in commentsKe Ma1-2/+2
Fixes a minor spelling mistake in two comment lines Signed-off-by: Ke Ma <makebit1999@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20sched_ext: Add support for cgroup bandwidth control interfaceTejun Heo1-3/+63
From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 13 Jun 2025 15:06:47 -1000 - Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED. - Put bandwidth control interface files for both cgroup v1 and v2 under CONFIG_GROUP_SCHED_BANDWIDTH. - Update tg_bandwidth() to fetch configuration parameters from fair if CONFIG_CFS_BANDWIDTH, SCX otherwise. - Update tg_set_bandwidth() to update the parameters for both fair and SCX. - Add bandwidth control parameters to struct scx_cgroup_init_args. - Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth control parameter updates. - Update scx_qmap and maximal selftest to test the new feature. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20sched_ext, sched/core: Factor out struct scx_task_groupTejun Heo1-16/+16
More sched_ext fields will be added to struct task_group. In preparation, factor out sched_ext fields into struct scx_task_group to reduce clutter in the common header. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20sched_ext: Merge branch 'for-6.16-fixes' into for-6.17Tejun Heo1-6/+11
Pull sched_ext/for-6.16-fixes to receive: c50784e99f0e ("sched_ext: Make scx_group_set_weight() always update tg->scx.weight") 33796b91871a ("sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group()") which are needed to implement CPU bandwidth control interface support. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-17sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from ↵Tejun Heo1-0/+5
sched_create_group() During task_group creation, sched_create_group() calls scx_group_set_weight() with CGROUP_WEIGHT_DFL to initialize the sched_ext portion. This is premature and ends up calling ops.cgroup_set_weight() with an incorrect @cgrp before ops.cgroup_init() is called. sched_create_group() should just initialize SCX related fields in the new task_group. Fix it by factoring out scx_tg_init() from sched_init() and making sched_create_group() call that function instead of scx_group_set_weight(). v2: Retain CONFIG_EXT_GROUP_SCHED ifdef in sched_init() as removing it leads to build failures on !CONFIG_GROUP_SCHED configs. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 819513666966 ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+
2025-06-17sched_ext: Make scx_group_set_weight() always update tg->scx.weightTejun Heo1-6/+6
Otherwise, tg->scx.weight can go out of sync while scx_cgroup is not enabled and ops.cgroup_init() may be called with a stale weight value. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 819513666966 ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+
2025-06-13sched_ext: Always use SMP versions in kernel/sched/ext.cCheng-Yang Chou1-25/+1
Simplify the scheduler by making formerly SMP-only primitives and data structures unconditional. tj: Updated subject for clarity. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09sched_ext: Make scx_locked_rq() inlineAndrea Righi1-11/+2
scx_locked_rq() is used both from ext.c and ext_idle.c, move it to ext.h as a static inline function. No functional changes. v2: Rename locked_rq to scx_locked_rq_state, expose it and make scx_locked_rq() inline, as suggested by Tejun. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09sched_ext: Make scx_rq_bypassing() inlineAndrea Righi1-5/+0
scx_rq_bypassing() is used both from ext.c and ext_idle.c, move it to ext.h as a static inline function. No functional changes. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-28Merge tag 'bpf-next-6.16' of ↵Linus Torvalds1-14/+1
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Fix and improve BTF deduplication of identical BTF types (Alan Maguire and Andrii Nakryiko) - Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and Alexis Lothoré) - Support load-acquire and store-release instructions in BPF JIT on riscv64 (Andrea Parri) - Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton Protopopov) - Streamline allowed helpers across program types (Feng Yang) - Support atomic update for hashtab of BPF maps (Hou Tao) - Implement json output for BPF helpers (Ihor Solodrai) - Several s390 JIT fixes (Ilya Leoshkevich) - Various sockmap fixes (Jiayuan Chen) - Support mmap of vmlinux BTF data (Lorenz Bauer) - Support BPF rbtree traversal and list peeking (Martin KaFai Lau) - Tests for sockmap/sockhash redirection (Michal Luczaj) - Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko) - Add support for dma-buf iterators in BPF (T.J. Mercier) - The verifier support for __bpf_trap() (Yonghong Song) * tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits) bpf, arm64: Remove unused-but-set function and variable. selftests/bpf: Add tests with stack ptr register in conditional jmp bpf: Do not include stack ptr register in precision backtracking bookkeeping selftests/bpf: enable many-args tests for arm64 bpf, arm64: Support up to 12 function arguments bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem() bpf: Avoid __bpf_prog_ret0_warn when jit fails bpftool: Add support for custom BTF path in prog load/loadall selftests/bpf: Add unit tests with __bpf_trap() kfunc bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable bpf: Remove special_kfunc_set from verifier selftests/bpf: Add test for open coded dmabuf_iter selftests/bpf: Add test for dmabuf_iter bpf: Add open coded dmabuf iterator bpf: Add dmabuf iterator dma-buf: Rename debugfs symbols bpf: Fix error return value in bpf_copy_from_user_dynptr libbpf: Use mmap to parse vmlinux BTF from sysfs selftests: bpf: Add a test for mmapable vmlinux BTF btf: Allow mmap of vmlinux btf ...
2025-05-20sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.cAndrea Righi1-5/+0
Relocate the scx_kf_allowed_if_unlocked(), so it can be used from other source files (e.g., ext_idle.c). No functional change. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-14sched_ext: Explain the temporary situation around scx_root dereferencesTejun Heo1-0/+8
Naked scx_root dereferences are being used as temporary markers to indicate that they need to be updated to point to the right scheduler instance. Explain the situation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Add @sch to SCX_CALL_OP*()Tejun Heo1-109/+175
In preparation of hierarchical scheduling support, add @sch to scx_exit() and friends: - scx_exit/error() updated to take explicit @sch instead of assuming scx_root. - scx_kf_exit/error() added. These are to be used from kfuncs, don't take @sch and internally determine the scx_sched instance to abort. Currently, it's always scx_root but once multiple scheduler support is in place, it will be the scx_sched instance that invoked the kfunc. This simplifies many callsites and defers scx_sched lookup until error is triggered. - @sch is propagated to ops_cpu_valid() and ops_sanitize_err(). The CPU validity conditions in ops_cpu_valid() are factored into __cpu_valid() to implement kf_cpu_valid() which is the counterpart to scx_kf_exit/error(). - All users are converted. Most conversions are straightforward. check_rq_for_timeouts() and scx_softlockup() are updated to use explicit rcu_dereference*(scx_root) for safety as they may execute asynchronous to the exit path. scx_tick() is also updated to use rcu_dereference(). While not strictly necessary due to the preceding scx_enabled() test and IRQ disabled context, this removes the subtlety at no noticeable cost. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Cleanup [__]scx_exit/error*()Tejun Heo1-23/+25
__scx_exit() is the base exit implementation and there are three wrappers on top of it - scx_exit(), __scx_error() and scx_error(). This is more confusing than helpful especially given that there are only a couple users of scx_exit() and __scx_error(). To simplify the situation: - Make __scx_exit() take va_list and rename it to scx_vexit(). This is to ease implementing more complex extensions on top. - Make scx_exit() a varargs wrapper around __scx_exit(). scx_exit() now takes both @kind and @exit_code. - Convert existing scx_exit() and __scx_error() users to use the new scx_exit(). - scx_error() remains unchanged. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Add @sch to SCX_CALL_OP*()Tejun Heo1-58/+74
In preparation of hierarchical scheduling support, make SCX_CALL_OP*() take explicit @sch instead of assuming scx_root. As scx_root is still the only scheduler instance, this patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>