aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/trace (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-11-07Merge tag 'trace-v6.18-rc4' of ↵Linus Torvalds2-2/+8
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Check for reader catching up in ring_buffer_map_get_reader() If the reader catches up to the writer in the memory mapped ring buffer then calling rb_get_reader_page() will return NULL as there's no pages left. But this isn't checked for before calling rb_get_reader_page() and the return of NULL causes a warning. If it is detected that the reader caught up to the writer, then simply exit the routine - Fix memory leak in histogram create_field_var() The couple of the error paths in create_field_var() did not properly clean up what was allocated. Make sure everything is freed properly on error - Fix help message of tools latency_collector The help message incorrectly stated that "-t" was the same as "--threads" whereas "--threads" is actually represented by "-e" * tag 'trace-v6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/tools: Fix incorrcet short option in usage text for --threads tracing: Fix memory leaks in create_field_var() ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up
2025-11-06tracing: Fix memory leaks in create_field_var()Zilin Guan1-2/+4
The function create_field_var() allocates memory for 'val' through create_hist_field() inside parse_atom(), and for 'var' through create_var(), which in turn allocates var->type and var->var.name internally. Simply calling kfree() to release these structures will result in memory leaks. Use destroy_hist_field() to properly free 'val', and explicitly release the memory of var->type and var->var.name before freeing 'var' itself. Link: https://patch.msgid.link/20251106120132.3639920-1-zilin@seu.edu.cn Fixes: 02205a6752f22 ("tracing: Add support for 'field variables'") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-06ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches upSteven Rostedt1-0/+4
The function ring_buffer_map_get_reader() is a bit more strict than the other get reader functions, and except for certain situations the rb_get_reader_page() should not return NULL. If it does, it triggers a warning. This warning was triggering but after looking at why, it was because another acceptable situation was happening and it wasn't checked for. If the reader catches up to the writer and there's still data to be read on the reader page, then the rb_get_reader_page() will return NULL as there's no new page to get. In this situation, the reader page should not be updated and no warning should trigger. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Reported-by: syzbot+92a3745cea5ec6360309@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/690babec.050a0220.baf87.0064.GAE@google.com/ Link: https://lore.kernel.org/20251016132848.1b11bb37@gandalf.local.home Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-07tracing: tprobe-events: Fix to put tracepoint_user when disable the tprobeMasami Hiramatsu (Google)1-0/+4
__unregister_trace_fprobe() checks tf->tuser to put it when removing tprobe. However, disable_trace_fprobe() does not use it and only calls unregister_fprobe(). Thus it forgets to disable tracepoint_user. If the trace_fprobe has tuser, put it for unregistering the tracepoint callbacks when disabling tprobe correctly. Link: https://lore.kernel.org/all/176244794466.155515.3971904050506100243.stgit@devnote2/ Fixes: 2867495dea86 ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>
2025-11-07tracing: tprobe-events: Fix to register tracepoint correctlyMasami Hiramatsu (Google)1-1/+2
Since __tracepoint_user_init() calls tracepoint_user_register() without initializing tuser->tpoint with given tracpoint, it does not register tracepoint stub function as callback correctly, and tprobe does not work. Initializing tuser->tpoint correctly before tracepoint_user_register() so that it sets up tracepoint callback. I confirmed below example works fine again. echo "t sched_switch preempt prev_pid=prev->pid next_pid=next->pid" > /sys/kernel/tracing/dynamic_events echo 1 > /sys/kernel/tracing/events/tracepoints/sched_switch/enable cat /sys/kernel/tracing/trace_pipe Link: https://lore.kernel.org/all/176244793514.155515.6466348656998627773.stgit@devnote2/ Fixes: 2867495dea86 ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Reported-by: Beau Belgrave <beaub@linux.microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>
2025-10-20rv: Make rtapp/pagefault monitor depends on CONFIG_MMUNam Cao1-0/+1
There is no page fault without MMU. Compiling the rtapp/pagefault monitor without CONFIG_MMU fails as page fault tracepoints' definitions are not available. Make rtapp/pagefault monitor depends on CONFIG_MMU. Fixes: 9162620eb604 ("rv: Add rtapp_pagefault monitor") Signed-off-by: Nam Cao <namcao@linutronix.de> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509260455.6Z9Vkty4-lkp@intel.com/ Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082317.973839-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-10-20rv: Fully convert enabled_monitors to use list_head as iteratorNam Cao1-6/+6
The callbacks in enabled_monitors_seq_ops are inconsistent. Some treat the iterator as struct rv_monitor *, while others treat the iterator as struct list_head *. This causes a wrong type cast and crashes the system as reported by Nathan. Convert everything to use struct list_head * as iterator. This also makes enabled_monitors consistent with available_monitors. Fixes: de090d1ccae1 ("rv: Fix wrong type cast in enabled_monitors_next()") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/linux-trace-kernel/20250923002004.GA2836051@ax162/ Signed-off-by: Nam Cao <namcao@linutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082235.973099-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-10-11Merge tag 'trace-v6.18-3' of ↵Linus Torvalds1-4/+8
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "The previous fix to trace_marker required updating trace_marker_raw as well. The difference between trace_marker_raw from trace_marker is that the raw version is for applications to write binary structures directly into the ring buffer instead of writing ASCII strings. This is for applications that will read the raw data from the ring buffer and get the data structures directly. It's a bit quicker than using the ASCII version. Unfortunately, it appears that our test suite has several tests that test writes to the trace_marker file, but lacks any tests to the trace_marker_raw file (this needs to be remedied). Two issues came about the update to the trace_marker_raw file that syzbot found: - Fix tracing_mark_raw_write() to use per CPU buffer The fix to use the per CPU buffer to copy from user space was needed for both the trace_maker and trace_maker_raw file. The fix for reading from user space into per CPU buffers properly fixed the trace_marker write function, but the trace_marker_raw file wasn't fixed properly. The user space data was correctly written into the per CPU buffer, but the code that wrote into the ring buffer still used the user space pointer and not the per CPU buffer that had the user space data already written. - Stop the fortify string warning from writing into trace_marker_raw After converting the copy_from_user_nofault() into a memcpy(), another issue appeared. As writes to the trace_marker_raw expects binary data, the first entry is a 4 byte identifier. The entry structure is defined as: struct { struct trace_entry ent; int id; char buf[]; }; The size of this structure is reserved on the ring buffer with: size = sizeof(*entry) + cnt; Then it is copied from the buffer into the ring buffer with: memcpy(&entry->id, buf, cnt); This use to be a copy_from_user_nofault(), but now converting it to a memcpy() triggers the fortify-string code, and causes a warning. The allocated space is actually more than what is copied, as the cnt used also includes the entry->id portion. Allocating sizeof(*entry) plus cnt is actually allocating 4 bytes more than what is needed. Change the size function to: size = struct_size(entry, buf, cnt - sizeof(entry->id)); And update the memcpy() to unsafe_memcpy()" * tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Stop fortify-string from warning in tracing_mark_raw_write() tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
2025-10-11tracing: Stop fortify-string from warning in tracing_mark_raw_write()Steven Rostedt1-2/+6
The way tracing_mark_raw_write() records its data is that it has the following structure: struct { struct trace_entry; int id; char buf[]; }; But memcpy(&entry->id, buf, size) triggers the following warning when the size is greater than the id: ------------[ cut here ]------------ memcpy: detected field-spanning write (size 6) of single field "&entry->id" at kernel/trace/trace.c:7458 (size 4) WARNING: CPU: 7 PID: 995 at kernel/trace/trace.c:7458 write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0 Modules linked in: CPU: 7 UID: 0 PID: 995 Comm: bash Not tainted 6.17.0-test-00007-g60b82183e78a-dirty #211 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0 Code: 04 00 75 a7 b9 04 00 00 00 48 89 de 48 89 04 24 48 c7 c2 e0 b1 d1 b2 48 c7 c7 40 b2 d1 b2 c6 05 2d 88 6a 04 01 e8 f7 e8 bd ff <0f> 0b 48 8b 04 24 e9 76 ff ff ff 49 8d 7c 24 04 49 8d 5c 24 08 48 RSP: 0018:ffff888104c3fc78 EFLAGS: 00010292 RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 1ffffffff6b363b4 RDI: 0000000000000001 RBP: ffff888100058a00 R08: ffffffffb041d459 R09: ffffed1020987f40 R10: 0000000000000007 R11: 0000000000000001 R12: ffff888100bb9010 R13: 0000000000000000 R14: 00000000000003e3 R15: ffff888134800000 FS: 00007fa61d286740(0000) GS:ffff888286cad000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000560d28d509f1 CR3: 00000001047a4006 CR4: 0000000000172ef0 Call Trace: <TASK> tracing_mark_raw_write+0x1fe/0x290 ? __pfx_tracing_mark_raw_write+0x10/0x10 ? security_file_permission+0x50/0xf0 ? rw_verify_area+0x6f/0x4b0 vfs_write+0x1d8/0xdd0 ? __pfx_vfs_write+0x10/0x10 ? __pfx_css_rstat_updated+0x10/0x10 ? count_memcg_events+0xd9/0x410 ? fdget_pos+0x53/0x5e0 ksys_write+0x182/0x200 ? __pfx_ksys_write+0x10/0x10 ? do_user_addr_fault+0x4af/0xa30 do_syscall_64+0x63/0x350 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fa61d318687 Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff RSP: 002b:00007ffd87fe0120 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007fa61d286740 RCX: 00007fa61d318687 RDX: 0000000000000006 RSI: 0000560d28d509f0 RDI: 0000000000000001 RBP: 0000560d28d509f0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000006 R13: 00007fa61d4715c0 R14: 00007fa61d46ee80 R15: 0000000000000000 </TASK> ---[ end trace 0000000000000000 ]--- This is because fortify string sees that the size of entry->id is only 4 bytes, but it is writing more than that. But this is OK as the dynamic_array is allocated to handle that copy. The size allocated on the ring buffer was actually a bit too big: size = sizeof(*entry) + cnt; But cnt includes the 'id' and the buffer data, so adding cnt to the size of *entry actually allocates too much on the ring buffer. Change the allocation to: size = struct_size(entry, buf, cnt - sizeof(entry->id)); and the memcpy() to unsafe_memcpy() with an added justification. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20251011112032.77be18e4@gandalf.local.home Fixes: 64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read user space") Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-10tracing: Fix tracing_mark_raw_write() to use buf and not ubufSteven Rostedt1-2/+2
The fix to use a per CPU buffer to read user space tested only the writes to trace_marker. But it appears that the selftests are missing tests to the trace_maker_raw file. The trace_maker_raw file is used by applications that writes data structures and not strings into the file, and the tools read the raw ring buffer to process the structures it writes. The fix that reads the per CPU buffers passes the new per CPU buffer to the trace_marker file writes, but the update to the trace_marker_raw write read the data from user space into the per CPU buffer, but then still used then passed the user space address to the function that records the data. Pass in the per CPU buffer and not the user space address. TODO: Add a test to better test trace_marker_raw. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20251011035243.386098147@kernel.org Fixes: 64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read user space") Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-09Merge tag 'trace-v6.18-2' of ↵Linus Torvalds5-79/+241
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing clean up and fixes from Steven Rostedt: - Have osnoise tracer use memdup_user_nul() The function osnoise_cpus_write() open codes a kmalloc() and then a copy_from_user() and then adds a nul byte at the end which is the same as simply using memdup_user_nul(). - Fix wakeup and irq tracers when failing to acquire calltime When the wakeup and irq tracers use the function graph tracer for tracing function times, it saves a timestamp into the fgraph shadow stack. It is possible that this could fail to be stored. If that happens, it exits the routine early. These functions also disable nesting of the operations by incremeting the data "disable" counter. But if the calltime exits out early, it never increments the counter back to what it needs to be. Since there's only a couple of lines of code that does work after acquiring the calltime, instead of exiting out early, reverse the if statement to be true if calltime is acquired, and place the code that is to be done within that if block. The clean up will always be done after that. - Fix ring_buffer_map() return value on failure of __rb_map_vma() If __rb_map_vma() fails in ring_buffer_map(), it does not return an error. This means the caller will be working against a bad vma mapping. Have ring_buffer_map() return an error when __rb_map_vma() fails. - Fix regression of writing to the trace_marker file A bug fix was made to change __copy_from_user_inatomic() to copy_from_user_nofault() in the trace_marker write function. The trace_marker file is used by applications to write into it (usually with a file descriptor opened at the start of the program) to record into the tracing system. It's usually used in critical sections so the write to trace_marker is highly optimized. The reason for copying in an atomic section is that the write reserves space on the ring buffer and then writes directly into it. After it writes, it commits the event. The time between reserve and commit must have preemption disabled. The trace marker write does not have any locking nor can it allocate due to the nature of it being a critical path. Unfortunately, converting __copy_from_user_inatomic() to copy_from_user_nofault() caused a regression in Android. Now all the writes from its applications trigger the fault that is rejected by the _nofault() version that wasn't rejected by the _inatomic() version. Instead of getting data, it now just gets a trace buffer filled with: tracing_mark_write: <faulted> To fix this, on opening of the trace_marker file, allocate per CPU buffers that can be used by the write call. Then when entering the write call, do the following: preempt_disable(); cpu = smp_processor_id(); buffer = per_cpu_ptr(cpu_buffers, cpu); do { cnt = nr_context_switches_cpu(cpu); migrate_disable(); preempt_enable(); ret = copy_from_user(buffer, ptr, size); preempt_disable(); migrate_enable(); } while (!ret && cnt != nr_context_switches_cpu(cpu)); if (!ret) ring_buffer_write(buffer); preempt_enable(); This works similarly to seqcount. As it must enabled preemption to do a copy_from_user() into a per CPU buffer, if it gets preempted, the buffer could be corrupted by another task. To handle this, read the number of context switches of the current CPU, disable migration, enable preemption, copy the data from user space, then immediately disable preemption again. If the number of context switches is the same, the buffer is still valid. Otherwise it must be assumed that the buffer may have been corrupted and it needs to try again. Now the trace_marker write can get the user data even if it has to fault it in, and still not grab any locks of its own. * tag 'trace-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Have trace_marker use per-cpu data to read user space ring buffer: Propagate __rb_map_vma return value to caller tracing: Fix irqoff tracers on failure of acquiring calltime tracing: Fix wakeup tracers on failure of acquiring calltime tracing/osnoise: Replace kmalloc + copy_from_user with memdup_user_nul
2025-10-08tracing: Have trace_marker use per-cpu data to read user spaceSteven Rostedt1-48/+220
It was reported that using __copy_from_user_inatomic() can actually schedule. Which is bad when preemption is disabled. Even though there's logic to check in_atomic() is set, but this is a nop when the kernel is configured with PREEMPT_NONE. This is due to page faulting and the code could schedule with preemption disabled. Link: https://lore.kernel.org/all/20250819105152.2766363-1-luogengkun@huaweicloud.com/ The solution was to change the __copy_from_user_inatomic() to copy_from_user_nofault(). But then it was reported that this caused a regression in Android. There's several applications writing into trace_marker() in Android, but now instead of showing the expected data, it is showing: tracing_mark_write: <faulted> After reverting the conversion to copy_from_user_nofault(), Android was able to get the data again. Writes to the trace_marker is a way to efficiently and quickly enter data into the Linux tracing buffer. It takes no locks and was designed to be as non-intrusive as possible. This means it cannot allocate memory, and must use pre-allocated data. A method that is actively being worked on to have faultable system call tracepoints read user space data is to allocate per CPU buffers, and use them in the callback. The method uses a technique similar to seqcount. That is something like this: preempt_disable(); cpu = smp_processor_id(); buffer = this_cpu_ptr(&pre_allocated_cpu_buffers, cpu); do { cnt = nr_context_switches_cpu(cpu); migrate_disable(); preempt_enable(); ret = copy_from_user(buffer, ptr, size); preempt_disable(); migrate_enable(); } while (!ret && cnt != nr_context_switches_cpu(cpu)); if (!ret) ring_buffer_write(buffer); preempt_enable(); It's a little more involved than that, but the above is the basic logic. The idea is to acquire the current CPU buffer, disable migration, and then enable preemption. At this moment, it can safely use copy_from_user(). After reading the data from user space, it disables preemption again. It then checks to see if there was any new scheduling on this CPU. If there was, it must assume that the buffer was corrupted by another task. If there wasn't, then the buffer is still valid as only tasks in preemptable context can write to this buffer and only those that are running on the CPU. By using this method, where trace_marker open allocates the per CPU buffers, trace_marker writes can access user space and even fault it in, without having to allocate or take any locks of its own. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Wattson CI <wattson-external@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20251008124510.6dba541a@gandalf.local.home Fixes: 3d62ab32df065 ("tracing: Fix tracing_marker may trigger page fault during preempt_disable") Reported-by: Runping Lai <runpinglai@google.com> Tested-by: Runping Lai <runpinglai@google.com> Closes: https://lore.kernel.org/linux-trace-kernel/20251007003417.3470979-2-runpinglai@google.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08ring buffer: Propagate __rb_map_vma return value to callerAnkit Khushwaha1-1/+1
The return value from `__rb_map_vma()`, which rejects writable or executable mappings (VM_WRITE, VM_EXEC, or !VM_MAYSHARE), was being ignored. As a result the caller of `__rb_map_vma` always returned 0 even when the mapping had actually failed, allowing it to proceed with an invalid VMA. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20251008172516.20697-1-ankitkhushwaha.linux@gmail.com Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Reported-by: syzbot+ddc001b92c083dbf2b97@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?id=194151be8eaebd826005329b2e123aecae714bdb Signed-off-by: Ankit Khushwaha <ankitkhushwaha.linux@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08tracing: Fix irqoff tracers on failure of acquiring calltimeSteven Rostedt1-13/+10
The functions irqsoff_graph_entry() and irqsoff_graph_return() both call func_prolog_dec() that will test if the data->disable is already set and if not, increment it and return. If it was set, it returns false and the caller exits. The caller of this function must decrement the disable counter, but misses doing so if the calltime fails to be acquired. Instead of exiting out when calltime is NULL, change the logic to do the work if it is not NULL and still do the clean up at the end of the function if it is NULL. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20251008114943.6f60f30f@gandalf.local.home Fixes: a485ea9e3ef3 ("tracing: Fix irqsoff and wakeup latency tracers when using function graph") Reported-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-2-sashal@kernel.org/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08tracing: Fix wakeup tracers on failure of acquiring calltimeSteven Rostedt1-10/+6
The functions wakeup_graph_entry() and wakeup_graph_return() both call func_prolog_preempt_disable() that will test if the data->disable is already set and if not, increment it and disable preemption. If it was set, it returns false and the caller exits. The caller of this function must decrement the disable counter, but misses doing so if the calltime fails to be acquired. Instead of exiting out when calltime is NULL, change the logic to do the work if it is not NULL and still do the clean up at the end of the function if it is NULL. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20251008114835.027b878a@gandalf.local.home Fixes: a485ea9e3ef3 ("tracing: Fix irqsoff and wakeup latency tracers when using function graph") Reported-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-1-sashal@kernel.org/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08tracing/osnoise: Replace kmalloc + copy_from_user with memdup_user_nulThorsten Blum1-7/+4
Replace kmalloc() followed by copy_from_user() with memdup_user_nul() to simplify and improve osnoise_cpus_write(). Remove the manual NUL-termination. No functional changes intended. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20251001130907.364673-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-05Merge tag 'trace-v6.18' of ↵Linus Torvalds8-20/+26
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Use READ_ONCE() and WRITE_ONCE() instead of RCU for syscall tracepoints Individual system call trace events are pseudo events attached to the raw_syscall trace events that just trace the entry and exit of all system calls. When any of these individual system call trace events get enabled, an element in an array indexed by the system call number is assigned to the trace file that defines how to trace it. When the trace event triggers, it reads this array and if the array has an element, it uses that trace file to know what to write it (the trace file defines the output format of the corresponding system call). The issue is that it uses rcu_dereference_ptr() and marks the elements of the array as using RCU. This is incorrect. There is no RCU synchronization here. The event file that is pointed to has a completely different way to make sure its freed properly. The reading of the array during the system call trace event is only to know if there is a value or not. If not, it does nothing (it means this system call isn't being traced). If it does, it uses the information to store the system call data. The RCU usage here can simply be replaced by READ_ONCE() and WRITE_ONCE() macros. - Have the system call trace events use "0x" for hex values Some system call trace events display hex values but do not have "0x" in front of it. Seeing "count: 44" can be assumed that it is 44 decimal when in actuality it is 44 hex (68 decimal). Display "0x44" instead. - Use vmalloc_array() in tracing_map_sort_entries() The function tracing_map_sort_entries() used array_size() and vmalloc() when it could have simply used vmalloc_array(). - Use for_each_online_cpu() in trace_osnoise.c() Instead of open coding for_each_cpu(cpu, cpu_online_mask), use for_each_online_cpu(). - Move the buffer field in struct trace_seq to the end The buffer field in struct trace_seq is architecture dependent in size, and caused padding for the fields after it. By moving the buffer to the end of the structure, it compacts the trace_seq structure better. - Remove redundant zeroing of cmdline_idx field in saved_cmdlines_buffer() The structure that contains cmdline_idx is zeroed by memset(), no need to explicitly zero any of its fields after that. - Use system_percpu_wq instead of system_wq in user_event_mm_remove() As system_wq is being deprecated, use the new wq. - Add cond_resched() is ftrace_module_enable() Some modules have a lot of functions (thousands of them), and the enabling of those functions can take some time. On non preemtable kernels, it was triggering a watchdog timeout. Add a cond_resched() to prevent that. - Add a BUILD_BUG_ON() to make sure PID_MAX_DEFAULT is always a power of 2 There's code that depends on PID_MAX_DEFAULT being a power of 2 or it will break. If in the future that changes, make sure the build fails to ensure that the code is fixed that depends on this. - Grab mutex_lock() before ever exiting s_start() The s_start() function is a seq_file start routine. As s_stop() is always called even if s_start() fails, and s_stop() expects the event_mutex to be held as it will always release it. That mutex must always be taken in s_start() even if that function fails. * tag 'trace-v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix lock imbalance in s_start() memory allocation failure path tracing: Ensure optimized hashing works ftrace: Fix softlockup in ftrace_module_enable tracing: replace use of system_wq with system_percpu_wq tracing: Remove redundant 0 value initialization tracing: Move buffer in trace_seq to end of struct tracing/osnoise: Use for_each_online_cpu() instead of for_each_cpu() tracing: Use vmalloc_array() to improve code tracing: Have syscall trace events show "0x" for values greater than 10 tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
2025-10-05Merge tag 'probes-fixes-v6.17' of ↵Linus Torvalds4-14/+28
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probe fix from Masami Hiramatsu: - Fix race condition in kprobe initialization causing NULL pointer dereference. This happens on weak memory model, which does not correctly manage the flags access with appropriate memory barriers. Use RELEASE-ACQUIRE to fix it. * tag 'probes-fixes-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix race condition in kprobe initialization causing NULL pointer dereference
2025-10-03Merge tag 'pull-f_path' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull file->f_path constification from Al Viro: "Only one thing was modifying ->f_path of an opened file - acct(2). Massaging that away and constifying a bunch of struct path * arguments in functions that might be given &file->f_path ends up with the situation where we can turn ->f_path into an anon union of const struct path f_path and struct path __f_path, the latter modified only in a few places in fs/{file_table,open,namei}.c, all for struct file instances that are yet to be opened" * tag 'pull-f_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits) Have cc(1) catch attempts to modify ->f_path kernel/acct.c: saner struct file treatment configfs:get_target() - release path as soon as we grab configfs_item reference apparmor/af_unix: constify struct path * arguments ovl_is_real_file: constify realpath argument ovl_sync_file(): constify path argument ovl_lower_dir(): constify path argument ovl_get_verity_digest(): constify path argument ovl_validate_verity(): constify {meta,data}path arguments ovl_ensure_verity_loaded(): constify datapath argument ksmbd_vfs_set_init_posix_acl(): constify path argument ksmbd_vfs_inherit_posix_acl(): constify path argument ksmbd_vfs_kern_path_unlock(): constify path argument ksmbd_vfs_path_lookup_locked(): root_share_path can be const struct path * check_export(): constify path argument export_operations->open(): constify path argument rqst_exp_get_by_name(): constify path argument nfs: constify path argument of __vfs_getattr() bpf...d_path(): constify path argument done_path_create(): constify path argument ...
2025-10-03Merge tag 'pull-fs_context' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fs_context updates from Al Viro: "Change vfs_parse_fs_string() calling conventions Get rid of the length argument (almost all callers pass strlen() of the string argument there), add vfs_parse_fs_qstr() for the cases that do want separate length" * tag 'pull-fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do_nfs4_mount(): switch to vfs_parse_fs_string() change the calling conventions for vfs_parse_fs_string()
2025-10-03tracing: Fix lock imbalance in s_start() memory allocation failure pathSasha Levin1-2/+1
When s_start() fails to allocate memory for set_event_iter, it returns NULL before acquiring event_mutex. However, the corresponding s_stop() function always tries to unlock the mutex, causing a lock imbalance warning: WARNING: bad unlock balance detected! 6.17.0-rc7-00175-g2b2e0c04f78c #7 Not tainted ------------------------------------- syz.0.85611/376514 is trying to release lock (event_mutex) at: [<ffffffff8dafc7a4>] traverse.part.0.constprop.0+0x2c4/0x650 fs/seq_file.c:131 but there are no more locks to release! The issue was introduced by commit b355247df104 ("tracing: Cache ':mod:' events for modules not loaded yet") which added the kzalloc() allocation before the mutex lock, creating a path where s_start() could return without locking the mutex while s_stop() would still try to unlock it. Fix this by unconditionally acquiring the mutex immediately after allocation, regardless of whether the allocation succeeded. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250929113238.3722055-1-sashal@kernel.org Fixes: b355247df104 ("tracing: Cache ":mod:" events for modules not loaded yet") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-02tracing: Fix race condition in kprobe initialization causing NULL pointer ↵Yuan Chen4-14/+28
dereference There is a critical race condition in kprobe initialization that can lead to NULL pointer dereference and kernel crash. [1135630.084782] Unable to handle kernel paging request at virtual address 0000710a04630000 ... [1135630.260314] pstate: 404003c9 (nZcv DAIF +PAN -UAO) [1135630.269239] pc : kprobe_perf_func+0x30/0x260 [1135630.277643] lr : kprobe_dispatcher+0x44/0x60 [1135630.286041] sp : ffffaeff4977fa40 [1135630.293441] x29: ffffaeff4977fa40 x28: ffffaf015340e400 [1135630.302837] x27: 0000000000000000 x26: 0000000000000000 [1135630.312257] x25: ffffaf029ed108a8 x24: ffffaf015340e528 [1135630.321705] x23: ffffaeff4977fc50 x22: ffffaeff4977fc50 [1135630.331154] x21: 0000000000000000 x20: ffffaeff4977fc50 [1135630.340586] x19: ffffaf015340e400 x18: 0000000000000000 [1135630.349985] x17: 0000000000000000 x16: 0000000000000000 [1135630.359285] x15: 0000000000000000 x14: 0000000000000000 [1135630.368445] x13: 0000000000000000 x12: 0000000000000000 [1135630.377473] x11: 0000000000000000 x10: 0000000000000000 [1135630.386411] x9 : 0000000000000000 x8 : 0000000000000000 [1135630.395252] x7 : 0000000000000000 x6 : 0000000000000000 [1135630.403963] x5 : 0000000000000000 x4 : 0000000000000000 [1135630.412545] x3 : 0000710a04630000 x2 : 0000000000000006 [1135630.421021] x1 : ffffaeff4977fc50 x0 : 0000710a04630000 [1135630.429410] Call trace: [1135630.434828] kprobe_perf_func+0x30/0x260 [1135630.441661] kprobe_dispatcher+0x44/0x60 [1135630.448396] aggr_pre_handler+0x70/0xc8 [1135630.454959] kprobe_breakpoint_handler+0x140/0x1e0 [1135630.462435] brk_handler+0xbc/0xd8 [1135630.468437] do_debug_exception+0x84/0x138 [1135630.475074] el1_dbg+0x18/0x8c [1135630.480582] security_file_permission+0x0/0xd0 [1135630.487426] vfs_write+0x70/0x1c0 [1135630.493059] ksys_write+0x5c/0xc8 [1135630.498638] __arm64_sys_write+0x24/0x30 [1135630.504821] el0_svc_common+0x78/0x130 [1135630.510838] el0_svc_handler+0x38/0x78 [1135630.516834] el0_svc+0x8/0x1b0 kernel/trace/trace_kprobe.c: 1308 0xffff3df8995039ec <kprobe_perf_func+0x2c>: ldr x21, [x24,#120] include/linux/compiler.h: 294 0xffff3df8995039f0 <kprobe_perf_func+0x30>: ldr x1, [x21,x0] kernel/trace/trace_kprobe.c 1308: head = this_cpu_ptr(call->perf_events); 1309: if (hlist_empty(head)) 1310: return 0; crash> struct trace_event_call -o struct trace_event_call { ... [120] struct hlist_head *perf_events; //(call->perf_event) ... } crash> struct trace_event_call ffffaf015340e528 struct trace_event_call { ... perf_events = 0xffff0ad5fa89f088, //this value is correct, but x21 = 0 ... } Race Condition Analysis: The race occurs between kprobe activation and perf_events initialization: CPU0 CPU1 ==== ==== perf_kprobe_init perf_trace_event_init tp_event->perf_events = list;(1) tp_event->class->reg (2)← KPROBE ACTIVE Debug exception triggers ... kprobe_dispatcher kprobe_perf_func (tk->tp.flags & TP_FLAG_PROFILE) head = this_cpu_ptr(call->perf_events)(3) (perf_events is still NULL) Problem: 1. CPU0 executes (1) assigning tp_event->perf_events = list 2. CPU0 executes (2) enabling kprobe functionality via class->reg() 3. CPU1 triggers and reaches kprobe_dispatcher 4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed) 5. CPU1 calls kprobe_perf_func() and crashes at (3) because call->perf_events is still NULL CPU1 sees that kprobe functionality is enabled but does not see that perf_events has been assigned. Add pairing read and write memory barriers to guarantee that if CPU1 sees that kprobe functionality is enabled, it must also see that perf_events has been assigned. Link: https://lore.kernel.org/all/20251001022025.44626-1-chenyuan_fl@163.com/ Fixes: 50d780560785 ("tracing/kprobes: Add probe handler dispatcher to support perf and ftrace concurrent use") Cc: stable@vger.kernel.org Signed-off-by: Yuan Chen <chenyuan@kylinos.cn> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-09-30Merge tag 'bpf-next-6.18' of ↵Linus Torvalds1-187/+14
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Support pulling non-linear xdp data with bpf_xdp_pull_data() kfunc (Amery Hung) Applied as a stable branch in bpf-next and net-next trees. - Support reading skb metadata via bpf_dynptr (Jakub Sitnicki) Also a stable branch in bpf-next and net-next trees. - Enforce expected_attach_type for tailcall compatibility (Daniel Borkmann) - Replace path-sensitive with path-insensitive live stack analysis in the verifier (Eduard Zingerman) This is a significant change in the verification logic. More details, motivation, long term plans are in the cover letter/merge commit. - Support signed BPF programs (KP Singh) This is another major feature that took years to materialize. Algorithm details are in the cover letter/marge commit - Add support for may_goto instruction to s390 JIT (Ilya Leoshkevich) - Add support for may_goto instruction to arm64 JIT (Puranjay Mohan) - Fix USDT SIB argument handling in libbpf (Jiawei Zhao) - Allow uprobe-bpf program to change context registers (Jiri Olsa) - Support signed loads from BPF arena (Kumar Kartikeya Dwivedi and Puranjay Mohan) - Allow access to union arguments in tracing programs (Leon Hwang) - Optimize rcu_read_lock() + migrate_disable() combination where it's used in BPF subsystem (Menglong Dong) - Introduce bpf_task_work_schedule*() kfuncs to schedule deferred execution of BPF callback in the context of a specific task using the kernel’s task_work infrastructure (Mykyta Yatsenko) - Enforce RCU protection for KF_RCU_PROTECTED kfuncs (Kumar Kartikeya Dwivedi) - Add stress test for rqspinlock in NMI (Kumar Kartikeya Dwivedi) - Improve the precision of tnum multiplier verifier operation (Nandakumar Edamana) - Use tnums to improve is_branch_taken() logic (Paul Chaignon) - Add support for atomic operations in arena in riscv JIT (Pu Lehui) - Report arena faults to BPF error stream (Puranjay Mohan) - Search for tracefs at /sys/kernel/tracing first in bpftool (Quentin Monnet) - Add bpf_strcasecmp() kfunc (Rong Tao) - Support lookup_and_delete_elem command in BPF_MAP_STACK_TRACE (Tao Chen) * tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (197 commits) libbpf: Replace AF_ALG with open coded SHA-256 selftests/bpf: Add stress test for rqspinlock in NMI selftests/bpf: Add test case for different expected_attach_type bpf: Enforce expected_attach_type for tailcall compatibility bpftool: Remove duplicate string.h header bpf: Remove duplicate crypto/sha2.h header libbpf: Fix error when st-prefix_ops and ops from differ btf selftests/bpf: Test changing packet data from kfunc selftests/bpf: Add stacktrace map lookup_and_delete_elem test case selftests/bpf: Refactor stacktrace_map case with skeleton bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE selftests/bpf: Fix flaky bpf_cookie selftest selftests/bpf: Test changing packet data from global functions with a kfunc bpf: Emit struct bpf_xdp_sock type in vmlinux BTF selftests/bpf: Task_work selftest cleanup fixes MAINTAINERS: Delete inactive maintainers from AF_XDP bpf: Mark kfuncs as __noclone selftests/bpf: Add kprobe multi write ctx attach test selftests/bpf: Add kprobe write ctx attach test selftests/bpf: Add uprobe context ip register change test ...
2025-09-30tracing: Ensure optimized hashing worksMichal Koutný1-0/+2
If ever PID_MAX_DEFAULT changes, it must be compatible with tracing hashmaps assumptions. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250924113810.2433478-1-mkoutny@suse.com Link: https://lore.kernel.org/r/20240409110126.651e94cb@gandalf.local.home/ Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-30ftrace: Fix softlockup in ftrace_module_enableVladimir Riabchun1-0/+2
A soft lockup was observed when loading amdgpu module. If a module has a lot of tracable functions, multiple calls to kallsyms_lookup can spend too much time in RCU critical section and with disabled preemption, causing kernel panic. This is the same issue that was fixed in commit d0b24b4e91fc ("ftrace: Prevent RCU stall on PREEMPT_VOLUNTARY kernels") and commit 42ea22e754ba ("ftrace: Add cond_resched() to ftrace_graph_set_hash()"). Fix it the same way by adding cond_resched() in ftrace_module_enable. Link: https://lore.kernel.org/aMQD9_lxYmphT-up@vova-pc Signed-off-by: Vladimir Riabchun <ferr.lambarginio@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-28Merge tag 'trace-v6.17-rc7' of ↵Linus Torvalds2-1/+14
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix buffer overflow in osnoise_cpu_write() The allocated buffer to read user space did not add a nul terminating byte after copying from user the string. It then reads the string, and if user space did not add a nul byte, the read will continue beyond the string. Add a nul terminating byte after reading the string. - Fix missing check for lockdown on tracing There's a path from kprobe events or uprobe events that can update the tracing system even if lockdown on tracing is activate. Add a check in the dynamic event path. - Add a recursion check for the function graph return path Now that fprobes can hook to the function graph tracer and call different code between the entry and the exit, the exit code may now call functions that are not called in entry. This means that the exit handler can possibly trigger recursion that is not caught and cause the system to crash. Add the same recursion checks in the function exit handler as exists in the entry handler path. * tag 'trace-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: fgraph: Protect return handler from recursion loop tracing: dynevent: Add a missing lockdown check on dynevent tracing/osnoise: Fix slab-out-of-bounds in _parse_integer_limit()
2025-09-27tracing: fgraph: Protect return handler from recursion loopMasami Hiramatsu (Google)1-0/+12
function_graph_enter_regs() prevents itself from recursion by ftrace_test_recursion_trylock(), but __ftrace_return_to_handler(), which is called at the exit, does not prevent such recursion. Therefore, while it can prevent recursive calls from fgraph_ops::entryfunc(), it is not able to prevent recursive calls to fgraph from fgraph_ops::retfunc(), resulting in a recursive loop. This can lead an unexpected recursion bug reported by Menglong. is_endbr() is called in __ftrace_return_to_handler -> fprobe_return -> kprobe_multi_link_exit_handler -> is_endbr. To fix this issue, acquire ftrace_test_recursion_trylock() in the __ftrace_return_to_handler() after unwind the shadow stack to mark this section must prevent recursive call of fgraph inside user-defined fgraph_ops::retfunc(). This is essentially a fix to commit 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer"), because before that fgraph was only used from the function graph tracer. Fprobe allowed user to run any callbacks from fgraph after that commit. Reported-by: Menglong Dong <menglong8.dong@gmail.com> Closes: https://lore.kernel.org/all/20250918120939.1706585-1-dongml2@chinatelecom.cn/ Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer") Cc: stable@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/175852292275.307379.9040117316112640553.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Menglong Dong <menglong8.dong@gmail.com> Acked-by: Menglong Dong <menglong8.dong@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-24Merge tag 'probes-fixes-v6.17-rc7' of ↵Linus Torvalds2-3/+8
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - fprobe: Even if there is a memory allocation failure, try to remove the addresses recorded until then from the filter. Previously we just skipped it. - tracing: dynevent: Add a missing lockdown check on dynevent. This dynevent is the interface for all probe events. Thus if there is no check, any probe events can be added after lock down the tracefs. * tag 'probes-fixes-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: dynevent: Add a missing lockdown check on dynevent tracing: fprobe: Fix to remove recorded module addresses from filter
2025-09-25tracing: dynevent: Add a missing lockdown check on dyneventMasami Hiramatsu (Google)1-0/+4
Since dynamic_events interface on tracefs is compatible with kprobe_events and uprobe_events, it should also check the lockdown status and reject if it is set. Link: https://lore.kernel.org/all/175824455687.45175.3734166065458520748.stgit@devnote2/ Fixes: 17911ff38aa5 ("tracing: Add locked_down checks to the open calls of files created for tracefs") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org
2025-09-24tracing: fprobe: Fix to remove recorded module addresses from filterMasami Hiramatsu (Google)1-3/+4
Even if there is a memory allocation failure in fprobe_addr_list_add(), there is a partial list of module addresses. So remove the recorded addresses from filter if exists. This also removes the redundant ret local variable. Fixes: a3dc2983ca7b ("tracing: fprobe: Cleanup fprobe hash when module unloading") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org Reviewed-by: Menglong Dong <menglong8.dong@gmail.com>
2025-09-24bpf: Allow uprobe program to change context registersJiri Olsa1-2/+7
Currently uprobe (BPF_PROG_TYPE_KPROBE) program can't write to the context registers data. While this makes sense for kprobe attachments, for uprobe attachment it might make sense to be able to change user space registers to alter application execution. Since uprobe and kprobe programs share the same type (BPF_PROG_TYPE_KPROBE), we can't deny write access to context during the program load. We need to check on it during program attachment to see if it's going to be kprobe or uprobe. Storing the program's write attempt to context and checking on it during the attachment. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250916215301.664963-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23tracing: dynevent: Add a missing lockdown check on dyneventMasami Hiramatsu (Google)1-0/+4
Since dynamic_events interface on tracefs is compatible with kprobe_events and uprobe_events, it should also check the lockdown status and reject if it is set. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/175824455687.45175.3734166065458520748.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23tracing/osnoise: Fix slab-out-of-bounds in _parse_integer_limit()Wang Liang1-1/+2
When config osnoise cpus by write() syscall, the following KASAN splat may be observed: BUG: KASAN: slab-out-of-bounds in _parse_integer_limit+0x103/0x130 Read of size 1 at addr ffff88810121e3a1 by task test/447 CPU: 1 UID: 0 PID: 447 Comm: test Not tainted 6.17.0-rc6-dirty #288 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x55/0x70 print_report+0xcb/0x610 kasan_report+0xb8/0xf0 _parse_integer_limit+0x103/0x130 bitmap_parselist+0x16d/0x6f0 osnoise_cpus_write+0x116/0x2d0 vfs_write+0x21e/0xcc0 ksys_write+0xee/0x1c0 do_syscall_64+0xa8/0x2a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> This issue can be reproduced by below code: const char *cpulist = "1"; int fd=open("/sys/kernel/debug/tracing/osnoise/cpus", O_WRONLY); write(fd, cpulist, strlen(cpulist)); Function bitmap_parselist() was called to parse cpulist, it require that the parameter 'buf' must be terminated with a '\0' or '\n'. Fix this issue by adding a '\0' to 'buf' in osnoise_cpus_write(). Cc: <mhiramat@kernel.org> Cc: <mathieu.desnoyers@efficios.com> Cc: <tglozar@redhat.com> Link: https://lore.kernel.org/20250916063948.3154627-1-wangliang74@huawei.com Fixes: 17f89102fe23 ("tracing/osnoise: Allow arbitrarily long CPU string") Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23tracing: replace use of system_wq with system_percpu_wqMarco Crivellari1-1/+1
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. queue_work() / queue_delayed_work() mod_delayed_work() will now use the new per-cpu wq: whether the user still stick on the old name a warn will be printed along a wq redirect to the new one. This patch add the new system_percpu_wq except for mm, fs and net subsystem, whom are handled in separated patches. The old wq will be kept for a few release cylces. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/20250905091040.109772-2-marco.crivellari@suse.com Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23tracing: Remove redundant 0 value initializationLiao Yuanhong1-1/+0
The saved_cmdlines_buffer struct is already zeroed by memset(). It's redundant to initialize s->cmdline_idx to 0. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250825123200.306272-1-liaoyuanhong@vivo.com Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23tracing/osnoise: Use for_each_online_cpu() instead of for_each_cpu()Fushuai Wang1-2/+2
Replace the opencoded for_each_cpu(cpu, cpu_online_mask) loop with the more readable and equivalent for_each_online_cpu(cpu) macro. Link: https://lore.kernel.org/20250811064158.2456-1-wangfushuai@baidu.com Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23tracing: Use vmalloc_array() to improve codeQianfeng Rong1-1/+1
Remove array_size() calls and replace vmalloc() with vmalloc_array() in tracing_map_sort_entries(). vmalloc_array() is optimized better, uses fewer instructions, and handles overflow more concisely[1]. [1]: https://lore.kernel.org/lkml/abc66ec5-85a4-47e1-9759-2f60ab111971@vivo.com/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250817084725.59477-1-rongqianfeng@vivo.com Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23tracing: Have syscall trace events show "0x" for values greater than 10Steven Rostedt1-3/+9
Currently the syscall trace events show each value as hexadecimal, but without adding "0x" it can be confusing: sys_write(fd: 4, buf: 0x55c4a1fa9270, count: 44) Looks like the above write wrote 44 bytes, when in reality it wrote 68 bytes. Add a "0x" for all values greater or equal to 10 to remove the ambiguity. For values less than 10, leave off the "0x" as that just adds noise to the output. Also change the iterator to check if "i" is nonzero and print the ", " delimiter at the start, then adding the logic to the trace_seq_printf() at the end. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Link: https://lore.kernel.org/20250923130713.764558957@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()Steven Rostedt2-10/+8
The syscall events are pseudo events that hook to the raw syscalls. The ftrace_syscall_enter/exit() callback is called by the raw_syscall enter/exit tracepoints respectively whenever any of the syscall events are enabled. The trace_array has an array of syscall "files" that correspond to the system calls based on their __NR_SYSCALL number. The array is read and if there's a pointer to a trace_event_file then it is considered enabled and if it is NULL that syscall event is considered disabled. Currently it uses an rcu_dereference_sched() to get this pointer and a rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary as the file pointer will not go away outside the synchronization of the tracepoint logic itself. And this code adds no extra RCU synchronization that uses this. Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which is all they need. This will also allow this code to not depend on preemption being disabled as system call tracepoints are now allowed to fault. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Link: https://lore.kernel.org/20250923130713.594320290@kernel.org Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-18bpf: Move the signature kfuncs to helpers.cKP Singh1-183/+0
No functional changes, except for the addition of the headers for the kfuncs so that they can be used for signature verification. Signed-off-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20250914215141.15144-8-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-18Merge tag 'trace-rv-v6.17-rc5' of ↵Linus Torvalds2-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verifier fixes from Steven Rostedt: - Fix build in some RISC-V flavours Some system calls only are available for the 64bit RISC-V machines. #ifdef out the cases of clock_nanosleep and futex in the sleep monitor if they are not supported by the architecture. - Fix wrong cast, obsolete after refactoring Use container_of() to get to the rv_monitor structure from the enable_monitors_next() 'p' pointer. The assignment worked only because the list field used happened to be the first field of the structure. - Remove redundant include files Some include files were listed twice. Remove the extra ones and sort the includes. - Fix missing unlock on failure There was an error path that exited the rv_register_monitor() function without releasing a lock. Change that to goto the lock release. - Add Gabriele Monaco to be Runtime Verifier maintainer Gabriele is doing most of the work on RV as well as collecting patches. Add him to the maintainers file for Runtime Verification. * tag 'trace-rv-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Add Gabriele Monaco as maintainer for Runtime Verification rv: Fix missing mutex unlock in rv_register_monitor() include/linux/rv.h: remove redundant include file rv: Fix wrong type cast in enabled_monitors_next() rv: Support systems with time64-only syscalls
2025-09-18tracing: kprobe-event: Fix null-ptr-deref in trace_kprobe_create_internal()Wang Liang1-0/+2
A crash was observed with the following output: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 UID: 0 PID: 2899 Comm: syz.2.399 Not tainted 6.17.0-rc5+ #5 PREEMPT(none) RIP: 0010:trace_kprobe_create_internal+0x3fc/0x1440 kernel/trace/trace_kprobe.c:911 Call Trace: <TASK> trace_kprobe_create_cb+0xa2/0xf0 kernel/trace/trace_kprobe.c:1089 trace_probe_create+0xf1/0x110 kernel/trace/trace_probe.c:2246 dyn_event_create+0x45/0x70 kernel/trace/trace_dynevent.c:128 create_or_delete_trace_kprobe+0x5e/0xc0 kernel/trace/trace_kprobe.c:1107 trace_parse_run_command+0x1a5/0x330 kernel/trace/trace.c:10785 vfs_write+0x2b6/0xd00 fs/read_write.c:684 ksys_write+0x129/0x240 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x5d/0x2d0 arch/x86/entry/syscall_64.c:94 </TASK> Function kmemdup() may return NULL in trace_kprobe_create_internal(), add check for it's return value. Link: https://lore.kernel.org/all/20250916075816.3181175-1-wangliang74@huawei.com/ Fixes: 33b4e38baa03 ("tracing: kprobe-event: Allocate string buffers from heap") Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-09-15bpf...d_path(): constify path argumentAl Viro1-1/+1
Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15rv: Fix missing mutex unlock in rv_register_monitor()Zhen Ni1-1/+1
If create_monitor_dir() fails, the function returns directly without releasing rv_interface_lock. This leaves the mutex locked and causes subsequent monitor registration attempts to deadlock. Fix it by making the error path jump to out_unlock, ensuring that the mutex is always released before returning. Fixes: 24cbfe18d55a ("rv: Merge struct rv_monitor_def into struct rv_monitor") Signed-off-by: Zhen Ni <zhen.ni@easystack.cn> Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Reviewed-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20250903065112.1878330-1-zhen.ni@easystack.cn Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-09-15rv: Fix wrong type cast in enabled_monitors_next()Nam Cao1-1/+1
Argument 'p' of enabled_monitors_next() is not a pointer to struct rv_monitor, it is actually a pointer to the list_head inside struct rv_monitor. Therefore it is wrong to cast 'p' to struct rv_monitor *. This wrong type cast has been there since the beginning. But it still worked because the list_head was the first field in struct rv_monitor_def. This is no longer true since commit 24cbfe18d55a ("rv: Merge struct rv_monitor_def into struct rv_monitor") moved the list_head, and this wrong type cast became a functional problem. Properly use container_of() instead. Fixes: 24cbfe18d55a ("rv: Merge struct rv_monitor_def into struct rv_monitor") Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20250806120911.989365-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-09-15rv: Support systems with time64-only syscallsPalmer Dabbelt1-0/+4
Some systems (like 32-bit RISC-V) only have the 64-bit time_t versions of syscalls. So handle the 32-bit time_t version of those being undefined. Fixes: f74f8bb246cf ("rv: Add rtapp_sleep monitor") Closes: https://lore.kernel.org/oe-kbuild-all/202508160204.SsFyNfo6-lkp@intel.com Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com> Acked-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20250804194518.97620-2-palmer@dabbelt.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-09-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5Alexei Starovoitov8-30/+62
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08tracing: Silence warning when chunk allocation fails in trace_pid_writePu Lehui1-1/+5
Syzkaller trigger a fault injection warning: WARNING: CPU: 1 PID: 12326 at tracepoint_add_func+0xbfc/0xeb0 Modules linked in: CPU: 1 UID: 0 PID: 12326 Comm: syz.6.10325 Tainted: G U 6.14.0-rc5-syzkaller #0 Tainted: [U]=USER Hardware name: Google Compute Engine/Google Compute Engine RIP: 0010:tracepoint_add_func+0xbfc/0xeb0 kernel/tracepoint.c:294 Code: 09 fe ff 90 0f 0b 90 0f b6 74 24 43 31 ff 41 bc ea ff ff ff RSP: 0018:ffffc9000414fb48 EFLAGS: 00010283 RAX: 00000000000012a1 RBX: ffffffff8e240ae0 RCX: ffffc90014b78000 RDX: 0000000000080000 RSI: ffffffff81bbd78b RDI: 0000000000000001 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffffffffffffffef R13: 0000000000000000 R14: dffffc0000000000 R15: ffffffff81c264f0 FS: 00007f27217f66c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2e80dff8 CR3: 00000000268f8000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tracepoint_probe_register_prio+0xc0/0x110 kernel/tracepoint.c:464 register_trace_prio_sched_switch include/trace/events/sched.h:222 [inline] register_pid_events kernel/trace/trace_events.c:2354 [inline] event_pid_write.isra.0+0x439/0x7a0 kernel/trace/trace_events.c:2425 vfs_write+0x24c/0x1150 fs/read_write.c:677 ksys_write+0x12b/0x250 fs/read_write.c:731 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f We can reproduce the warning by following the steps below: 1. echo 8 >> set_event_notrace_pid. Let tr->filtered_pids owns one pid and register sched_switch tracepoint. 2. echo ' ' >> set_event_pid, and perform fault injection during chunk allocation of trace_pid_list_alloc. Let pid_list with no pid and assign to tr->filtered_pids. 3. echo ' ' >> set_event_pid. Let pid_list is NULL and assign to tr->filtered_pids. 4. echo 9 >> set_event_pid, will trigger the double register sched_switch tracepoint warning. The reason is that syzkaller injects a fault into the chunk allocation in trace_pid_list_alloc, causing a failure in trace_pid_list_set, which may trigger double register of the same tracepoint. This only occurs when the system is about to crash, but to suppress this warning, let's add failure handling logic to trace_pid_list_set. Link: https://lore.kernel.org/20250908024658.2390398-1-pulehui@huaweicloud.com Fixes: 8d6e90983ade ("tracing: Create a sparse bitmask for pid filtering") Reported-by: syzbot+161412ccaeff20ce4dde@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67cb890e.050a0220.d8275.022e.GAE@google.com Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-06tracing/osnoise: Fix null-ptr-deref in bitmap_parselist()Wang Liang1-0/+3
A crash was observed with the following output: BUG: kernel NULL pointer dereference, address: 0000000000000010 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 2 UID: 0 PID: 92 Comm: osnoise_cpus Not tainted 6.17.0-rc4-00201-gd69eb204c255 #138 PREEMPT(voluntary) RIP: 0010:bitmap_parselist+0x53/0x3e0 Call Trace: <TASK> osnoise_cpus_write+0x7a/0x190 vfs_write+0xf8/0x410 ? do_sys_openat2+0x88/0xd0 ksys_write+0x60/0xd0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> This issue can be reproduced by below code: fd=open("/sys/kernel/debug/tracing/osnoise/cpus", O_WRONLY); write(fd, "0-2", 0); When user pass 'count=0' to osnoise_cpus_write(), kmalloc() will return ZERO_SIZE_PTR (16) and cpulist_parse() treat it as a normal value, which trigger the null pointer dereference. Add check for the parameter 'count'. Cc: <mhiramat@kernel.org> Cc: <mathieu.desnoyers@efficios.com> Cc: <tglozar@redhat.com> Link: https://lore.kernel.org/20250906035610.3880282-1-wangliang74@huawei.com Fixes: 17f89102fe23 ("tracing/osnoise: Allow arbitrarily long CPU string") Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-06trace/fgraph: Fix error handlingGuenter Roeck1-1/+2
Commit edede7a6dcd7 ("trace/fgraph: Fix the warning caused by missing unregister notifier") added a call to unregister the PM notifier if register_ftrace_graph() failed. It does so unconditionally. However, the PM notifier is only registered with the first call to register_ftrace_graph(). If the first registration was successful and a subsequent registration failed, the notifier is now unregistered even if ftrace graphs are still registered. Fix the problem by only unregistering the PM notifier during error handling if there are no active fgraph registrations. Fixes: edede7a6dcd7 ("trace/fgraph: Fix the warning caused by missing unregister notifier") Closes: https://lore.kernel.org/all/63b0ba5a-a928-438e-84f9-93028dd72e54@roeck-us.net/ Cc: Ye Weihua <yeweihua4@huawei.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250906050618.2634078-1-linux@roeck-us.net Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>