diff options
| author | Rik van Riel <riel@surriel.com> | 2025-02-25 22:00:45 -0500 |
|---|---|---|
| committer | Ingo Molnar <mingo@kernel.org> | 2025-03-19 11:12:29 +0100 |
| commit | 4afeb0ed1753ebcad93ee3b45427ce85e9c8ec40 (patch) | |
| tree | 1d68278b950a636707c4bf0ca449f76795a37d9c /arch/x86/include/asm/tlbflush.h | |
| parent | x86/mm: Add global ASID process exit helpers (diff) | |
| download | linux-4afeb0ed1753ebcad93ee3b45427ce85e9c8ec40.tar.gz linux-4afeb0ed1753ebcad93ee3b45427ce85e9c8ec40.zip | |
x86/mm: Enable broadcast TLB invalidation for multi-threaded processes
There is not enough room in the 12-bit ASID address space to hand out
broadcast ASIDs to every process. Only hand out broadcast ASIDs to processes
when they are observed to be simultaneously running on 4 or more CPUs.
This also allows single threaded process to continue using the cheaper, local
TLB invalidation instructions like INVLPGB.
Due to the structure of flush_tlb_mm_range(), the INVLPGB flushing is done in
a generically named broadcast_tlb_flush() function which can later also be
used for Intel RAR.
Combined with the removal of unnecessary lru_add_drain calls() (see
https://lore.kernel.org/r/20241219153253.3da9e8aa@fangorn) this results in
a nice performance boost for the will-it-scale tlb_flush2_threads test on an
AMD Milan system with 36 cores:
- vanilla kernel: 527k loops/second
- lru_add_drain removal: 731k loops/second
- only INVLPGB: 527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second
Profiling with only the INVLPGB changes showed while TLB invalidation went
down from 40% of the total CPU time to only around 4% of CPU time, the
contention simply moved to the LRU lock.
Fixing both at the same time about doubles the number of iterations per second
from this case.
Comparing will-it-scale tlb_flush2_threads with several different numbers of
threads on a 72 CPU AMD Milan shows similar results. The number represents the
total number of loops per second across all the threads:
threads tip INVLPGB
1 315k 304k
2 423k 424k
4 644k 1032k
8 652k 1267k
16 737k 1368k
32 759k 1199k
64 636k 1094k
72 609k 993k
1 and 2 thread performance is similar with and without INVLPGB, because
INVLPGB is only used on processes using 4 or more CPUs simultaneously.
The number is the median across 5 runs.
Some numbers closer to real world performance can be found at Phoronix, thanks
to Michael:
https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits
[ bp:
- Massage
- :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi
- :%s/\<clear_asid_transition\>/mm_clear_asid_transition/cgi
- Fold in a 0day bot fix: https://lore.kernel.org/oe-kbuild-all/202503040000.GtiWUsBm-lkp@intel.com
]
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Link: https://lore.kernel.org/r/20250226030129.530345-11-riel@surriel.com
Diffstat (limited to 'arch/x86/include/asm/tlbflush.h')
| -rw-r--r-- | arch/x86/include/asm/tlbflush.h | 6 |
1 files changed, 6 insertions, 0 deletions
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index e6c3be06dd21..7cad283d502d 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -280,6 +280,11 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) smp_store_release(&mm->context.global_asid, asid); } +static inline void mm_clear_asid_transition(struct mm_struct *mm) +{ + WRITE_ONCE(mm->context.asid_transition, false); +} + static inline bool mm_in_asid_transition(struct mm_struct *mm) { if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) @@ -291,6 +296,7 @@ static inline bool mm_in_asid_transition(struct mm_struct *mm) static inline u16 mm_global_asid(struct mm_struct *mm) { return 0; } static inline void mm_init_global_asid(struct mm_struct *mm) { } static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) { } +static inline void mm_clear_asid_transition(struct mm_struct *mm) { } static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; } #endif /* CONFIG_BROADCAST_TLB_FLUSH */ |
