1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354
//! CPU local storage //! //! We want some statics to be cpu-local (e.g. [`CURRENT_THREAD`]). We could implement this fully //! in software, by having an area of memory that is replicated for every cpu core, where //! statics are indexes in this memory area, and provide getters and setters to access and modify //! the cpu-local statics. //! //! However this is not ideal as it is not really optimized, and pretty tedious. //! //! Instead we use the very common concept of Thread Local Storage (TLS), and apply it to cpu cores //! instead of threads, and let the compiler do all the hard work for us. //! //! # Usage //! //! In the kernel you declare a cpu-local using the [#\[thread_local\] attribute] : //! //! ``` //! #[thread_local] //! static MY_CPU_LOCAL: core::cell::Cell<u8> = core::cell::Cell::new(42); //! ``` //! //! and access it as if it was a regular static, only that each cpu core will have its own view of //! the static. //! //! The compiler is responsible for generating code that will access the right address, provided //! we configured TLS correctly. //! //! ##### Early boot //! //! Note that you can't access a cpu-local static before [`init_cpu_locals`] is called, because //! the cpu-local areas arent' initialized yet, and this will likely result to a cpu exception //! being raised, or UB. //! //! This means you can't ever access cpu-locals in early boot. If your code might be called during //! early boot, we advise you to use [`ARE_CPU_LOCALS_INITIALIZED_YET`] to know if you're allowed //! to access your cpu-local static, and if not return an error of some kind. //! //! # Inner workings //! //! We implement the TLS according to conventions laid out by [Ulrich Drepper's paper on TLS] which //! is followed by LLVM and most compilers. //! //! Since we're running on i386, we're following variant II. //! //! Each cpu core's `gs` segment points to a thread local memory area where cpu-locals statics live. //! Cpu-local statics are simply accessed through an offset from `gs`. //! Those regions can be found in [`CPU_LOCAL_REGIONS`]. //! //! The linker is in charge of creating an ELF segment of type `PT_TLS` where an initialization image //! for cpu local regions can be found, and is meant to be copy-pasted for every ~~thread we create~~ //! cpu core we have. //! //! ##### Segmentation //! //! Each core gets its own [GDT]. In each of these there is a `KTls` segment which points to this //! core's cpu-local area, and which is meant to be loaded into `gs`. //! //! Because userspace might want to use Thread Local Storage too, and also needs `gs` to point to its //! thread local area (see [`set_thread_area`]), we swap the segment `gs` points to everytime //! we enter and leave the kernel in [`trap_gate_asm`], from `UTls_Elf` to `KTls` and back. //! //! TLS on x86 are really weird. It uses the variant II, where offsets must be *subtracted* from `gs`, //! even though segmentation only supports *adding* offsets. The only way to make them work is to have //! `gs` segment's limit be `0xffffffff`, effectively spanning the whole address space, and when //! the cpu will add a "negative" (e.g. `0xfffffffc` for -4) offset, it will treat it as an unsigned //! huge positive offset, which when added to `gs`'s base will "wrap around" the address space, //! and effectively end up 4 bytes behind `gs`'s base. //! //! Illustration: //! //! ![cpu backflip](https://raw.githubusercontent.com/sunriseos/SunriseOS/master/kernel/res/cpu_locals_segmentation_doc.gif) //! //! ##### dtv and `__tls_get_addr` //! //! We're the kernel, and we don't do dynamic loading (no loadable kernel modules). //! Because of this, we know our TLS model will be static (either Initial Exec or Local Exec). //! Those models always access thread-locals directly via `gs`, and always short-circuit the dtv. //! //! So we don't even bother allocating a dtv array at all. Neither do we define a `__tls_get_addr` //! function. //! //! [`CURRENT_THREAD`]: crate::scheduler::CURRENT_THREAD //! [`init_cpu_locals`]: init_cpu_locals //! [`ARE_CPU_LOCALS_INITIALIZED_YET`]: ARE_CPU_LOCALS_INITIALIZED_YET //! [Ulrich Drepper's paper on TLS]: https://web.archive.org/web/20190710135250/https://akkadia.org/drepper/tls.pdf //! [`CPU_LOCAL_REGIONS`]: crate::cpu_locals::CPU_LOCAL_REGIONS //! [GDT]: crate::i386::gdt //! [`set_thread_area`]: crate::syscalls::set_thread_area //! [#\[thread_local\] attribute]: https://github.com/rust-lang/rust/issues/10310 use crate::i386::multiboot; use crate::elf_loader::map_grub_module; use crate::i386::gdt::{GDT, GdtIndex}; use sunrise_libutils::div_ceil; use xmas_elf::program::{Type, SegmentData}; use alloc::alloc::{alloc_zeroed, dealloc}; use core::mem::align_of; use core::alloc::Layout; use core::mem::size_of; use alloc::vec::Vec; use crate::sync::Once; use core::sync::atomic::{AtomicBool, Ordering}; use core::fmt::Debug; /// Use this if your code might run in an early boot stage to know if you're /// allowed to access a cpu-local variable. Accessing one when this is false is UB. /// /// Always true after [`init_cpu_locals`] have been called. pub static ARE_CPU_LOCALS_INITIALIZED_YET: AtomicBool = AtomicBool::new(false); /// Array of cpu local regions, copied from the initialization image in kernel's ELF. /// /// One per cpu core. static CPU_LOCAL_REGIONS: Once<Vec<CpuLocalRegion>> = Once::new(); /// Address that should be put in `KTls` segment's base. /// The limit should be `0xffffffff`. /// /// Used for creating a core's GDT, before starting it. /// /// # Panics /// /// Panics if `cpu_id` is greater than the `cpu_count` that was supplied to [`init_cpu_locals`]. pub fn get_cpu_locals_ptr_for_core(cpu_id: usize) -> *const u8 { CPU_LOCAL_REGIONS.r#try() .expect("CPU_LOCAL_REGIONS not initialized") .get(cpu_id) .unwrap_or_else(|| panic!("cpu locals not initialized for cpu id {}", cpu_id)) .tcb() as *const ThreadControlBlock as *const u8 } /// Initializes cpu locals during early boot stage. /// /// * Maps the kernel's ELF to get our `PT_TLS` program header information, including the TLS /// initialization image. /// * Allocates an array of `cpu_count` cpu local regions and stores them in [CPU_LOCAL_REGIONS]. /// * Makes this core's `KTls` segment point to `CPU_LOCAL_REGIONS[0]`'s [`ThreadControlBlock`]. /// /// # Panics /// /// * Failed to map kernel's ELF. /// * Failed to get kernel ELF's TLS initialization image. pub fn init_cpu_locals(cpu_count: usize) { debug_assert!(cpu_count > 0, "You can't have 0 cpu cores - I'm running code therefor I am"); CPU_LOCAL_REGIONS.call_once(|| { // map our own ELF so that we can access our PT_TLS let mapped_kernel_elf = multiboot::try_get_boot_information() .and_then(|info| info.module_tags().next()) .and_then(|module| map_grub_module(module).ok()) .expect("cpu_locals: cannot get kernel elf"); let kernel_elf = mapped_kernel_elf.elf.as_ref() .expect("cpu_locals: module 0 is not kernel elf"); // find the PT_TLS header let tls_program_header = kernel_elf.program_iter() .find(|p_header| p_header.get_type().ok().map(|p_header_type| match p_header_type { Type::Tls => true, _ => false } ).unwrap_or(false) ) .expect("cpu_locals: kernel elf has no PT_TLS program header"); // get our tls initialisation image at header.p_offset, header.p_filesz let tls_init_image = match tls_program_header.get_data(kernel_elf) .expect("cpu_locals: cannot get PT_TLS content") { SegmentData::Undefined(tls_data) => tls_data, x => panic!("PT_TLS: Unexpected Segment data {:?}", x) }; // create one cpu local region per cpu from the initialisation image let mut cpu_local_regions = Vec::with_capacity(cpu_count); for _ in 0..cpu_count { cpu_local_regions.push( CpuLocalRegion::allocate( tls_init_image, tls_program_header.mem_size() as usize, tls_program_header.align() as usize ) ); } // make gs point to the first cpu local region. let mut gdt = GDT.r#try() .expect("GDT not initialized") .lock(); gdt.table[GdtIndex::KTls as usize].set_base( cpu_local_regions[0].tcb() as *const _ as usize as u32 ); gdt.commit(None, None, None, None, None, None); cpu_local_regions }); // yes, they are 😌 ARE_CPU_LOCALS_INITIALIZED_YET.store(true, Ordering::Relaxed); } /// The `round` function, as defined in section 3.0: /// /// ```text /// round(x,y) = y * ⌈x/y⌉ /// ``` /// /// Just a poorly-named `align_up`. fn tls_align_up(x: usize, y: usize) -> usize { y * div_ceil(x, y) } /// Elf TLS TCB /// /// The variant II leaves the specification of the ThreadControlBlock (TCB) to the implementor, /// with the only requirement that the first word in the TCB, pointed by `tp`, contains its own /// address, i.e. is a pointer to itself (GNU variant). /// /// We don't need to store anything else in the TCB, it's just the self pointer. #[repr(C)] #[derive(Debug)] struct ThreadControlBlock { /// Pointer containing its own address. tp_self_ptr: *const ThreadControlBlock, } /// Represents an allocated cpu local region. /// /// Because cpu regions have a really specific layout, we don't use Box and instead interact with /// the allocator directly. This type is the equivalent of a Box, it stores the pointer to the /// allocated memory, and deallocates it on Drop. struct CpuLocalRegion { /// Pointer to the allocated memory ptr: usize, /// Layout of the allocated memory. Used when deallocating. layout: Layout, /// Offset of the TCB in this allocation. tcb_offset: usize, } impl CpuLocalRegion { /// Returns a pointer to the [ThreadControlBlock] in the allocated region. /// All cpu-local arithmetic are done relative to this pointer. /// /// For TLS to work, the value stored at this address should be the address itself, i.e. /// having a pointer pointing to itself. fn tcb(&self) -> &ThreadControlBlock { unsafe { // safe: - guaranteed to be aligned, and still in the allocation, // - no one should ever have a mut reference to the ThreadControlBlock after its // initialisation. &*((self.ptr + self.tcb_offset) as *const ThreadControlBlock) } } /// Allocates a CpuLocalRegion. /// /// The region's content is copied from the TLS initialisation image described by `block_src`, /// padded with 0s for `block_size`, to which is appended a [`ThreadControlBlock`]. /// /// The CpuLocalRegion uses `PT_TLS`'s `p_align` field passed in `block_align` /// to compute its layout and total size. /// /// ### Alignment /// /// ```text /// /// V----------------------V tls_align_up(tls_size_1, align_1) /// /// +-- gs:0 /// | /// +----------------------|-- tlsoffset_1 = gs:0 - tls_align_up(tls_size_1, align_1) /// | | /// V V /// /// j----------------~-----j---------j /// ... | tls_size_1 | pad | TCB | /// j----------------~-----j---------j /// /// ^ ^ ^ /// | | | /// | | +-- TCB_align: Determines alignment of everything. /// | | = max(align_of::<TCB>(), align_1). e.g. : 16. /// | | /// | +------------------------- TCB_align - n * align_1 /// | => still aligned to align_1 because TCB is aligned to align_1. /// | /// +------------------------------ alloc_align == TCB_align /// => &TCB = &alloc + tls_align_up(gs:0 - tls_offset_1, TCB_align) /// /// ^---^ alloc_pad /// /// ``` #[allow(clippy::cast_ptr_alignment)] fn allocate(block_src: &[u8], block_size: usize, block_align: usize) -> Self { let tls_offset1 = tls_align_up(block_size, block_align); let tcb_align = usize::max(align_of::<ThreadControlBlock>(), block_align); let tcb_offset = tls_align_up(tls_offset1, tcb_align); let alloc_pad_size = tcb_offset - tls_offset1; let layout = Layout::from_size_align( tcb_offset + size_of::<ThreadControlBlock>(), tcb_align ).unwrap(); let alloc = unsafe { // safe: layout.size >= sizeof::<TCB> -> layout.size != 0 alloc_zeroed(layout) }; assert!(!alloc.is_null(), "cpu_locals: failed static area allocation"); unsafe { // safe: everything is done within our allocation, u8 is always aligned. // copy data core::ptr::copy_nonoverlapping( block_src as *const [u8] as *const u8, alloc.add(alloc_pad_size), block_src.len() ); // .tbss + pad are already set to 0 by alloc_zeroed. // write tcb core::ptr::write( alloc.add(tcb_offset) as *mut ThreadControlBlock, ThreadControlBlock { tp_self_ptr: alloc.add(tcb_offset) as *const ThreadControlBlock } ); }; Self { ptr: alloc as usize, layout, tcb_offset } } } impl Drop for CpuLocalRegion { /// Dropping a CpuLocalRegion deallocates it. fn drop(&mut self) { unsafe { // safe: - self.ptr is obviously allocated. // - self.layout is the same argument that was used for alloc. dealloc(self.ptr as *mut u8, self.layout) }; } } impl Debug for CpuLocalRegion { fn fmt(&self, f: &mut core::fmt::Formatter) -> Result<(), core::fmt::Error> { f.debug_struct("CpuLocalRegion") .field("start_address", &self.ptr) .field("tcb_address", &self.tcb()) .field("total_size", &self.layout.size()) .finish() } }