scx_utils/infeasible.rs
1// Copyright (c) Meta Platforms, Inc. and affiliates.
2
3// This software may be used and distributed according to the terms of the
4// GNU General Public License version 2.
5
6//! # SCX Load Calculator
7//!
8//! A crate providing abstractions for calculating load between scheduling
9//! domains.
10//!
11//! Load Balancing Primer
12//! ---------------------
13//!
14//! Let's begin by doing a brief overview of load balancing. In order to
15//! understand load balancing, we'll introduce the relevant terms, and explain
16//! what load balancing is from there:
17//!
18//! *Weight*
19//!
20//! A positive integer value representing a task's priority in the system. The
21//! meaning of weight is ultimately up to the caller to determine and apply, but
22//! conceptually speaking weight is a linear scaling factor of a task's
23//! perceived load (defined below) in the system.
24//!
25//! *Duty Cycle*
26//!
27//! A task's duty cycle is the proportion of time that it could have used a CPU
28//! in some time window. For example, say that a task does nothing but spin in a
29//! tight loop for 10ms, and then sleep for 10ms. Then in a 20ms time window,
30//! its duty cycle would be 0.5.
31//!
32//! Note that this is not the same thing as the proportion of time it's
33//! _runnable_. For example, if you had two such tasks sharing a CPU, one of
34//! them may wait for 10ms to use the core while the other one runs, and then it
35//! would run for its 10ms duty cycle. It was runnable for 20ms, but could only
36//! actually use the CPU for 10ms, so its duty cycle was 10ms / 20ms = 0.5.
37//!
38//! *Load*
39//!
40//! A scheduling entity's load l_x is simply the product of its weight and duty
41//! cycle:
42//!
43//! l_x = w_x * d_x
44//!
45//! *Infeasible Weights*
46//!
47//! At a conceptual level, the goal of a load balancer is of course to balance
48//! load across the system. If a scheduling entity has load l_x, and the total
49//! sum of all entities' loads on the system is L, then entity x should receive
50//! l_x / L proportion of the available CPU capacity on the system.
51//!
52//! This formulation works fine in many cases, but can break down when an
53//! entity's proportion of load (and thus allotted CPU capacity) exceeds the
54//! amount of CPU it can consume.
55//!
56//! For example, say you were on a system with 2 sets of 2-core groups (for a
57//! total of 4 cores on the system), eight tasks with the following duty cycle
58//! and weight:
59//!
60//! ID Weight Duty Cycle Load
61//!
62//! 0 1 1.0 1
63//! 1 1 1.0 1
64//! 2 1 1.0 1
65//! 3 1 1.0 1
66//! 4 1 1.0 1
67//! 5 1 1.0 1
68//! 6 1 1.0 1
69//! 7 10000 1.0 10000
70//!
71//!
72//! The load sum L of the system is 1 * 7 + 1 * 10000 = 10007. This means that
73//! tasks 0-6 have a load proportion of 1 / 10007 ~= 0.0001, and task 7 has a
74//! load proportion of 0.9993. In other words, tasks 0-6 are entitled to
75//! 0.0001 * 4 = 0.0004 CPUs worth of capacity, and task 7 is entitled to
76//! 0.9993 * 4 = 3.9972 CPUs worth of capacity.
77//!
78//! Task 7 can only consume at most 1.0 CPU due to its duty cycle being 1.0, so
79//! its weight is "infeasible" as the amount of CPU capacity it would be
80//! afforded exceeds what it can actually use.
81//!
82//! What we instead want is to find an adjusted weight w' that we can assign to
83//! Task 7 such that it gets its full duty cycle of 1.0, but allows the
84//! remaining 3 cores worth of compute to be equally spread amongst the
85//! remaining tasks. The task of finding this weight w' is known as the
86//! "infeasible weights problem", and is solved by this crate.
87//!
88//! More details on load balancing and the infeasible weights problem are
89//! provided in the following Google Drive document:
90//!
91//! <https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv>
92//!
93//! Using the Crate
94//! ---------------
95//!
96//! This crate has two primary sets of APIs:
97//!
98//! (1) APIs for aggregating domain loads specified by the caller
99//! (2) APIs for querying those loads, after being adjusted for infeasibility
100//!
101//! LoadAggregator
102//! --------------
103//!
104//! The object that addresses (1) is the LoadAggregator. This object receives
105//! load passed by the user, and once all load has been received, runs the
106//! numbers to create load sums and weights that are adjusted for infeasibility
107//! as described above. LoadAggregator objects can be created and used as
108//! follows:
109//!
110//! Assume we're on a 16-core (32-CPU) host with two core complexes:
111//!
112//!```rust
113//! use scx_utils::LoadAggregator;
114//! use log::info;
115//!
116//! let mut aggregator = LoadAggregator::new(32, false);
117//! // Create a LoadAggregator object, specifying the number of CPUs on the
118//! // system, and whether it should only aggregate duty cycle.
119//! let mut aggregator = LoadAggregator::new(32, false);
120//!
121//! // Domain 0, weight 1, has duty cycle 1.0.
122//! aggregator.record_dom_load(0, 1, 1.0);
123//!
124//! // Domain 1, weight 1, has duty cycle 1.0.
125//! aggregator.record_dom_load(1, 1, 1.0);
126//! // ...
127//! aggregator.record_dom_load(31, 1, 1.0);
128//! // ...
129//! aggregator.record_dom_load(63, 1, 1.0);
130//!
131//! // Domain 64, weight 10000, has duty cycle 1.0.
132//! aggregator.record_dom_load(64, 10000, 1.0);
133//!
134//! // Note that it is allowed to record load for a domain more than once.
135//! // For a given domain you may only record load for a specific weight
136//! // once, but you may record loads for multiple different weights for a
137//! // single domain.
138//!
139//! // Create the LoadLedger object
140//! let ledger = aggregator.calculate();
141//!
142//! // Outputs: 66.06451612903226
143//! info!("{}", ledger.global_load_sum());
144//!```
145//!
146//! In the above example, we have 65 domains, all with duty cycle 1.0. 64 of
147//! them have weight 1, and one of them has weight 10000. If there were multiple
148//! tasks per domain, we would record different or additional values. For
149//! example, if we had two tasks with weight 1 in domain 0, and an additional
150//! task with weight 100 in domain 1, we would record their loads as follows:
151//!
152//!```rust,ignore
153//! // Assume the same aggregator as above.
154//!
155//! // In this version, domain 0 has 2 tasks with weight 1.0 and duty cycle
156//! // 1.0.
157//! aggregator.record_dom_load(0, 1, 2.0);
158//!
159//! // In this version, domain 1 also has a task with weight 100 and duty
160//! // cycle 1.0.
161//! aggregator.record_dom_load(1, 100, 1.0);
162//!```
163//!
164//! Note that the abstractions here are meant to be generic across schedulers.
165//! LoadAggregator assumes nothing about the scheduling domains being passed to
166//! it. For example, they may span layers defined in a scheduler, domains
167//! specified by the user via cpumask strings, or domains that span L3 caches.
168//!
169//! LoadLedger
170//! ----------
171//!
172//! Once you have fed all load values to the LoadAggregator, you can use it to
173//! calculate load sums (including adjusting for weight infeasibility), and
174//! create a read-only LoadLedger object that can be used to query the values as
175//! described in (2).
176//!
177//! There are various values of interest that can be queried from a LoadLedger
178//! object. For example, you may ask for the sum of load (adjusted for
179//! infeasibility) for the whole system, or the sum of duty cycle for the whole
180//! system, or the sum of load for each domain (adjusted for infeasibility):
181//!
182//! ```rust
183//! use scx_utils::LoadAggregator;
184//! use log::info;
185//! let mut aggregator = LoadAggregator::new(32, false);
186//! aggregator.record_dom_load(0, 1, 1.0);
187//! // ...
188//! aggregator.record_dom_load(63, 1, 1.0);
189//! aggregator.record_dom_load(64, 10000, 1.0);
190//!
191//! let ledger = aggregator.calculate();
192//!
193//! info!("Global load sum: {}", ledger.global_load_sum());
194//! info!("Global duty cycle sum: {}", ledger.global_dcycle_sum());
195//!
196//! let dom_dcycles = ledger.dom_dcycle_sums();
197//! let dom_loads = ledger.dom_dcycle_sums();
198//! let effective_max_weight = ledger.effective_max_weight();
199//!
200//! // ...
201//! ```
202
203use anyhow::Result;
204use anyhow::bail;
205use std::collections::BTreeMap;
206
207const MIN_WEIGHT: usize = 1;
208
209#[derive(Debug)]
210pub struct LoadLedger {
211 dom_load_sums: Vec<f64>,
212 dom_dcycle_sums: Vec<f64>,
213 global_dcycle_sum: f64,
214 global_load_sum: f64,
215 effective_max_weight: f64,
216}
217
218impl LoadLedger {
219 /// Return the global, host-wide sum of duty cycles.
220 pub fn global_dcycle_sum(&self) -> f64 {
221 self.global_dcycle_sum
222 }
223
224 /// Return the global, host-wide load sum; adjusted for infeasibility.
225 pub fn global_load_sum(&self) -> f64 {
226 self.global_load_sum
227 }
228
229 /// Return an array of domain duty cycle sums, indexed by ID.
230 pub fn dom_dcycle_sums(&self) -> &[f64] {
231 &self.dom_dcycle_sums
232 }
233
234 /// Return an array of domain load sums, indexed by ID, and adjusted for
235 /// infeasibility.
236 pub fn dom_load_sums(&self) -> &[f64] {
237 &self.dom_load_sums
238 }
239
240 /// If applicable, return the adjusted weight for all infeasible scheduling
241 /// entities.
242 pub fn effective_max_weight(&self) -> f64 {
243 self.effective_max_weight
244 }
245}
246
247#[derive(Debug)]
248struct Domain {
249 loads: BTreeMap<usize, f64>,
250 dcycle_sum: f64,
251 load_sum: f64,
252}
253
254fn approx_eq(a: f64, b: f64) -> bool {
255 (a - b).abs() <= 0.0001f64
256}
257
258fn approx_ge(a: f64, b: f64) -> bool {
259 a > b || approx_eq(a, b)
260}
261
262#[derive(Debug)]
263pub struct LoadAggregator {
264 doms: BTreeMap<usize, Domain>,
265 global_loads: BTreeMap<usize, f64>,
266 nr_cpus: usize,
267 max_weight: usize,
268 global_dcycle_sum: f64,
269 global_load_sum: f64,
270 effective_max_weight: f64,
271 dcycle_only: bool,
272}
273
274impl LoadAggregator {
275 /// Create a LoadAggregator object that can be given domains and loads by
276 /// the caller, and then used to create a LoadLedger object.
277 pub fn new(nr_cpus: usize, dcycle_only: bool) -> LoadAggregator {
278 LoadAggregator {
279 doms: BTreeMap::new(),
280 global_loads: BTreeMap::new(),
281 nr_cpus,
282 max_weight: 0,
283 global_dcycle_sum: 0.0f64,
284 global_load_sum: 0.0f64,
285 effective_max_weight: 10000.0f64,
286 dcycle_only,
287 }
288 }
289
290 /// Given a LoadAggregator with recorded domain loads, compute the
291 /// system-wide load, adjusting for infeasible weights when necessary.
292 pub fn calculate(&mut self) -> LoadLedger {
293 if !self.dcycle_only && approx_ge(self.max_weight as f64, self.infeasible_threshold()) {
294 self.adjust_infeas_weights();
295 }
296
297 let mut dom_load_sums = Vec::new();
298 let mut dom_dcycle_sums = Vec::new();
299
300 for (_, dom) in self.doms.iter() {
301 dom_load_sums.push(dom.load_sum);
302 dom_dcycle_sums.push(dom.dcycle_sum);
303 }
304
305 LoadLedger {
306 dom_load_sums,
307 dom_dcycle_sums,
308 global_dcycle_sum: self.global_dcycle_sum,
309 global_load_sum: self.global_load_sum,
310 effective_max_weight: self.effective_max_weight,
311 }
312 }
313
314 /// Init a domain and set default load values.
315 /// Does nothing if the domain already exists.
316 pub fn init_domain(&mut self, dom_id: usize) {
317 self.doms.entry(dom_id).or_insert(Domain {
318 loads: BTreeMap::new(),
319 dcycle_sum: 0.0f64,
320 load_sum: 0.0f64,
321 });
322 }
323
324 /// Record an instance of some domain's load (by specifying its weight and
325 /// dcycle). Returns an error if duty cycle is specified more than once
326 /// for a given (Domain, weight) tuple.
327 pub fn record_dom_load(&mut self, dom_id: usize, weight: usize, dcycle: f64) -> Result<()> {
328 if weight < MIN_WEIGHT {
329 bail!(
330 "weight {} is less than minimum weight {}",
331 weight,
332 MIN_WEIGHT
333 );
334 }
335
336 let domain = self.doms.entry(dom_id).or_insert(Domain {
337 loads: BTreeMap::new(),
338 dcycle_sum: 0.0f64,
339 load_sum: 0.0f64,
340 });
341
342 if domain.loads.insert(weight, dcycle).is_some() {
343 bail!("Domain {} already had load for weight {}", dom_id, weight);
344 }
345
346 let weight_dcycle = self.global_loads.entry(weight).or_insert(0.0f64);
347 *weight_dcycle += dcycle;
348
349 let load = weight as f64 * dcycle;
350
351 domain.dcycle_sum += dcycle;
352 domain.load_sum += load;
353
354 self.global_dcycle_sum += dcycle;
355 self.global_load_sum += load;
356
357 if weight > self.max_weight {
358 self.max_weight = weight;
359 }
360
361 Ok(())
362 }
363
364 fn infeasible_threshold(&self) -> f64 {
365 // If the sum of duty cycle on the system is >= P, any weight w_x of a
366 // task that exceeds L / P is guaranteed to be infeasible. Furthermore,
367 // if any weight w_x == L / P then we know that task t_x can get its
368 // full duty cycle, as:
369 //
370 // c_x = P * (w_x * d_x) / L
371 // = P * (L/P * d_x) / L
372 // = d_x / L / L
373 // = d_x
374 //
375 // If there is no scheduling entity whose weight exceeds L / P that has
376 // a nonzero duty cycle, then all weights are feasible and we can use
377 // the data we collected above without having to adjust for
378 // infeasibility. Otherwise, we have at least one infeasible weight.
379 //
380 // See the comment in adjust_infeas_weights() for a more comprehensive
381 // description of the algorithm for adjusting for infeasible weights.
382 self.global_load_sum / self.nr_cpus as f64
383 }
384
385 fn apply_infeasible_threshold(&mut self, lambda_x: f64) {
386 self.effective_max_weight = lambda_x;
387 self.global_load_sum = 0.0f64;
388 for (_, dom) in self.doms.iter_mut() {
389 dom.load_sum = 0.0f64;
390 for (weight, dcycle) in dom.loads.iter() {
391 let adjusted = (*weight as f64).min(lambda_x);
392 let load = adjusted * dcycle;
393
394 dom.load_sum += load;
395 }
396 self.global_load_sum += dom.load_sum;
397 }
398 }
399
400 fn adjust_infeas_weights(&mut self) {
401 // At this point we have the following data points:
402 //
403 // P : The number of cores on the system
404 // L : The total load sum of the system before any adjustments for
405 // infeasibility
406 // Lf: The load sum of all feasible scheduling entities
407 // D : The total sum of duty cycles across all domains in the system
408 // Di: The duty cycle sum of all infeasible tasks
409 //
410 // We need to find a weight lambda_x such that every infeasible
411 // scheduling entity in the system will be granted a CPU allocation
412 // equal to their duty cycle, and all the remaining compute capacity in
413 // the system will be divided fairly amongst the feasible tasks
414 // according to their load. Our goal is to find a value lambda_x such
415 // that every infeasible entity is allocated its duty cycle, and the
416 // remaining compute capacity is shared fairly amongst the feasible
417 // entities on the system.
418 //
419 // If L' is the load sum on the system after clamping all weights
420 // w_x > lambda_x to lambda_x, then lambda_x can be defined as follows:
421 //
422 // lambda_x = L' / P
423 //
424 // => L' = lambda_x * Di + Lf
425 // => lambda_x * P' = lambda_x * Di + Lf
426 // => lambda_x (P' - D_I) = Lf
427 // => lambda_x = Lf / (P' - Di)
428 //
429 // Thus, need to iterate over different values of x until we find a
430 // lambda_x such that:
431 //
432 // w_x >= lambda_x >= w_x+1
433 //
434 // Once we find a lambda_x, we need to:
435 //
436 // 1. Adjust the maximum weights of any w_x > lambda_x -> lambda_x
437 // 2. Subtract (w_i - lambda_x) from the load sums that the infeasible
438 // entities were contributing to.
439 // 3. Re-calculate the per-domain load, and the global load average.
440 //
441 // Note that we should always find a lambda_x at this point, as we
442 // verified in the caller that there is at least one infeasible entity
443 // in the system.
444 //
445 // All of this is described and proven in detail in the following pdf:
446 //
447 // <https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv>
448 let p = self.nr_cpus as f64;
449 let mut curr_dcycle_sum = 0.0f64;
450 let mut curr_load_sum = self.global_load_sum;
451 let mut lambda_x = curr_load_sum / p;
452
453 for (weight, dcycles) in self.global_loads.iter().rev() {
454 if approx_ge(lambda_x, *weight as f64) {
455 self.apply_infeasible_threshold(lambda_x);
456 return;
457 }
458
459 curr_dcycle_sum += dcycles;
460 curr_load_sum -= *weight as f64 * dcycles;
461 lambda_x = curr_load_sum / (p - curr_dcycle_sum);
462 }
463
464 // We can fail to find an infeasible weight if the host is
465 // under-utilized. In this case, just fall back to using weights. If
466 // this is happening due to a stale system-wide util value due to the
467 // tuner not having run recently enough, it is a condition that should
468 // self-correct soon. If it is the result of the user configuring us to
469 // use weights even when the system is under-utilized, they were warned
470 // when the scheduler was launched.
471 }
472}