scx_utils/
infeasible.rs

1// Copyright (c) Meta Platforms, Inc. and affiliates.
2
3// This software may be used and distributed according to the terms of the
4// GNU General Public License version 2.
5
6//! # SCX Load Calculator
7//!
8//! A crate providing abstractions for calculating load between scheduling
9//! domains.
10//!
11//! Load Balancing Primer
12//! ---------------------
13//!
14//! Let's begin by doing a brief overview of load balancing. In order to
15//! understand load balancing, we'll introduce the relevant terms, and explain
16//! what load balancing is from there:
17//!
18//! *Weight*
19//!
20//! A positive integer value representing a task's priority in the system. The
21//! meaning of weight is ultimately up to the caller to determine and apply, but
22//! conceptually speaking weight is a linear scaling factor of a task's
23//! perceived load (defined below) in the system.
24//!
25//! *Duty Cycle*
26//!
27//! A task's duty cycle is the proportion of time that it could have used a CPU
28//! in some time window. For example, say that a task does nothing but spin in a
29//! tight loop for 10ms, and then sleep for 10ms. Then in a 20ms time window,
30//! its duty cycle would be 0.5.
31//!
32//! Note that this is not the same thing as the proportion of time it's
33//! _runnable_. For example, if you had two such tasks sharing a CPU, one of
34//! them may wait for 10ms to use the core while the other one runs, and then it
35//! would run for its 10ms duty cycle. It was runnable for 20ms, but could only
36//! actually use the CPU for 10ms, so its duty cycle was 10ms / 20ms = 0.5.
37//!
38//! *Load*
39//!
40//! A scheduling entity's load l_x is simply the product of its weight and duty
41//! cycle:
42//!
43//! l_x = w_x * d_x
44//!
45//! *Infeasible Weights*
46//!
47//! At a conceptual level, the goal of a load balancer is of course to balance
48//! load across the system. If a scheduling entity has load l_x, and the total
49//! sum of all entities' loads on the system is L, then entity x should receive
50//! l_x / L proportion of the available CPU capacity on the system.
51//!
52//! This formulation works fine in many cases, but can break down when an
53//! entity's proportion of load (and thus allotted CPU capacity) exceeds the
54//! amount of CPU it can consume.
55//!
56//! For example, say you were on a system with 2 sets of 2-core groups (for a
57//! total of 4 cores on the system), eight tasks with the following duty cycle
58//! and weight:
59//!
60//! ID          Weight          Duty Cycle              Load
61//!
62//! 0           1               1.0                     1
63//! 1           1               1.0                     1
64//! 2           1               1.0                     1
65//! 3           1               1.0                     1
66//! 4           1               1.0                     1
67//! 5           1               1.0                     1
68//! 6           1               1.0                     1
69//! 7           10000           1.0                     10000
70//!
71//!
72//! The load sum L of the system is 1 * 7 + 1 * 10000 = 10007. This means that
73//! tasks 0-6 have a load proportion of 1 / 10007 ~= 0.0001, and task 7 has a
74//! load proportion of 0.9993. In other words, tasks 0-6 are entitled to
75//! 0.0001 * 4 = 0.0004 CPUs worth of capacity, and task 7 is entitled to
76//! 0.9993 * 4 = 3.9972 CPUs worth of capacity.
77//!
78//! Task 7 can only consume at most 1.0 CPU due to its duty cycle being 1.0, so
79//! its weight is "infeasible" as the amount of CPU capacity it would be
80//! afforded exceeds what it can actually use.
81//!
82//! What we instead want is to find an adjusted weight w' that we can assign to
83//! Task 7 such that it gets its full duty cycle of 1.0, but allows the
84//! remaining 3 cores worth of compute to be equally spread amongst the
85//! remaining tasks. The task of finding this weight w' is known as the
86//! "infeasible weights problem", and is solved by this crate.
87//!
88//! More details on load balancing and the infeasible weights problem are
89//! provided in the following Google Drive document:
90//!
91//! <https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv>
92//!
93//! Using the Crate
94//! ---------------
95//!
96//! This crate has two primary sets of APIs:
97//!
98//! (1) APIs for aggregating domain loads specified by the caller
99//! (2) APIs for querying those loads, after being adjusted for infeasibility
100//!
101//! LoadAggregator
102//! --------------
103//!
104//! The object that addresses (1) is the LoadAggregator. This object receives
105//! load passed by the user, and once all load has been received, runs the
106//! numbers to create load sums and weights that are adjusted for infeasibility
107//! as described above. LoadAggregator objects can be created and used as
108//! follows:
109//!
110//! Assume we're on a 16-core (32-CPU) host with two core complexes:
111//!
112//!```rust
113//!     use scx_utils::LoadAggregator;
114//!     use log::info;
115//!
116//!     let mut aggregator = LoadAggregator::new(32, false);
117//!     // Create a LoadAggregator object, specifying the number of CPUs on the
118//!     // system, and whether it should only aggregate duty cycle.
119//!     let mut aggregator = LoadAggregator::new(32, false);
120//!
121//!     // Domain 0, weight 1, has duty cycle 1.0.
122//!     aggregator.record_dom_load(0, 1, 1.0);
123//!
124//!     // Domain 1, weight 1, has duty cycle 1.0.
125//!     aggregator.record_dom_load(1, 1, 1.0);
126//!     // ...
127//!     aggregator.record_dom_load(31, 1, 1.0);
128//!     // ...
129//!     aggregator.record_dom_load(63, 1, 1.0);
130//!
131//!     // Domain 64, weight 10000, has duty cycle 1.0.
132//!     aggregator.record_dom_load(64, 10000, 1.0);
133//!
134//!     // Note that it is allowed to record load for a domain more than once.
135//!     // For a given domain you may only record load for a specific weight
136//!     // once, but you may record loads for multiple different weights for a
137//!     // single domain.
138//!
139//!     // Create the LoadLedger object
140//!     let ledger = aggregator.calculate();
141//!
142//!     // Outputs: 66.06451612903226
143//!     info!("{}", ledger.global_load_sum());
144//!```
145//!
146//! In the above example, we have 65 domains, all with duty cycle 1.0. 64 of
147//! them have weight 1, and one of them has weight 10000. If there were multiple
148//! tasks per domain, we would record different or additional values. For
149//! example, if we had two tasks with weight 1 in domain 0, and an additional
150//! task with weight 100 in domain 1, we would record their loads as follows:
151//!
152//!```rust,ignore
153//!     // Assume the same aggregator as above.
154//!
155//!     // In this version, domain 0 has 2 tasks with weight 1.0 and duty cycle
156//!     // 1.0.
157//!     aggregator.record_dom_load(0, 1, 2.0);
158//!
159//!     // In this version, domain 1 also has a task with weight 100 and duty
160//!     // cycle 1.0.
161//!     aggregator.record_dom_load(1, 100, 1.0);
162//!```
163//!
164//! Note that the abstractions here are meant to be generic across schedulers.
165//! LoadAggregator assumes nothing about the scheduling domains being passed to
166//! it. For example, they may span layers defined in a scheduler, domains
167//! specified by the user via cpumask strings, or domains that span L3 caches.
168//!
169//! LoadLedger
170//! ----------
171//!
172//! Once you have fed all load values to the LoadAggregator, you can use it to
173//! calculate load sums (including adjusting for weight infeasibility), and
174//! create a read-only LoadLedger object that can be used to query the values as
175//! described in (2).
176//!
177//! There are various values of interest that can be queried from a LoadLedger
178//! object. For example, you may ask for the sum of load (adjusted for
179//! infeasibility) for the whole system, or the sum of duty cycle for the whole
180//! system, or the sum of load for each domain (adjusted for infeasibility):
181//!
182//! ```rust
183//!     use scx_utils::LoadAggregator;
184//!     use log::info;
185//!     let mut aggregator = LoadAggregator::new(32, false);
186//!     aggregator.record_dom_load(0, 1, 1.0);
187//!     // ...
188//!     aggregator.record_dom_load(63, 1, 1.0);
189//!     aggregator.record_dom_load(64, 10000, 1.0);
190//!
191//!     let ledger = aggregator.calculate();
192//!
193//!     info!("Global load sum: {}", ledger.global_load_sum());
194//!     info!("Global duty cycle sum: {}", ledger.global_dcycle_sum());
195//!
196//!     let dom_dcycles = ledger.dom_dcycle_sums();
197//!     let dom_loads = ledger.dom_dcycle_sums();
198//!     let effective_max_weight = ledger.effective_max_weight();
199//!
200//!     // ...
201//! ```
202
203use anyhow::Result;
204use anyhow::bail;
205use std::collections::BTreeMap;
206
207const MIN_WEIGHT: usize = 1;
208
209#[derive(Debug)]
210pub struct LoadLedger {
211    dom_load_sums: Vec<f64>,
212    dom_dcycle_sums: Vec<f64>,
213    global_dcycle_sum: f64,
214    global_load_sum: f64,
215    effective_max_weight: f64,
216}
217
218impl LoadLedger {
219    /// Return the global, host-wide sum of duty cycles.
220    pub fn global_dcycle_sum(&self) -> f64 {
221        self.global_dcycle_sum
222    }
223
224    /// Return the global, host-wide load sum; adjusted for infeasibility.
225    pub fn global_load_sum(&self) -> f64 {
226        self.global_load_sum
227    }
228
229    /// Return an array of domain duty cycle sums, indexed by ID.
230    pub fn dom_dcycle_sums(&self) -> &[f64] {
231        &self.dom_dcycle_sums
232    }
233
234    /// Return an array of domain load sums, indexed by ID, and adjusted for
235    /// infeasibility.
236    pub fn dom_load_sums(&self) -> &[f64] {
237        &self.dom_load_sums
238    }
239
240    /// If applicable, return the adjusted weight for all infeasible scheduling
241    /// entities.
242    pub fn effective_max_weight(&self) -> f64 {
243        self.effective_max_weight
244    }
245}
246
247#[derive(Debug)]
248struct Domain {
249    loads: BTreeMap<usize, f64>,
250    dcycle_sum: f64,
251    load_sum: f64,
252}
253
254fn approx_eq(a: f64, b: f64) -> bool {
255    (a - b).abs() <= 0.0001f64
256}
257
258fn approx_ge(a: f64, b: f64) -> bool {
259    a > b || approx_eq(a, b)
260}
261
262#[derive(Debug)]
263pub struct LoadAggregator {
264    doms: BTreeMap<usize, Domain>,
265    global_loads: BTreeMap<usize, f64>,
266    nr_cpus: usize,
267    max_weight: usize,
268    global_dcycle_sum: f64,
269    global_load_sum: f64,
270    effective_max_weight: f64,
271    dcycle_only: bool,
272}
273
274impl LoadAggregator {
275    /// Create a LoadAggregator object that can be given domains and loads by
276    /// the caller, and then used to create a LoadLedger object.
277    pub fn new(nr_cpus: usize, dcycle_only: bool) -> LoadAggregator {
278        LoadAggregator {
279            doms: BTreeMap::new(),
280            global_loads: BTreeMap::new(),
281            nr_cpus,
282            max_weight: 0,
283            global_dcycle_sum: 0.0f64,
284            global_load_sum: 0.0f64,
285            effective_max_weight: 10000.0f64,
286            dcycle_only,
287        }
288    }
289
290    /// Given a LoadAggregator with recorded domain loads, compute the
291    /// system-wide load, adjusting for infeasible weights when necessary.
292    pub fn calculate(&mut self) -> LoadLedger {
293        if !self.dcycle_only && approx_ge(self.max_weight as f64, self.infeasible_threshold()) {
294            self.adjust_infeas_weights();
295        }
296
297        let mut dom_load_sums = Vec::new();
298        let mut dom_dcycle_sums = Vec::new();
299
300        for (_, dom) in self.doms.iter() {
301            dom_load_sums.push(dom.load_sum);
302            dom_dcycle_sums.push(dom.dcycle_sum);
303        }
304
305        LoadLedger {
306            dom_load_sums,
307            dom_dcycle_sums,
308            global_dcycle_sum: self.global_dcycle_sum,
309            global_load_sum: self.global_load_sum,
310            effective_max_weight: self.effective_max_weight,
311        }
312    }
313
314    /// Init a domain and set default load values.
315    /// Does nothing if the domain already exists.
316    pub fn init_domain(&mut self, dom_id: usize) {
317        self.doms.entry(dom_id).or_insert(Domain {
318            loads: BTreeMap::new(),
319            dcycle_sum: 0.0f64,
320            load_sum: 0.0f64,
321        });
322    }
323
324    /// Record an instance of some domain's load (by specifying its weight and
325    /// dcycle). Returns an error if duty cycle is specified more than once
326    /// for a given (Domain, weight) tuple.
327    pub fn record_dom_load(&mut self, dom_id: usize, weight: usize, dcycle: f64) -> Result<()> {
328        if weight < MIN_WEIGHT {
329            bail!(
330                "weight {} is less than minimum weight {}",
331                weight,
332                MIN_WEIGHT
333            );
334        }
335
336        let domain = self.doms.entry(dom_id).or_insert(Domain {
337            loads: BTreeMap::new(),
338            dcycle_sum: 0.0f64,
339            load_sum: 0.0f64,
340        });
341
342        if domain.loads.insert(weight, dcycle).is_some() {
343            bail!("Domain {} already had load for weight {}", dom_id, weight);
344        }
345
346        let weight_dcycle = self.global_loads.entry(weight).or_insert(0.0f64);
347        *weight_dcycle += dcycle;
348
349        let load = weight as f64 * dcycle;
350
351        domain.dcycle_sum += dcycle;
352        domain.load_sum += load;
353
354        self.global_dcycle_sum += dcycle;
355        self.global_load_sum += load;
356
357        if weight > self.max_weight {
358            self.max_weight = weight;
359        }
360
361        Ok(())
362    }
363
364    fn infeasible_threshold(&self) -> f64 {
365        // If the sum of duty cycle on the system is >= P, any weight w_x of a
366        // task that exceeds L / P is guaranteed to be infeasible. Furthermore,
367        // if any weight w_x == L / P then we know that task t_x can get its
368        // full duty cycle, as:
369        //
370        // c_x = P * (w_x * d_x) / L
371        //     = P * (L/P * d_x) / L
372        //     = d_x / L / L
373        //     = d_x
374        //
375        // If there is no scheduling entity whose weight exceeds L / P that has
376        // a nonzero duty cycle, then all weights are feasible and we can use
377        // the data we collected above without having to adjust for
378        // infeasibility. Otherwise, we have at least one infeasible weight.
379        //
380        // See the comment in adjust_infeas_weights() for a more comprehensive
381        // description of the algorithm for adjusting for infeasible weights.
382        self.global_load_sum / self.nr_cpus as f64
383    }
384
385    fn apply_infeasible_threshold(&mut self, lambda_x: f64) {
386        self.effective_max_weight = lambda_x;
387        self.global_load_sum = 0.0f64;
388        for (_, dom) in self.doms.iter_mut() {
389            dom.load_sum = 0.0f64;
390            for (weight, dcycle) in dom.loads.iter() {
391                let adjusted = (*weight as f64).min(lambda_x);
392                let load = adjusted * dcycle;
393
394                dom.load_sum += load;
395            }
396            self.global_load_sum += dom.load_sum;
397        }
398    }
399
400    fn adjust_infeas_weights(&mut self) {
401        // At this point we have the following data points:
402        //
403        // P : The number of cores on the system
404        // L : The total load sum of the system before any adjustments for
405        //     infeasibility
406        // Lf: The load sum of all feasible scheduling entities
407        // D : The total sum of duty cycles across all domains in the system
408        // Di: The duty cycle sum of all infeasible tasks
409        //
410        // We need to find a weight lambda_x such that every infeasible
411        // scheduling entity in the system will be granted a CPU allocation
412        // equal to their duty cycle, and all the remaining compute capacity in
413        // the system will be divided fairly amongst the feasible tasks
414        // according to their load. Our goal is to find a value lambda_x such
415        // that every infeasible entity is allocated its duty cycle, and the
416        // remaining compute capacity is shared fairly amongst the feasible
417        // entities on the system.
418        //
419        // If L' is the load sum on the system after clamping all weights
420        // w_x > lambda_x to lambda_x, then lambda_x can be defined as follows:
421        //
422        // lambda_x = L' / P
423        //
424        // => L'                  = lambda_x * Di + Lf
425        // => lambda_x * P'       = lambda_x * Di + Lf
426        // => lambda_x (P' - D_I) = Lf
427        // => lambda_x            = Lf / (P' - Di)
428        //
429        // Thus, need to iterate over different values of x until we find a
430        // lambda_x such that:
431        //
432        //      w_x >= lambda_x >= w_x+1
433        //
434        // Once we find a lambda_x, we need to:
435        //
436        // 1. Adjust the maximum weights of any w_x > lambda_x -> lambda_x
437        // 2. Subtract (w_i - lambda_x) from the load sums that the infeasible
438        //    entities were contributing to.
439        // 3. Re-calculate the per-domain load, and the global load average.
440        //
441        // Note that we should always find a lambda_x at this point, as we
442        // verified in the caller that there is at least one infeasible entity
443        // in the system.
444        //
445        // All of this is described and proven in detail in the following pdf:
446        //
447        // <https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv>
448        let p = self.nr_cpus as f64;
449        let mut curr_dcycle_sum = 0.0f64;
450        let mut curr_load_sum = self.global_load_sum;
451        let mut lambda_x = curr_load_sum / p;
452
453        for (weight, dcycles) in self.global_loads.iter().rev() {
454            if approx_ge(lambda_x, *weight as f64) {
455                self.apply_infeasible_threshold(lambda_x);
456                return;
457            }
458
459            curr_dcycle_sum += dcycles;
460            curr_load_sum -= *weight as f64 * dcycles;
461            lambda_x = curr_load_sum / (p - curr_dcycle_sum);
462        }
463
464        // We can fail to find an infeasible weight if the host is
465        // under-utilized. In this case, just fall back to using weights. If
466        // this is happening due to a stale system-wide util value due to the
467        // tuner not having run recently enough, it is a condition that should
468        // self-correct soon. If it is the result of the user configuring us to
469        // use weights even when the system is under-utilized, they were warned
470        // when the scheduler was launched.
471    }
472}