Module load_balance

Expand description

§Rusty load balancer

The module that includes logic for performing load balancing in the scx_rusty scheduler.

§Load Balancing

scx_rusty performs load balancing using the following general workflow:

Determine domain load averages from the duty cycle buckets in the dom_ctx_map_elem map, aggregate the load using scx_utils::LoadCalculator, and then determine load distribution (accounting for infeasible weights) the scx_utils::LoadLedger object.

Create a hierarchy representing load using NumaNode and Domain objects as follows:

                         o--------------------------------o
                         |             LB Root            |
                         |                                |
                         | PushNodes: <Load, NumaNode>    |
                         | PullNodes: <Load, NumaNode>    |
                         | BalancedNodes: <Load, NumaNode>|
                         o----------------o---------------o
                                          |
                      o-------------------o-------------------o
                      |                   |                   |
                      |                   |                   |
        o-------------o--------------o   ...   o--------------o-------------o
        |          NumaNode          |         |           NumaNode         |
        | ID    0                    |         | ID    1                    |
        | PushDomains <Load, Domain> |         | PushDomains <Load, Domain> |
        | PullDomains <Load, Domain> |         | PullDomains <Load, Domain> |
        | BalancedDomains <Domain>   |         | BalancedDomains <Domain>   |
        | LoadSum f64                |         | LoadSum f64                |
        | LoadAvg f64                |         | LoadAvg f64                |
        | LoadImbal f64              |         | LoadImbal f64              |
        | BalanceCost f64            |         | BalanceCost f64            |
        | ...                        |         | ...                        |
        o-------------o--------------o         o----------------------------o
                      |
                      |

As mentioned above, the hierarchy is created by querying BPF for each domain’s duty cycle, and using the infeasible.rs crate to determine load averages and load sums for each domain.

From the LB Root, we begin by iterating over all NUMA nodes, and migrating load from any nodes with an excess of load (push nodes) to nodes with a lack of load (pull domains). The cost of migrations here are higher than when migrating load between domains within a node. Ultimately, migrations are performed by moving tasks between domains. The difference in this step is that imbalances are first addressed by moving tasks between NUMA nodes, and that such migrations only take place when imbalances are sufficiently high to warrant it.
Once load has been migrated between NUMA nodes, we iterate over each NUMA node and migrate load between the domains inside of each. The cost of migrations here are lower than between NUMA nodes. Like with load balancing between NUMA nodes, migrations here are just moving tasks between domains.

The load hierarchy is always created when load_balance() is called on a LoadBalancer object, but actual load balancing is only performed if the balance_load option is specified.

§Statistics

After load balancing has occurred, statistics may be queried by invoking the get_stats() function on the LoadBalancer object:

let lb = LoadBalancer::new(...)?;
lb.load_balance()?;

let stats = lb.get_stats();
...

Statistics are exported as a vector of NumaStat objects, which each contains load balancing statistics for that NUMA node, as well as statistics for any Domains contained therein as DomainStats objects.

§Future Improvements

There are a few ways that we could further improve the implementation here:

The logic for load balancing between NUMA nodes, and load balancing within a specific NUMA node (i.e. between domains in that NUMA node), could probably be improved to avoid code duplication using traits and/or generics.
When deciding whether to migrate a task, we’re only looking at its impact on addressing load imbalances. In reality, this is a very complex, multivariate cost function. For example, a domain with sufficiently low load to warrant having an imbalance / requiring more load maybe should not pull load if it’s running tasks that are much better suited to isolation. Or, a domain may not want to push a task to another domain if the task is co-located with other tasks that benefit from shared L3 cache locality.

Coming up with an extensible and clean way to model and implement this is likely itself a large project.
We’re not accounting for cgroups when performing load balancing.