pub(crate) struct Opts {Show 27 fields
pub(crate) slice_us: u64,
pub(crate) max_exec_us: u64,
pub(crate) interval: f64,
pub(crate) no_load_frac_limit: bool,
pub(crate) exit_dump_len: u32,
pub(crate) verbose: u8,
pub(crate) disable_topology: Option<bool>,
pub(crate) xnuma_preemption: bool,
pub(crate) monitor_disable: bool,
pub(crate) example: Option<String>,
pub(crate) layer_preempt_weight_disable: f64,
pub(crate) layer_growth_weight_disable: f64,
pub(crate) stats: Option<f64>,
pub(crate) monitor: Option<f64>,
pub(crate) run_example: bool,
pub(crate) local_llc_iteration: bool,
pub(crate) lo_fb_wait_us: u64,
pub(crate) lo_fb_share: f64,
pub(crate) disable_antistall: bool,
pub(crate) antistall_sec: u64,
pub(crate) enable_gpu_support: bool,
pub(crate) gpu_kprobe_level: u64,
pub(crate) netdev_irq_balance: bool,
pub(crate) disable_queued_wakeup: bool,
pub(crate) disable_percpu_kthread_preempt: bool,
pub(crate) help_stats: bool,
pub(crate) specs: Vec<String>,
}
Expand description
scx_layered: A highly configurable multi-layer sched_ext scheduler
scx_layered allows classifying tasks into multiple layers and applying different scheduling policies to them. The configuration is specified in json and composed of two parts - matches and policies.
§Matches
Whenever a task is forked or its attributes are changed, the task goes through a series of matches to determine the layer it belongs to. A match set is composed of OR groups of AND blocks. An example:
“matches”: [ [ { “CgroupPrefix”: “system.slice/” } ], [ { “CommPrefix”: “fbagent” }, { “NiceAbove”: 0 } ] ],
The outer array contains the OR groups and the inner AND blocks, so the above matches:
-
Tasks which are in the cgroup sub-hierarchy under “system.slice”.
-
Or tasks whose comm starts with “fbagent” and have a nice value > 0.
Currently, the following matches are supported:
-
CgroupPrefix: Matches the prefix of the cgroup that the task belongs to. As this is a string match, whether the pattern has the trailing ‘/’ makes a difference. For example, “TOP/CHILD/” only matches tasks which are under that particular cgroup while “TOP/CHILD” also matches tasks under “TOP/CHILD0/” or “TOP/CHILD1/”.
-
CommPrefix: Matches the task’s comm prefix.
-
PcommPrefix: Matches the task’s thread group leader’s comm prefix.
-
NiceAbove: Matches if the task’s nice value is greater than the pattern.
-
NiceBelow: Matches if the task’s nice value is smaller than the pattern.
-
NiceEquals: Matches if the task’s nice value is exactly equal to the pattern.
-
UIDEquals: Matches if the task’s effective user id matches the value
-
GIDEquals: Matches if the task’s effective group id matches the value.
-
PIDEquals: Matches if the task’s pid matches the value.
-
PPIDEquals: Matches if the task’s ppid matches the value.
-
TGIDEquals: Matches if the task’s tgid matches the value.
-
NSPIDEquals: Matches if the task’s namespace id and pid matches the values.
-
NSEquals: Matches if the task’s namespace id matches the values.
-
IsGroupLeader: Bool. When true, matches if the task is group leader (i.e. PID == TGID), aka the thread from which other threads are made. When false, matches if the task is not the group leader (i.e. the rest).
-
CmdJoin: Matches when the task uses pthread_setname_np to send a join/leave command to the scheduler. See examples/cmdjoin.c for more details.
-
UsedGpuTid: Bool. When true, matches if the tasks which have used gpus by tid.
-
UsedGpuPid: Bool. When true, matches if the tasks which have used gpu by tgid/pid.
While there are complexity limitations as the matches are performed in BPF, it is straightforward to add more types of matches.
§Policies
The following is an example policy configuration for a layer.
“kind”: { “Confined”: { “cpus_range”: [1, 8], “util_range”: [0.8, 0.9] } }
It’s of “Confined” kind, which tries to concentrate the layer’s tasks into a limited number of CPUs. In the above case, the number of CPUs assigned to the layer is scaled between 1 and 8 so that the per-cpu utilization is kept between 80% and 90%. If the CPUs are loaded higher than 90%, more CPUs are allocated to the layer. If the utilization drops below 80%, the layer loses CPUs.
Currently, the following policy kinds are supported:
-
Confined: Tasks are restricted to the allocated CPUs. The number of CPUs allocated is modulated to keep the per-CPU utilization in “util_range”. The range can optionally be restricted with the “cpus_range” property.
-
Grouped: Similar to Confined but tasks may spill outside if there are idle CPUs outside the allocated ones. The range can optionally be restricted with the “cpus_range” property.
-
Open: Prefer the CPUs which are not occupied by Confined or Grouped layers. Tasks in this group will spill into occupied CPUs if there are no unoccupied idle CPUs.
All layers take the following options:
-
min_exec_us: Minimum execution time in microseconds. Whenever a task is scheduled in, this is the minimum CPU time that it’s charged no matter how short the actual execution time may be.
-
yield_ignore: Yield ignore ratio. If 0.0, yield(2) forfeits a whole execution slice. 0.25 yields three quarters of an execution slice and so on. If 1.0, yield is completely ignored.
-
slice_us: Scheduling slice duration in microseconds.
-
fifo: Use FIFO queues within the layer instead of the default vtime.
-
preempt: If true, tasks in the layer will preempt tasks which belong to other non-preempting layers when no idle CPUs are available.
-
preempt_first: If true, tasks in the layer will try to preempt tasks in their previous CPUs before trying to find idle CPUs.
-
exclusive: If true, tasks in the layer will occupy the whole core. The other logical CPUs sharing the same core will be kept idle. This isn’t a hard guarantee, so don’t depend on it for security purposes.
-
allow_node_aligned: Put node aligned tasks on layer DSQs instead of lo fallback. This is a hack to support node-affine tasks without making the whole scheduler node aware and should only be used with open layers on non-saturated machines to avoid possible stalls.
-
prev_over_idle_core: On SMT enabled systems, prefer using the same CPU when picking a CPU for tasks on this layer, even if that CPUs SMT sibling is processing a task.
-
weight: Weight of the layer, which is a range from 1 to 10000 with a default of 100. Layer weights are used during contention to prevent starvation across layers. Weights are used in combination with utilization to determine the infeasible adjusted weight with higher weights having a larger adjustment in adjusted utilization.
-
disallow_open_after_us: Duration to wait after machine reaches saturation before confining tasks in Open layers.
-
cpus_range_frac: Array of 2 floats between 0 and 1.0. Lower and upper bound fractions of all CPUs to give to a layer. Mutually exclusive with cpus_range.
-
disallow_preempt_after_us: Duration to wait after machine reaches saturation before confining tasks to preempt.
-
xllc_mig_min_us: Skip cross-LLC migrations if they are likely to run on their existing LLC sooner than this.
-
idle_smt: *** DEPRECATED ****
-
growth_algo: When a layer is allocated new CPUs different algorithms can be used to determine which CPU should be allocated next. The default algorithm is a “sticky” algorithm that attempts to spread layers evenly across cores.
-
perf: CPU performance target. 0 means no configuration. A value between 1 and 1024 indicates the performance level CPUs running tasks in this layer are configured to using scx_bpf_cpuperf_set().
-
idle_resume_us: Sets the idle resume QoS value. CPU idle time governors are expected to regard the minimum of the global (effective) CPU latency limit and the effective resume latency constraint for the given CPU as the upper limit for the exit latency of the idle states. See the latest kernel docs for more details: https://www.kernel.org/doc/html/latest/admin-guide/pm/cpuidle.html
-
nodes: If set the layer will use the set of NUMA nodes for scheduling decisions. If unset then all available NUMA nodes will be used. If the llcs value is set the cpuset of NUMA nodes will be or’ed with the LLC config.
-
llcs: If set the layer will use the set of LLCs (last level caches) for scheduling decisions. If unset then all LLCs will be used. If the nodes value is set the cpuset of LLCs will be or’ed with the nodes config.
Similar to matches, adding new policies and extending existing ones should be relatively straightforward.
§Configuration example and running scx_layered
An scx_layered config is composed of layer configs. A layer config is composed of a name, a set of matches, and a policy block. Running the following will write an example configuration into example.json.
$ scx_layered -e example.json
Note that the last layer in the configuration must have an empty match set as a catch-all for tasks which haven’t been matched into previous layers.
The configuration can be specified in multiple json files and command line arguments, which are concatenated in the specified order. Each must contain valid layer configurations.
By default, an argument to scx_layered is interpreted as a JSON string. If the argument is a pointer to a JSON file, it should be prefixed with file: or f: as follows:
$ scx_layered file:example.json … $ scx_layered f:example.json
§Monitoring Statistics
Run with --stats INTERVAL
to enable stats monitoring. There is
also an scx_stat server listening on /var/run/scx/root/stat that can
be monitored by running scx_layered --monitor INTERVAL
separately.
$ scx_layered --monitor 1
tot= 117909 local=86.20 open_idle= 0.21 affn_viol= 1.37 proc=6ms
busy= 34.2 util= 1733.6 load= 21744.1 fallback_cpu= 1
batch : util/frac= 11.8/ 0.7 load/frac= 29.7: 0.1 tasks= 2597
tot= 3478 local=67.80 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
cpus= 2 [ 2, 2] 04000001 00000000
immediate: util/frac= 1218.8/ 70.3 load/frac= 21399.9: 98.4 tasks= 1107
tot= 68997 local=90.57 open_idle= 0.26 preempt= 9.36 affn_viol= 0.00
cpus= 50 [ 50, 50] fbfffffe 000fffff
normal : util/frac= 502.9/ 29.0 load/frac= 314.5: 1.4 tasks= 3512
tot= 45434 local=80.97 open_idle= 0.16 preempt= 0.00 affn_viol= 3.56
cpus= 50 [ 50, 50] fbfffffe 000fffff
Global statistics: see SysStats
Per-layer statistics: see LayerStats
Fields§
§slice_us: u64
Scheduling slice duration in microseconds.
max_exec_us: u64
Maximum consecutive execution time in microseconds. A task may be allowed to keep executing on a CPU for this long. Note that this is the upper limit and a task may have to moved off the CPU earlier. 0 indicates default - 20 * slice_us.
interval: f64
Scheduling interval in seconds.
no_load_frac_limit: bool
DEPRECATED Disable load-fraction based max layer CPU limit. recommended.
exit_dump_len: u32
Exit debug dump buffer length. 0 indicates default.
verbose: u8
Enable verbose output, including libbpf details. Specify multiple times to increase verbosity.
disable_topology: Option<bool>
Disable topology awareness. When enabled, the “nodes” and “llcs” settings on a layer are ignored. Defaults to false on topologies with multiple NUMA nodes or LLCs, and true otherwise.
xnuma_preemption: bool
Enable cross NUMA preemption.
monitor_disable: bool
Disable monitor
example: Option<String>
Write example layer specifications into the file and exit.
layer_preempt_weight_disable: f64
DEPRECATED Disables preemption if the weighted load fraction of a layer (load_frac_adj) exceeds the threshold. The default is disabled (0.0).
layer_growth_weight_disable: f64
DEPRECATED Disables layer growth if the weighted load fraction of a layer (load_frac_adj) exceeds the threshold. The default is disabled (0.0).
stats: Option<f64>
Enable stats monitoring with the specified interval.
monitor: Option<f64>
Run in stats monitoring mode with the specified interval. Scheduler is not launched.
run_example: bool
Run with example layer specifications (useful for e.g. CI pipelines)
local_llc_iteration: bool
***DEPRECATED *** Enables iteration over local LLCs first for dispatch.
lo_fb_wait_us: u64
Low priority fallback DSQs are used to execute tasks with custom CPU affinities. These DSQs are immediately executed iff a CPU is otherwise idle. However, after the specified wait, they are guranteed upto –lo-fb-share fraction of each CPU.
The fraction of CPU time guaranteed to low priority fallback DSQs. See –lo-fb-wait-us.
disable_antistall: bool
Disable antistall
antistall_sec: u64
Maximum task runnable_at delay (in seconds) before antistall turns on
enable_gpu_support: bool
Enable gpu support
gpu_kprobe_level: u64
Gpu Kprobe Level The value set here determines how agressive the kprobes enabled on gpu driver functions are. Higher values are more aggressive, incurring more system overhead and more accurately identifying PIDs using GPUs in a more timely manner. Lower values incur less system overhead, at the cost of less accurately identifying GPU pids and taking longer to do so.
netdev_irq_balance: bool
Enable netdev IRQ balancing. This is experimental and should be used with caution.
disable_queued_wakeup: bool
Disable queued wakeup optimization.
disable_percpu_kthread_preempt: bool
Per-cpu kthreads are preempting by default. Make it not so.
help_stats: bool
Show descriptions for statistics.
specs: Vec<String>
Layer specification. See –help.
Trait Implementations§
Source§impl Args for Opts
impl Args for Opts
Source§fn group_id() -> Option<Id>
fn group_id() -> Option<Id>
ArgGroup::id
][crate::ArgGroup::id] for this set of argumentsSource§fn augment_args<'b>(__clap_app: Command) -> Command
fn augment_args<'b>(__clap_app: Command) -> Command
Source§fn augment_args_for_update<'b>(__clap_app: Command) -> Command
fn augment_args_for_update<'b>(__clap_app: Command) -> Command
Command
] so it can instantiate self
via
[FromArgMatches::update_from_arg_matches_mut
] Read moreSource§impl FromArgMatches for Opts
impl FromArgMatches for Opts
Source§fn from_arg_matches(__clap_arg_matches: &ArgMatches) -> Result<Self, Error>
fn from_arg_matches(__clap_arg_matches: &ArgMatches) -> Result<Self, Error>
Source§fn from_arg_matches_mut(
__clap_arg_matches: &mut ArgMatches,
) -> Result<Self, Error>
fn from_arg_matches_mut( __clap_arg_matches: &mut ArgMatches, ) -> Result<Self, Error>
Source§fn update_from_arg_matches(
&mut self,
__clap_arg_matches: &ArgMatches,
) -> Result<(), Error>
fn update_from_arg_matches( &mut self, __clap_arg_matches: &ArgMatches, ) -> Result<(), Error>
ArgMatches
to self
.Source§fn update_from_arg_matches_mut(
&mut self,
__clap_arg_matches: &mut ArgMatches,
) -> Result<(), Error>
fn update_from_arg_matches_mut( &mut self, __clap_arg_matches: &mut ArgMatches, ) -> Result<(), Error>
ArgMatches
to self
.Source§impl Parser for Opts
impl Parser for Opts
§fn parse_from<I, T>(itr: I) -> Self
fn parse_from<I, T>(itr: I) -> Self
§fn try_parse_from<I, T>(itr: I) -> Result<Self, Error>
fn try_parse_from<I, T>(itr: I) -> Result<Self, Error>
§fn update_from<I, T>(&mut self, itr: I)
fn update_from<I, T>(&mut self, itr: I)
§fn try_update_from<I, T>(&mut self, itr: I) -> Result<(), Error>
fn try_update_from<I, T>(&mut self, itr: I) -> Result<(), Error>
Auto Trait Implementations§
impl Freeze for Opts
impl RefUnwindSafe for Opts
impl Send for Opts
impl Sync for Opts
impl Unpin for Opts
impl UnwindSafe for Opts
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
§impl<T> Conv for T
impl<T> Conv for T
§impl<T> FmtForward for T
impl<T> FmtForward for T
§fn fmt_binary(self) -> FmtBinary<Self>where
Self: Binary,
fn fmt_binary(self) -> FmtBinary<Self>where
Self: Binary,
self
to use its Binary
implementation when Debug
-formatted.§fn fmt_display(self) -> FmtDisplay<Self>where
Self: Display,
fn fmt_display(self) -> FmtDisplay<Self>where
Self: Display,
self
to use its Display
implementation when
Debug
-formatted.§fn fmt_lower_exp(self) -> FmtLowerExp<Self>where
Self: LowerExp,
fn fmt_lower_exp(self) -> FmtLowerExp<Self>where
Self: LowerExp,
self
to use its LowerExp
implementation when
Debug
-formatted.§fn fmt_lower_hex(self) -> FmtLowerHex<Self>where
Self: LowerHex,
fn fmt_lower_hex(self) -> FmtLowerHex<Self>where
Self: LowerHex,
self
to use its LowerHex
implementation when
Debug
-formatted.§fn fmt_octal(self) -> FmtOctal<Self>where
Self: Octal,
fn fmt_octal(self) -> FmtOctal<Self>where
Self: Octal,
self
to use its Octal
implementation when Debug
-formatted.§fn fmt_pointer(self) -> FmtPointer<Self>where
Self: Pointer,
fn fmt_pointer(self) -> FmtPointer<Self>where
Self: Pointer,
self
to use its Pointer
implementation when
Debug
-formatted.§fn fmt_upper_exp(self) -> FmtUpperExp<Self>where
Self: UpperExp,
fn fmt_upper_exp(self) -> FmtUpperExp<Self>where
Self: UpperExp,
self
to use its UpperExp
implementation when
Debug
-formatted.§fn fmt_upper_hex(self) -> FmtUpperHex<Self>where
Self: UpperHex,
fn fmt_upper_hex(self) -> FmtUpperHex<Self>where
Self: UpperHex,
self
to use its UpperHex
implementation when
Debug
-formatted.§fn fmt_list(self) -> FmtList<Self>where
&'a Self: for<'a> IntoIterator,
fn fmt_list(self) -> FmtList<Self>where
&'a Self: for<'a> IntoIterator,
§impl<T> Instrument for T
impl<T> Instrument for T
§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more§impl<T> Pipe for Twhere
T: ?Sized,
impl<T> Pipe for Twhere
T: ?Sized,
§fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> Rwhere
Self: Sized,
fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> Rwhere
Self: Sized,
§fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> Rwhere
R: 'a,
fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> Rwhere
R: 'a,
self
and passes that borrow into the pipe function. Read more§fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> Rwhere
R: 'a,
fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> Rwhere
R: 'a,
self
and passes that borrow into the pipe function. Read more§fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
§fn pipe_borrow_mut<'a, B, R>(
&'a mut self,
func: impl FnOnce(&'a mut B) -> R,
) -> R
fn pipe_borrow_mut<'a, B, R>( &'a mut self, func: impl FnOnce(&'a mut B) -> R, ) -> R
§fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
self
, then passes self.as_ref()
into the pipe function.§fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
self
, then passes self.as_mut()
into the pipe
function.§fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
self
, then passes self.deref()
into the pipe function.§impl<T> Pointable for T
impl<T> Pointable for T
§impl<T> Tap for T
impl<T> Tap for T
§fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
Borrow<B>
of a value. Read more§fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
BorrowMut<B>
of a value. Read more§fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
AsRef<R>
view of a value. Read more§fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
AsMut<R>
view of a value. Read more§fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
Deref::Target
of a value. Read more§fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
Deref::Target
of a value. Read more§fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self
fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self
.tap()
only in debug builds, and is erased in release builds.§fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self
fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self
.tap_mut()
only in debug builds, and is erased in release
builds.§fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
.tap_borrow()
only in debug builds, and is erased in release
builds.§fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
.tap_borrow_mut()
only in debug builds, and is erased in release
builds.§fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
.tap_ref()
only in debug builds, and is erased in release
builds.§fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
.tap_ref_mut()
only in debug builds, and is erased in release
builds.§fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
.tap_deref()
only in debug builds, and is erased in release
builds.