SLA
Assert that your workflows meet SLAs.
What is an SLA
A Service Level Agreement (SLA) is a core property of a flow that defines a behavior
to trigger if the flow runs too long or fails to meet the defined assertion.
SLA types
Currently, Kestra supports the following SLA types:
- MAX_DURATION — the maximum allowed execution duration before the SLA is breached
- EXECUTION_ASSERTION — an assertion defined by a Pebble expression that must be met during the execution. If the assertion doesn’t hold true, the SLA is breached.
How to use SLAs
SLAs are defined using the sla
property at the root of a flow, and they declare the desired state that must be met during executions of the flow.
MAX_DURATION
If a workflow execution exceeds the expected duration, an SLA can trigger corrective actions, such as cancelling the execution.
The following SLA cancels an execution if it takes more than 8 hours:
id: sla_examplenamespace: company.team
sla: - id: maxDuration type: MAX_DURATION duration: PT8H behavior: CANCEL labels: sla: miss reason: durationExceeded
tasks: - id: punctual type: io.kestra.plugin.core.log.Log message: Workflow started, monitoring SLA compliance
- id: sleepyhead type: io.kestra.plugin.core.flow.Sleep duration: PT9H
- id: never_executed_task type: io.kestra.plugin.core.log.Log message: This task will never start because the SLA was breached
EXECUTION_ASSERTION
An SLA can also be based on an assertion that must hold true during execution. If the assertion fails, the SLA is breached.
The following SLA fails if the output of mytask
is not equal to expected output
:
id: sla_demonamespace: company.team
sla: - id: assert_output type: EXECUTION_ASSERTION assert: "{{ outputs.mytask.value == 'expected output' }}" behavior: FAIL labels: sla: miss reason: outputMismatch
tasks: - id: mytask type: io.kestra.plugin.core.debug.Return format: expected output
SLA behavior
The behavior
property of an SLA defines the action to take when the SLA is breached. The following behaviors are supported:
- CANCEL — cancels the execution
- FAIL — fails the execution
- NONE — logs a message
In addition, each breached SLA can set labels that can be used to filter executions or trigger follow-up actions.
Alerts on SLA breaches
For example, if you want to receive a Slack alert when an SLA is breached, you can use a Flow trigger to react to cancelled or failed executions labeled with sla: miss
:
id: sla_miss_alertnamespace: system
tasks: - id: send_alert type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook url: "{{secret('SLACK_WEBHOOK')}}" messageText: "SLA breached for flow `{{trigger.namespace}}.{{trigger.flowId}}` with ID `{{trigger.executionId}}`"
triggers: - id: alert_on_failure type: io.kestra.plugin.core.trigger.Flow labels: sla: miss states: - FAILED - WARNING - CANCELLED
Best practice: Use labels with SLAs to track SLA breaches across environments, and pair them with alerting or monitoring flows for proactive response.