Current behavior (today)
| Action | Supported? |
|---|---|
| Pause / Resume | Yes — only while run.status is paused |
Retry same run (run_id) | No — failed runs stay terminal |
| Start new run | Yes — full re-execution with a new run_id |
Auto retry_count on step | Transient errors only (timeout, rate limit) — not “fix credential and continue” |
What to do after a failure
- Open the failed run in Workflow Studio or Command Center.
- Find the step marked Failed and read the error (often
LLM_REQUEST_FAILEDorMCP_TOOL_FAILED). - Fix the root cause — model, provider credential, MCP binding, tool args, or graph logic.
- Publish a new workflow version if the graph changed.
- Start a new run — do not expect the old run to resume.
Why Cancelled and Blocked appear
On fail-fast runs, the platform marks orphan steps honestly:| Badge | Meaning |
|---|---|
| Failed | Step that caused the run to fail |
| Cancelled | Was running in parallel when another step failed |
| Skipped (Blocked) | Never started — downstream blocked by the failure |
Planned recovery (not shipped)
The platform team is designing continue-from-failure so operators can fix config (e.g. wrong LLM model) without redoing every upstream step.Option A — New run with checkpoint (likely first)
Start a newrun_id but reuse completed step outputs from a parent failed run. Only re-execute from the failed step (and downstream).
UX (planned): Retry run (reuse checkpoints) vs Run again (fresh).
Option B — Retry on same run
Reopen a failed run: reset the failed step and downstream topending, keep upstream results, append new events (attempt 2 on the same timeline).
UX (planned): Retry from failure on the failed run row in Command Center.
Option C — Recoverable failures (later)
Some error classes (e.g. missing credential) would not fail the whole run immediately; the step waits for operator fix and Retry step.Open product question
When one parallel branch fails, should sibling branches be allowed to finish?| Policy | Today | Alternative |
|---|---|---|
| Fail-fast | Run fails immediately; siblings Cancelled | — |
| Drain parallel | — | Siblings finish; join may still fail |
| Lenient join | — | Join uses fallbacks (e.g. default subject from an earlier Lua step) |
API preview (future)
Checkpoint retry (Option A) may look like:start command after fixing the workflow.
Related
- Runs and Command Center — run lifecycle and failure badges
- Troubleshooting — failure codes and recovery steps
- Workflow patterns — Retries —
retry_countfor transient errors - Run event types —
step_cancelled, blocked skip payloads