Skip to content

feat(supervisor): custom tolerations for scheduled runs#3297

Merged
myftija merged 3 commits intomainfrom
scheduled-run-tolerations
Mar 30, 2026
Merged

feat(supervisor): custom tolerations for scheduled runs#3297
myftija merged 3 commits intomainfrom
scheduled-run-tolerations

Conversation

@myftija
Copy link
Copy Markdown
Collaborator

@myftija myftija commented Mar 30, 2026

Adds support for taint tolerations for scheduled runs. Useful for selectively tolerating taints on dedicated node pools.

The new KUBERNETES_SCHEDULED_RUN_TOLERATIONS env variable accepts a comma-separated list in the format key=value:effect (or key:effect for the Exists operator).

Drive-by: renames all KUBERNETES_SCHEDULE_* affinity env vars to KUBERNETES_SCHEDULED_RUN_* for clarity — this feature isn't used in production yet or published in a tagged image; the name change is fine.

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 30, 2026

⚠️ No Changeset found

Latest commit: 450910e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 29c705a2-15ef-4eff-b2d0-633350f052a2

📥 Commits

Reviewing files that changed from the base of the PR and between 46521c5 and 450910e.

📒 Files selected for processing (1)
  • apps/supervisor/src/env.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: sdk-compat / Bun Runtime
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

  • apps/supervisor/src/env.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

  • apps/supervisor/src/env.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

**/*.ts: Use typecheck to verify changes in apps and internal packages (apps/*, internal-packages/*), not build - building proves almost nothing about correctness
When writing Trigger.dev tasks, always import from @trigger.dev/sdk. Never use @trigger.dev/sdk/v3 or deprecated client.defineJob
Add crumbs as you write code - mark lines with // @Crumbs or wrap blocks in `// `#region` `@crumbs for agentcrumbs debug tracing, then strip before merge

Files:

  • apps/supervisor/src/env.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • apps/supervisor/src/env.ts
apps/supervisor/src/env.ts

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Environment configuration should be defined in src/env.ts

Files:

  • apps/supervisor/src/env.ts
apps/**/*

📄 CodeRabbit inference engine (CLAUDE.md)

When modifying only server components (apps/webapp/, apps/supervisor/, etc.) with no package changes, add a .server-changes/ file instead of a changeset

Files:

  • apps/supervisor/src/env.ts
🧠 Learnings (8)
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/env.ts : Environment configuration should be defined in `src/env.ts`

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-03-27T18:11:55.459Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:83-98
Timestamp: 2026-03-27T18:11:55.459Z
Learning: In `apps/supervisor/src/index.ts`, `RESOURCE_MONITOR_ENABLED` (env var in `apps/supervisor/src/env.ts`) defaults to `false`. As a result, the local `ResourceMonitor`-based `maxResources`/`skipDequeue` gating in `preDequeue` is inactive in compute mode deployments. Do not flag local resource monitor usage in compute mode as a live bug; it has no practical impact unless `RESOURCE_MONITOR_ENABLED` is explicitly set to `true`.

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-03-27T11:45:37.910Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/workloadServer/index.ts:832-840
Timestamp: 2026-03-27T11:45:37.910Z
Learning: In `apps/supervisor/src/workloadManager/compute.ts` and the supervisor restore flow, `TRIGGER_METADATA_URL` does not need to be re-injected on VM restore because it is baked into the instance environment at creation time and the environment is preserved through snapshot/restore. The Kubernetes restore path follows the same pattern. Do not flag the absence of `TRIGGER_METADATA_URL` re-injection on restore as a bug.

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-08-14T18:35:44.370Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 2390
File: apps/webapp/app/env.server.ts:764-765
Timestamp: 2025-08-14T18:35:44.370Z
Learning: The BoolEnv helper in apps/webapp/app/utils/boolEnv.ts uses z.preprocess with inconsistent default value types across the codebase - some usages pass boolean defaults (correct) while others pass string defaults (incorrect), leading to type confusion. The helper should enforce boolean-only defaults or have clearer documentation.

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-03-21T21:23:35.117Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/runs/v3/ai/extractAISummarySpanData.ts:149-150
Timestamp: 2026-03-21T21:23:35.117Z
Learning: In `apps/webapp/app/components/runs/v3/ai/extractAISummarySpanData.ts` (and related AI span extraction files in `apps/webapp/app/components/runs/v3/ai/`), manual JSON.parse with typeof guards and type assertions is intentional. These functions are on the hot path for span rendering, so Zod validation is deliberately avoided for performance reasons. Do not suggest replacing manual JSON parsing with Zod schemas in these files.

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-03-26T09:02:07.973Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: In `triggerdotdev/trigger.dev`, `TaskRun.annotations` are always written atomically in one operation that conforms exactly to the `RunAnnotations` schema (from `trigger.dev/core/v3`). Using `RunAnnotations.safeParse` in `#parseAnnotations` (e.g., in `apps/webapp/app/services/runsReplicationService.server.ts`) is intentional and correct — there is no risk of partial/legacy annotation payloads causing schema mismatches, so suggesting a relaxed passthrough schema for this parsing is unnecessary.

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • apps/supervisor/src/env.ts
🔇 Additional comments (3)
apps/supervisor/src/env.ts (3)

126-135: LGTM!

Reformatting to explicit .trim().min(1) chains improves readability while maintaining the same validation behavior.


148-165: LGTM!

The rename from KUBERNETES_SCHEDULE_* to KUBERNETES_SCHEDULED_RUN_* provides better clarity about the purpose of these variables.


167-223: Well-structured toleration parsing with proper validation.

The implementation correctly handles both key=value:effect and key:effect formats, validates effects against known Kubernetes taint effects, and rejects empty keys. The ctx.addIssue pattern allows collecting all validation errors in a single parse attempt.

One edge case to be aware of: key=:NoSchedule (empty value) is accepted and produces { key: "key", operator: "Equal", value: "", effect: "NoSchedule" }. This is technically valid in Kubernetes (matches taints with an empty value), but might indicate user error. Consider whether an empty value warning is warranted.


Walkthrough

Kubernetes schedule-affinity environment variables in the supervisor were renamed from KUBERNETES_SCHEDULE_* to KUBERNETES_SCHEDULED_RUN_*. A new optional KUBERNETES_SCHEDULED_RUN_TOLERATIONS env var was added; it is parsed from a comma-separated key=value:effect or key:effect format into Kubernetes toleration objects with Zod validation that rejects malformed entries and invalid effects. Zod schema calls for schedule-related labels were reformatted without changing validation. Pod creation now conditionally sets spec.tolerations for scheduled runs via a new helper, and schedule-related affinity logic was updated to use the renamed env vars.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description explains the feature, format, and rationale, but omits required template sections like Testing, Changelog details, and the checklist confirmation. Complete the PR description template by adding Testing section (test steps taken), expanding Changelog section with more detail, and confirming the checklist items were followed.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature being added: custom tolerations for scheduled runs in the supervisor module.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch scheduled-run-tolerations

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@myftija myftija merged commit 7210bde into main Mar 30, 2026
39 checks passed
@myftija myftija deleted the scheduled-run-tolerations branch March 30, 2026 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants