feat(supervisor): add node affinity rules for large machine worker pool scheduling #2869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

myftija merged 2 commits into main from large-runs-affinity-rules

Jan 13, 2026

Merged

feat(supervisor): add node affinity rules for large machine worker pool scheduling #2869

myftija merged 2 commits into main from large-runs-affinity-rules

Jan 13, 2026

+53 −0

Conversation

@myftija

Copy link

Collaborator

@myftija myftija commented Jan 12, 2026 •

edited

Loading

Background
Runs with large-1x or large-2x machine presets are disproportionally affected by scheduling delays during peak times. This is in part caused by the fact that the worker pool is shared for all runs, meaning large runs compete with smaller runs for available capacity. Because large runs require significantly more CPU and memory, they are harder for the scheduler to bin-pack onto existing nodes, often requiring a node with a significant amount of free resources or waiting for a new node to spin up entirely. This effect is amplified during peak times when nodes are already densely packed with smaller workloads, leaving insufficient contiguous resources for large runs. Also, large runs make up a small percentage of the total runs.

Changes

This PR adds Kubernetes node affinity settings to separate large and standard machine workloads across node pools.

Controlled via KUBERNETES_LARGE_MACHINE_POOL_LABEL env var (disabled when not set)
Large machine presets (large-*) get a soft preference to schedule on the large pool, with fallback to standard nodes
Non-large machines are excluded from the large pool via required anti-affinity
This ensures the large machine pool is reserved for large workloads while allowing large workloads to spill over to standard nodes if needed

@myftija


 feat(supervisor): add node affinity rules for large machine worker po...

45ed802

...ol scheduling
**Background**
Runs with `large-1x` or `large-2x` machine presets are disproportionally
affected by scheduling delays during peak times. This is in part caused
by the fact that the worker pool is shared for all runs, meaning large
runs compete with smaller runs for available capacity. Because large runs require
significantly more CPU and memory, they are harder for the scheduler to bin-pack onto existing
nodes, often requiring a node with a significant amount of free resources or
waiting for a new node to spin up entirely. This effect is amplified during peak times
when nodes are already densely packed with smaller workloads, leaving insufficient contiguous resources for large runs.
**Changes**
This PR adds Kubernetes node affinity settings to separate large and standard machine workloads across node pools.
 - Controlled via KUBERNETES_LARGE_MACHINE_POOL_LABEL env var (disabled when not set)
 - Large machine presets (large-*) get a soft preference to schedule on the large pool, with fallback to standard nodes
 - Non-large machines are excluded from the large pool via required anti-affinity
 - This ensures the large machine pool is reserved for large workloads while allowing large workloads to spill over to standard nodes if needed

@changeset-bot

Copy link

changeset-bot bot commented Jan 12, 2026 •

edited

Loading

⚠️ No Changeset found

Latest commit: a847ef6

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

Copy link

Contributor

coderabbitai bot commented Jan 12, 2026 •

edited

Loading

Walkthrough

Adds an optional environment variable KUBERNETES_LARGE_MACHINE_POOL_LABEL to the environment schema. Updates the workload manager to compute and inject Kubernetes pod affinity based on that variable and machine preset: if the variable is set and the preset name starts with large-, a soft preferredDuringSchedulingIgnoredDuringExecution rule favors the large machine pool; if the variable is set and the preset is not large, a required nodeSelectorTerm excludes the large machine pool. Two private helpers (#isLargeMachine and #getNodeAffinity) implement the logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description provides background on the problem, details the changes made, and explains how the feature works, but does not follow the repository's template structure.	Restructure the description to follow the template: add 'Closes #' reference, include the checklist, separate Testing and Changelog sections, and add 💯 at the end.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding node affinity rules for large machine worker pool scheduling in the supervisor component.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45ed802 and a847ef6.

📒 Files selected for processing (1)

apps/supervisor/src/workloadManager/kubernetes.ts

🚧 Files skipped from review as they are similar to previous changes (1)

apps/supervisor/src/workloadManager/kubernetes.ts

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)

GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: typecheck / typecheck
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: claude-review
GitHub Check: Analyze (javascript-typescript)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

@claude

Copy link

claude bot commented Jan 12, 2026

PR Review: Node Affinity Rules for Large Machine Worker Pool Scheduling

Summary

This PR adds Kubernetes node affinity configuration to separate large machine workloads (large-1x, large-2x) from standard workloads across different node pools. This is a well-targeted optimization for reducing scheduling delays during peak times.

Code Quality & Design

Strengths:

Clean, focused implementation with minimal changes
Feature is opt-in via KUBERNETES_LARGE_MACHINE_POOL_LABEL env var - no breaking changes
Good asymmetric strategy: large machines get soft preference (can fall back to standard nodes) while standard machines have hard exclusion from the large pool
Comments clearly explain the scheduling behavior

Observations:

Hardcoded machinepool label key (kubernetes.ts:378,399): The node label key is hardcoded to "machinepool". Consider whether this should be configurable alongside the label value, or at minimum, document this requirement in the env var comment. Operators will need to know their nodes must have the machinepool=<value> label.
Machine name prefix matching (kubernetes.ts:362): The #isLargeMachine function uses preset.name.startsWith("large-"). This works correctly with the current MachinePresetName enum (large-1x, large-2x), but be aware this couples the scheduling logic to the naming convention.

Potential Issues

No unit tests: The apps/supervisor/src/workloadManager/ directory has no test files. Given this is Kubernetes scheduling logic that affects production workload placement, consider adding tests for:
- #isLargeMachine() correctly identifies large presets
- #getNodeAffinity() returns undefined when env var is not set
- Correct affinity structure for large vs non-large machines
Per CLAUDE.md guidelines, tests should use testcontainers, but these particular functions are pure logic and could be unit tested directly.
Interaction with existing nodeSelector: The #defaultPodSpec already sets a nodeSelector based on KUBERNETES_WORKER_NODETYPE_LABEL. The new affinity rules are additive, which should work correctly with Kubernetes scheduling, but operators should understand both constraints apply.

Performance Considerations

The implementation looks efficient - no additional API calls or heavy computation. The affinity configuration is built once per pod creation.

Security

No security concerns. The feature is configuration-driven and affects pod scheduling only.

Minor Suggestions

The env var comment could be more descriptive:

// Current:
KUBERNETES_LARGE_MACHINE_POOL_LABEL: z.string().optional(), // if set, large-* presets affinity for machinepool=<value>
// Suggested:
KUBERNETES_LARGE_MACHINE_POOL_LABEL: z.string().optional(), // If set, applies node affinity: large-* presets prefer nodes with label machinepool=<value>, others avoid them

Consider adding a changeset if this is a notable feature for managed deployments (though since this is in apps/supervisor and not a public package, it may not be required).

Verdict

Approve with minor suggestions. The implementation is clean, well-scoped, and solves a real scheduling problem with a sensible asymmetric approach. The main suggestion is to add unit tests for the affinity logic to prevent regressions.

mpcgrid

mpcgrid approved these changes

Jan 12, 2026

View reviewed changes

@vibe-kanban-cloud

Copy link

vibe-kanban-cloud bot commented Jan 12, 2026

Review Complete

Your review story is ready!

View Story

Comment !reviewfast on this PR to re-generate the story.

coderabbitai[bot]

coderabbitai bot reviewed

Jan 12, 2026

View reviewed changes

Copy link

Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)

apps/supervisor/src/workloadManager/kubernetes.ts (1)
365-410: Well-designed affinity logic with appropriate scheduling semantics.

The implementation correctly applies:

Soft preference (weight 100) for large machines → allows fallback to standard nodes

Hard exclusion for non-large machines → reserves the large pool

The NotIn operator correctly allows scheduling on nodes that don't have the machinepool label at all.

Consider adding debug logging when affinity rules are applied to aid troubleshooting scheduling issues:
♻️ Optional enhancement for observability
 #getNodeAffinity(preset: MachinePreset): k8s.V1Affinity | undefined {
 if (!env.KUBERNETES_LARGE_MACHINE_POOL_LABEL) {
 return undefined;
 }
 if (this.#isLargeMachine(preset)) {
+ this.logger.debug("[KubernetesWorkloadManager] Applying soft affinity for large machine pool", {
+ preset: preset.name,
+ poolLabel: env.KUBERNETES_LARGE_MACHINE_POOL_LABEL,
+ });
 // soft preference for the large-machine pool, falls back to standard if unavailable
 return {
 // ... existing code
 };
 }
+ this.logger.debug("[KubernetesWorkloadManager] Applying anti-affinity for large machine pool", {
+ preset: preset.name,
+ poolLabel: env.KUBERNETES_LARGE_MACHINE_POOL_LABEL,
+ });
 // not schedulable in the large-machine pool
 return {
 // ... existing code
 };
 }

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c30c2d and 45ed802.

📒 Files selected for processing (2)

apps/supervisor/src/env.ts
apps/supervisor/src/workloadManager/kubernetes.ts

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Every Trigger.dev task must be exported; use task() function with unique id and run async function

Files:

apps/supervisor/src/env.ts
apps/supervisor/src/workloadManager/kubernetes.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

apps/supervisor/src/env.ts
apps/supervisor/src/workloadManager/kubernetes.ts

**/*.{js,ts,jsx,tsx,json,md,css,scss}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier

Files:

apps/supervisor/src/env.ts
apps/supervisor/src/workloadManager/kubernetes.ts

**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

apps/supervisor/src/env.ts
apps/supervisor/src/workloadManager/kubernetes.ts

**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx,js}: Import from @trigger.dev/core using subpaths only; never import from root of @trigger.dev/core
Always import task definitions from @trigger.dev/sdk, never from @trigger.dev/sdk/v3 or deprecated client.defineJob pattern

Files:

apps/supervisor/src/env.ts
apps/supervisor/src/workloadManager/kubernetes.ts

🧠 Learnings (4)

📓 Common learnings

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025年11月27日T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Specify machine resources using the `machine` property with preset options like 'large-1x'

📚 Learning: 2025年11月27日T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025年11月27日T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Specify machine resources using the `machine` property with preset options like 'large-1x'

Applied to files:

apps/supervisor/src/env.ts
apps/supervisor/src/workloadManager/kubernetes.ts

📚 Learning: 2025年06月04日T16:02:22.957Z

Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 2150
File: apps/supervisor/src/workloadManager/docker.ts:115-115
Timestamp: 2025年06月04日T16:02:22.957Z
Learning: In the Trigger.dev codebase, the supervisor component uses DOCKER_ENFORCE_MACHINE_PRESETS while the docker provider component uses ENFORCE_MACHINE_PRESETS. These are separate components with separate environment variable configurations for the same logical concept of enforcing machine presets.

Applied to files:

apps/supervisor/src/env.ts

📚 Learning: 2025年01月13日T18:31:48.160Z

Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 1608
File: apps/webapp/app/v3/services/triggerTask.server.ts:418-418
Timestamp: 2025年01月13日T18:31:48.160Z
Learning: The `MachinePresetName` schema is used to validate machine preset values in the trigger.dev codebase, ensuring type safety and validation of machine preset options.

Applied to files:

apps/supervisor/src/workloadManager/kubernetes.ts

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)

GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: typecheck / typecheck
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (3)

apps/supervisor/src/env.ts (1)

94-94: LGTM!

The optional environment variable follows the existing patterns in this file. The inline comment clearly documents that the feature is disabled when unset and explains the label semantics (machinepool=<value>).

apps/supervisor/src/workloadManager/kubernetes.ts (2)

98-98: LGTM!

Clean integration with the pod spec. Returning undefined when the feature is disabled correctly omits the affinity field entirely.

361-363: LGTM!

Simple and effective prefix check. All large machine presets (large-1x, large-2x) follow the large- prefix convention, and this approach is flexible enough to accommodate any future large-* presets without code changes.

@myftija


 Update machinepool label key

a847ef6

@claude

Copy link

claude bot commented Jan 12, 2026

PR Review: Node Affinity Rules for Large Machine Worker Pool Scheduling

Summary

This PR adds Kubernetes node affinity rules to separate large machine workloads (large-1x, large-2x) from standard workloads, addressing scheduling delays during peak times. The implementation is clean and follows the existing patterns in the codebase.

Code Quality

Strengths:

Clean, well-structured code that follows existing patterns in the file
Good use of private methods with descriptive names (#isLargeMachine, #getNodeAffinity)
Appropriate use of soft preference (preferredDuringSchedulingIgnoredDuringExecution) for large machines, allowing fallback to standard nodes
Hard constraint (requiredDuringSchedulingIgnoredDuringExecution) for standard machines prevents them from consuming large pool capacity
Feature is opt-in via env var, maintaining backward compatibility

Minor Suggestions:

Consider using the existing PLACEMENT_TAGS_PREFIX (which defaults to node.cluster.x-k8s.io) for the node label key instead of hardcoding node.cluster.x-k8s.io/machinepool. This would allow for more flexibility.
The comment in the env file could be slightly more descriptive about what value to provide.

Potential Issues

Edge case - Nodes without the label: When KUBERNETES_LARGE_MACHINE_POOL_LABEL is set, non-large machines use a NotIn constraint. If a node does not have the node.cluster.x-k8s.io/machinepool label at all, the NotIn constraint will still allow scheduling on that node (since the label is simply absent, not matching the excluded value). This is probably the desired behavior but worth confirming.
No DoesNotExist handling: Standard workloads can still schedule on nodes that do not have the machinepool label at all. Consider whether you want to add an additional term to handle nodes without the label. Though this might be unnecessary complexity if your cluster setup guarantees all nodes have this label.

Performance Considerations

The affinity rules add minimal overhead to pod scheduling
Using soft preference (weight 100) for large machines is a good balance - it strongly prefers the large pool but allows spillover
No performance concerns identified

Security Considerations

No security concerns - this is purely a scheduling optimization
The env var pattern follows existing conventions

Test Coverage

No unit tests added for the new #isLargeMachine and #getNodeAffinity methods
I see the supervisor has some test files (e.g., envUtil.test.ts, podCleaner.test.ts)
Consider adding tests for:
- #isLargeMachine returns true for large-1x, large-2x and false for other presets
- #getNodeAffinity returns undefined when env var is not set
- #getNodeAffinity returns correct affinity structure for large and non-large machines

Overall Assessment

This is a well-designed, practical solution to improve scheduling performance for large machine workloads. The implementation is clean and follows existing patterns. The main suggestion is to add unit tests for the new logic to ensure correctness and prevent regressions.

Recommendation: Approve with minor suggestions (consider tests and the optional improvements noted above).

nicktrn

nicktrn approved these changes

Jan 12, 2026

View reviewed changes

@myftija myftija merged commit a3c3876 into main

Jan 13, 2026

34 checks passed

@myftija myftija deleted the large-runs-affinity-rules branch

January 13, 2026 09:54

Labels

None yet

4 participants

@myftija @mpcgrid @nicktrn

Uh oh!

feat(supervisor): add node affinity rules for large machine worker pool scheduling #2869

feat(supervisor): add node affinity rules for large machine worker pool scheduling #2869

Conversation

@myftija myftija commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

claude bot commented Jan 12, 2026

PR Review: Node Affinity Rules for Large Machine Worker Pool Scheduling

Summary

Code Quality & Design

Potential Issues

Performance Considerations

Security

Minor Suggestions

Verdict

Uh oh!

vibe-kanban-cloud bot commented Jan 12, 2026

Review Complete

Uh oh!

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude bot commented Jan 12, 2026

PR Review: Node Affinity Rules for Large Machine Worker Pool Scheduling

Summary

Code Quality

Potential Issues

Performance Considerations

Security Considerations

Test Coverage

Overall Assessment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

@myftija myftija commented Jan 12, 2026 •

edited

Loading

changeset-bot bot commented Jan 12, 2026 •

edited

Loading

coderabbitai bot commented Jan 12, 2026 •

edited

Loading