You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Introduce a new Triton backend (WITH_TRITON) built on Triton's AOT compiler. Operators live under src/triton/ops/<op>/ and are compiled to cubins + C dispatchers at CMake configure time.
Add scripts/triton/generate_ops.py to walk the operator tree and invoke each operator's build.py.
Add scripts/triton/aot.py to encapsulate triton aot build process.
Extend scripts/generate_wrappers.py with a --with-triton flag so the generated dispatch table picks
up Triton implementations.
Wire WITH_TRITON through CMakeLists.txt and src/CMakeLists.txt: new option, mutual exclusion
with non-NVIDIA backends, enable_language(C) for the generated .c files, and a TRITON_PYTHON_EXECUTABLE cache var mirroring the NineToothed pattern.
Implement Add as the first operator (src/triton/ops/add/{add.py,build.py,add.h}), registered as Operator<Add, kNvidia, 8>.
Motivation
Triton is not yet integrated into InfiniOps. Adding a Triton backend lets contributors write kernels in Python instead of CUDA C++, plugs cleanly into the existing Operator<...> dispatch via AOT‐compiled kernels, and lays the groundwork for reaching non‐NVIDIA targets as Triton’s upstream support expands. Add ships as the first operator to validate the end‐to‐end pipeline.
Closes #N/A
Type of Change
feat — new feature / new operator / new platform
fix — bug fix
perf — performance improvement (no behavioral change)
refactor — code restructuring without behavior change
test — adding or fixing tests only
docs — documentation only
build / ci — build system or CI configuration
chore — tooling, formatting, or other non-code changes
Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)
This PR integrates Triton via the AOT pipeline; the overall flow mirrors the existing WITH_NINETOOTHED integration (CMakeLists.txt option, codegen script, *_PYTHON_EXECUTABLE cache var, generated .c dispatchers). Add is the first operator.
Checklist
Every contributor must verify every item below before requesting
review. Tick each box only after the check has actually been performed —
do not tick speculatively. If an item truly does not apply, replace the
checkbox with N/A and briefly explain why in an inline comment.
Title, Branch, and Commits
PR title follows Conventional Commits (e.g. feat(nvidia): ..., fix(cuda/gemm): ...).
Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
Each commit message follows Conventional Commits.
Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
No stray merge commits from master — the branch is rebased cleanly on top of the current master.
No fixup! / squash! / wip commits remain.
Scope and Design
Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
No unrelated formatting churn that would obscure the diff.
Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.
General Code Hygiene (applies to all languages)
The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
No trailing whitespace, tab/space mixing, or stray BOMs.
Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
All comments and error messages are in English (CONTRIBUTING.md §Code/General).
Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).
clang-format (version 21, per .github/workflows/clang-format.yml) has been run against all modified .h, .cc, .cuh, and .mlu files; the diff is clean.
clang-tidy concerns (per .clang-tidy) have been reviewed — no new warnings beyond the existing baseline.
Operator parameter order is inputs first, outputs last; attributes are between inputs and outputs; naming follows PyTorch → ONNX → CUDA API precedence (CONTRIBUTING.md §C++).
No exceptions are thrown. Error paths use assert with messages that include at least __FILE__, __LINE__, and __func__ (CONTRIBUTING.md §C++).
Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
Kernel files are named correctly: custom = kernel / kernel_v2 / ...; well-known algorithms use the algorithm name; library-based implementations use the library name (CONTRIBUTING.md §C++).
N/A Separate .cuh + .cu for kernel — Triton AOT generates .c files instead of CUDA source; the extern "C" block in add.h includes the generated launchers directly, matching the pattern in src/ninetoothed/ops/<op>/<op>.h.
Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
Exactly one blank line between classes, between classes and functions, and between functions (CONTRIBUTING.md §C++).
Exactly one blank line between members (functions and variables) within a class (CONTRIBUTING.md §C++).
Exactly one blank line before and after the contents of a namespace (CONTRIBUTING.md §C++).
New operators added via src/base/<op>.h (inheriting Operator<Op>) with platform implementations under src/<category>/<platform>/ inheriting the base (CONTRIBUTING.md §Adding an Operator).
No raw new/delete; RAII / smart pointers / existing allocators are used.
Python Specific (if Python files changed)
Code is PEP 8 compliant; ruff check passes cleanly on CI (see .github/workflows/ruff.yml).
ruff format --check passes cleanly — if not, run ruff format and commit the result.
Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
N/A No new framework-specific conventions touched (no new pytest.skip calls).
No blank line between the function signature and the body when there is no docstring or comment (CONTRIBUTING.md §Python).
A blank line is present before and afterif, for, and similar control-flow statements (CONTRIBUTING.md §Python).
A blank line appears before each return, except when it directly follows a control-flow statement (CONTRIBUTING.md §Python).
N/A No new docstrings added.
Type hints are added / kept consistent with the surrounding code.
Testing
pytest was run locally on every supported platform that this PR can affect, and the results are recorded in the "Test Results" table above (CONTRIBUTING.md §Pull Requests).
For any platform that could not be tested, an explicit reason is given in the table and a reviewer with access has been tagged.
New functionality has matching tests under tests/ following tests/test_add.py / tests/test_gemm.py patterns (CONTRIBUTING.md §Adding an Operator).
Tests use pytest.mark.parametrize correctly: dependent parameters share one decorator (e.g. @pytest.mark.parametrize("dtype, rtol, atol", ...)), independent parameters use separate decorators ordered by parameter declaration.
N/A pytest.mark.auto_act_and_assert — covered by the existing test scaffolding.
Default dtype / device parameterization is relied on, or overridden with an explicit pytest.mark.parametrize when necessary.
N/A No new flaky tests introduced.
N/A Not a bug fix; no regression test required.
Build, CI, and Tooling
The project builds cleanly from a fresh directory with pip install .[dev] on at least one affected platform.
compile_commands.json still regenerates (CMake option CMAKE_EXPORT_COMPILE_COMMANDS=ON in pyproject.toml — required by the code-lint skill and clang-tidy -p).
New backends / devices have been added to auto-detection in CMakeLists.txt under if(AUTO_DETECT_DEVICES)and to if(AUTO_DETECT_BACKENDS) if applicable.
Only one CUDA-like GPU backend is selectable at a time — the existing mutual-exclusion check in CMakeLists.txt is not broken.
Both CI workflows (clang-format.yml, ruff.yml) are green locally (or expected to be green on CI).
No new runtime dependency was added without updating pyproject.toml's [project.optional-dependencies] (or justified in the PR description).
Documentation
N/A README.md / CONTRIBUTING.md not updated — WITH_TRITON follows the exact same opt-in pattern as WITH_NINETOOTHED, which is similarly undocumented. Happy to add a section if requested.
New operators, new dispatch helpers, or new public utilities are documented (docstring, header comment, or an addition to CONTRIBUTING.md §Some Code Explanations).
N/A No user-visible breaking changes.
Security and Safety
No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
Third-party code is license-compatible and attributed.
No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
Summary
WITH_TRITON) built on Triton's AOT compiler. Operators live undersrc/triton/ops/<op>/and are compiled to cubins + C dispatchers at CMake configure time.scripts/triton/generate_ops.pyto walk the operator tree and invoke each operator'sbuild.py.scripts/triton/aot.pyto encapsulate triton aot build process.scripts/generate_wrappers.pywith a--with-tritonflag so the generated dispatch table picksup Triton implementations.
WITH_TRITONthroughCMakeLists.txtandsrc/CMakeLists.txt: new option, mutual exclusionwith non-NVIDIA backends,
enable_language(C)for the generated.cfiles, and aTRITON_PYTHON_EXECUTABLEcache var mirroring the NineToothed pattern.Addas the first operator (src/triton/ops/add/{add.py,build.py,add.h}), registered asOperator<Add, kNvidia, 8>.Motivation
Triton is not yet integrated into InfiniOps. Adding a Triton backend lets contributors write kernels in Python instead of CUDA C++, plugs cleanly into the existing
Operator<...>dispatch via AOT‐compiled kernels, and lays the groundwork for reaching non‐NVIDIA targets as Triton’s upstream support expands.Addships as the first operator to validate the end‐to‐end pipeline.Closes #N/A
Type of Change
feat— new feature / new operator / new platformfix— bug fixperf— performance improvement (no behavioral change)refactor— code restructuring without behavior changetest— adding or fixing tests onlydocs— documentation onlybuild/ci— build system or CI configurationchore— tooling, formatting, or other non-code changes!in the Conventional Commits prefix or aBREAKING CHANGE:footer)Platforms Affected
WITH_CPU)WITH_NVIDIA)WITH_ILUVATAR)WITH_METAX)WITH_CAMBRICON)WITH_MOORE)WITH_ASCEND)WITH_TORCH)Test Results on Supported Platforms
pytestResultpytest tests/test_add.py -k cuda-8→ 108 passed, 0 failedFull `pytest` output (optional)
Benchmark / Performance Impact
"N/A"
Notes for Reviewers
This PR integrates Triton via the AOT pipeline; the overall flow mirrors the existing
WITH_NINETOOTHEDintegration (CMakeLists.txtoption, codegen script,*_PYTHON_EXECUTABLEcache var, generated.cdispatchers).Addis the first operator.Checklist
Title, Branch, and Commits
feat(nvidia): ...,fix(cuda/gemm): ...).<type>/xxx-yyyy-zzzzwhere<type>matches the PR title's Conventional Commits type and words are joined with hyphens (seeCONTRIBUTING.md§Branches).CONTRIBUTING.md§Pull Requests).master— the branch is rebased cleanly on top of the currentmaster.fixup!/squash!/wipcommits remain.Scope and Design
CONTRIBUTING.md§Code/General).printf/std::cout/print(...)left behind, orTODOwithout an owner and issue link.General Code Hygiene (applies to all languages)
CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).the `seqlens_k` tensor) (CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General; §Python).C++ Specific (if C++ files changed)
clang-format(version 21, per.github/workflows/clang-format.yml) has been run against all modified.h,.cc,.cuh, and.mlufiles; the diff is clean.clang-tidyconcerns (per.clang-tidy) have been reviewed — no new warnings beyond the existing baseline.CONTRIBUTING.md§C++).assertwith messages that include at least__FILE__,__LINE__, and__func__(CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).kernel/kernel_v2/ ...; well-known algorithms use the algorithm name; library-based implementations use the library name (CONTRIBUTING.md§C++)..cuh+.cufor kernel — Triton AOT generates.cfiles instead of CUDA source; the extern "C" block in add.h includes the generated launchers directly, matching the pattern insrc/ninetoothed/ops/<op>/<op>.h.CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).src/base/<op>.h(inheritingOperator<Op>) with platform implementations undersrc/<category>/<platform>/inheriting the base (CONTRIBUTING.md§Adding an Operator).new/delete; RAII / smart pointers / existing allocators are used.Python Specific (if Python files changed)
ruff checkpasses cleanly on CI (see.github/workflows/ruff.yml).ruff format --checkpasses cleanly — if not, runruff formatand commit the result.CONTRIBUTING.md§Python).CONTRIBUTING.md§Python).if,for, and similar control-flow statements (CONTRIBUTING.md§Python).return, except when it directly follows a control-flow statement (CONTRIBUTING.md§Python).Testing
pytestwas run locally on every supported platform that this PR can affect, and the results are recorded in the "Test Results" table above (CONTRIBUTING.md§Pull Requests).tests/followingtests/test_add.py/tests/test_gemm.pypatterns (CONTRIBUTING.md§Adding an Operator).pytest.mark.parametrizecorrectly: dependent parameters share one decorator (e.g.@pytest.mark.parametrize("dtype, rtol, atol", ...)), independent parameters use separate decorators ordered by parameter declaration.dtype/deviceparameterization is relied on, or overridden with an explicitpytest.mark.parametrizewhen necessary.Build, CI, and Tooling
pip install .[dev]on at least one affected platform.compile_commands.jsonstill regenerates (CMake optionCMAKE_EXPORT_COMPILE_COMMANDS=ONinpyproject.toml— required by thecode-lintskill andclang-tidy -p).CMakeLists.txtunderif(AUTO_DETECT_DEVICES)and toif(AUTO_DETECT_BACKENDS)if applicable.CMakeLists.txtis not broken.clang-format.yml,ruff.yml) are green locally (or expected to be green on CI).pyproject.toml's[project.optional-dependencies](or justified in the PR description).Documentation
README.md/ CONTRIBUTING.md not updated — WITH_TRITON follows the exact same opt-in pattern as WITH_NINETOOTHED, which is similarly undocumented. Happy to add a section if requested.CONTRIBUTING.md§Some Code Explanations).Security and Safety