- 
  Notifications
 
You must be signed in to change notification settings  - Fork 710
 
RFC: Backend Test Infrastructure #11140
-
 IntroductionThe purpose of this RFC is to propose shared test suites and stress test infrastructure that can run against all ExecuTorch backends. These backends should be able to integrate with this framework with minimal effort and should cover all major capabilities such as partitioner modes, tensor dtypes, operator implementations, and quantization flows. The initial goal of this effort is to ensure that all ExecuTorch backends meet a standard of functional reliability and correctness defined by ExecuTorch as a platform. Practically, this means being able to take any PT2 graph, partition only what a backend can handle, lower, generate .pte, load it, and run without errors or crashes on most relevant platforms for a given backend. This document proposes a design aiming at ExecuTorch GA release needs. However, here we are intentionally limiting the implementation scope, as reflected in the milestones for the v0.7 release to remain adaptable for the GA requirements. Consequently, performance is explicitly a non-goal for ET v0.7 release. Strict numerical correctness is also outside the scope for v0.7, as the primary focus is on functionality. Additionally, exhaustively measuring coverage on various platforms and operating systems for backends, such as Vulkan on Windows, is also not part of the early goals. However, as highlighted earlier, we ultimately anticipate the proposed design to support all these aspects if needed for the GA release. Notes for ReviewersWe’ve intentionally focused on high-level interfaces and goals in this document over implementation details. We expect that the implementation will not be particularly controversial, but we would be interested in points that you foresee as being potentially complex or problematic. We are especially looking for feedback from backend authors on the following questions: MotivationAs ExecuTorch approaches 1.0 GA release in October of this year, we have explicit goals on software reliability and out of box experience. 
 ScopeDesign Goals for 0.7
 Potential design goals for GA include performance, numerical correctness, and platform coverage. Non-Goals for 0.7
 Motivating Use CasesThe proposed design should serve to support a number of related use cases for backend validation. Some specific motivating examples include: 
 DesignThe proposed design involves a core test harness, which takes an input, a matrix of configurations. The output of the test run is a test report, which includes pass/fail status for all tests and any additional collected metrics. This test harness should be able to be run ad-hoc locally or integrated into CI jobs. Configuration MatrixThe configuration matrix determines which tests to run, quantization and lowering flows, and runtime targets. Each axis takes a set of one or more configurations. The test harness is responsible for evaluating each combination in the cartesian product of the configurations. Orthogonal Configuration Axes: 
 Let’s talk about each in a bit more detail. 
 Each test is defined by an eager mode model and a set of test inputs. The test itself is intended to be orthogonal of backend, quantization, and runtime configuration. We may need to relax tolerances between backends, but we expect functional correctness and reliability across all backends and configurations, which is what this effort aims to achieve. We intend to introduce two primary test suites: models and operators. The operator test suite should cover the entire ATen op set, and any other operators that are common. In practice, we see many non-core ops that are not decomposed, and there is a general de-emphasis on a single op-set, so we will want to add operators as we see them. Model tests will leverage our existing example models as a baseline, but it should be easy to integrate external model libraries, such as HF Transformers. We will also need to create artificial models to validate specific configurations, such as dynamically quantized linears with multiple consumers, or other cases we have seen problems with in the past. For operator-level tests, we anticipate using FACTO to generate input permutations, including dtypes. For model-level tests, we will likely need to manage dtype as an independent axis, and input tensor generation will be coupled to the dtype. 2. Backend-Independent ConfigurationThis configuration controls model DTypes and whether the model is exported with dynamic shapes. Even in cases where backends do not support dynamic shapes, it is beneficial to validate that they do not attempt to partition nodes that they cannot handle. 3. Backend Interface and RecipesThe backend interface is responsible for allowing backends to register quantization and lowering configurations. Quantization is coupled to the backend, but is considered a separate axis, such that we can test multiple quantization schemes against multiple lowering schemes independently. Such as testing (no quant, 8-bit, 4-bit) x (to_edge, to_edge_transformer_and_lower), or perhaps with different partitioner options. When the high-level export API and recipes are available (Tarun's Recipe RFC), we may integrate that, as it fulfills effectively the same functionality. However, I’d like to avoid a hard dependency on this work, so we will maintain test recipes independently for now. We can re-evaluate when the high-level export API is available. 4. Device configurationsWhen integrating with an external executor, this configuration controls which devices are used to run benchmarks. There is necessarily a dependency on the specific backends used, as backends are largely hardware-dependent. This configuration acts as one filter in this set, where the backend configuration also factors in, such that tests are run only on devices that pass all filters. 5. Runtime InterfaceThe runtime interface abstracts the underlying .pte execution mechanism. The test harness will provide a .pte file, expected outputs and tolerances, and any runtime configuration options (thread count, etc.). The runtime provider will be responsible for executing the .pte with the given inputs and validating that the outputs are within tolerance. This design intends to provide an abstraction for the underlying runtime executor. Initially, we will use pybindings as the runtime executor. In a later milestone, we will add the option to run tests on-device using AWS device farm. This may be necessary to support all backends, and it provides a more realistic execution environment. Question for reviewers: Which backends support pybindings currently? Are we going to enforce pybind support in ET (via simulators or similar) for all backends? 6. MetricsAs part of the test execution framework, we want to be able to collect metrics on the lowering, execution, and outputs. The test framework will include common logic for storing and aggregating metrics. Each metric should be recorded per-test case and aggregated for each configuration set. Desired metrics include: 
 7. OutputsThe primary output of the test run is a report, which includes pass/fail information for each test, logging and output for failed tests, and individual and aggregated metrics for the run. At a high-level, the goals are that it should be easy to view summary statistics, we should be able to access the raw result and metric data for post-processing, and it should be easy to debug failed tests. Milestones for v0.7The following features are proposed to be delivered with the ExecuTorch 0.7 milestone (mid/end of June, 2025). 
 Milestone 1 - end of May 2025: 
 Milestone 2 - end of June 2025: 
 Dependencies (XFN)
 Risks
  | 
 
Beta Was this translation helpful? Give feedback.
All reactions
- 
 
👍 1