XLA Tooling

The XLA development workflow is usually centered around HLO IR, which represents isolated functional computation given to the compiler. XLA comes with multiple command line tools (described below) which consume HLO and either run it, or provide an intermediate compilation stage. Using such tools is invaluable for a fast compile->modify->run iteration cycle, as HLO is both visualizable and hackable, and iteratively changing and running it is often the fastest way to understand and to fix an XLA performance or behavior.

The easiest way to obtain the HLO for a program being compiled with XLA is usually to use the XLA_FLAGS environment variable:

$XLA_FLAGS=--xla_dump_to=/tmp/myfolder./myprogram-entry-point

which stores all before-optimization HLO files in the folder specified, along with many other useful artifacts.

[`run_hlo_module`] Run HLO Modules

bazelrun//xla/tools:run_hlo_module-- [flags] <filename>

The tool run_hlo_module operates on pre-optimization HLO, and by default bundles compilation, running and comparison with the reference interpreter implementation. For example, the usual invocation to run an input file computation.hlo on an NVIDIA GPU and to check it for correctness is:

run_hlo_module --platform=CUDA --reference_platform=Interpreter computation.hlo

Run Multiple HLO Modules

Invocation with multiple HLO modules is supported for run_hlo_module. To run all hlo modules from a directory:

bazelrun//xla/tools:run_hlo_module-- [flags] /dump/*before_optimizations*

[`multihost_hlo_runner`] Run HLO Modules With SPMD Support

# Note: Binary name is `hlo_runner_main`.
bazelrun//xla/tools/multihost_hlo_runner:hlo_runner_main--[flags]<filename>

Multihost HLO runner is a very similar tool, with the caveat that it supports SPMD, including cross host communication. See Multi-Host HLO Runner for details.

Run Multiple HLO Modules With SPMD Support

Similar to run_hlo_module, multihost_hlo_runner also supports invocation with multiple modules.

bazelrun//xla/tools/multihost_hlo_runner:hlo_runner_main-- [flags] /dump/*before_optimizations*

[`hlo-opt`] Compile HLO Module

bazelrun//xla/tools:hlo-opt -- --platform=[gpu|cpu|...] [more flags] <filename>

When debugging or understanding the workings of the compiler, it is often useful to get the expansion for a particular hardware at a particular point in the pipeline (be it HLO, optimized HLO, TritonIR or LLVM), for a given HLO or StableHLO input.

hlo-opt supports multiple output stages: be it PTX, HLO after optimizations, LLVM IR before optimizations, or TritonIR. The exact set of stages supported depends on the platform (as e.g. PTX is NVIDIA-specific), and can be seen using the --list-stages command:

hlo-opt--platform=CUDA--list-stages
buffer-assignment
hlo
hlo-backend
html
llvm
llvm-after-optimizations
llvm-before-optimizations
ptx

After selecting a stage, the user can write the result of the conversion for a given platform to a given stream:

hlo-opt --platform=cpu --stage=hlo input.hlo

which would print the dump to stdout (or to a given file if -o was specified).

Deviceless Compilation for GPU

Deviceless compilation do not need access to a GPU. The Deviceless Compilation provides a way to specify GPU spec on the command line (--xla_gpu_target_config_filename) for stages where access to GPU is required, eliminating a need for GPU device.

Example: PTX output without access to a gpu device:

hlo-opt--platform=CUDA--stage=llvm--xla_gpu_target_config_filename=/xla/tools/hlo_opt/gpu_specs/a100_pcie_80.txtpbinput.hlo

Specs for popular GPUs are shipped with the compiler, and the provided file is string serialization of device_description.proto:

gpu_device_info {
 cuda_compute_capability {
 major: 8
 minor: 0
 }
 threads_per_block_limit: 1024
 threads_per_warp: 32
 shared_memory_per_block: 127152
 shared_memory_per_core: 65536
 threads_per_core_limit: 2048
 core_count: 6192
 fpus_per_core: 64
 block_dim_limit_x: 2147483647
 block_dim_limit_y: 65535
 block_dim_limit_z: 65535
 memory_bandwidth: 2039000000000
 l2_cache_size: 4194304
 clock_rate_ghz: 1.1105
 device_memory_size: 79050250240
}
platform_name: "CUDA"

More GPU specs are located at /xla/tools/hlo_opt/gpu_specs

Autotuning

Sometimes compilation may involve autotuning based on a compilation --stage. For the deviceless compilation to work, the user either need to
disable autotuning with --xla_gpu_autotune_level=0
or
load a pre-existing autotuning results with --xla_gpu_load_autotune_results_from=<filename> (obtained with --xla_gpu_dump_autotune_results_to=<filename>).

hlo-opt--platform=CUDA--stage=llvm--xla_gpu_target_config_filename=gpu_specs/a100_pcie_80.txtpb--xla_gpu_load_autotune_results_from=results.textpbinput.hlo

The autotune file is text serialization of autotune_results.proto, with example looking like:

version:3
results{
device:"CUDA: 8.0, Cores: 108, GPU clock: 1.41 GHz, Memory bandwidth: 1555 GB/s, L2 cache: 40 MB"
hlo:"{\n tmp_0 = f16[1,16,17,3]{3,2,1,0} parameter(0)\n tmp_1 = f16[16,51]{1,0} bitcast(f16[1,16,17,3]{3,2,1,0} tmp_0)\n tmp_2 = s8[16,17,3]{2,1,0} parameter(1)\n tmp_3 = s8[51,16]{0,1} bitcast(s8[16,17,3]{2,1,0} tmp_2)\n tmp_4 = f16[51,16]{0,1} convert(s8[51,16]{0,1} tmp_3)\n tmp_5 = f16[16,16]{1,0} dot(f16[16,51]{1,0} tmp_1, f16[51,16]{0,1} tmp_4), lhs_contracting_dims={1}, rhs_contracting_dims={0}\n ROOT tmp_6 = f16[1,16,16]{2,1,0} bitcast(f16[16,16]{1,0} tmp_5)\n}"
result{
run_time{
nanos:31744
}
triton{
block_m:32
block_n:32
block_k:32
split_k:1
num_stages:1
num_warps:4
}
}
}

The autotuning database can be serialized using XLA_FLAGS=--xla_gpu_dump_autotune_results_to=<myfile.pbtxt>

[`hlo-opt`] HLO Pass Development And Debugging

#Ifyouareworkingwithhardwareindependentpassesfromthe
#`xla/hlo/transforms/`directory,preferlight-weightversion
#ofthe`hlo-opt`toolwithfewerdependencies:
bazelrun//xla/hlo/tools:hlo-opt-- [flags] <filename>
#Otherwise,forhardwareindependentandCPU,GPUpassesuse
#thesamebinaryfrom"Compile HLO Modules"sectionabove:
bazelrun//xla/tools:hlo-opt-- [flags] <filename>

The hlo-opt tool allows execution of an individual passes independent of the given platform compilation stages. This isolation helps to quickly run passes on input hlo module and pinpoint the root cause of failures.

hlo-opt --passes=schedule-aware-collective-cse input.hlo

hlo-opt tool also supports DebugOptions XLA_FLAGS.

hlo-opt --passes=schedule-aware-collective-cse
--xla_gpu_experimental_collective_cse_distance_threshold=20 input.hlo

Use--list-passes option to get the pass name string.

hlo-opt --list-passes

Users can create their own custom pipeline by specifying more than one passes to --passes option.

hlo-opt --passes=pass1,pass2,pass3 input.hlo

Assist New HLO Pass Development

First, write your pass.
Register the new pass to the hlo-opt tool pass registry.
```
RegisterPass<FooPass>(FooPassInputOptions)
```
Based on the pass type, choose one of the following locations for registration:
opt_lib.cc Hardware-independent passes.
cpu_opt.cc CPU specific passes.
gpu_opt.cc GPU specific passes.
compiled_opt.cc Passes common to CPU, GPU, XPU.
Don't forget to add build dependency.

Include pass registration as part of your PR(example) so that the pass will be available to use for all hlo-opt users.
Rebuild the hlo-opt tool, validate successful pass registration using --list-passes option and then use --passes option to run the pass.
```
$hlo-opt--passes=foo-passinput.hlo
```
Writing unit tests for the pass? refer https://openxla.org/xla/test_hlo_passes for more details.

Pass Runtime Measurement

For large models, full compilation runs can take upto few minutes, making it challenging to detect subtle performance regressions. In contrast, individual pass runs using hlo-opt allow for precise performance measurement and the easy detection of even small increases in execution time caused by new code changes.

timehlo-opt--passes=reduce-window-rewriter,scatter_simplifier
--xla_reduce_window_rewrite_base_length=128input.hlo

[`hlo-opt`] Convert HLO Module Formats

#Usethelightweightversionofthe`hlo-opt`tool.
bazelrun//xla/hlo/tools:hlo-opt-- [flags] <filename>

Convert `HLO Text` -> `HLO Proto`

hlo-opt --emit-proto input.hlo

Convert `HLO Proto` or `HLO Proto Binary` -> `HLO Text`

hlo-opt input.pbtxt or input.pb

[`ptx-opt`] Compiler LLVM Module down to PTX

The tool will run LLVMIR optimization pipeline and then call CompileToPtx.

bazelrun//xla/hlo/tools/ptx-opt----arch=9.0<filename>

The tool can also dump LLVMIR after every path.

bazelrun//xla/hlo/tools/ptx-opt----arch=9.0--xla_dump_to=<path>--xla_gpu_dump_llvmir<filename>

XLA Tooling Stay organized with collections Save and categorize content based on your preferences.

[run_hlo_module] Run HLO Modules

Run Multiple HLO Modules

[multihost_hlo_runner] Run HLO Modules With SPMD Support