Name	Name	Last commit message	Last commit date
Latest commit History 9 Commits
src	src
LICENSE.md	LICENSE.md
PTXprofiler.sln	PTXprofiler.sln
PTXprofiler.vcxproj	PTXprofiler.vcxproj
README.md	README.md
make.sh	make.sh

Name

Last commit message

Last commit date

Latest commit

History

src

PTXprofiler

A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.

How to compile?

on Windows: compile with Visual Studio Community
on Linux: run chmod +x make.sh and ./make.sh path/to/kernel.ptx

How to use?

Generate a .ptx file from your application; this works only with an Nvidia GPU. With the OpenCL-Wrapper, you can simply uncomment #define PTX in src/opencl.hpp and compile and run. A file kernel.ptx is created, containing the PTX assembly code.
Run bin/PTXprofiler.exe path/to/kernel.ptx. For FluidX3D for example, this table is generated:

kernel name |flops (float int bit )|copy |branch|cache (load store)|memory (load cached store)
--------------------------------|---------------------------|------|------|--------------------|---------------------------
initialize | 283 129 61 93| 33| 6| 0 0 0| 135 35 0 100
stream_collide | 363 261 35 67| 23| 2| 0 0 0| 153 77 0 76
update_fields | 160 56 37 67| 21| 2| 0 0 0| 93 77 0 16
voxelize_mesh | 170 91 34 45| 40| 11| 84 48 36| 37 36 0 1
transfer_extract_fi | 460 0 221 239| 122| 63| 0 0 0| 180 80 20 80
transfer__insert_fi | 483 0 247 236| 115| 47| 0 0 0| 180 80 20 80
transfer_extract_rho_u_flags | 47 0 39 8| 23| 1| 0 0 0| 68 34 0 34
transfer__insert_rho_u_flags | 47 0 39 8| 23| 1| 0 0 0| 68 34 0 34

For each OpenCL/CUDA kernel, instructions are counted and listed:
- GPUs compute floating-point, integer and bit manipulation operations on the same ALUs, so they are counted combined as flops, but also listed separately as float, int and bit.
- Data movement operations are listed under copy.
- Branches are listed under branch.
- Total shared/local memory (L1 cache) accesses in Byte are listed under cache, with separate counters for load and store.
- Total global memory (VRAM) accesses in Byte are listed under memory, with separate counters for load, cached (load from VRAM or L2 cache) and store.
You can use the counted flops and memory accesses, together with the measured execution time of the kernel, to place it in a roofline model diagram.

Limitations

Matrix/tensor operations are not yet supported.
Non-unrolled loops are only counted for one iteration, but may be executed multiple times, duplicating the number of actually executed instructions inside the loop.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProjectPhysX/PTXprofiler

Folders and files

Latest commit

History

Repository files navigation

PTXprofiler

How to compile?

How to use?

Limitations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors

Uh oh!

Languages