-
Notifications
You must be signed in to change notification settings - Fork 41
Fixes for building with LLVM / XL OpenMP offload #706
Conversation
bbpbuildbot
commented
Dec 10, 2021
Logfiles from GitLab pipeline #28712 (:no_entry:) have been uploaded here!
Status and direct links:
pramodk
commented
Dec 10, 2021
On Ascent @ ORNL:
module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost
export NVLOCALRC=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/localrc
export PATH=/sw/ascent/gcc/6.4.0/bin:$PATH
module load xl/16.1.1-10
Configure XL for newer GCC:
xlc_configure -gcc /sw/ascent/gcc/10.2.0 -o /ccsopen/proj/gen170/neuron/nersc-gpu-hackaxlc_gcc10.cfg -cuda /sw/ascent/cuda/11.0.2 -cudaVersion 11.0 -cuda_cc_major 7 -cuda_cc_minor 0
And build with:
cmake .. -DCORENRN_ENABLE_CALIPER_PROFILING=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_NMODL=ON -DCMAKE_INSTALL_PREFIX=../install -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0" -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_CUDA_COMPILER=nvcc -DCORENRN_EXTERNAL_BENCHMARK_DATA=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/ -DCMAKE_CXX_FLAGS="" -DCORENRN_ENABLE_UNIT_TESTS=OFF -DCMAKE_CXX_COMPILER=xlc++_r -DCORENRN_NMODL_DIR=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/kumbhar/nmodl/build/install -DCORENRN_ENABLE_MPI=ON
7cb13a1 to
a8958d1
Compare
bbpbuildbot
commented
Dec 10, 2021
Logfiles from GitLab pipeline #28718 (:no_entry:) have been uploaded here!
Status and direct links:
b468870 to
affbc25
Compare
bbpbuildbot
commented
Dec 10, 2021
Logfiles from GitLab pipeline #28723 (:no_entry:) have been uploaded here!
Status and direct links:
affbc25 to
9cd7ace
Compare
olupton
commented
Dec 17, 2021
I rebased this and tried to resolve the conflicts fairly blindly.
bbpbuildbot
commented
Dec 17, 2021
Logfiles from GitLab pipeline #29782 (:no_entry:) have been uploaded here!
Status and direct links:
- ⛔ build:coreneuron+nmodl:intel
- ⛔ build:coreneuron:intel
- ⛔ build:coreneuron+nmodl:gpu
- ⛔ build:coreneuron+nmodl~openmp:gpu
- ⛔ build:coreneuron:gpu
- ⏩ test:coreneuron+nmodl:intel
- ⏩ test:coreneuron:intel
- ⏩ test:coreneuron+nmodl:gpu
- ⏩ test:coreneuron+nmodl~openmp:gpu
- ⏩ test:coreneuron:gpu
- ⏩ build:neuron+nmodl:intel
- ⏩ build:neuron:intel
- ⏩ build:neuron+nmodl:gpu
- ⏩ build:neuron+nmodl~openmp:gpu
- ⏩ build:neuron:gpu
- ⏩ test:neuron+nmodl:intel
- ⏩ test:neuron:intel
- ⏩ test:neuron+nmodl:gpu
- ⏩ test:neuron+nmodl~openmp:gpu
- ⏩ test:neuron:gpu
9cd7ace to
dc9632a
Compare
bbpbuildbot
commented
Dec 23, 2021
Logfiles from GitLab pipeline #30397 (:no_entry:) have been uploaded here!
Status and direct links:
- ⏩ build:coreneuron+nmodl:intel
- ⛔ build:coreneuron:intel
- ⏩ build:coreneuron+nmodl:gpu
- ⏩ build:coreneuron+nmodl~openmp:gpu
- ⛔ build:coreneuron:gpu
- ⏩ test:coreneuron+nmodl:intel
- ⏩ test:coreneuron:intel
- ⏩ test:coreneuron+nmodl:gpu
- ⏩ test:coreneuron+nmodl~openmp:gpu
- ⏩ test:coreneuron:gpu
- ⏩ build:neuron+nmodl:intel
- ⏩ build:neuron:intel
- ⏩ build:neuron+nmodl:gpu
- ⏩ build:neuron+nmodl~openmp:gpu
- ⏩ build:neuron:gpu
- ⏩ test:neuron+nmodl:intel
- ⏩ test:neuron:intel
- ⏩ test:neuron+nmodl:gpu
- ⏩ test:neuron+nmodl~openmp:gpu
- ⏩ test:neuron:gpu
... working * todo: temporary changes to OpenAccHelper.cmake, needs refinement * todo: see caliper linkling issue * todo: _OPENACC needs to be renamed CORENRN_ENABLE_GPU so that OpenMP based builds can use GPU offload. * todo: hardcoded CXX flags for quick build
dc9632a to
439c560
Compare
bbpbuildbot
commented
Dec 31, 2021
Logfiles from GitLab pipeline #30552 (:no_entry:) have been uploaded here!
Status and direct links:
- ⏩ build:coreneuron+nmodl:intel
- ⛔ build:coreneuron:intel
- ⏩ build:coreneuron+nmodl:gpu
- ⏩ build:coreneuron+nmodl~openmp:gpu
- ⛔ build:coreneuron:gpu
- ⏩ test:coreneuron+nmodl:intel
- ⏩ test:coreneuron:intel
- ⏩ test:coreneuron+nmodl:gpu
- ⏩ test:coreneuron+nmodl~openmp:gpu
- ⏩ test:coreneuron:gpu
- ⏩ build:neuron+nmodl:intel
- ⏩ build:neuron:intel
- ⏩ build:neuron+nmodl:gpu
- ⏩ build:neuron+nmodl~openmp:gpu
- ⏩ build:neuron:gpu
- ⏩ test:neuron+nmodl:intel
- ⏩ test:neuron:intel
- ⏩ test:neuron+nmodl:gpu
- ⏩ test:neuron+nmodl~openmp:gpu
- ⏩ test:neuron:gpu
* Setting CMAKE_EXE_LINKER_FLAGS was used also for NMODL
when NMODL is built from submodule
* In case of LLVM OpenMP offload, if binary is being created
from the object that is not compiled with OpenMP flags,
it causes link error:
echo "int main() { return 0; } " > foo.cpp
clang++ -c foo.cpp # adding openmp flags here doesn't cause below error
clang++ foo.o -o foo -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda
nvlink fatal : Could not open input file '/tmp/foo-0b9a1a.cubin'
clang-12: error: nvlink command failed with exit code 1 (use -v to see invocation)
bbpbuildbot
commented
Jan 3, 2022
Logfiles from GitLab pipeline #30663 (:no_entry:) have been uploaded here!
Status and direct links:
- ⏩ build:coreneuron+nmodl:intel
- ⛔ build:coreneuron:intel
- ⏩ build:coreneuron+nmodl:gpu
- ⏩ build:coreneuron+nmodl~openmp:gpu
- ⛔ build:coreneuron:gpu
- ⏩ test:coreneuron+nmodl:intel
- ⏩ test:coreneuron:intel
- ⏩ test:coreneuron+nmodl:gpu
- ⏩ test:coreneuron+nmodl~openmp:gpu
- ⏩ test:coreneuron:gpu
- ⏩ build:neuron+nmodl:intel
- ⏩ build:neuron:intel
- ⏩ build:neuron+nmodl:gpu
- ⏩ build:neuron+nmodl~openmp:gpu
- ⏩ build:neuron:gpu
- ⏩ test:neuron+nmodl:intel
- ⏩ test:neuron:intel
- ⏩ test:neuron+nmodl:gpu
- ⏩ test:neuron+nmodl~openmp:gpu
- ⏩ test:neuron:gpu
pramodk
commented
Jan 4, 2022
Here are some notes from various experimentation / debugging attempts to get OpenMP offload working with LLVM v13.0.
- On Slack thread, Olli reminded small reproducer from NERSC Hackathon showing issue with global variables and static library:
[olupton@r2i3n6 build_olli_gpu]$ cat test.sh CXX=clang++ CXXFLAGS="-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -g" ${CXX} ${CXXFLAGS} -c test.cpp ar cq libtest.a test.o ${CXX} ${CXXFLAGS} -o test1 main.cpp -L. -ltest ${CXX} ${CXXFLAGS} -o test2 main.cpp test.o [olupton@r2i3n6 build_olli_gpu]$ cat test.cpp #pragma omp declare target int y; #pragma omp end declare target int test() { y = 24; #pragma omp target update to(y) y = 42; int x; #pragma omp target map(from:x) { x = y; } return x; } [olupton@r2i3n6 build_olli_gpu]$ cat main.cpp extern int test(); int main() { return test(); }
small reproducer for the problem I am seeing now — going via the static library seems to cause problems 🤷
[olupton@r2i3n6 build_olli_gpu]$ ./test1 CUDA error: Loading global 'y' Failed CUDA error: named symbol not found Libomptarget error: Unable to generate entries table for device id 0. Libomptarget error: Failed to init globals on device 0 Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings. test.cpp:7:3: Libomptarget fatal error 1: failure of target construct while offloading is mandatory Aborted [olupton@r2i3n6 build_olli_gpu]$ ./test2; echo $? 24
- So I switched to shared library in coreneuron + nrnivmodl-core i.e. build shared library of libcorenrnmech. This was giving undefined symbol errors for global variables & function defined in libcoreneuron.a:
nvlink error : Undefined reference to '_ZN10coreneuron7celsiusE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin' nvlink error : Undefined reference to '_ZN10coreneuron2piE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin' nvlink error : Undefined reference to '_ZN10coreneuron11secondorderE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin' nvlink error : Undefined reference to '_ZN10coreneuron7at_timeEPNS_9NrnThreadEd' in
See also Olli's comment in NERSC GPU Hackathon Slack:
With the Clang/OpenMP build + shared libraries instead of static, I avoided some link errors by removing the annotations in the header
#pragma omp declare target extern double celsius; #pragma omp end declare target
and the remaining one (at_time function) by making its body available. But now I have a new segfault at startup:
Program received signal SIGSEGV, Segmentation fault. 0x00007fffed92b801 in RTLsTy::RegisterRequires(long) () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so (gdb) bt #0 0x00007fffed92b801 in RTLsTy::RegisterRequires(long) () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so #1 0x00007fffed92824e in __tgt_register_requires () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so #2 0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2 #3 0x00007fffed8e417a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
still investigating..
- Then I avoided building libcorenrnmech library and tried to use mechanisms object files directly to create
special-core(by modifying nrnivmodl-core-makefile). This didn't go too far - I saw similar linking errors with global symbols in libcoreneuron library:
=> Binary creating x86_64/special-core /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-lvcrm6/bin/clang++ -fopenmp=libomp -std=c++14 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -Wl,--as-needed -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -o x86_64/special-core /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/share/coreneuron/coreneuron.cpp \ -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/include \ x86_64/corenrn/build/_mod_func.o x86_64/corenrn/build/_dimplic.o x86_64/corenrn/build/exp2syn.o x86_64/corenrn/build/expsyn.o x86_64/corenrn/build/halfgap.o x86_64/corenrn/build/hh.o x86_64/corenrn/build/netstim.o x86_64/corenrn/build/passive.o x86_64/corenrn/build/pattern.o x86_64/corenrn/build/stim.o x86_64/corenrn/build/svclmp.o x86_64/corenrn/build/enginemech.o nvlink error : Undefined reference to '_ZN10coreneuron7celsiusE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin' nvlink error : Undefined reference to '_ZN10coreneuron2piE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin' nvlink error : Undefined reference to '_ZN10coreneuron11secondorderE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin' nvlink error : Undefined reference to '_ZN10coreneuron7at_timeEPNS_9NrnThreadEd' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/svclmp-852313.cubin'
- Next, I wanted to avoid building any intermediate libraries i.e. create
special-corebinary by taking all objects. For this, I took mechanism cpp files generated bynrnivmodl-coreand added into coreneuron source tree (temporarily). Then, updated cmake to build special-core directly via cmake i.e. something like (partial diff):
+++ b/coreneuron/CMakeLists.txt @@ -47,6 +47,13 @@ list(APPEND CORENEURON_CODE_FILES ${PROJECT_BINARY_DIR}/coreneuron/config/config set(DIMPLIC_CODE_FILE "mechanism/mech/dimplic.cpp") set(ENGINEMECH_CODE_FILE "mechanism/mech/enginemech.cpp") +file(GLOB CORENEURON_SPECIAL_CORE_FILES "exe/*.cpp") +list(APPEND CORENEURON_SPECIAL_CORE_FILES "${CMAKE_CURRENT_SOURCE_DIR}/apps/coreneuron.cpp") +list(APPEND CORENEURON_SPECIAL_CORE_FILES "${DIMPLIC_CODE_FILE}") +list(APPEND CORENEURON_SPECIAL_CORE_FILES "${ENGINEMECH_CODE_FILE}") + # for external mod files we need to generate modl_ref function in mod_func.c set(MODFUNC_PERL_SCRIPT "mechanism/mech/mod_func.c.pl") @@ -184,6 +191,8 @@ if(CORENRN_ENABLE_MPI AND NOT CORENRN_ENABLE_MPI_DYNAMIC) set(CORENRN_MPI_OBJ $<TARGET_OBJECTS:${CORENRN_MPI_LIB_NAME}>) endif() +set(COMPILE_LIBRARY_TYPE OBJECT) + # main coreneuron library add_library( coreneuron @@ -319,7 +328,7 @@ add_custom_command( "${modfile_directory}" WORKING_DIRECTORY ${CMAKE_BINARY_DIR}/bin COMMENT "Running nrnivmodl-core with halfgap.mod") -add_custom_target(nrniv-core ALL DEPENDS ${output_binaries}) +#add_custom_target(nrniv-core ALL DEPENDS ${output_binaries}) include_directories(${CORENEURON_PROJECT_SOURCE_DIR}) @@ -358,6 +367,12 @@ configure_file("utils/profile/profiler_interface.h" # main program required for building special-core file(COPY apps/coreneuron.cpp DESTINATION ${CMAKE_BINARY_DIR}/share/coreneuron) +add_executable(special-core-gpu ${CORENEURON_SPECIAL_CORE_FILES}) +target_compile_options(special-core-gpu BEFORE PRIVATE $<$<COMPILE_LANGUAGE:CXX>:${CORENRN_ACC_FLAGS}>) +target_compile_definitions(special-core-gpu PUBLIC -DADDITIONAL_MECHS) +target_link_libraries(special-core-gpu coreneuron ${CMAKE_DL_LIBS})
This created exe using all object files:
[ 87%] Linking CXX executable ../bin/special-core-gpu cd /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron && /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cmake-3.21.4-cdyb7k/bin/cmake -E cmake_link_script CMakeFiles/special-core-gpu.dir/link.txt --verbose=1 /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-klsplt/bin/clang++ -fopenmp=libomp -Wl,--as-needed -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include CMakeFiles/special-core-gpu.dir/exe/_mod_func.cpp.o CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o CMakeFiles/special-core-gpu.dir/exe/expsyn.cpp.o CMakeFiles/special-core-gpu.dir/exe/halfgap.cpp.o CMakeFiles/special-core-gpu.dir/exe/hh.cpp.o CMakeFiles/special-core-gpu.dir/exe/netstim.cpp.o CMakeFiles/special-core-gpu.dir/exe/passive.cpp.o CMakeFiles/special-core-gpu.dir/exe/pattern.cpp.o CMakeFiles/special-core-gpu.dir/exe/stim.cpp.o CMakeFiles/special-core-gpu.dir/exe/svclmp.cpp.o CMakeFiles/special-core-gpu.dir/apps/coreneuron.cpp.o CMakeFiles/special-core-gpu.dir/mechanism/mech/dimplic.cpp.o CMakeFiles/special-core-gpu.dir/mechanism/mech/enginemech.cpp.o CMakeFiles/coreneuron.dir/apps/corenrn_parameters.cpp.o CMakeFiles/coreneuron.dir/apps/main1.cpp.o CMakeFiles/coreneuron.dir/gpu/nrn_acc_manager.cpp.o CMakeFiles/coreneuron.dir/io/core2nrn_data_return.cpp.o CMakeFiles/coreneuron.dir/io/file_utils.cpp.o CMakeFiles/coreneuron.dir/io/global_vars.cpp.o CMakeFiles/coreneuron.dir/io/lfp.cpp.o CMakeFiles/coreneuron.dir/io/mech_report.cpp.o CMakeFiles/coreneuron.dir/io/mem_layout_util.cpp.o CMakeFiles/coreneuron.dir/io/mk_mech.cpp.o CMakeFiles/coreneuron.dir/io/nrn2core_data_init.cpp.o CMakeFiles/coreneuron.dir/io/nrn_checkpoint.cpp.o CMakeFiles/coreneuron.dir/io/nrn_filehandler.cpp.o CMakeFiles/coreneuron.dir/io/nrn_setup.cpp.o CMakeFiles/coreneuron.dir/io/output_spikes.cpp.o CMakeFiles/coreneuron.dir/io/phase1.cpp.o CMakeFiles/coreneuron.dir/io/phase2.cpp.o CMakeFiles/coreneuron.dir/io/prcellstate.cpp.o CMakeFiles/coreneuron.dir/io/reports/binary_report_handler.cpp.o CMakeFiles/coreneuron.dir/io/reports/nrnreport.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_configuration_parser.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_event.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_handler.cpp.o CMakeFiles/coreneuron.dir/io/reports/sonata_report_handler.cpp.o CMakeFiles/coreneuron.dir/io/setup_fornetcon.cpp.o CMakeFiles/coreneuron.dir/mechanism/capac.cpp.o CMakeFiles/coreneuron.dir/mechanism/eion.cpp.o CMakeFiles/coreneuron.dir/mechanism/mech_mapping.cpp.o CMakeFiles/coreneuron.dir/mechanism/patternstim.cpp.o CMakeFiles/coreneuron.dir/mechanism/register_mech.cpp.o CMakeFiles/coreneuron.dir/network/cvodestb.cpp.o CMakeFiles/coreneuron.dir/network/multisend.cpp.o CMakeFiles/coreneuron.dir/network/multisend_setup.cpp.o CMakeFiles/coreneuron.dir/network/netcvode.cpp.o CMakeFiles/coreneuron.dir/network/netpar.cpp.o CMakeFiles/coreneuron.dir/network/partrans.cpp.o CMakeFiles/coreneuron.dir/network/partrans_setup.cpp.o CMakeFiles/coreneuron.dir/network/tqueue.cpp.o CMakeFiles/coreneuron.dir/permute/balance.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder1.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder2.cpp.o CMakeFiles/coreneuron.dir/permute/data_layout.cpp.o CMakeFiles/coreneuron.dir/permute/node_permute.cpp.o CMakeFiles/coreneuron.dir/sim/fadvance_core.cpp.o CMakeFiles/coreneuron.dir/sim/fast_imem.cpp.o CMakeFiles/coreneuron.dir/sim/finitialize.cpp.o CMakeFiles/coreneuron.dir/sim/multicore.cpp.o CMakeFiles/coreneuron.dir/sim/solve_core.cpp.o CMakeFiles/coreneuron.dir/sim/treeset_core.cpp.o CMakeFiles/coreneuron.dir/utils/ispc/globals.cpp.o CMakeFiles/coreneuron.dir/utils/ivocvect.cpp.o CMakeFiles/coreneuron.dir/utils/lpt.cpp.o CMakeFiles/coreneuron.dir/utils/memory.cpp.o CMakeFiles/coreneuron.dir/utils/memory_utils.cpp.o CMakeFiles/coreneuron.dir/utils/nrn_stats.cpp.o CMakeFiles/coreneuron.dir/utils/nrnoc_aux.cpp.o CMakeFiles/coreneuron.dir/utils/nrntimeout.cpp.o CMakeFiles/coreneuron.dir/utils/progressbar/progressbar.cpp.o CMakeFiles/coreneuron.dir/utils/randoms/nrnran123.cpp.o CMakeFiles/coreneuron.dir/utils/string_utils.cpp.o CMakeFiles/coreneuron.dir/utils/utils.cpp.o CMakeFiles/coreneuron.dir/utils/vrecord.cpp.o CMakeFiles/coreneuron.dir/config/config.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpi_def_cinc.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpi.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpidec.cpp.o -o ../bin/special-core-gpu -Wl,-rpath,/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/lib64 -ldl /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/lib64/libcudart.so
Unfortunately, this still gives similar cryptic error at launch for GPU or CPU execution:
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build$ ./bin/special-core-gpu -e 1 -d ../tests/integration/ring Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020 Version : 1.0 4de7951f (2022年01月04日 16:29:44 +0100) CUDA error: Error returned from cuModuleLoadDataEx CUDA error: out of memory Libomptarget error: Unable to generate entries table for device id 0. Libomptarget error: Failed to init globals on device 0 Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings. Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only. Libomptarget fatal error 1: failure of target construct while offloading is mandatory Aborted kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build$ ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu Info : 4 GPUs shared by 1 ranks per node Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020 Version : 1.0 4de7951f (2022年01月04日 16:29:44 +0100) CUDA error: Error returned from cuModuleLoadDataEx CUDA error: out of memory Libomptarget error: Unable to generate entries table for device id 0. Libomptarget error: Failed to init globals on device 0 Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings. Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only. Libomptarget fatal error 1: failure of target construct while offloading is mandatory Aborted
I saw one such error reported here but that doesn't seem relevant.
- Next, I wanted to incrementally enable OpenMP offload part and find out what causes above error. So, first thing I did was to disable OpenMP offload parts from mechanisms cpp files i.e. simply something like:
cd coreneuron/exe # this directory contain exp2syn.cpp expsyn.cpp halfgap.cpp hh.cpp _kinderiv.h _mod_func.cpp netstim.cpp passive.cpp pattern.cpp stim.cpp svclmp.cpp sed -i 's#nrn_pragma_omp#//nrn_pragma_omp#g' *.cpp
And by re-building I saw that the special-core-gpu binary that we build is running fine on CPU or GPU!
So the issue seems to be somehow related to mechanisms cpp files!
$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu ==112850== NVPROF is profiling process 112850, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu Info : 4 GPUs shared by 1 ranks per node Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020 Version : 1.0 4de7951f (2022年01月04日 16:29:44 +0100) Additional mechanisms from files exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod Exp2Syn Reg Memory (MBs) : After mk_mech : Max 298.0469, Min 298.0469, Avg 298.0469 Memory (MBs) : After MPI_Init : Max 296.1094, Min 296.1094, Avg 296.1094 Memory (MBs) : Before nrn_setup : Max 296.1406, Min 296.1406, Avg 296.1406 WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1. Setup Done : 0.00 seconds Model size : 84.19 kB Memory (MBs) : After nrn_setup : Max 296.4258, Min 296.4258, Avg 296.4258 GENERAL PARAMETERS --mpi=false --mpi-lib= --gpu=true --dt=0.025 --tstop=1 GPU --nwarp=65536 --cell-permute=0 --cuda-interface=false INPUT PARAMETERS --voltage=-65 --seed=-1 --datpath=../tests/integration/ring --filesdat=files.dat --pattern= --report-conf= --restore= PARALLEL COMPUTATION PARAMETERS --threading=false --skip_mpi_finalize=false SPIKE EXCHANGE --ms_phases=2 --ms_subintervals=2 --multisend=false --spk_compress=0 --binqueue=false CONFIGURATION --spikebuf=100000 --prcellgid=-1 --forwardskip=0 --celsius=6.3 --mindelay=1 --report-buffer-size=4 OUTPUT PARAMETERS --dt_io=0.1 --outpath=. --checkpoint= Start time (t) = 0 Memory (MBs) : After mk_spikevec_buffer : Max 296.4258, Min 296.4258, Avg 296.4258 .... Memory (MBs) : After nrn_finitialize : Max 301.5273, Min 301.5273, Avg 301.5273 psolve |=========================================================| t: 1.00 ETA: 0h00m00s Solver Time : 0.24223 Simulation Statistics Number of cells: 20 Number of compartments: 804 Number of presyns: 21 Number of input presyns: 0 Number of synapses: 21 Number of point processes: 41 Number of transfer sources: 0 Number of transfer targets: 0 Number of spikes: 40 Number of spikes with non negative gid-s: 40 ==112850== Profiling application: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu ==112850== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 11.68% 20.892ms 160 130.57us 126.98us 134.59us __omp_offloading_2f_188133fe__ZN10coreneuron11nrn_cur_ionEPNS_9NrnThreadEPNS_9Memb_listEi_l273 9.64% 17.250ms 80 215.63us 186.62us 222.72us __omp_offloading_2f_2979dc5__ZN10coreneuron12nrn_state_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l520 8.98% 16.057ms 160 100.35us 99.007us 101.92us __omp_offloading_2f_12055abf__ZN10coreneuron22net_buf_receive_ExpSynEPNS_9NrnThreadE_l290 7.72% 13.816ms 80 172.70us 169.18us 174.59us __omp_offloading_2f_2979dc5__ZN10coreneuron10nrn_cur_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l472 6.87% 12.293ms 80 153.66us 152.13us 154.94us __omp_offloading_2f_12055abf__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l383 6.82% 12.200ms 80 152.50us 150.37us 154.30us __omp_offloading_2f_3962f8e__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_l276 6.07% 10.865ms 80 135.81us 135.23us 136.58us __omp_offloading_2f_12055abf__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l426 6.05% 10.826ms 80 135.33us 132.64us 136.80us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_lhsEPNS_9NrnThreadE_l166 6.04% 10.806ms 80 135.07us 133.41us 137.73us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l96 6.01% 10.746ms 80 134.32us 133.73us 135.33us __omp_offloading_2f_198f6a66__ZN10coreneuron21nrn_jacob_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l74 5.94% 10.621ms 80 132.76us 132.35us 136.48us __omp_offloading_2f_191507c2__ZN10coreneuron8NetCvode12check_threshEPNS_9NrnThreadE_l541 5.92% 10.589ms 80 132.36us 131.52us 133.31us __omp_offloading_2f_198f6a66__ZN10coreneuron19nrn_cur_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l120 5.84% 10.446ms 80 130.57us 127.07us 131.42us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l37 5.79% 10.358ms 80 129.48us 128.90us 132.19us __omp_offloading_2f_1d8e98f0__ZN10coreneuron6updateEPNS_9NrnThreadE_l217 0.33% 594.30us 6 99.050us 2.4320us 446.46us [CUDA memset] 0.18% 318.05us 246 1.2920us 1.2470us 1.7920us [CUDA memcpy HtoD] 0.10% 185.74us 125 1.4850us 1.3750us 2.9440us [CUDA memcpy DtoH] API calls: 38.95% 214.73ms 1 214.73ms 214.73ms 214.73ms cuDevicePrimaryCtxRetain 38.43% 211.84ms 1450 146.09us 756ns 275.66us cuStreamSynchronize 9.61% 52.977ms 1 52.977ms 52.977ms 52.977ms cuModuleLoadDataEx 4.59% 25.275ms 1 25.275ms 25.275ms 25.275ms cuModuleUnload 3.69% 20.328ms 6 3.3880ms 8.9020us 20.278ms cudaMallocManaged 2.10% 11.594ms 84 138.03us 11.220us 146.78us cuMemcpyDtoHAsync 1.17% 6.4369ms 1280 5.0280us 4.2990us 17.611us cuLaunchKernel 0.30% 1.6383ms 407 4.0250us 132ns 194.16us cuDeviceGetAttribute 0.28% 1.5530ms 4 388.24us 385.38us 395.38us cuDeviceTotalMem 0.26% 1.4583ms 1619 900ns 161ns 581.21us cuCtxSetCurrent 0.19% 1.0351ms 246 4.2070us 3.6880us 13.495us cuMemcpyHtoDAsync 0.15% 844.22us 6 140.70us 26.971us 516.61us cudaMemset 0.11% 591.25us 41 14.420us 11.523us 44.503us cuMemcpyDtoH 0.07% 407.33us 32 12.729us 1.7660us 190.68us cuStreamCreate 0.03% 172.74us 4 43.183us 36.483us 57.844us cuDeviceGetName 0.02% 125.94us 32 3.9350us 2.2320us 27.967us cuStreamDestroy 0.02% 95.955us 373 257ns 144ns 4.1180us cuGetProcAddress 0.01% 44.318us 54 820ns 292ns 5.4780us cuModuleGetGlobal 0.01% 30.319us 41 739ns 377ns 4.3020us cuModuleGetFunction 0.00% 16.994us 1 16.994us 16.994us 16.994us cuMemAlloc 0.00% 13.340us 1 13.340us 13.340us 13.340us cudaSetDevice 0.00% 12.246us 4 3.0610us 985ns 8.0850us cuDeviceGetPCIBusId 0.00% 5.9690us 9 663ns 141ns 3.1970us cuDeviceGet 0.00% 2.8740us 1 2.8740us 2.8740us 2.8740us cuDevicePrimaryCtxGetState 0.00% 2.7800us 2 1.3900us 201ns 2.5790us cuCtxGetLimit 0.00% 1.7460us 5 349ns 256ns 595ns cuFuncGetAttribute 0.00% 1.1300us 4 282ns 202ns 443ns cuDeviceGetCount 0.00% 946ns 4 236ns 161ns 322ns cuDeviceGetUuid 0.00% 895ns 1 895ns 895ns 895ns cuInit 0.00% 737ns 1 737ns 737ns 737ns cuDevicePrimaryCtxSetFlags 0.00% 591ns 1 591ns 591ns 591ns cuCtxGetDevice 0.00% 371ns 1 371ns 371ns 371ns cuDevicePrimaryCtxRelease 0.00% 217ns 1 217ns 217ns 217ns cuDriverGetVersion ==112850== Unified Memory profiling result: Device "Tesla V100-SXM2-16GB (0)" Count Avg Size Min Size Max Size Total Size Total Time Name 4 32.000KB 4.0000KB 60.000KB 128.0000KB 20.64000us Host To Device 6 32.000KB 4.0000KB 60.000KB 192.0000KB 21.34400us Device To Host 3 - - - - 575.5490us Gpu page fault groups 16 4.0000KB 4.0000KB 4.0000KB 64.00000KB - Memory thrashes Total CPU Page faults: 3 Total CPU thrashes: 16
- Then, I was able to isolate the issue to single file
exp2syn.cpp- if I comment out all OpenMP offload pragmas inexp2syn.cppthen special-core-gpu binary works fine!
But the funny part with exp2syn.cpp is that this mechanism is not used in the ringtest model at all ! :
$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats ==112982== NVPROF is profiling process 112982, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats Info : 4 GPUs shared by 1 ranks per node Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020 Version : 1.0 4de7951f (2022年01月04日 16:29:44 +0100) Additional mechanisms from files exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod Memory (MBs) : After mk_mech : Max 298.0469, Min 298.0469, Avg 298.0469 Memory (MBs) : After MPI_Init : Max 296.1094, Min 296.1094, Avg 296.1094 Memory (MBs) : Before nrn_setup : Max 296.1406, Min 296.1406, Avg 296.1406 WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1. ================ MECHANISMS COUNT BY TYPE ================== Id Name Count 0 (null) 0 1 (null) 0 2 morphology 0 3 capacitance 392 4 pas 372 5 extracellular 0 6 fastpas 0 7 IClamp 0 8 AlphaSynapse 0 9 ExpSyn 40 10 Exp2Syn 0 11 SEClamp 0 12 VClamp 0 13 OClamp 0 14 APCount 0 15 na_ion 20 16 k_ion 20 17 hh 20 18 NetStim 1 19 IntFire1 0 20 IntFire2 0 21 IntFire4 0 22 PointProcessMark 0 23 PatternStim 0 24 HalfGap 0
You can see that Exp2Syn count is 0!
- In the
exp2syn.cppif I enable single OpenMP pragma e.g. in the functionnrn_state_Exp2Syn()(which is not executed!) then I still get below error:
$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats ==113465== NVPROF is profiling process 113465, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats Info : 4 GPUs shared by 1 ranks per node Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020 Version : 1.0 4de7951f (2022年01月04日 16:29:44 +0100) CUDA error: Error returned from cuModuleLoadDataEx CUDA error: out of memory Libomptarget error: Unable to generate entries table for device id 0. Libomptarget error: Failed to init globals on device 0 Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings. Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only. Libomptarget fatal error 1: failure of target construct while offloading is mandatory
- During this experimentation I realised that CoreNEURON CMake doesn't set any optimisation flags when LLVM compiler is used. See https://github.com/BlueBrain/CoreNeuron/pull/734/files. So I compiled
exp2syn.cppwith optimisation flag-O1and the error disappeared!
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ rm ../bin/special-core-gpu kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-klsplt/bin/clang++ -DADDITIONAL_MECHS -DCORENEURON_BUILD -DCORENEURON_CUDA_PROFILING -DCORENEURON_ENABLE_GPU -DCORENEURON_PREFER_OPENMP_OFFLOAD -DDISABLE_HOC_EXP -DENABLE_SPLAYTREE_QUEUING -DLAYOUT=0 -DNRNMPI=0 -DNRN_MULTISEND=0 -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/targets/x86_64-linux/include -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/include -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/coreneuron/utils/randoms -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron -isystem /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -fopenmp=libomp -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -std=c++14 -MD -MT coreneuron/CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o -MF CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o.d -o CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o -c /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/coreneuron/exe/exp2syn.cpp -O1 kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ make [ 1%] Built target pyembed [ 2%] Built target fmt [ 3%] Built target pywrapper [ 3%] Built target pyastgen [ 18%] Built target lexer_obj [ 18%] Built target lexer [ 21%] Built target util_obj [ 22%] Built target util [ 38%] Built target visitor_obj [ 39%] Built target visitor [ 43%] Built target codegen [ 45%] Built target printer_obj [ 45%] Built target printer [ 46%] Built target symtab_obj [ 47%] Built target symtab [ 48%] Built target nmodl [ 48%] Built target nrnivmodl-core [ 50%] Built target kin_deriv_header [ 83%] Built target coreneuron [ 86%] Built target scopmath Consolidate compiler generated dependencies of target special-core-gpu [ 87%] Linking CXX executable ../bin/special-core-gpu [ 93%] Built target special-core-gpu [100%] Built target coreneuron-copy-nrnivmodl-core-dependencies kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ nvprof ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats ==115317== NVPROF is profiling process 115317, command: ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats Info : 4 GPUs shared by 1 ranks per node Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020 Version : 1.0 4de7951f (2022年01月04日 16:29:44 +0100) Additional mechanisms from files exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod Exp2Syn Reg Memory (MBs) : After mk_mech : Max 297.8047, Min 297.8047, Avg 297.8047 Memory (MBs) : After MPI_Init : Max 296.2422, Min 296.2422, Avg 296.2422 Memory (MBs) : Before nrn_setup : Max 296.2734, Min 296.2734, Avg 296.2734 WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1. .... Memory (MBs) : After nrn_finitialize : Max 301.6562, Min 301.6562, Avg 301.6562 psolve |=========================================================| t: 1.00 ETA: 0h00m01s Solver Time : 0.253621 Simulation Statistics Number of cells: 20 Number of compartments: 804 Number of presyns: 21 Number of input presyns: 0 Number of synapses: 21 Number of point processes: 41 Number of transfer sources: 0 Number of transfer targets: 0 Number of spikes: 40 Number of spikes with non negative gid-s: 40 ==115317== Profiling application: ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats ==115317== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 11.69% 21.776ms 160 136.10us 127.74us 143.14us __omp_offloading_2f_188133fe__ZN10coreneuron11nrn_cur_ionEPNS_9NrnThreadEPNS_9Memb_listEi_l273 9.66% 17.998ms 80 224.98us 198.98us 236.25us __omp_offloading_2f_2979dc5__ZN10coreneuron12nrn_state_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l520 8.98% 16.731ms 160 104.57us 99.968us 108.10us __omp_offloading_2f_12055abf__ZN10coreneuron22net_buf_receive_ExpSynEPNS_9NrnThreadE_l290 7.82% 14.565ms 80 182.06us 176.06us 187.84us __omp_offloading_2f_2979dc5__ZN10coreneuron10nrn_cur_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l472 6.86% 12.780ms 80 159.75us 153.70us 163.58us __omp_offloading_2f_12055abf__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l383 6.82% 12.697ms 80 158.71us 152.13us 163.20us __omp_offloading_2f_3962f8e__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_l276 6.04% 11.251ms 80 140.64us 135.58us 144.13us __omp_offloading_2f_12055abf__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l426 6.04% 11.250ms 80 140.62us 134.53us 146.59us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l96 6.03% 11.240ms 80 140.50us 135.33us 143.87us __omp_offloading_2f_198f6a66__ZN10coreneuron21nrn_jacob_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l74 6.01% 11.203ms 80 140.04us 133.09us 145.22us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_lhsEPNS_9NrnThreadE_l166 5.95% 11.076ms 80 138.45us 133.34us 144.96us __omp_offloading_2f_191507c2__ZN10coreneuron8NetCvode12check_threshEPNS_9NrnThreadE_l541 5.92% 11.035ms 80 137.94us 132.42us 141.38us __omp_offloading_2f_198f6a66__ZN10coreneuron19nrn_cur_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l120 5.80% 10.795ms 80 134.94us 130.05us 139.90us __omp_offloading_2f_1d8e98f0__ZN10coreneuron6updateEPNS_9NrnThreadE_l217 5.77% 10.751ms 80 134.39us 129.34us 137.79us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l37 0.32% 598.59us 6 99.765us 2.7200us 447.39us [CUDA memset] 0.17% 323.17us 246 1.3130us 1.2470us 1.8240us [CUDA memcpy HtoD] 0.11% 195.62us 126 1.5520us 1.3760us 3.0090us [CUDA memcpy DtoH] API calls: 39.19% 222.11ms 1450 153.18us 811ns 3.8766ms cuStreamSynchronize 38.77% 219.72ms 1 219.72ms 219.72ms 219.72ms cuDevicePrimaryCtxRetain 9.24% 52.366ms 1 52.366ms 52.366ms 52.366ms cuModuleLoadDataEx 4.44% 25.182ms 1 25.182ms 25.182ms 25.182ms cuModuleUnload 3.59% 20.330ms 6 3.3883ms 9.0800us 20.277ms cudaMallocManaged 2.13% 12.063ms 84 143.60us 11.176us 157.59us cuMemcpyDtoHAsync 1.20% 6.7851ms 1280 5.3000us 4.5340us 19.189us cuLaunchKernel 0.29% 1.6442ms 407 4.0390us 133ns 187.16us cuDeviceGetAttribute 0.28% 1.5656ms 4 391.41us 387.70us 395.00us cuDeviceTotalMem 0.25% 1.4421ms 1619 890ns 159ns 579.67us cuCtxSetCurrent 0.19% 1.0798ms 246 4.3890us 3.7420us 17.665us cuMemcpyHtoDAsync 0.15% 853.90us 6 142.32us 27.310us 519.82us cudaMemset 0.11% 647.30us 42 15.411us 11.264us 49.087us cuMemcpyDtoH 0.07% 412.45us 32 12.889us 1.7620us 191.92us cuStreamCreate 0.03% 168.17us 4 42.041us 36.633us 56.245us cuDeviceGetName 0.02% 112.19us 32 3.5050us 2.1360us 21.674us cuStreamDestroy 0.02% 108.93us 373 292ns 138ns 4.2190us cuGetProcAddress 0.01% 44.237us 55 804ns 401ns 3.3740us cuModuleGetGlobal 0.01% 34.313us 42 816ns 354ns 6.8130us cuModuleGetFunction 0.00% 14.817us 1 14.817us 14.817us 14.817us cudaSetDevice 0.00% 13.742us 1 13.742us 13.742us 13.742us cuMemAlloc 0.00% 11.865us 4 2.9660us 937ns 7.8230us cuDeviceGetPCIBusId 0.00% 7.1150us 9 790ns 142ns 4.0720us cuDeviceGet 0.00% 2.6710us 1 2.6710us 2.6710us 2.6710us cuDevicePrimaryCtxGetState 0.00% 2.5190us 2 1.2590us 224ns 2.2950us cuCtxGetLimit 0.00% 2.2450us 5 449ns 231ns 744ns cuFuncGetAttribute 0.00% 1.2930us 4 323ns 180ns 528ns cuDeviceGetCount 0.00% 1.1640us 1 1.1640us 1.1640us 1.1640us cuInit 0.00% 984ns 4 246ns 162ns 333ns cuDeviceGetUuid 0.00% 674ns 1 674ns 674ns 674ns cuDevicePrimaryCtxSetFlags 0.00% 502ns 1 502ns 502ns 502ns cuCtxGetDevice 0.00% 359ns 1 359ns 359ns 359ns cuDevicePrimaryCtxRelease 0.00% 215ns 1 215ns 215ns 215ns cuDriverGetVersion ..
So using #734 I am able to avoid the CUDA error: out of memory error.
TODO: Go back to nrnivmodl-core based build and see how this could be tested there. (tomorrow)
Summary of the IBM XL Compiler
Issues reported on llvm openmp-dev mailing list: https://lists.llvm.org/pipermail/openmp-dev/2022-January/004276.html
- Issue/Question # 1 : Shared library as well as Static library works
- Issue/Question # 2 : Static library works but Shared library fails to link:
+ CXX=xlc++_r + CXXFLAGS='-fopenmp -fPIC -qsmp=omp -qoffload -g -O2' + xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -c test.cpp + ar cq libtest.a test.o + xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test1 main.cpp -L. -ltest + xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test2 main.cpp test.o + rm test.o + xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -fpic -shared test.cpp -o libtest.so /cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/xlc++_r: warning: 1501-269 fpic is not supported on this Operating System platform. Option fpic will be ignored. /cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/xlc++_r: warning: 1501-308 The device linker only supports static linking. Any device code placed into a shared library by the qmkshrobj option will be inaccessible. /cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/minilink: warning: 1501-308 The device linker only supports static linking. Any device code placed into a shared library by the qmkshrobj option will be inaccessible. + xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test3 main.cpp -L. -ltest -Wl,-rpath . nvlink error : Undefined reference to 'y' in '/tmp/24507_0.o' ... ... $ nvprof ./test1 ==29304== NVPROF is profiling process 29304, command: ./test1 --> 0 --> 1 --> 4 --> 2 --> 3 ==29304== Profiling application: ./test1 ==29304== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 79.20% 41.659us 1 41.659us 41.659us 41.659us __xl_main_l11_OL_1 11.99% 6.3040us 3 2.1010us 1.8560us 2.5280us [CUDA memcpy DtoH] 4.98% 2.6200us 1 2.6200us 2.6200us 2.6200us __xl__Z4testv_l8_OL_1
- Another issue that took quite some time to debug is following:
Historically famous one:
int *_displ = nrb->_displ;
int _displ_cnt = nrb->_displ_cnt;
#pragma omp target update to(_displ[0:_displ_cnt])
vs.
#pragma omp target update to(nrb->_displ[0:nrb->_displ_cnt])
- First update the contents of the array
- Second is updating _displ pointer itself from host to device (and hence result in wrong pointer on device side).
In coreneuron/gpu/nrn_acc_manager.cpp under update_net_receive_buffer() I did:
#if 0 #pragma omp target update to(_displ[0:nrb->_displ_cnt]) #else #pragma omp target update to(nrb->_displ[0:nrb->_displ_cnt]) #endif #pragma omp target { printf("nrb->_displ :%p \n", nrb->_displ); } abort();
And this produces following in first and second case:
Updating nrb now nrb->_displ :0x7ffef000b100 Aborted (core dumped) .... nrb->_displ :0x26a6b9c0 Aborted (core dumped)
Uh oh!
There was an error while loading. Please reload this page.
How to test this?
Outstanding issues?
Running for
0msec gives:CI_BRANCHES:NMODL_BRANCH=hackathon_main,NEURON_BRANCH=master,