Search code, repositories, users, issues, pull requests...

Copy link

Collaborator

@pramodk pramodk commented Dec 10, 2021 •

edited

Loading

Various changes (including temporary) to make XL OpenMP offload build working
 * todo: need to rebase this branch on latest hackathon_master
 * todo: temporary changes to OpenAccHelper.cmake, needs refinement
 * todo: see caliper linkling issue
 * todo: _OPENACC needs to be renamed CORENRN_ENABLE_GPU so that OpenMP
 based builds can use GPU offload.
 * todo: hardcoded CXX flags for quick build

How to test this?

Outstanding issues?

Building with XLC gives (57cecb5):

make[2]: *** [coreneuron/CMakeFiles/coreneuron.dir/build.make:114: coreneuron/CMakeFiles/coreneuron.dir/io/core2nrn_data_return.cpp.o] Error 1
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/io/mech_report.cpp:12:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/coreneuron.hpp:24:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/utils/randoms/nrnran123.h:42:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/philox.h:37:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/features/compilerfeatures.h:218:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/features/clangfeatures.h:91:
/m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/features/gccfeatures.h:48:10: fatal error: 'ppu_intrinsics.h' file not found
#include <ppu_intrinsics.h>
 ^~~~~~~~~~~~~~~~~~

Running ringtest with XLC gives (57cecb5):

(venv) [pkumbhar@login01 build]$ ./bin/ppc64le/special-core -d ../tests/integration/ring --gpu --cell-permute=2
 Info : 4 GPUs shared by 1 ranks per node
 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 57cecb59 (2022年01月03日 17:00:52 +0100)
 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
 Memory (MBs) : After mk_mech : Max 390.8125, Min 390.8125, Avg 390.8125
 Memory (MBs) : After MPI_Init : Max 390.8125, Min 390.8125, Avg 390.8125
 Memory (MBs) : Before nrn_setup : Max 390.8125, Min 390.8125, Avg 390.8125
best_balance=0.848837 ncell=10 ntype=3 nwarp=10
best_balance=0.82093 ncell=10 ntype=3 nwarp=10
 Setup Done : 0.13 seconds
 Model size : 84.19 kB
 Memory (MBs) : After nrn_setup : Max 398.8750, Min 398.8750, Avg 398.8750
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=100
GPU
--nwarp=65536
--cell-permute=2
--cuda-interface=false
INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=
PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false
SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false
CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4
OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=
 Start time (t) = 0
 Memory (MBs) : After mk_spikevec_buffer : Max 398.8750, Min 398.8750, Avg 398.8750
 Memory (MBs) : After nrn_finitialize : Max 398.6875, Min 398.6875, Avg 398.6875
1587-175 The underlying GPU runtime reported the following error "an illegal memory access was encountered".
1587-175 The underlying GPU runtime reported the following error "an illegal memory access was encountered".
1587-163 Error encountered while attempting to execute on the target device 0. The program will stop.
1587-163 Error encountered while attempting to execute on the target device 0. The program will stop.
free(): corrupted unsorted chunks

Issue seems to be with net_buf_receive_ExpSyn(). If I comment it out then it runs further

Running for 0 msec gives:

(venv) [pkumbhar@login01 build]$ gdb --args ./bin/ppc64le/special-core -d ../tests/integration/ring --gpu --cell-permute=2 -e 0
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-6.el8_0
...
(gdb) r
Starting program: /m100/home/userexternal/pkumbhar/CoreNeuron/build/bin/ppc64le/special-core -d ../tests/integration/ring --gpu --cell-permute=2 -e 0
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-72.el8_1.1.ppc64le
Missing separate debuginfo for /cineca/prod/opt/compilers/cuda/11.2/none/compat/libcuda.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/d1/e9e189d76f924564adf8c9d73eeb713d5b23d9.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/power9/libthread_db.so.1".
[New Thread 0x7fffefcbd890 (LWP 128546)]
 Info : 4 GPUs shared by 1 ranks per node
 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 57cecb59 (2022年01月03日 17:00:52 +0100)
[New Thread 0x7fffcfffd890 (LWP 128551)]
 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
 Memory (MBs) : After mk_mech : Max 390.7500, Min 390.7500, Avg 390.7500
 Memory (MBs) : After MPI_Init : Max 390.7500, Min 390.7500, Avg 390.7500
 Memory (MBs) : Before nrn_setup : Max 390.7500, Min 390.7500, Avg 390.7500
best_balance=0.848837 ncell=10 ntype=3 nwarp=10
best_balance=0.82093 ncell=10 ntype=3 nwarp=10
 Setup Done : 0.00 seconds
 Model size : 84.19 kB
 Memory (MBs) : After nrn_setup : Max 390.7500, Min 390.7500, Avg 390.7500
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=0
GPU
--nwarp=65536
--cell-permute=2
--cuda-interface=false
INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=
PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false
SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false
CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4
OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=
 Start time (t) = 0
 Memory (MBs) : After mk_spikevec_buffer : Max 390.7500, Min 390.7500, Avg 390.7500
 Memory (MBs) : After nrn_finitialize : Max 390.5625, Min 390.5625, Avg 390.5625
psolve |=========================================================| t: 0.00 ETA: 0h00m00s
Solver Time : 4.69685e-05
Thread 1 "special-core" received signal SIGSEGV, Segmentation fault.
IPRA.$_ZN10coreneuronL17update_ml_on_hostEPKNS_9Memb_listEi (ml=0x7fff50009f00, type=-16288) at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/gpu/nrn_acc_manager.cpp:180
180	 int n = ml->nodecount;
Missing separate debuginfos, use: dnf debuginfo-install libgcc-8.3.1-4.5.el8.ppc64le libstdc++-8.3.1-4.5.el8.ppc64le
(gdb) bt
#0 IPRA.$_ZN10coreneuronL17update_ml_on_hostEPKNS_9Memb_listEi (ml=0x7fff50009f00, type=-16288) at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/gpu/nrn_acc_manager.cpp:180
#1 0x000000001014c28c in coreneuron::update_nrnthreads_on_host () at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/gpu/nrn_acc_manager.cpp:831
#2 0x00000000100df048 in run_solve_core (argc=-14016, argv=0x0) at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/apps/main1.cpp:630
#3 0x000000001003a048 in solve_core (argc=-13968, argv=0x0) at /m100/home/userexternal/pkumbhar/CoreNeuron/build/share/coreneuron/enginemech.cpp:49
#4 0x0000000010039fd8 in main (argc=-13936, argv=0x7fff0000002e) at /m100/home/userexternal/pkumbhar/CoreNeuron/build/share/coreneuron/coreneuron.cpp:14

if I comment out update_nrnthreads_on_host() then it runs further.

CI_BRANCHES:NMODL_BRANCH=hackathon_main,NEURON_BRANCH=master,

@pramodk pramodk marked this pull request as draft

December 10, 2021 00:36

Copy link

Collaborator

bbpbuildbot commented Dec 10, 2021

Logfiles from GitLab pipeline #28712 (:no_entry:) have been uploaded here!

Status and direct links:

Copy link

Collaborator Author

pramodk commented Dec 10, 2021

On Ascent @ ORNL:

module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost
export NVLOCALRC=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/localrc
export PATH=/sw/ascent/gcc/6.4.0/bin:$PATH
module load xl/16.1.1-10

Configure XL for newer GCC:

xlc_configure -gcc /sw/ascent/gcc/10.2.0 -o /ccsopen/proj/gen170/neuron/nersc-gpu-hackaxlc_gcc10.cfg -cuda /sw/ascent/cuda/11.0.2 -cudaVersion 11.0 -cuda_cc_major 7 -cuda_cc_minor 0

And build with:

cmake .. -DCORENRN_ENABLE_CALIPER_PROFILING=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_NMODL=ON -DCMAKE_INSTALL_PREFIX=../install -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0" -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_CUDA_COMPILER=nvcc -DCORENRN_EXTERNAL_BENCHMARK_DATA=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/ -DCMAKE_CXX_FLAGS="" -DCORENRN_ENABLE_UNIT_TESTS=OFF -DCMAKE_CXX_COMPILER=xlc++_r -DCORENRN_NMODL_DIR=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/kumbhar/nmodl/build/install -DCORENRN_ENABLE_MPI=ON

@pramodk pramodk force-pushed the olupton/llvm-gpu branch from 7cb13a1 to a8958d1 Compare

December 10, 2021 06:58

Copy link

Collaborator

bbpbuildbot commented Dec 10, 2021

Logfiles from GitLab pipeline #28718 (:no_entry:) have been uploaded here!

Status and direct links:

@olupton olupton force-pushed the olupton/llvm-gpu branch 2 times, most recently from b468870 to affbc25 Compare

December 10, 2021 07:57

Fixes and improvements from LLVM/XLC work. #716

Copy link

Collaborator

bbpbuildbot commented Dec 10, 2021

Logfiles from GitLab pipeline #28723 (:no_entry:) have been uploaded here!

Status and direct links:

@olupton olupton mentioned this pull request

Dec 13, 2021

Merged

@olupton olupton force-pushed the olupton/llvm-gpu branch from affbc25 to 9cd7ace Compare

December 17, 2021 15:48

@olupton

Copy link

Contributor

olupton commented Dec 17, 2021

I rebased this and tried to resolve the conflicts fairly blindly.

Copy link

Collaborator

bbpbuildbot commented Dec 17, 2021

@olupton olupton force-pushed the olupton/llvm-gpu branch from 9cd7ace to dc9632a Compare

December 23, 2021 11:08

@olupton olupton changed the base branch from hackathon_main to master

December 23, 2021 11:08

Copy link

Collaborator

bbpbuildbot commented Dec 23, 2021

olupton and others added 7 commits

December 31, 2021 14:02


 fixups for llvm

b440d11


 Tweaks towards an LLVM build.

7e3e2c7


 Declare cnrn_ methods.

72cb02a


 Various changes (including temporary) to make XL OpenMP offload build...

fbf9551

... working
 * todo: temporary changes to OpenAccHelper.cmake, needs refinement
 * todo: see caliper linkling issue
 * todo: _OPENACC needs to be renamed CORENRN_ENABLE_GPU so that OpenMP
 based builds can use GPU offload.
 * todo: hardcoded CXX flags for quick build


 Use CORENEURON_PREFER_OPENMP_OFFLOAD.


 Disable CUDA complation of partial piv and nrnran123.

4b8b093


 fixup.

439c560

@pramodk pramodk force-pushed the olupton/llvm-gpu branch from dc9632a to 439c560 Compare

December 31, 2021 13:07

Copy link

Collaborator

bbpbuildbot commented Dec 31, 2021

pramodk added 2 commits

January 3, 2022 16:53


 Avoid setting additional linker flags at global scope

6ff530c

 * Setting CMAKE_EXE_LINKER_FLAGS was used also for NMODL
 when NMODL is built from submodule
 * In case of LLVM OpenMP offload, if binary is being created
 from the object that is not compiled with OpenMP flags,
 it causes link error:
 echo "int main() { return 0; } " > foo.cpp
 clang++ -c foo.cpp # adding openmp flags here doesn't cause below error
 clang++ foo.o -o foo -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda
 nvlink fatal : Could not open input file '/tmp/foo-0b9a1a.cubin'
 clang-12: error: nvlink command failed with exit code 1 (use -v to see invocation)


 Skip cuda files in the build (temporarily) for LLVM only build

57cecb5

Copy link

Collaborator

bbpbuildbot commented Jan 3, 2022

Copy link

Collaborator Author

pramodk commented Jan 4, 2022

Here are some notes from various experimentation / debugging attempts to get OpenMP offload working with LLVM v13.0.

On Slack thread, Olli reminded small reproducer from NERSC Hackathon showing issue with global variables and static library:

[olupton@r2i3n6 build_olli_gpu]$ cat test.sh
CXX=clang++
CXXFLAGS="-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -g"
${CXX} ${CXXFLAGS} -c test.cpp
ar cq libtest.a test.o
${CXX} ${CXXFLAGS} -o test1 main.cpp -L. -ltest
${CXX} ${CXXFLAGS} -o test2 main.cpp test.o
[olupton@r2i3n6 build_olli_gpu]$ cat test.cpp
#pragma omp declare target
int y;
#pragma omp end declare target
int test() {
 y = 24;
 #pragma omp target update to(y)
 y = 42;
 int x;
 #pragma omp target map(from:x)
 {
 x = y;
 }
 return x;
}
[olupton@r2i3n6 build_olli_gpu]$ cat main.cpp
extern int test();
int main() {
 return test();
}

small reproducer for the problem I am seeing now — going via the static library seems to cause problems 🤷

[olupton@r2i3n6 build_olli_gpu]$ ./test1
CUDA error: Loading global 'y' Failed
CUDA error: named symbol not found
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
test.cpp:7:3: Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted
[olupton@r2i3n6 build_olli_gpu]$ ./test2; echo $?
24

So I switched to shared library in coreneuron + nrnivmodl-core i.e. build shared library of libcorenrnmech. This was giving undefined symbol errors for global variables & function defined in libcoreneuron.a:

nvlink error : Undefined reference to '_ZN10coreneuron7celsiusE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron2piE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron11secondorderE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron7at_timeEPNS_9NrnThreadEd' in

See also Olli's comment in NERSC GPU Hackathon Slack:

With the Clang/OpenMP build + shared libraries instead of static, I avoided some link errors by removing the annotations in the header

#pragma omp declare target 
extern double celsius;
#pragma omp end declare target

and the remaining one (at_time function) by making its body available. But now I have a new segfault at startup:

Program received signal SIGSEGV, Segmentation fault.
0x00007fffed92b801 in RTLsTy::RegisterRequires(long) () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
(gdb) bt
#0 0x00007fffed92b801 in RTLsTy::RegisterRequires(long) () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
#1 0x00007fffed92824e in __tgt_register_requires () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
#2 0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#3 0x00007fffed8e417a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2

 still investigating..

Then I avoided building libcorenrnmech library and tried to use mechanisms object files directly to create special-core (by modifying nrnivmodl-core-makefile). This didn't go too far - I saw similar linking errors with global symbols in libcoreneuron library:

 => Binary creating x86_64/special-core
/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-lvcrm6/bin/clang++ -fopenmp=libomp -std=c++14 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -Wl,--as-needed -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -o x86_64/special-core /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/share/coreneuron/coreneuron.cpp \
 -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/include \
 x86_64/corenrn/build/_mod_func.o x86_64/corenrn/build/_dimplic.o x86_64/corenrn/build/exp2syn.o x86_64/corenrn/build/expsyn.o x86_64/corenrn/build/halfgap.o x86_64/corenrn/build/hh.o x86_64/corenrn/build/netstim.o x86_64/corenrn/build/passive.o x86_64/corenrn/build/pattern.o x86_64/corenrn/build/stim.o x86_64/corenrn/build/svclmp.o x86_64/corenrn/build/enginemech.o
nvlink error : Undefined reference to '_ZN10coreneuron7celsiusE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron2piE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron11secondorderE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron7at_timeEPNS_9NrnThreadEd' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/svclmp-852313.cubin'

Next, I wanted to avoid building any intermediate libraries i.e. create special-core binary by taking all objects. For this, I took mechanism cpp files generated by nrnivmodl-core and added into coreneuron source tree (temporarily). Then, updated cmake to build special-core directly via cmake i.e. something like (partial diff):

+++ b/coreneuron/CMakeLists.txt
@@ -47,6 +47,13 @@ list(APPEND CORENEURON_CODE_FILES ${PROJECT_BINARY_DIR}/coreneuron/config/config
 set(DIMPLIC_CODE_FILE "mechanism/mech/dimplic.cpp")
 set(ENGINEMECH_CODE_FILE "mechanism/mech/enginemech.cpp")
+file(GLOB CORENEURON_SPECIAL_CORE_FILES "exe/*.cpp")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${CMAKE_CURRENT_SOURCE_DIR}/apps/coreneuron.cpp")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${DIMPLIC_CODE_FILE}")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${ENGINEMECH_CODE_FILE}")
+
 # for external mod files we need to generate modl_ref function in mod_func.c
 set(MODFUNC_PERL_SCRIPT "mechanism/mech/mod_func.c.pl")
@@ -184,6 +191,8 @@ if(CORENRN_ENABLE_MPI AND NOT CORENRN_ENABLE_MPI_DYNAMIC)
 set(CORENRN_MPI_OBJ $<TARGET_OBJECTS:${CORENRN_MPI_LIB_NAME}>)
 endif()
+set(COMPILE_LIBRARY_TYPE OBJECT)
+
 # main coreneuron library
 add_library(
 coreneuron
@@ -319,7 +328,7 @@ add_custom_command(
 "${modfile_directory}"
 WORKING_DIRECTORY ${CMAKE_BINARY_DIR}/bin
 COMMENT "Running nrnivmodl-core with halfgap.mod")
-add_custom_target(nrniv-core ALL DEPENDS ${output_binaries})
+#add_custom_target(nrniv-core ALL DEPENDS ${output_binaries})
 include_directories(${CORENEURON_PROJECT_SOURCE_DIR})
@@ -358,6 +367,12 @@ configure_file("utils/profile/profiler_interface.h"
 # main program required for building special-core
 file(COPY apps/coreneuron.cpp DESTINATION ${CMAKE_BINARY_DIR}/share/coreneuron)
+add_executable(special-core-gpu ${CORENEURON_SPECIAL_CORE_FILES})
+target_compile_options(special-core-gpu BEFORE PRIVATE $<$<COMPILE_LANGUAGE:CXX>:${CORENRN_ACC_FLAGS}>)
+target_compile_definitions(special-core-gpu PUBLIC -DADDITIONAL_MECHS)
+target_link_libraries(special-core-gpu coreneuron ${CMAKE_DL_LIBS})

This created exe using all object files:

[ 87%] Linking CXX executable ../bin/special-core-gpu
cd /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron && /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cmake-3.21.4-cdyb7k/bin/cmake -E cmake_link_script CMakeFiles/special-core-gpu.dir/link.txt --verbose=1
/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-klsplt/bin/clang++ -fopenmp=libomp -Wl,--as-needed -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include CMakeFiles/special-core-gpu.dir/exe/_mod_func.cpp.o CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o CMakeFiles/special-core-gpu.dir/exe/expsyn.cpp.o CMakeFiles/special-core-gpu.dir/exe/halfgap.cpp.o CMakeFiles/special-core-gpu.dir/exe/hh.cpp.o CMakeFiles/special-core-gpu.dir/exe/netstim.cpp.o CMakeFiles/special-core-gpu.dir/exe/passive.cpp.o CMakeFiles/special-core-gpu.dir/exe/pattern.cpp.o CMakeFiles/special-core-gpu.dir/exe/stim.cpp.o CMakeFiles/special-core-gpu.dir/exe/svclmp.cpp.o CMakeFiles/special-core-gpu.dir/apps/coreneuron.cpp.o CMakeFiles/special-core-gpu.dir/mechanism/mech/dimplic.cpp.o CMakeFiles/special-core-gpu.dir/mechanism/mech/enginemech.cpp.o CMakeFiles/coreneuron.dir/apps/corenrn_parameters.cpp.o CMakeFiles/coreneuron.dir/apps/main1.cpp.o CMakeFiles/coreneuron.dir/gpu/nrn_acc_manager.cpp.o CMakeFiles/coreneuron.dir/io/core2nrn_data_return.cpp.o CMakeFiles/coreneuron.dir/io/file_utils.cpp.o CMakeFiles/coreneuron.dir/io/global_vars.cpp.o CMakeFiles/coreneuron.dir/io/lfp.cpp.o CMakeFiles/coreneuron.dir/io/mech_report.cpp.o CMakeFiles/coreneuron.dir/io/mem_layout_util.cpp.o CMakeFiles/coreneuron.dir/io/mk_mech.cpp.o CMakeFiles/coreneuron.dir/io/nrn2core_data_init.cpp.o CMakeFiles/coreneuron.dir/io/nrn_checkpoint.cpp.o CMakeFiles/coreneuron.dir/io/nrn_filehandler.cpp.o CMakeFiles/coreneuron.dir/io/nrn_setup.cpp.o CMakeFiles/coreneuron.dir/io/output_spikes.cpp.o CMakeFiles/coreneuron.dir/io/phase1.cpp.o CMakeFiles/coreneuron.dir/io/phase2.cpp.o CMakeFiles/coreneuron.dir/io/prcellstate.cpp.o CMakeFiles/coreneuron.dir/io/reports/binary_report_handler.cpp.o CMakeFiles/coreneuron.dir/io/reports/nrnreport.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_configuration_parser.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_event.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_handler.cpp.o CMakeFiles/coreneuron.dir/io/reports/sonata_report_handler.cpp.o CMakeFiles/coreneuron.dir/io/setup_fornetcon.cpp.o CMakeFiles/coreneuron.dir/mechanism/capac.cpp.o CMakeFiles/coreneuron.dir/mechanism/eion.cpp.o CMakeFiles/coreneuron.dir/mechanism/mech_mapping.cpp.o CMakeFiles/coreneuron.dir/mechanism/patternstim.cpp.o CMakeFiles/coreneuron.dir/mechanism/register_mech.cpp.o CMakeFiles/coreneuron.dir/network/cvodestb.cpp.o CMakeFiles/coreneuron.dir/network/multisend.cpp.o CMakeFiles/coreneuron.dir/network/multisend_setup.cpp.o CMakeFiles/coreneuron.dir/network/netcvode.cpp.o CMakeFiles/coreneuron.dir/network/netpar.cpp.o CMakeFiles/coreneuron.dir/network/partrans.cpp.o CMakeFiles/coreneuron.dir/network/partrans_setup.cpp.o CMakeFiles/coreneuron.dir/network/tqueue.cpp.o CMakeFiles/coreneuron.dir/permute/balance.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder1.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder2.cpp.o CMakeFiles/coreneuron.dir/permute/data_layout.cpp.o CMakeFiles/coreneuron.dir/permute/node_permute.cpp.o CMakeFiles/coreneuron.dir/sim/fadvance_core.cpp.o CMakeFiles/coreneuron.dir/sim/fast_imem.cpp.o CMakeFiles/coreneuron.dir/sim/finitialize.cpp.o CMakeFiles/coreneuron.dir/sim/multicore.cpp.o CMakeFiles/coreneuron.dir/sim/solve_core.cpp.o CMakeFiles/coreneuron.dir/sim/treeset_core.cpp.o CMakeFiles/coreneuron.dir/utils/ispc/globals.cpp.o CMakeFiles/coreneuron.dir/utils/ivocvect.cpp.o CMakeFiles/coreneuron.dir/utils/lpt.cpp.o CMakeFiles/coreneuron.dir/utils/memory.cpp.o CMakeFiles/coreneuron.dir/utils/memory_utils.cpp.o CMakeFiles/coreneuron.dir/utils/nrn_stats.cpp.o CMakeFiles/coreneuron.dir/utils/nrnoc_aux.cpp.o CMakeFiles/coreneuron.dir/utils/nrntimeout.cpp.o CMakeFiles/coreneuron.dir/utils/progressbar/progressbar.cpp.o CMakeFiles/coreneuron.dir/utils/randoms/nrnran123.cpp.o CMakeFiles/coreneuron.dir/utils/string_utils.cpp.o CMakeFiles/coreneuron.dir/utils/utils.cpp.o CMakeFiles/coreneuron.dir/utils/vrecord.cpp.o CMakeFiles/coreneuron.dir/config/config.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpi_def_cinc.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpi.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpidec.cpp.o -o ../bin/special-core-gpu -Wl,-rpath,/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/lib64 -ldl /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/lib64/libcudart.so

Unfortunately, this still gives similar cryptic error at launch for GPU or CPU execution:

kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build$ ./bin/special-core-gpu -e 1 -d ../tests/integration/ring
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022年01月04日 16:29:44 +0100)
CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build$ ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022年01月04日 16:29:44 +0100)
CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted

I saw one such error reported here but that doesn't seem relevant.

Next, I wanted to incrementally enable OpenMP offload part and find out what causes above error. So, first thing I did was to disable OpenMP offload parts from mechanisms cpp files i.e. simply something like:

cd coreneuron/exe # this directory contain exp2syn.cpp expsyn.cpp halfgap.cpp hh.cpp _kinderiv.h _mod_func.cpp netstim.cpp passive.cpp pattern.cpp stim.cpp svclmp.cpp
sed -i 's#nrn_pragma_omp#//nrn_pragma_omp#g' *.cpp

And by re-building I saw that the special-core-gpu binary that we build is running fine on CPU or GPU!
So the issue seems to be somehow related to mechanisms cpp files!

$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
==112850== NVPROF is profiling process 112850, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
 Info : 4 GPUs shared by 1 ranks per node
 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022年01月04日 16:29:44 +0100)
 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Exp2Syn Reg
 Memory (MBs) : After mk_mech : Max 298.0469, Min 298.0469, Avg 298.0469
 Memory (MBs) : After MPI_Init : Max 296.1094, Min 296.1094, Avg 296.1094
 Memory (MBs) : Before nrn_setup : Max 296.1406, Min 296.1406, Avg 296.1406
 WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.
 Setup Done : 0.00 seconds
 Model size : 84.19 kB
 Memory (MBs) : After nrn_setup : Max 296.4258, Min 296.4258, Avg 296.4258
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=1
GPU
--nwarp=65536
--cell-permute=0
--cuda-interface=false
INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=
PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false
SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false
CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4
OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=
 Start time (t) = 0
 Memory (MBs) : After mk_spikevec_buffer : Max 296.4258, Min 296.4258, Avg 296.4258
....
 Memory (MBs) : After nrn_finitialize : Max 301.5273, Min 301.5273, Avg 301.5273
psolve |=========================================================| t: 1.00 ETA: 0h00m00s
Solver Time : 0.24223
 Simulation Statistics
 Number of cells: 20
 Number of compartments: 804
 Number of presyns: 21
 Number of input presyns: 0
 Number of synapses: 21
 Number of point processes: 41
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 40
 Number of spikes with non negative gid-s: 40
==112850== Profiling application: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
==112850== Profiling result:
 Type Time(%) Time Calls Avg Min Max Name
 GPU activities: 11.68% 20.892ms 160 130.57us 126.98us 134.59us __omp_offloading_2f_188133fe__ZN10coreneuron11nrn_cur_ionEPNS_9NrnThreadEPNS_9Memb_listEi_l273
 9.64% 17.250ms 80 215.63us 186.62us 222.72us __omp_offloading_2f_2979dc5__ZN10coreneuron12nrn_state_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l520
 8.98% 16.057ms 160 100.35us 99.007us 101.92us __omp_offloading_2f_12055abf__ZN10coreneuron22net_buf_receive_ExpSynEPNS_9NrnThreadE_l290
 7.72% 13.816ms 80 172.70us 169.18us 174.59us __omp_offloading_2f_2979dc5__ZN10coreneuron10nrn_cur_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l472
 6.87% 12.293ms 80 153.66us 152.13us 154.94us __omp_offloading_2f_12055abf__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l383
 6.82% 12.200ms 80 152.50us 150.37us 154.30us __omp_offloading_2f_3962f8e__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_l276
 6.07% 10.865ms 80 135.81us 135.23us 136.58us __omp_offloading_2f_12055abf__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l426
 6.05% 10.826ms 80 135.33us 132.64us 136.80us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_lhsEPNS_9NrnThreadE_l166
 6.04% 10.806ms 80 135.07us 133.41us 137.73us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l96
 6.01% 10.746ms 80 134.32us 133.73us 135.33us __omp_offloading_2f_198f6a66__ZN10coreneuron21nrn_jacob_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l74
 5.94% 10.621ms 80 132.76us 132.35us 136.48us __omp_offloading_2f_191507c2__ZN10coreneuron8NetCvode12check_threshEPNS_9NrnThreadE_l541
 5.92% 10.589ms 80 132.36us 131.52us 133.31us __omp_offloading_2f_198f6a66__ZN10coreneuron19nrn_cur_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l120
 5.84% 10.446ms 80 130.57us 127.07us 131.42us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l37
 5.79% 10.358ms 80 129.48us 128.90us 132.19us __omp_offloading_2f_1d8e98f0__ZN10coreneuron6updateEPNS_9NrnThreadE_l217
 0.33% 594.30us 6 99.050us 2.4320us 446.46us [CUDA memset]
 0.18% 318.05us 246 1.2920us 1.2470us 1.7920us [CUDA memcpy HtoD]
 0.10% 185.74us 125 1.4850us 1.3750us 2.9440us [CUDA memcpy DtoH]
 API calls: 38.95% 214.73ms 1 214.73ms 214.73ms 214.73ms cuDevicePrimaryCtxRetain
 38.43% 211.84ms 1450 146.09us 756ns 275.66us cuStreamSynchronize
 9.61% 52.977ms 1 52.977ms 52.977ms 52.977ms cuModuleLoadDataEx
 4.59% 25.275ms 1 25.275ms 25.275ms 25.275ms cuModuleUnload
 3.69% 20.328ms 6 3.3880ms 8.9020us 20.278ms cudaMallocManaged
 2.10% 11.594ms 84 138.03us 11.220us 146.78us cuMemcpyDtoHAsync
 1.17% 6.4369ms 1280 5.0280us 4.2990us 17.611us cuLaunchKernel
 0.30% 1.6383ms 407 4.0250us 132ns 194.16us cuDeviceGetAttribute
 0.28% 1.5530ms 4 388.24us 385.38us 395.38us cuDeviceTotalMem
 0.26% 1.4583ms 1619 900ns 161ns 581.21us cuCtxSetCurrent
 0.19% 1.0351ms 246 4.2070us 3.6880us 13.495us cuMemcpyHtoDAsync
 0.15% 844.22us 6 140.70us 26.971us 516.61us cudaMemset
 0.11% 591.25us 41 14.420us 11.523us 44.503us cuMemcpyDtoH
 0.07% 407.33us 32 12.729us 1.7660us 190.68us cuStreamCreate
 0.03% 172.74us 4 43.183us 36.483us 57.844us cuDeviceGetName
 0.02% 125.94us 32 3.9350us 2.2320us 27.967us cuStreamDestroy
 0.02% 95.955us 373 257ns 144ns 4.1180us cuGetProcAddress
 0.01% 44.318us 54 820ns 292ns 5.4780us cuModuleGetGlobal
 0.01% 30.319us 41 739ns 377ns 4.3020us cuModuleGetFunction
 0.00% 16.994us 1 16.994us 16.994us 16.994us cuMemAlloc
 0.00% 13.340us 1 13.340us 13.340us 13.340us cudaSetDevice
 0.00% 12.246us 4 3.0610us 985ns 8.0850us cuDeviceGetPCIBusId
 0.00% 5.9690us 9 663ns 141ns 3.1970us cuDeviceGet
 0.00% 2.8740us 1 2.8740us 2.8740us 2.8740us cuDevicePrimaryCtxGetState
 0.00% 2.7800us 2 1.3900us 201ns 2.5790us cuCtxGetLimit
 0.00% 1.7460us 5 349ns 256ns 595ns cuFuncGetAttribute
 0.00% 1.1300us 4 282ns 202ns 443ns cuDeviceGetCount
 0.00% 946ns 4 236ns 161ns 322ns cuDeviceGetUuid
 0.00% 895ns 1 895ns 895ns 895ns cuInit
 0.00% 737ns 1 737ns 737ns 737ns cuDevicePrimaryCtxSetFlags
 0.00% 591ns 1 591ns 591ns 591ns cuCtxGetDevice
 0.00% 371ns 1 371ns 371ns 371ns cuDevicePrimaryCtxRelease
 0.00% 217ns 1 217ns 217ns 217ns cuDriverGetVersion
==112850== Unified Memory profiling result:
Device "Tesla V100-SXM2-16GB (0)"
 Count Avg Size Min Size Max Size Total Size Total Time Name
 4 32.000KB 4.0000KB 60.000KB 128.0000KB 20.64000us Host To Device
 6 32.000KB 4.0000KB 60.000KB 192.0000KB 21.34400us Device To Host
 3 - - - - 575.5490us Gpu page fault groups
 16 4.0000KB 4.0000KB 4.0000KB 64.00000KB - Memory thrashes
Total CPU Page faults: 3
Total CPU thrashes: 16

Then, I was able to isolate the issue to single file exp2syn.cpp - if I comment out all OpenMP offload pragmas in exp2syn.cpp then special-core-gpu binary works fine!

But the funny part with exp2syn.cpp is that this mechanism is not used in the ringtest model at all ! :

$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
==112982== NVPROF is profiling process 112982, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
 Info : 4 GPUs shared by 1 ranks per node
 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022年01月04日 16:29:44 +0100)
 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
 Memory (MBs) : After mk_mech : Max 298.0469, Min 298.0469, Avg 298.0469
 Memory (MBs) : After MPI_Init : Max 296.1094, Min 296.1094, Avg 296.1094
 Memory (MBs) : Before nrn_setup : Max 296.1406, Min 296.1406, Avg 296.1406
 WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.
================ MECHANISMS COUNT BY TYPE ==================
 Id Name Count
 0 (null) 0
 1 (null) 0
 2 morphology 0
 3 capacitance 392
 4 pas 372
 5 extracellular 0
 6 fastpas 0
 7 IClamp 0
 8 AlphaSynapse 0
 9 ExpSyn 40
 10 Exp2Syn 0
 11 SEClamp 0
 12 VClamp 0
 13 OClamp 0
 14 APCount 0
 15 na_ion 20
 16 k_ion 20
 17 hh 20
 18 NetStim 1
 19 IntFire1 0
 20 IntFire2 0
 21 IntFire4 0
 22 PointProcessMark 0
 23 PatternStim 0
 24 HalfGap 0

You can see that Exp2Syn count is 0!

In the exp2syn.cpp if I enable single OpenMP pragma e.g. in the functionnrn_state_Exp2Syn() (which is not executed!) then I still get below error:

$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
==113465== NVPROF is profiling process 113465, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
 Info : 4 GPUs shared by 1 ranks per node
 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022年01月04日 16:29:44 +0100)
CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

During this experimentation I realised that CoreNEURON CMake doesn't set any optimisation flags when LLVM compiler is used. See https://github.com/BlueBrain/CoreNeuron/pull/734/files. So I compiled exp2syn.cpp with optimisation flag -O1 and the error disappeared!

kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ rm ../bin/special-core-gpu
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-klsplt/bin/clang++ -DADDITIONAL_MECHS -DCORENEURON_BUILD -DCORENEURON_CUDA_PROFILING -DCORENEURON_ENABLE_GPU -DCORENEURON_PREFER_OPENMP_OFFLOAD -DDISABLE_HOC_EXP -DENABLE_SPLAYTREE_QUEUING -DLAYOUT=0 -DNRNMPI=0 -DNRN_MULTISEND=0 -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/targets/x86_64-linux/include -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/include -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/coreneuron/utils/randoms -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron -isystem /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -fopenmp=libomp -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -std=c++14 -MD -MT coreneuron/CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o -MF CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o.d -o CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o -c /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/coreneuron/exe/exp2syn.cpp -O1
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ make
[ 1%] Built target pyembed
[ 2%] Built target fmt
[ 3%] Built target pywrapper
[ 3%] Built target pyastgen
[ 18%] Built target lexer_obj
[ 18%] Built target lexer
[ 21%] Built target util_obj
[ 22%] Built target util
[ 38%] Built target visitor_obj
[ 39%] Built target visitor
[ 43%] Built target codegen
[ 45%] Built target printer_obj
[ 45%] Built target printer
[ 46%] Built target symtab_obj
[ 47%] Built target symtab
[ 48%] Built target nmodl
[ 48%] Built target nrnivmodl-core
[ 50%] Built target kin_deriv_header
[ 83%] Built target coreneuron
[ 86%] Built target scopmath
Consolidate compiler generated dependencies of target special-core-gpu
[ 87%] Linking CXX executable ../bin/special-core-gpu
[ 93%] Built target special-core-gpu
[100%] Built target coreneuron-copy-nrnivmodl-core-dependencies
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ nvprof ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
==115317== NVPROF is profiling process 115317, command: ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
 Info : 4 GPUs shared by 1 ranks per node
 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022年01月04日 16:29:44 +0100)
 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Exp2Syn Reg
 Memory (MBs) : After mk_mech : Max 297.8047, Min 297.8047, Avg 297.8047
 Memory (MBs) : After MPI_Init : Max 296.2422, Min 296.2422, Avg 296.2422
 Memory (MBs) : Before nrn_setup : Max 296.2734, Min 296.2734, Avg 296.2734
 WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.
....
 Memory (MBs) : After nrn_finitialize : Max 301.6562, Min 301.6562, Avg 301.6562
psolve |=========================================================| t: 1.00 ETA: 0h00m01s
Solver Time : 0.253621
 Simulation Statistics
 Number of cells: 20
 Number of compartments: 804
 Number of presyns: 21
 Number of input presyns: 0
 Number of synapses: 21
 Number of point processes: 41
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 40
 Number of spikes with non negative gid-s: 40
==115317== Profiling application: ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
==115317== Profiling result:
 Type Time(%) Time Calls Avg Min Max Name
 GPU activities: 11.69% 21.776ms 160 136.10us 127.74us 143.14us __omp_offloading_2f_188133fe__ZN10coreneuron11nrn_cur_ionEPNS_9NrnThreadEPNS_9Memb_listEi_l273
 9.66% 17.998ms 80 224.98us 198.98us 236.25us __omp_offloading_2f_2979dc5__ZN10coreneuron12nrn_state_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l520
 8.98% 16.731ms 160 104.57us 99.968us 108.10us __omp_offloading_2f_12055abf__ZN10coreneuron22net_buf_receive_ExpSynEPNS_9NrnThreadE_l290
 7.82% 14.565ms 80 182.06us 176.06us 187.84us __omp_offloading_2f_2979dc5__ZN10coreneuron10nrn_cur_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l472
 6.86% 12.780ms 80 159.75us 153.70us 163.58us __omp_offloading_2f_12055abf__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l383
 6.82% 12.697ms 80 158.71us 152.13us 163.20us __omp_offloading_2f_3962f8e__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_l276
 6.04% 11.251ms 80 140.64us 135.58us 144.13us __omp_offloading_2f_12055abf__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l426
 6.04% 11.250ms 80 140.62us 134.53us 146.59us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l96
 6.03% 11.240ms 80 140.50us 135.33us 143.87us __omp_offloading_2f_198f6a66__ZN10coreneuron21nrn_jacob_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l74
 6.01% 11.203ms 80 140.04us 133.09us 145.22us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_lhsEPNS_9NrnThreadE_l166
 5.95% 11.076ms 80 138.45us 133.34us 144.96us __omp_offloading_2f_191507c2__ZN10coreneuron8NetCvode12check_threshEPNS_9NrnThreadE_l541
 5.92% 11.035ms 80 137.94us 132.42us 141.38us __omp_offloading_2f_198f6a66__ZN10coreneuron19nrn_cur_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l120
 5.80% 10.795ms 80 134.94us 130.05us 139.90us __omp_offloading_2f_1d8e98f0__ZN10coreneuron6updateEPNS_9NrnThreadE_l217
 5.77% 10.751ms 80 134.39us 129.34us 137.79us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l37
 0.32% 598.59us 6 99.765us 2.7200us 447.39us [CUDA memset]
 0.17% 323.17us 246 1.3130us 1.2470us 1.8240us [CUDA memcpy HtoD]
 0.11% 195.62us 126 1.5520us 1.3760us 3.0090us [CUDA memcpy DtoH]
 API calls: 39.19% 222.11ms 1450 153.18us 811ns 3.8766ms cuStreamSynchronize
 38.77% 219.72ms 1 219.72ms 219.72ms 219.72ms cuDevicePrimaryCtxRetain
 9.24% 52.366ms 1 52.366ms 52.366ms 52.366ms cuModuleLoadDataEx
 4.44% 25.182ms 1 25.182ms 25.182ms 25.182ms cuModuleUnload
 3.59% 20.330ms 6 3.3883ms 9.0800us 20.277ms cudaMallocManaged
 2.13% 12.063ms 84 143.60us 11.176us 157.59us cuMemcpyDtoHAsync
 1.20% 6.7851ms 1280 5.3000us 4.5340us 19.189us cuLaunchKernel
 0.29% 1.6442ms 407 4.0390us 133ns 187.16us cuDeviceGetAttribute
 0.28% 1.5656ms 4 391.41us 387.70us 395.00us cuDeviceTotalMem
 0.25% 1.4421ms 1619 890ns 159ns 579.67us cuCtxSetCurrent
 0.19% 1.0798ms 246 4.3890us 3.7420us 17.665us cuMemcpyHtoDAsync
 0.15% 853.90us 6 142.32us 27.310us 519.82us cudaMemset
 0.11% 647.30us 42 15.411us 11.264us 49.087us cuMemcpyDtoH
 0.07% 412.45us 32 12.889us 1.7620us 191.92us cuStreamCreate
 0.03% 168.17us 4 42.041us 36.633us 56.245us cuDeviceGetName
 0.02% 112.19us 32 3.5050us 2.1360us 21.674us cuStreamDestroy
 0.02% 108.93us 373 292ns 138ns 4.2190us cuGetProcAddress
 0.01% 44.237us 55 804ns 401ns 3.3740us cuModuleGetGlobal
 0.01% 34.313us 42 816ns 354ns 6.8130us cuModuleGetFunction
 0.00% 14.817us 1 14.817us 14.817us 14.817us cudaSetDevice
 0.00% 13.742us 1 13.742us 13.742us 13.742us cuMemAlloc
 0.00% 11.865us 4 2.9660us 937ns 7.8230us cuDeviceGetPCIBusId
 0.00% 7.1150us 9 790ns 142ns 4.0720us cuDeviceGet
 0.00% 2.6710us 1 2.6710us 2.6710us 2.6710us cuDevicePrimaryCtxGetState
 0.00% 2.5190us 2 1.2590us 224ns 2.2950us cuCtxGetLimit
 0.00% 2.2450us 5 449ns 231ns 744ns cuFuncGetAttribute
 0.00% 1.2930us 4 323ns 180ns 528ns cuDeviceGetCount
 0.00% 1.1640us 1 1.1640us 1.1640us 1.1640us cuInit
 0.00% 984ns 4 246ns 162ns 333ns cuDeviceGetUuid
 0.00% 674ns 1 674ns 674ns 674ns cuDevicePrimaryCtxSetFlags
 0.00% 502ns 1 502ns 502ns 502ns cuCtxGetDevice
 0.00% 359ns 1 359ns 359ns 359ns cuDevicePrimaryCtxRelease
 0.00% 215ns 1 215ns 215ns 215ns cuDriverGetVersion
..

So using #734 I am able to avoid the CUDA error: out of memory error.

TODO: Go back to nrnivmodl-core based build and see how this could be tested there. (tomorrow)