ggerganov
Oct 11, 2025
Maintainer

Overview

In this guide we will configure the NVIDIA DGXTM Spark as a local and private AI assistant using the ggml software stack. The guide is geared towards developers and builders. We are going to setup the following AI capabilities:

AI chat
AI coding agent
Inline text completion service
Embeddings service
Vision service
Speech-to-text (STT) service

These features will run simultaneously, in your local network, allowing you to fully utilize the power of your device at home or in the office.

image

Software

We are going to use the following open-source software:

ggml-org/llama.cpp
ggml-org/whisper.cpp
ggml-org/llama.vim (vim plugin)
ggml-org/llama.vscode (VSCode extension)

Setup

Simply run the following command in a terminal on your NVIDIA DGXTM Spark:

bash <(curl -s https://ggml.ai/dgx-spark.sh)

Note

The dgx-spark.sh script above is quite basic and is merely one of the many possible ways you can configure your device for AI use cases. It is provided here mainly for convenience and as an example. Feel free to inspect it and adjust it for your needs.

The command downloads and builds the latest version of the ggml software stack and starts multiple HTTP REST services as shown in the following table:

Port	Base URL	Model	Typical Use‐Case
8021	`http://localhost:8021`	Gemma	Generate text embeddings
8022	`http://localhost:8022`	Qwen	Fill‐in‐the‐middle (in‐fill) text generation
8023	`http://localhost:8023`	GPT-OSS	General‐purpose LLM completions, chat and tool use
8024	`http://localhost:8024`	Gemma (Vision)	Vision tasks – image‐to‐text, multimodal inference
8025	`http://localhost:8025`	Whisper	Speech‐to‐text transcription

The first time running the command can take a few minutes to download the model weights. If everything goes OK, you should see the following output:

image

At this point, the machine is fully configured and ready to be used. Internet connection is no longer necessary.

Here's sample output of nvidia-smi while the ggml services are running:

image

Use cases

Here is a small fraction of the AI use cases that are possible with this configuration.

Basic chat

Simply point your browser to the chat endpoint http://localhost:8023:

image

Inline code completions (FIM)

Install the llama.vim plugin in your Vim/Neovim editor and configure it to use the FIM endpoint http://localhost:8022:
image
In VSCode, install the llama.vscode extension and configure it in a similar way to use the FIM endpoint:
image

Coding agent

In VSCode, configure the llama.vscode extension to use the endpoints for completions, chat, embeddings and tools:

image

Document and image processing

Submit PDFs and image documents in the WebUI to analyze them with a multimodal LLM. For visuals, use the vision endpoint http://localhost:8024:

image

Audio transcription

Use the speech-to-text endpoint at http://localhost:8025 to quickly transcribe audio files:

image

Performance

For performance numbers, see Performance of llama.cpp on NVIDIA DGX Spark

Conclusion

The new NVIDIA DGX Spark is a great choice for serving the latest AI models locally and privately. With 128GB of unified system memory it has the capacity to host multiple AI services simultaneously. And the ggml software stack is the best way to do that.

Replies: 3 comments 3 replies

mfoldes
Oct 19, 2025

This is an outstanding guide and I was able to get everything up and running on the Spark with minimal fuss.

I did run into an interesting issue. Right now out of the box with the latest updates, it looks like my nvcc version was reported as this

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

and my gcc version was

gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Apparently, there were some changes made to the gcc vector type definitions starting with version 13, so I had to revert and rebuild with gcc 12 by appending these headers to your shell script

...{everything before line 48}
printf "[I] Installing llama.cpp\n"
git clone https://github.com/ggml-org/llama.cpp ~/ggml-org/llama.cpp
cd ~/ggml-org/llama.cpp
cmake -B build-cuda -DCMAKE_C_COMPILER=/usr/bin/gcc-12 -DCMAKE_CXX_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DGGML_CUDA=ON
cmake --build build-cuda -j
printf "[I] Installing whisper.cpp\n"
git clone https://github.com/ggml-org/whisper.cpp ~/ggml-org/whisper.cpp
cd ~/ggml-org/whisper.cpp
cmake -B build-cuda -DCMAKE_C_COMPILER=/usr/bin/gcc-12 -DCMAKE_CXX_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DGGML_CUDA=ON
cmake --build build-cuda -j
...{everything after line 58}

2 replies

@ggerganov

ggerganov Oct 20, 2025
Maintainer Author

Thanks for the feedback.

That's weird - I have the same config as you and there are no issues with the build:

ggml@spark-17ed:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
ggml@spark-17ed:~$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ggml@spark-17ed:~$ bash <(curl -s https://ggml.ai/dgx-spark.sh)
[I] Setting up NVIDIA DGX Spark for local AI with ggml
[W] The directory '~/ggml-org' already exists and will be deleted. Continue? (y/N) y
[I] Proceeding...
[I] Installing llama.cpp
Cloning into '/home/ggml/ggml-org/llama.cpp'...
remote: Enumerating objects: 65253, done.
remote: Counting objects: 100% (104/104), done.
remote: Compressing objects: 100% (86/86), done.
remote: Total 65253 (delta 72), reused 18 (delta 18), pack-reused 65149 (from 4)
Receiving objects: 100% (65253/65253), 178.28 MiB | 20.42 MiB/s, done.
Resolving deltas: 100% (47464/47464), done.
-- The C compiler identification is GNU 13.3.0
-- The CXX compiler identification is GNU 13.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
...
[I] Downloading Whisper model...
[P] Starting service on port 8021 (embd) ...
[P] Starting service on port 8022 (fim) ...
[P] Starting service on port 8023 (chat, tools) ...
[P] Starting service on port 8024 (vision) ...
[P] Starting service on port 8025 (stt) ...
[I] Downloading models and waiting for services to become healthy - please wait ... (this can take a long time)
[P] Service on port 8021 is ready (waiting for 4 services to initialize ...)
[P] Service on port 8022 is loading model ... (waiting for 4 services to initialize ...)
[P] Service on port 8024 is loading model ... (waiting for 4 services to initialize ...)
[P] Service on port 8025 is ready (waiting for 3 services to initialize ...)
[P] Service on port 8023 is loading model ... (waiting for 3 services to initialize ...)
[P] Service on port 8024 is ready (waiting for 2 services to initialize ...)
[P] Service on port 8022 is ready (waiting for 1 services to initialize ...)
[P] Service on port 8023 is ready (waiting for 0 services to initialize ...)
[I] All ggml services are up and ready - your NVIDIA DGX Spark is ready to use!
[I] Entering monitoring loop (Ctrl-C to stop)

@mfoldes

mfoldes Oct 20, 2025

It was a weird issue for sure and the build environments should be identical since I'd have to assume you're running the same Ubuntu variant that comes with all DGX Sparks, but I couldn't for the life of me get it to compile. But the workaround sorted it out whereas before I was getting this.

-- The C compiler identification is GNU 13.3.0
-- The CXX compiler identification is GNU 13.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMAKE_BUILD_TYPE=Release
-- Found Git: /usr/bin/git (found version "2.43.0") 
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/cc
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE 
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- GGML_SYSTEM_ARCH: ARM
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5") 
-- ARM detected
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
-- ARM -mcpu not found, -mcpu=native will be used
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nodotprod
-- Performing Test GGML_MACHINE_SUPPORTS_nodotprod - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_noi8mm
-- Performing Test GGML_MACHINE_SUPPORTS_noi8mm - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosve
-- Performing Test GGML_MACHINE_SUPPORTS_nosve - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosme
-- Performing Test GGML_MACHINE_SUPPORTS_nosme - Failed
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native 
-- Found CUDAToolkit: /usr/local/cuda/targets/sbsa-linux/include (found version "13.0.88") 
-- CUDA Toolkit found
-- Using CUDA architectures: native
CMake Error at /usr/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:780 (message):
 Compiling the CUDA compiler identification source file
 "CMakeCUDACompilerId.cu" failed.
 Compiler: /usr/bin/nvcc
 Build flags:
 Id flags: --keep;--keep-dir;tmp -v
 The output was:
 1
 #$ _NVVM_BRANCH_=nvvm
 #$ _SPACE_=
 #$ _CUDART_=cudart
 #$ _HERE_=/usr/lib/nvidia-cuda-toolkit/bin
 #$ _THERE_=/usr/lib/nvidia-cuda-toolkit/bin
 #$ _TARGET_SIZE_=
 #$ _TARGET_DIR_=
 #$ _TARGET_SIZE_=64
 #$ NVVMIR_LIBRARY_DIR=/usr/lib/nvidia-cuda-toolkit/libdevice
 #$
 PATH=/usr/lib/nvidia-cuda-toolkit/bin:/home/holycudabatman/.local/bin/:/home/holycudabatman/.vscode-server/cli/servers/Stable-7d842fb85a0275a4a8e4d7e040d2625abbf7f084/server/bin/remote-cli:/usr/local/cuda/bin:/opt/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/holycudabatman/ngc-cli:/home/holycudabatman/ngc-cli:/home/holycudabatman/.vscode-server/extensions/ms-python.debugpy-2025141-linux-arm64/bundled/scripts/noConfigScripts:/home/holycudabatman/.vscode-server/data/User/globalStorage/github.copilot-chat/debugCommand
 #$ LIBRARIES= -L/usr/lib/aarch64-linux-gnu/stubs
 -L/usr/lib/aarch64-linux-gnu
 #$ rm tmp/a_dlink.reg.c
 #$ gcc -D__CUDA_ARCH_LIST__=520 -E -x c++ -D__CUDACC__ -D__NVCC__
 -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=0
 -D__CUDACC_VER_BUILD__=140 -D__CUDA_API_VER_MAJOR__=12
 -D__CUDA_API_VER_MINOR__=0 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include
 "cuda_runtime.h" "CMakeCUDACompilerId.cu" -o
 "tmp/CMakeCUDACompilerId.cpp4.ii"
 #$ cudafe++ --c++17 --gnu_version=130300 --display_error_number
 --orig_src_file_name "CMakeCUDACompilerId.cu" --orig_src_path_name
 "/home/holycudabatman/ggml-org/llama.cpp/build-cuda/CMakeFiles/3.28.3/CompilerIdCUDA/CMakeCUDACompilerId.cu"
 --allow_managed --unsigned_chars --unsigned_wchar_t --m64 --parse_templates
 --gen_c_file_name "tmp/CMakeCUDACompilerId.cudafe1.cpp" --stub_file_name
 "CMakeCUDACompilerId.cudafe1.stub.c" --gen_module_id_file
 --module_id_file_name "tmp/CMakeCUDACompilerId.module_id"
 "tmp/CMakeCUDACompilerId.cpp4.ii"
 #$ gcc -D__CUDA_ARCH__=520 -D__CUDA_ARCH_LIST__=520 -E -x c++
 -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__
 -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=0
 -D__CUDACC_VER_BUILD__=140 -D__CUDA_API_VER_MAJOR__=12
 -D__CUDA_API_VER_MINOR__=0 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include
 "cuda_runtime.h" "CMakeCUDACompilerId.cu" -o
 "tmp/CMakeCUDACompilerId.cpp1.ii"
 #$ cicc --c++17 --gnu_version=130300 --display_error_number
 --orig_src_file_name "CMakeCUDACompilerId.cu" --orig_src_path_name
 "/home/holycudabatman/ggml-org/llama.cpp/build-cuda/CMakeFiles/3.28.3/CompilerIdCUDA/CMakeCUDACompilerId.cu"
 --allow_managed --unsigned_chars --unsigned_wchar_t -arch compute_52 -m64
 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1
 --include_file_name "CMakeCUDACompilerId.fatbin.c" -tused
 --module_id_file_name "tmp/CMakeCUDACompilerId.module_id" --gen_c_file_name
 "tmp/CMakeCUDACompilerId.cudafe1.c" --stub_file_name
 "tmp/CMakeCUDACompilerId.cudafe1.stub.c" --gen_device_file_name
 "tmp/CMakeCUDACompilerId.cudafe1.gpu" "tmp/CMakeCUDACompilerId.cpp1.ii" -o
 "tmp/CMakeCUDACompilerId.ptx"
 /usr/include/aarch64-linux-gnu/bits/math-vector.h(96): error: identifier
 "__Float32x4_t" is undefined
 /usr/include/aarch64-linux-gnu/bits/math-vector.h(97): error: identifier
 "__Float64x2_t" is undefined
 /usr/include/aarch64-linux-gnu/bits/math-vector.h(106): error: identifier
 "__SVFloat32_t" is undefined
 /usr/include/aarch64-linux-gnu/bits/math-vector.h(107): error: identifier
 "__SVFloat64_t" is undefined
 /usr/include/aarch64-linux-gnu/bits/math-vector.h(108): error: identifier
 "__SVBool_t" is undefined
 5 errors detected in the compilation of "CMakeCUDACompilerId.cu".
 # --error 0x1 --
Call Stack (most recent call first):
 /usr/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:8 (CMAKE_DETERMINE_COMPILER_ID_BUILD)
 /usr/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:53 (__determine_compiler_id_test)
 /usr/share/cmake-3.28/Modules/CMakeDetermineCUDACompiler.cmake:135 (CMAKE_DETERMINE_COMPILER_ID)
 ggml/src/ggml-cuda/CMakeLists.txt:41 (enable_language)
-- Configuring incomplete, errors occurred!
gmake: Makefile: No such file or directory
gmake: *** No rule to make target 'Makefile'. Stop.

openmarmot
Oct 22, 2025

just got my spark, had to install the libcurl dev package so cmake could find curl but otherwise no issues
sudo apt install libcurl4-openssl-dev

1 reply

@ggerganov

ggerganov Oct 22, 2025
Maintainer Author

Thanks for the report - fixed it with ggml-org/ggml-org.github.io@c13a770. Any additional feedback is highly appreciated.

nbergantino
Oct 25, 2025

Why is the model LLama 3.3 70B (4 bits quantization) so slow on the DGX Spark, using the latest version of llama.cpp (checkout 25 october 2025) ?

On other consumer architectures this model is very fast with llama.cpp. Can you give me some suggests ? Thank you

command line (the model was quantized 2 days ago with the latest version of llama.cpp):
llama-cli -m /code/cpp/llamacpp/models/meta/Llama-3.3-Instruct-70B-Q4-K-M.gguf -p "write a paragraph about the quantum computing"

In the following a video.
https://github.com/user-attachments/assets/bd5dd08b-6bee-47a9-8bff-5400c19db889

0 replies

guide : setting up NVIDIA DGX Spark with ggml #16514

Uh oh!

Uh oh!

ggerganov Oct 11, 2025 Maintainer

Overview

Software

Setup

Use cases

Basic chat

Inline code completions (FIM)

Coding agent

Document and image processing

Audio transcription

Performance

Conclusion

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

mfoldes Oct 19, 2025

Uh oh!

ggerganov Oct 20, 2025 Maintainer Author

Uh oh!

mfoldes Oct 20, 2025

Uh oh!

openmarmot Oct 22, 2025

Uh oh!

ggerganov Oct 22, 2025 Maintainer Author

Uh oh!

nbergantino Oct 25, 2025

ggerganov
Oct 11, 2025
Maintainer

Replies: 3 comments 3 replies

mfoldes
Oct 19, 2025

ggerganov Oct 20, 2025
Maintainer Author

openmarmot
Oct 22, 2025

ggerganov Oct 22, 2025
Maintainer Author

nbergantino
Oct 25, 2025