Setup · Primary guide

DGX Containers

NVIDIA DGX Spark · GB10 · ARM64 · CUDA 13

Build vertexnova/gsplat-spark and vertexnova/aiml-spark from the same PyTorch base image. Work through the tabs in order: prerequisites once, then each image, verify day-to-day use, summary and limits last. For raw Docker commands, open Docker on Spark in another tab (same tab + copy pattern).

DGX OS 7.4.0 CUDA 13.0.2 Driver 580.142 ARM64 GB10 · SM 12.0 128 GB unified memory

Prerequisites — do once after first boot

1 Disable swap done once

DGX Spark uses unified memory — CPU and GPU share the same 128GB pool. With swap enabled, heavy GPU workloads trigger a death spiral: training fills memory → OS swaps to disk → machine freezes. Disabling swap turns a machine freeze into a clean job crash.

# Comment out swap in fstab
sudo nano /etc/fstab
# Change: /swap.img  none  swap  sw  0  0
# To:   # /swap.img  none  swap  sw  0  0

# Apply immediately
sudo swapoff -a
swapon --show    # should print nothing
2 Verify GPU access from Docker done once
docker run --rm --gpus all \
  nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi
Expected output: NVIDIA-SMI 580.142 · Driver 580.142 · CUDA 13.0 · GB10
3 Pull base image done once

All containers build on NVIDIA's official PyTorch container — Blackwell-optimised, ARM64 native, CUDA 13, PyTorch 2.7.

docker pull nvcr.io/nvidia/pytorch:25.03-py3
# Size: 21.8GB — download once, reuse for all containers
4 Standard run flags always use

These flags are required for every container on DGX Spark. Without --ipc=host, PyTorch shared memory is capped at 64MB and will crash during training.

docker run --gpus all -it \
  --name CONTAINER_NAME \
  --ipc=host \                 # shared memory — required for PyTorch
  --ulimit memlock=-1 \        # no memory lock limit
  --ulimit stack=67108864 \    # 64MB stack
  -v ~/workspace:/workspace \  # persist files
  -p 7007:7007 \               # nerfstudio viewer
  IMAGE_NAME bash
5 CUDA arch for GB10 critical

The GB10 is compute capability SM 12.0. PyTorch 2.7 in the NVIDIA container does not recognise 12.1 — use 12.0 when building any CUDA extension from source.

export TORCH_CUDA_ARCH_LIST="12.0"
# Set this before: pip install gsplat, any CUDA extension build
Using 12.1 raises an unknown-arch error in this PyTorch build. Use 12.0 only.

vertexnova/gsplat-spark:v1

3D Gaussian Splatting workspace — gsplat 1.5.3, PyTorch 2.7, ARM64 native.

PyTorch 2.7.0a0 NVIDIA build · pre-installed
gsplat 1.5.3 built from source · SM 12.0
CUDA 13.0.2 pre-installed · SM 12.0
gdown 6.0.0 scene download utility
1 Launch base container
docker run --gpus all -it \
  --name gsplat-build \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v ~/workspace:/workspace \
  -p 7007:7007 \
  nvcr.io/nvidia/pytorch:25.03-py3 bash
2 Verify GPU inside container
python3 -c "import torch; print(torch.cuda.get_device_name(0))"
# → NVIDIA GB10
3 Install gsplat from source

Must build from source — no pre-compiled ARM64 + CUDA 13 wheel exists on PyPI. The --no-build-isolation flag prevents pip from trying to install a conflicting PyTorch version.

export TORCH_CUDA_ARCH_LIST="12.0"

pip install git+https://github.com/nerfstudio-project/gsplat.git \
  --no-build-isolation

# Compilation takes 5-10 minutes — normal
Do not point pip at the public PyTorch cu130 index here; it conflicts with the NVIDIA container build.
4 Verify gsplat
python3 -c "import gsplat; print('gsplat OK', gsplat.__version__)"
# → gsplat OK 1.5.3
5 Exit and commit
exit

docker commit gsplat-build vertexnova/gsplat-spark:v1
docker images | grep vertexnova

vertexnova/aiml-spark:v1

CV, tabular work, light 3D mesh I/O, and JupyterLab on top of the same NVIDIA PyTorch base. NumPy, pandas, matplotlib, and scikit-learn already ship in the base image.

PyTorch 2.7.0a0 NVIDIA base image
OpenCV 4.10.0 headless · pip stage 1
JupyterLab latest pip stage 4
trimesh · plyfile 4.x / 1.x mesh + PLY I/O
scikit-image · imageio 0.26 / 2.37 pip stage 1
seaborn · rich · tqdm pip stage 2
Not covered here: Open3D (no stable ARM64 wheel in this flow), TensorFlow on Python 3.12 (use a separate TF-focused image if needed).
1 Launch base container
docker run --gpus all -it \
  --name aiml-dev \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v ~/workspace:/workspace \
  nvcr.io/nvidia/pytorch:25.03-py3 bash
2 Stage 1 — computer vision
pip install \
  opencv-python-headless \
  scikit-image \
  imageio \
  Pillow
3 Stage 2 — data science
pip install \
  seaborn \
  tqdm \
  rich
# pandas, matplotlib, scikit-learn, numpy already in base image
4 Stage 3 — 3D geometry
pip install trimesh plyfile
# open3d skipped — no ARM64 wheel available
5 Stage 4 — dev tools
pip install \
  jupyterlab \
  ipywidgets \
  gdown \
  python-dotenv \
  pyyaml \
  requests
6 Exit and commit
exit

docker commit aiml-dev vertexnova/aiml-spark:v1
docker images | grep vertexnova

Verify & launch containers

1 Health checks after every reboot
# 1. GPU driver OK
nvidia-smi

# 2. Docker GPU access OK
docker run --rm --gpus all \
  nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi

# 3. Images intact
docker images | grep vertexnova

# 4. Optional — containers you expect to stay up
docker ps | grep YOUR_SERVICE
2 Launch gsplat container
docker run --gpus all -it \
  --name gsplat-dev \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v ~/workspace:/workspace \
  -p 7007:7007 \
  vertexnova/gsplat-spark:v1 bash

# Inside — verify
python3 -c "import gsplat; print('gsplat', gsplat.__version__)"
3 Launch AI/ML container
docker run --gpus all -it \
  --name aiml-dev \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v ~/workspace:/workspace \
  -p 8888:8888 \
  vertexnova/aiml-spark:v1 bash

# Inside — verify everything
python3 -c "
import torch, cv2, numpy, pandas, trimesh, plyfile
print('torch:', torch.__version__, '| GPU:', torch.cuda.get_device_name(0))
print('opencv:', cv2.__version__)
print('trimesh:', trimesh.__version__)
print('ALL OK')
"
4 Launch JupyterLab
# Inside aiml-dev container
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

# On your laptop browser
# http://dgx-spark.local:8888
5 Re-enter stopped container
# Check if stopped
docker ps -a | grep dev

# Restart and re-enter
docker start aiml-dev
docker exec -it aiml-dev bash

# Or one line
docker start -ai aiml-dev
6 Save work after install session
exit
docker commit aiml-dev vertexnova/aiml-spark:v2
docker images | grep vertexnova

Container registry summary

vertexnova/gsplat-spark v1 built

3D Gaussian Splatting workspace.

gsplat 1.5.3 PyTorch 2.7 CUDA 13 ARM64 gdown
21.9 GB
vertexnova/aiml-spark v1 built

General AI/ML workspace.

PyTorch 2.7 OpenCV 4.10 trimesh plyfile JupyterLab pandas scikit-learn seaborn
22.0 GB
vertexnova/nerfstudio-spark planned

Full nerfstudio with COLMAP for training 3DGS from custom scenes.

nerfstudio COLMAP gsplat Open3D
vertexnova/vertexnova-spark planned

C++ rendering engine development — Vulkan, CMake, SPIRV-Cross.

CMake Vulkan SDK SPIRV-Cross clang

Known ARM64 + CUDA 13 limitations

open3d no wheel No ARM64 binary. Must build from source (~60 min) or use trimesh instead.
tensorflow-cpu no wheel No Python 3.12 ARM64 wheel. Use dedicated TF container with Python 3.10.
vllm fragile No stable cu130 aarch64 release. Pin to specific nightly wheel.
flash-attn fragile aarch64 builds exist but need TORCH_CUDA_ARCH_LIST=12.0 at build time.
CUDA 12.x packages incompatible DGX Spark only has libcudart.so.13. Any package built against CUDA 12 will fail with ImportError.