ML stack

ML stack reference

Python, NumPy, PyTorch, and LibTorch C++ in one place — tensor work, training loops, and patterns from gsplat, nerfstudio, vne3dgs, and custom C++ inference.

Create arrays

np.zeros((N, 3))N×3 array of zeros

np.ones((N, 3))N×3 array of ones

np.random.randn(N, 3)Normal distribution

np.linspace(0, 1, 100)100 evenly spaced values

np.arange(0, 10, 2)[0, 2, 4, 6, 8]

Shape & reshape

a.shapeReturns tuple e.g. (5834784, 3)

a.reshape(N, -1)Reshape, -1 = infer

a.flatten()1D array

np.stack([a,b], axis=1)[N] + [N] → [N, 2]

np.concatenate([a,b], axis=0)[N,3] + [M,3] → [N+M, 3]

Indexing & masking

a[0]First row

a[:, 1]All rows, column 1

a[10:20]Rows 10 to 19

a[a > 0.5]Boolean mask — filter values

np.where(a > 0, a, 0)Conditional select

Math ops

np.exp(a)Elementwise exp — for 3DGS scales

np.log(a)Elementwise log

1 / (1 + np.exp(-a))Sigmoid — for 3DGS opacity

np.linalg.norm(a, axis=-1)L2 norm per row

a @ bMatrix multiply

a.mean(axis=0)Mean along rows

Type conversion

a.astype(np.float32)Cast to float32 — always use for GPU

a.astype(np.uint8)Cast to uint8 — for saving images

torch.tensor(a)numpy → torch tensor

tensor.numpy()torch → numpy (must be on CPU)

tensor.cpu().numpy()GPU tensor → numpy

Create tensors

torch.zeros(N, 3)CPU zeros

torch.zeros(N, 3, device='cuda')GPU zeros

torch.randn(N, 3, device='cuda')GPU random normal

torch.tensor(np_array, dtype=torch.float32)From numpy

torch.eye(4, device='cuda')4×4 identity matrix

torch.arange(0, 10, device='cuda')Range on GPU

Shape manipulation

x.shapetorch.Size([N, 3])

x.view(N, -1)Reshape (contiguous memory)

x.reshape(N, -1)Reshape (handles non-contiguous)

x.unsqueeze(0)[N,3] → [1,N,3] — add dim

x.squeeze(0)[1,N,3] → [N,3] — remove dim

x.permute(2, 0, 1)Reorder dims — like transpose

x.contiguous()Make memory contiguous after permute

Device management

x.to('cuda')Move to GPU

x.to('cpu')Move to CPU

x.cuda()Shorthand for .to('cuda')

x.cpu()Shorthand for .to('cpu')

x.deviceReturns device: cuda:0

torch.cuda.is_available()Check GPU available

torch.cuda.get_device_name(0)GPU name string

torch.device('cuda:1')Explicit device handle — pass to .to(device)

torch.cuda.set_device(1)Default CUDA device index for new tensors

Dtype

x.dtypeReturns torch.float32 etc.

x.float()Cast to float32

x.half()Cast to float16 — faster on GPU

x.long()Cast to int64

x.bool()Cast to bool

Reduction ops

x.sum()Sum all elements

x.sum(dim=0)Sum along rows → [3]

x.mean(dim=-1)Mean along last dim

x.max(dim=0).valuesMax values (returns namedtuple)

x.min(dim=0).indicesIndices of min values

x.norm(dim=-1)L2 norm per row

x.median(dim=0).valuesMedian — used in scene_center()

x.quantile(0.9)90th percentile

Elementwise ops

x * yElementwise multiply

x + yElementwise add

torch.exp(x)e^x — 3DGS scale activation

torch.sigmoid(x)1/(1+e^-x) — opacity activation

torch.relu(x)max(0, x)

x.clamp(0, 1)Clip to [0, 1] — for RGB

x.abs()Absolute value

Matrix ops

x @ yMatrix multiply

torch.matmul(x, y)Same as @

x.TTranspose 2D tensor

torch.inverse(x)Matrix inverse

torch.linalg.norm(x)Matrix norm

torch.bmm(a, b)Batched matmul — [B,N,K] @ [B,K,M] → [B,N,M]

torch.einsum('ij,jk->ik', a, b)Einstein sum — same idea as tensor cores in papers

Combine tensors

torch.cat([a, b], dim=0)Concat along dim — [N,3]+[M,3]→[N+M,3]

torch.stack([a, b], dim=0)New dim — [N,3]+[N,3]→[2,N,3]

a.expand(4, -1, -1)Repeat without copy — broadcasting

a.repeat(4, 1, 1)Repeat with copy

Basic training loop

model = MyNet().cuda()
optimizer = torch.optim.Adam(
    model.parameters(), lr=1e-3
)

for epoch in range(n_epochs):
    # Forward
    pred = model(x)
    loss = F.mse_loss(pred, target)

    # Backward
    optimizer.zero_grad()   # clear old grads
    loss.backward()         # compute grads
    optimizer.step()        # update params

    if epoch % 10 == 0:
        print(f"{epoch}: {loss.item():.4f}")

Optimizers

torch.optim.Adam(params, lr=1e-3)Best default — use this first

torch.optim.SGD(params, lr=0.01)Stochastic gradient descent

torch.optim.AdamW(params, lr=1e-3)Adam + weight decay

optimizer.zero_grad()Clear gradients — call before backward()

optimizer.step()Update parameters

Loss functions

F.mse_loss(pred, target)Mean squared error

F.l1_loss(pred, target)Mean absolute error

F.cross_entropy(pred, labels)Classification loss

F.binary_cross_entropy(pred, target)Binary classification

(1 - ssim(pred, target))SSIM — used in 3DGS training

Gradient control

with torch.no_grad():Disable gradients — for inference

with torch.inference_mode():Stricter than no_grad — no version counter ops

model.eval()Turn off dropout / fix BatchNorm stats — call before eval loop

model.train()Restore training behavior after inference

tensor.detach()Detach from graph — stop gradient flow

tensor.requires_grad_(True)Enable gradient tracking

loss.backward()Compute all gradients

tensor.gradAccess computed gradient

torch.nn.utils.clip_grad_norm_(params, 1.0)Gradient clipping

Save & load

torch.save(model.state_dict(), 'model.pt')Save weights

model.load_state_dict(torch.load('model.pt'))Load weights

torch.save(checkpoint, 'ckpt.pt')Save full checkpoint

torch.compile(model)PyTorch 2 graph compile — optional speedup (Python)

Mixed precision (AMP)

scaler = torch.cuda.amp.GradScaler()
for x, y in loader:
    optimizer.zero_grad(set_to_none=True)
    with torch.cuda.amp.autocast():
        loss = model(x, y)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

torch.cuda.amp.autocast()Run forward in lower precision where safe

GradScaler()Scale loss to avoid fp16 underflow in backward

DataLoader

DataLoader(ds, batch_size=32, shuffle=True)Basic batched iteration

num_workers=4Parallel CPU loading — start small on Docker/NFS

pin_memory=TrueFaster host→GPU copy when training on CUDA

persistent_workers=TrueKeep workers alive between epochs (PyTorch 1.7+)

collate_fn=batch_listCustom stacking when samples are variable-length

drop_last=TrueOmit final partial batch — avoids BatchNorm size-1 batch

CUDA memory & sync

torch.cuda.empty_cache()Return cached blocks to driver — does not free tensors in scope

torch.cuda.memory_allocated()Current tensor bytes on device

torch.cuda.max_memory_allocated()Peak since reset — pair with reset_peak_stats()

torch.cuda.synchronize()Block until GPU work completes — use when timing

torch.cuda.reset_peak_memory_stats()Clear peak counter before a benchmark region

LibTorch is the C++ API for PyTorch. Types live in at:: / torch::; autograd and torch::nn mirror Python. Match your LibTorch download to the CUDA/toolchain you link against.

CMake & include

#include <torch/torch.h>

# CMakeLists.txt (typical)
find_package(Torch REQUIRED)
target_link_libraries(myapp PRIVATE "${TORCH_LIBRARIES}")
set_property(TARGET myapp PROPERTY CXX_STANDARD 17)

-DCMAKE_PREFIX_PATH=/path/to/libtorchPoint CMake at extracted libtorch (prebuilt)

${TORCH_CXX_FLAGS}Append to CMAKE_CXX_FLAGS — ABI / codegen flags from find_package

Tensor creation

torch::zeros({N, 3}, torch::kCUDA)Float32 on default CUDA device

torch::randn({N, 3}, torch::dtype(torch::kFloat16).device(torch::kCUDA))Explicit TensorOptions

torch::tensor({1., 2., 3.}, torch::kFloat32)From initializer list

torch::from_blob(ptr, {H, W, 3}, torch::kFloat32).clone()Wrap CPU memory — clone() if buffer outlives tensor

torch::empty_like(x)Uninitialized buffer same shape/device/dtype as x

Shape, device, dtype

x.sizes()IntArrayRef — like shape in Python

x.dim()Number of dimensions

x.view({N, -1})Reshape (must be contiguous — else reshape)

x.reshape({N, -1})Reshape with copy if needed

x.to(torch::kCUDA)Move tensor to CUDA

x.to(torch::kCPU)Move to CPU

x.to(torch::kFloat16)Cast dtype

x.device()c10::Device — cuda:0 / cpu

torch::cuda::is_available()Runtime GPU check

Common ops (C++)

torch::matmul(a, b)Same rules as Python @

a.mm(b)2D matrix multiply

torch::cat({a, b}, /*dim=*/0)Concat along dim

torch::stack({a, b}, 0)New leading dim

torch::relu(x)Elementwise

torch::sigmoid(x)Opacity-style activation

torch::exp(x)Scale-style activation

x.clamp(0, 1)Clip range

x.norm(/*p=*/2, /*dim=*/-1, /*keepdim=*/true)L2 norm along dim

Autograd

x.requires_grad_(true)Track ops on this tensor

loss.backward()Accumulate grads into .grad()

x.grad()Gradient tensor — check .defined() before reading

torch::NoGradGuard no_grad;RAII — disable grad inside scope (inference)

torch::Tensor y = x.detach();Stop gradient through y

torch::nn::Module

struct Net : torch::nn::Module {
  Net()
    : fc1(register_module("fc1", torch::nn::Linear(3, 64))),
      fc2(register_module("fc2", torch::nn::Linear(64, 1))) {}

  torch::Tensor forward(torch::Tensor x) {
    x = torch::relu(fc1->forward(x));
    return fc2->forward(x);
  }

  torch::nn::Linear fc1{nullptr}, fc2{nullptr};
};

auto net = std::make_shared<Net>();
net->to(torch::kCUDA);

Training loop (C++)

auto net = std::make_shared<Net>();
net->to(torch::kCUDA);
torch::optim::Adam optimizer(
    net->parameters(), torch::optim::AdamOptions(1e-3));

for (int64_t epoch = 0; epoch < 100; ++epoch) {
  optimizer.zero_grad();
  auto pred = net->forward(batch_x);
  auto loss = torch::mse_loss(pred, batch_y);
  loss.backward();
  optimizer.step();
}

torch::optim::AdamWSame family as Python AdamW — AdamWOptions(lr)

torch::mse_lossAlso in torch/nn/functional.h as torch::nn::functional::mse_loss

Save / load

torch::save(net, "model.pt");Archive module state (trainable module)

torch::load(net, "model.pt");Restore — net type must match saved module

torch::save(tensor, "t.pt");Single tensor or vector of tensors

torch::jit::load("model.ts")TorchScript / traced model — different pipeline from nn.Module save

Python ↔ C++ weights

torch.save(model.state_dict(), "w.pt")Python: only tensors

torch::pickle_loadC++: read Python state_dict bytes — map tensor names to named_parameters(), or prefer TorchScript for deploy

torch.jit.trace / scriptPython save .pt / archive, then torch::jit::load in C++ for inference

Indexing & scalars

x.select(0, i)Slice along dim 0 at index i

x.masked_select(mask)Bool mask same shape — returns 1D flattened values

x.item<float>()Scalar tensor → C++ scalar (exactly one element)

x.accessor<float, 2>()Fast CPU pointer view — [i][j] (CPU contiguous)

Gotchas

view vs reshapeSame as Python — permute/transpose then need contiguous() before view

std::move(tensor)Moving tensors is cheap; avoid copies in hot loops

TORCH_MODULE(Net)Macro: holder for std::shared_ptr<Net> with ->forward — alternative to raw struct + make_shared

torch::TensorOptions()Default dtype float32, device CPU — chain .dtype(torch::kFloat16).device(torch::kCUDA)

torch::cuda::synchronize()Block until CUDA work finishes — use when timing C++ paths

ATen threadingCPU ops use intra-op parallel; set with at::set_num_threads if needed

ABIPrebuilt LibTorch must match compiler (e.g. libstdc++) and CUDA major

nn.Module template

class MyNet(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(in_dim, 128),
            nn.ReLU(),
            nn.Linear(128, out_dim),
        )

    def forward(self, x):
        return self.layers(x)

model = MyNet(3, 1).cuda()
print(sum(p.numel() for p in model.parameters()),
      "parameters")

Dataset + DataLoader

from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

dataset = MyDataset(X, y)
loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
)

for batch_x, batch_y in loader:
    loss = train_step(batch_x, batch_y)

Inference pipeline

model.eval()           # disable dropout/BN

with torch.no_grad():  # disable gradients
    x = preprocess(input).cuda()
    pred = model(x)
    result = postprocess(pred.cpu())

model.train()          # back to training mode

Debug tensor issues

# Always check these when something breaks
print(x.shape)         # shape
print(x.dtype)         # float32 / int64 etc.
print(x.device)        # cpu / cuda:0
print(x.min(), x.max()) # value range
print(x.isnan().any()) # NaN check
print(x.isinf().any()) # Inf check

# Visualize a batch
import matplotlib.pyplot as plt
img = renders[0].cpu().numpy()
plt.imshow(img); plt.savefig("debug.png")

Dataclasses

from dataclasses import dataclass, field

@dataclass
class Config:
    lr: float = 1e-3
    batch_size: int = 32
    device: str = "cuda"
    tags: list = field(default_factory=list)

cfg = Config(lr=1e-4)
print(cfg.lr, cfg.device)

pathlib

from pathlib import Path

root = Path("/workspace/datasets")
ply = root / "garden" / "point_cloud.ply"

print(ply.exists())     # True/False
print(ply.suffix)       # .ply
print(ply.stem)         # point_cloud
print(ply.parent)       # /workspace/datasets/garden

# Glob
plys = list(root.glob("**/*.ply"))
for p in plys:
    print(p)

argparse

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--ply",    required=True)
parser.add_argument("--output", default="out.png")
parser.add_argument("--width",  type=int, default=1280)
parser.add_argument("--fov",    type=float, default=60.0)
parser.add_argument("--verbose",action="store_true")
args = parser.parse_args()

print(args.ply, args.width)

List comprehensions

# Basic
squares = [x**2 for x in range(10)]

# With filter
big = [x for x in values if x > threshold]

# Nested
flat = [item for row in matrix for item in row]

# Dict comprehension
d = {k: v for k, v in zip(keys, values)}

# Generator (memory efficient, no [])
gen = (x**2 for x in range(1_000_000))

Context managers

# File I/O
with open("log.txt", "w") as f:
    f.write("done\n")

# PyTorch no_grad (most common in ML)
with torch.no_grad():
    pred = model(x)

# Timing
import time
t = time.time()
result = expensive_op()
print(f"{(time.time()-t)*1000:.1f}ms")

f-strings

n = 5_834_784
t = 0.133

# Basic
print(f"Gaussians: {n:,}")       # 5,834,784
print(f"Time: {t*1000:.1f}ms")   # 133.0ms
print(f"FPS: {1/t:.1f}")         # 7.5

# Width padding
print("%-10s %8s" % ("loss", "value"))
for k,v in stats.items():
    print("%-10s %8.4f" % (k, v))

Load .ply scene

from plyfile import PlyData
import numpy as np, torch

ply = PlyData.read("point_cloud.ply")
v = ply['vertex']

means    = torch.tensor(
    np.stack([v['x'], v['y'], v['z']], axis=1),
    dtype=torch.float32, device='cuda')

# Scales stored as log → exp()
scales   = torch.tensor(
    np.exp(np.stack([
        v['scale_0'], v['scale_1'], v['scale_2']
    ], axis=1)), dtype=torch.float32, device='cuda')

# Opacity stored as logit → sigmoid()
opacities = torch.sigmoid(torch.tensor(
    np.array(v['opacity']),
    dtype=torch.float32, device='cuda'))

# SH DC → RGB: 0.5 + 0.28209 * f_dc
colors = (0.5 + 0.28209 * torch.tensor(
    np.stack([v['f_dc_0'], v['f_dc_1'], v['f_dc_2']], axis=1),
    dtype=torch.float32, device='cuda')).clamp(0, 1)

gsplat rasterization call

from gsplat import rasterization

renders, alphas, info = rasterization(
    means=means,      # [N, 3]
    quats=quats,      # [N, 4] wxyz normalized
    scales=scales,    # [N, 3] after exp()
    opacities=opacities, # [N] after sigmoid()
    colors=colors,    # [N, 3] RGB [0,1]
    viewmats=viewmat, # [C, 4, 4] world→cam
    Ks=K,             # [C, 3, 3] intrinsics
    width=W,
    height=H,
)
# renders: [C, H, W, 3]
# alphas:  [C, H, W, 1]

Camera intrinsics

import math

def make_K(W, H, fov_deg=60.0, device='cuda'):
    fov = math.radians(fov_deg)
    fx = (W / 2) / math.tan(fov / 2)
    K = torch.tensor([[
        [fx,  0, W/2],
        [ 0, fx, H/2],
        [ 0,  0,   1],
    ]], dtype=torch.float32, device=device)
    return K  # [1, 3, 3]

Alpha compositing

# gsplat returns raw render + alpha
# Composite manually over background:
# out = render * alpha + bg * (1 - alpha)

bg = torch.ones(3, device='cuda')  # white
alpha = alphas[0]                   # [H, W, 1]
composited = (renders[0] * alpha
    + bg.view(1,1,3) * (1 - alpha)) # [H, W, 3]

# Save to PNG
img = (composited.clamp(0,1).cpu().numpy()
       * 255).astype(np.uint8)
imageio.imwrite("out.png", img)

Scene inspection

# Find scene center (median is robust to outliers)
center = means.median(dim=0).values

# Scene spread
spread = (means - center).norm(dim=-1).mean()

# High opacity core
core = means[opacities > 0.5]
core_center = core.mean(dim=0)

# Orbit radius
radius = spread * 2.0

print(f"Gaussians: {len(means):,}")
print(f"Center:    {center.cpu().numpy().round(2)}")
print(f"Spread:    {spread:.2f}")
print(f"Radius:    {radius:.2f}")

DGX Spark gotchas

TORCH_CUDA_ARCH_LIST="12.0"Required before building any CUDA extension

--no-build-isolationRequired for gsplat pip install — prevents torch version conflict

--ipc=hostDocker flag — PyTorch needs host IPC for shared memory

swapoff -aDisable swap — unified memory + swap = machine freeze

nvcr.io/nvidia/pytorch:25.03-py3Base image — ARM64 + CUDA 13 native

ML stack — notes starter

Copy these templates into your notebook so every study session captures what you learned and what to practice next.

Daily study note (20-30 min)

# Date:
# Topic:
# Goal:

## Quick recap
- What concept did I study?
- What is still unclear?

## Code I ran
# paste snippet here

## Shape + dtype checks
- input shape:
- output shape:
- dtype/device:

## One takeaway
- Example: "Broadcasting works right-to-left by dimension."

## Next step
- 1 small thing to try tomorrow

PyTorch debug checklist

# When code breaks, always check:
print("shape:", x.shape)
print("dtype:", x.dtype)
print("device:", x.device)
print("range:", x.min().item(), x.max().item())
print("has_nan:", x.isnan().any().item())
print("has_inf:", x.isinf().any().item())

# If training:
print("loss:", loss.item())
for name, p in model.named_parameters():
    if p.grad is not None:
        print(name, p.grad.norm().item())

# If output looks wrong:
# 1) inspect one sample
# 2) clamp values to expected range
# 3) move to cpu() then visualize

Experiment log template

# Experiment ID:
# Dataset/scene:
# Model/config:

## Hypothesis
# "If I change X, metric Y should improve because..."

## Changes made
- learning rate:
- batch size:
- loss terms:
- augmentation:

## Results
- train loss:
- val metric:
- visual quality notes:

## Decision
- keep / discard / retry
- reason:

## Follow-up
- next experiment to run:

Python mini-reference for notes

dict.get(key, default)Safe key lookup without KeyError

enumerate(items)Loop with index and value

zip(a, b)Iterate two lists together

sorted(items, key=...)Sort by custom field

any(flags), all(flags)Quick boolean checks

assert condition, "message"Guard assumptions early

from pprint import pprintReadable nested dict output