ML stack

ML stack reference

Python, NumPy, PyTorch, and LibTorch C++ in one place — tensor work, training loops, and patterns from gsplat, nerfstudio, vne3dgs, and custom C++ inference.

Create arrays
np.zeros((N, 3))N×3 array of zeros
np.ones((N, 3))N×3 array of ones
np.random.randn(N, 3)Normal distribution
np.linspace(0, 1, 100)100 evenly spaced values
np.arange(0, 10, 2)[0, 2, 4, 6, 8]
Shape & reshape
a.shapeReturns tuple e.g. (5834784, 3)
a.reshape(N, -1)Reshape, -1 = infer
a.flatten()1D array
np.stack([a,b], axis=1)[N] + [N] → [N, 2]
np.concatenate([a,b], axis=0)[N,3] + [M,3] → [N+M, 3]
Indexing & masking
a[0]First row
a[:, 1]All rows, column 1
a[10:20]Rows 10 to 19
a[a > 0.5]Boolean mask — filter values
np.where(a > 0, a, 0)Conditional select
Math ops
np.exp(a)Elementwise exp — for 3DGS scales
np.log(a)Elementwise log
1 / (1 + np.exp(-a))Sigmoid — for 3DGS opacity
np.linalg.norm(a, axis=-1)L2 norm per row
a @ bMatrix multiply
a.mean(axis=0)Mean along rows
Type conversion
a.astype(np.float32)Cast to float32 — always use for GPU
a.astype(np.uint8)Cast to uint8 — for saving images
torch.tensor(a)numpy → torch tensor
tensor.numpy()torch → numpy (must be on CPU)
tensor.cpu().numpy()GPU tensor → numpy
Create tensors
torch.zeros(N, 3)CPU zeros
torch.zeros(N, 3, device='cuda')GPU zeros
torch.randn(N, 3, device='cuda')GPU random normal
torch.tensor(np_array, dtype=torch.float32)From numpy
torch.eye(4, device='cuda')4×4 identity matrix
torch.arange(0, 10, device='cuda')Range on GPU
Shape manipulation
x.shapetorch.Size([N, 3])
x.view(N, -1)Reshape (contiguous memory)
x.reshape(N, -1)Reshape (handles non-contiguous)
x.unsqueeze(0)[N,3] → [1,N,3] — add dim
x.squeeze(0)[1,N,3] → [N,3] — remove dim
x.permute(2, 0, 1)Reorder dims — like transpose
x.contiguous()Make memory contiguous after permute
Device management
x.to('cuda')Move to GPU
x.to('cpu')Move to CPU
x.cuda()Shorthand for .to('cuda')
x.cpu()Shorthand for .to('cpu')
x.deviceReturns device: cuda:0
torch.cuda.is_available()Check GPU available
torch.cuda.get_device_name(0)GPU name string
torch.device('cuda:1')Explicit device handle — pass to .to(device)
torch.cuda.set_device(1)Default CUDA device index for new tensors
Dtype
x.dtypeReturns torch.float32 etc.
x.float()Cast to float32
x.half()Cast to float16 — faster on GPU
x.long()Cast to int64
x.bool()Cast to bool
Reduction ops
x.sum()Sum all elements
x.sum(dim=0)Sum along rows → [3]
x.mean(dim=-1)Mean along last dim
x.max(dim=0).valuesMax values (returns namedtuple)
x.min(dim=0).indicesIndices of min values
x.norm(dim=-1)L2 norm per row
x.median(dim=0).valuesMedian — used in scene_center()
x.quantile(0.9)90th percentile
Elementwise ops
x * yElementwise multiply
x + yElementwise add
torch.exp(x)e^x — 3DGS scale activation
torch.sigmoid(x)1/(1+e^-x) — opacity activation
torch.relu(x)max(0, x)
x.clamp(0, 1)Clip to [0, 1] — for RGB
x.abs()Absolute value
Matrix ops
x @ yMatrix multiply
torch.matmul(x, y)Same as @
x.TTranspose 2D tensor
torch.inverse(x)Matrix inverse
torch.linalg.norm(x)Matrix norm
torch.bmm(a, b)Batched matmul — [B,N,K] @ [B,K,M] → [B,N,M]
torch.einsum('ij,jk->ik', a, b)Einstein sum — same idea as tensor cores in papers
Combine tensors
torch.cat([a, b], dim=0)Concat along dim — [N,3]+[M,3]→[N+M,3]
torch.stack([a, b], dim=0)New dim — [N,3]+[N,3]→[2,N,3]
a.expand(4, -1, -1)Repeat without copy — broadcasting
a.repeat(4, 1, 1)Repeat with copy
Basic training loop
model = MyNet().cuda()
optimizer = torch.optim.Adam(
    model.parameters(), lr=1e-3
)

for epoch in range(n_epochs):
    # Forward
    pred = model(x)
    loss = F.mse_loss(pred, target)

    # Backward
    optimizer.zero_grad()   # clear old grads
    loss.backward()         # compute grads
    optimizer.step()        # update params

    if epoch % 10 == 0:
        print(f"{epoch}: {loss.item():.4f}")
Optimizers
torch.optim.Adam(params, lr=1e-3)Best default — use this first
torch.optim.SGD(params, lr=0.01)Stochastic gradient descent
torch.optim.AdamW(params, lr=1e-3)Adam + weight decay
optimizer.zero_grad()Clear gradients — call before backward()
optimizer.step()Update parameters
Loss functions
F.mse_loss(pred, target)Mean squared error
F.l1_loss(pred, target)Mean absolute error
F.cross_entropy(pred, labels)Classification loss
F.binary_cross_entropy(pred, target)Binary classification
(1 - ssim(pred, target))SSIM — used in 3DGS training
Gradient control
with torch.no_grad():Disable gradients — for inference
with torch.inference_mode():Stricter than no_grad — no version counter ops
model.eval()Turn off dropout / fix BatchNorm stats — call before eval loop
model.train()Restore training behavior after inference
tensor.detach()Detach from graph — stop gradient flow
tensor.requires_grad_(True)Enable gradient tracking
loss.backward()Compute all gradients
tensor.gradAccess computed gradient
torch.nn.utils.clip_grad_norm_(params, 1.0)Gradient clipping
Save & load
torch.save(model.state_dict(), 'model.pt')Save weights
model.load_state_dict(torch.load('model.pt'))Load weights
torch.save(checkpoint, 'ckpt.pt')Save full checkpoint
torch.compile(model)PyTorch 2 graph compile — optional speedup (Python)
Mixed precision (AMP)
scaler = torch.cuda.amp.GradScaler()
for x, y in loader:
    optimizer.zero_grad(set_to_none=True)
    with torch.cuda.amp.autocast():
        loss = model(x, y)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
torch.cuda.amp.autocast()Run forward in lower precision where safe
GradScaler()Scale loss to avoid fp16 underflow in backward
DataLoader
DataLoader(ds, batch_size=32, shuffle=True)Basic batched iteration
num_workers=4Parallel CPU loading — start small on Docker/NFS
pin_memory=TrueFaster host→GPU copy when training on CUDA
persistent_workers=TrueKeep workers alive between epochs (PyTorch 1.7+)
collate_fn=batch_listCustom stacking when samples are variable-length
drop_last=TrueOmit final partial batch — avoids BatchNorm size-1 batch
CUDA memory & sync
torch.cuda.empty_cache()Return cached blocks to driver — does not free tensors in scope
torch.cuda.memory_allocated()Current tensor bytes on device
torch.cuda.max_memory_allocated()Peak since reset — pair with reset_peak_stats()
torch.cuda.synchronize()Block until GPU work completes — use when timing
torch.cuda.reset_peak_memory_stats()Clear peak counter before a benchmark region

LibTorch is the C++ API for PyTorch. Types live in at:: / torch::; autograd and torch::nn mirror Python. Match your LibTorch download to the CUDA/toolchain you link against.

CMake & include
#include <torch/torch.h>

# CMakeLists.txt (typical)
find_package(Torch REQUIRED)
target_link_libraries(myapp PRIVATE "${TORCH_LIBRARIES}")
set_property(TARGET myapp PROPERTY CXX_STANDARD 17)
-DCMAKE_PREFIX_PATH=/path/to/libtorchPoint CMake at extracted libtorch (prebuilt)
${TORCH_CXX_FLAGS}Append to CMAKE_CXX_FLAGS — ABI / codegen flags from find_package
Tensor creation
torch::zeros({N, 3}, torch::kCUDA)Float32 on default CUDA device
torch::randn({N, 3}, torch::dtype(torch::kFloat16).device(torch::kCUDA))Explicit TensorOptions
torch::tensor({1., 2., 3.}, torch::kFloat32)From initializer list
torch::from_blob(ptr, {H, W, 3}, torch::kFloat32).clone()Wrap CPU memory — clone() if buffer outlives tensor
torch::empty_like(x)Uninitialized buffer same shape/device/dtype as x
Shape, device, dtype
x.sizes()IntArrayRef — like shape in Python
x.dim()Number of dimensions
x.view({N, -1})Reshape (must be contiguous — else reshape)
x.reshape({N, -1})Reshape with copy if needed
x.to(torch::kCUDA)Move tensor to CUDA
x.to(torch::kCPU)Move to CPU
x.to(torch::kFloat16)Cast dtype
x.device()c10::Device — cuda:0 / cpu
torch::cuda::is_available()Runtime GPU check
Common ops (C++)
torch::matmul(a, b)Same rules as Python @
a.mm(b)2D matrix multiply
torch::cat({a, b}, /*dim=*/0)Concat along dim
torch::stack({a, b}, 0)New leading dim
torch::relu(x)Elementwise
torch::sigmoid(x)Opacity-style activation
torch::exp(x)Scale-style activation
x.clamp(0, 1)Clip range
x.norm(/*p=*/2, /*dim=*/-1, /*keepdim=*/true)L2 norm along dim
Autograd
x.requires_grad_(true)Track ops on this tensor
loss.backward()Accumulate grads into .grad()
x.grad()Gradient tensor — check .defined() before reading
torch::NoGradGuard no_grad;RAII — disable grad inside scope (inference)
torch::Tensor y = x.detach();Stop gradient through y
torch::nn::Module
struct Net : torch::nn::Module {
  Net()
    : fc1(register_module("fc1", torch::nn::Linear(3, 64))),
      fc2(register_module("fc2", torch::nn::Linear(64, 1))) {}

  torch::Tensor forward(torch::Tensor x) {
    x = torch::relu(fc1->forward(x));
    return fc2->forward(x);
  }

  torch::nn::Linear fc1{nullptr}, fc2{nullptr};
};

auto net = std::make_shared<Net>();
net->to(torch::kCUDA);
Training loop (C++)
auto net = std::make_shared<Net>();
net->to(torch::kCUDA);
torch::optim::Adam optimizer(
    net->parameters(), torch::optim::AdamOptions(1e-3));

for (int64_t epoch = 0; epoch < 100; ++epoch) {
  optimizer.zero_grad();
  auto pred = net->forward(batch_x);
  auto loss = torch::mse_loss(pred, batch_y);
  loss.backward();
  optimizer.step();
}
torch::optim::AdamWSame family as Python AdamW — AdamWOptions(lr)
torch::mse_lossAlso in torch/nn/functional.h as torch::nn::functional::mse_loss
Save / load
torch::save(net, "model.pt");Archive module state (trainable module)
torch::load(net, "model.pt");Restore — net type must match saved module
torch::save(tensor, "t.pt");Single tensor or vector of tensors
torch::jit::load("model.ts")TorchScript / traced model — different pipeline from nn.Module save
Python ↔ C++ weights
torch.save(model.state_dict(), "w.pt")Python: only tensors
torch::pickle_loadC++: read Python state_dict bytes — map tensor names to named_parameters(), or prefer TorchScript for deploy
torch.jit.trace / scriptPython save .pt / archive, then torch::jit::load in C++ for inference
Indexing & scalars
x.select(0, i)Slice along dim 0 at index i
x.masked_select(mask)Bool mask same shape — returns 1D flattened values
x.item<float>()Scalar tensor → C++ scalar (exactly one element)
x.accessor<float, 2>()Fast CPU pointer view — [i][j] (CPU contiguous)
Gotchas
view vs reshapeSame as Python — permute/transpose then need contiguous() before view
std::move(tensor)Moving tensors is cheap; avoid copies in hot loops
TORCH_MODULE(Net)Macro: holder for std::shared_ptr<Net> with ->forward — alternative to raw struct + make_shared
torch::TensorOptions()Default dtype float32, device CPU — chain .dtype(torch::kFloat16).device(torch::kCUDA)
torch::cuda::synchronize()Block until CUDA work finishes — use when timing C++ paths
ATen threadingCPU ops use intra-op parallel; set with at::set_num_threads if needed
ABIPrebuilt LibTorch must match compiler (e.g. libstdc++) and CUDA major
nn.Module template
class MyNet(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(in_dim, 128),
            nn.ReLU(),
            nn.Linear(128, out_dim),
        )

    def forward(self, x):
        return self.layers(x)

model = MyNet(3, 1).cuda()
print(sum(p.numel() for p in model.parameters()),
      "parameters")
Dataset + DataLoader
from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

dataset = MyDataset(X, y)
loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
)

for batch_x, batch_y in loader:
    loss = train_step(batch_x, batch_y)
Inference pipeline
model.eval()           # disable dropout/BN

with torch.no_grad():  # disable gradients
    x = preprocess(input).cuda()
    pred = model(x)
    result = postprocess(pred.cpu())

model.train()          # back to training mode
Debug tensor issues
# Always check these when something breaks
print(x.shape)         # shape
print(x.dtype)         # float32 / int64 etc.
print(x.device)        # cpu / cuda:0
print(x.min(), x.max()) # value range
print(x.isnan().any()) # NaN check
print(x.isinf().any()) # Inf check

# Visualize a batch
import matplotlib.pyplot as plt
img = renders[0].cpu().numpy()
plt.imshow(img); plt.savefig("debug.png")
Dataclasses
from dataclasses import dataclass, field

@dataclass
class Config:
    lr: float = 1e-3
    batch_size: int = 32
    device: str = "cuda"
    tags: list = field(default_factory=list)

cfg = Config(lr=1e-4)
print(cfg.lr, cfg.device)
pathlib
from pathlib import Path

root = Path("/workspace/datasets")
ply = root / "garden" / "point_cloud.ply"

print(ply.exists())     # True/False
print(ply.suffix)       # .ply
print(ply.stem)         # point_cloud
print(ply.parent)       # /workspace/datasets/garden

# Glob
plys = list(root.glob("**/*.ply"))
for p in plys:
    print(p)
argparse
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--ply",    required=True)
parser.add_argument("--output", default="out.png")
parser.add_argument("--width",  type=int, default=1280)
parser.add_argument("--fov",    type=float, default=60.0)
parser.add_argument("--verbose",action="store_true")
args = parser.parse_args()

print(args.ply, args.width)
List comprehensions
# Basic
squares = [x**2 for x in range(10)]

# With filter
big = [x for x in values if x > threshold]

# Nested
flat = [item for row in matrix for item in row]

# Dict comprehension
d = {k: v for k, v in zip(keys, values)}

# Generator (memory efficient, no [])
gen = (x**2 for x in range(1_000_000))
Context managers
# File I/O
with open("log.txt", "w") as f:
    f.write("done\n")

# PyTorch no_grad (most common in ML)
with torch.no_grad():
    pred = model(x)

# Timing
import time
t = time.time()
result = expensive_op()
print(f"{(time.time()-t)*1000:.1f}ms")
f-strings
n = 5_834_784
t = 0.133

# Basic
print(f"Gaussians: {n:,}")       # 5,834,784
print(f"Time: {t*1000:.1f}ms")   # 133.0ms
print(f"FPS: {1/t:.1f}")         # 7.5

# Width padding
print("%-10s %8s" % ("loss", "value"))
for k,v in stats.items():
    print("%-10s %8.4f" % (k, v))
Load .ply scene
from plyfile import PlyData
import numpy as np, torch

ply = PlyData.read("point_cloud.ply")
v = ply['vertex']

means    = torch.tensor(
    np.stack([v['x'], v['y'], v['z']], axis=1),
    dtype=torch.float32, device='cuda')

# Scales stored as log → exp()
scales   = torch.tensor(
    np.exp(np.stack([
        v['scale_0'], v['scale_1'], v['scale_2']
    ], axis=1)), dtype=torch.float32, device='cuda')

# Opacity stored as logit → sigmoid()
opacities = torch.sigmoid(torch.tensor(
    np.array(v['opacity']),
    dtype=torch.float32, device='cuda'))

# SH DC → RGB: 0.5 + 0.28209 * f_dc
colors = (0.5 + 0.28209 * torch.tensor(
    np.stack([v['f_dc_0'], v['f_dc_1'], v['f_dc_2']], axis=1),
    dtype=torch.float32, device='cuda')).clamp(0, 1)
gsplat rasterization call
from gsplat import rasterization

renders, alphas, info = rasterization(
    means=means,      # [N, 3]
    quats=quats,      # [N, 4] wxyz normalized
    scales=scales,    # [N, 3] after exp()
    opacities=opacities, # [N] after sigmoid()
    colors=colors,    # [N, 3] RGB [0,1]
    viewmats=viewmat, # [C, 4, 4] world→cam
    Ks=K,             # [C, 3, 3] intrinsics
    width=W,
    height=H,
)
# renders: [C, H, W, 3]
# alphas:  [C, H, W, 1]
Camera intrinsics
import math

def make_K(W, H, fov_deg=60.0, device='cuda'):
    fov = math.radians(fov_deg)
    fx = (W / 2) / math.tan(fov / 2)
    K = torch.tensor([[
        [fx,  0, W/2],
        [ 0, fx, H/2],
        [ 0,  0,   1],
    ]], dtype=torch.float32, device=device)
    return K  # [1, 3, 3]
Alpha compositing
# gsplat returns raw render + alpha
# Composite manually over background:
# out = render * alpha + bg * (1 - alpha)

bg = torch.ones(3, device='cuda')  # white
alpha = alphas[0]                   # [H, W, 1]
composited = (renders[0] * alpha
    + bg.view(1,1,3) * (1 - alpha)) # [H, W, 3]

# Save to PNG
img = (composited.clamp(0,1).cpu().numpy()
       * 255).astype(np.uint8)
imageio.imwrite("out.png", img)
Scene inspection
# Find scene center (median is robust to outliers)
center = means.median(dim=0).values

# Scene spread
spread = (means - center).norm(dim=-1).mean()

# High opacity core
core = means[opacities > 0.5]
core_center = core.mean(dim=0)

# Orbit radius
radius = spread * 2.0

print(f"Gaussians: {len(means):,}")
print(f"Center:    {center.cpu().numpy().round(2)}")
print(f"Spread:    {spread:.2f}")
print(f"Radius:    {radius:.2f}")
DGX Spark gotchas
TORCH_CUDA_ARCH_LIST="12.0"Required before building any CUDA extension
--no-build-isolationRequired for gsplat pip install — prevents torch version conflict
--ipc=hostDocker flag — PyTorch needs host IPC for shared memory
swapoff -aDisable swap — unified memory + swap = machine freeze
nvcr.io/nvidia/pytorch:25.03-py3Base image — ARM64 + CUDA 13 native

ML stack — notes starter

Copy these templates into your notebook so every study session captures what you learned and what to practice next.

Daily study note (20-30 min)
# Date:
# Topic:
# Goal:

## Quick recap
- What concept did I study?
- What is still unclear?

## Code I ran
# paste snippet here

## Shape + dtype checks
- input shape:
- output shape:
- dtype/device:

## One takeaway
- Example: "Broadcasting works right-to-left by dimension."

## Next step
- 1 small thing to try tomorrow
PyTorch debug checklist
# When code breaks, always check:
print("shape:", x.shape)
print("dtype:", x.dtype)
print("device:", x.device)
print("range:", x.min().item(), x.max().item())
print("has_nan:", x.isnan().any().item())
print("has_inf:", x.isinf().any().item())

# If training:
print("loss:", loss.item())
for name, p in model.named_parameters():
    if p.grad is not None:
        print(name, p.grad.norm().item())

# If output looks wrong:
# 1) inspect one sample
# 2) clamp values to expected range
# 3) move to cpu() then visualize
Experiment log template
# Experiment ID:
# Dataset/scene:
# Model/config:

## Hypothesis
# "If I change X, metric Y should improve because..."

## Changes made
- learning rate:
- batch size:
- loss terms:
- augmentation:

## Results
- train loss:
- val metric:
- visual quality notes:

## Decision
- keep / discard / retry
- reason:

## Follow-up
- next experiment to run:
Python mini-reference for notes
dict.get(key, default)Safe key lookup without KeyError
enumerate(items)Loop with index and value
zip(a, b)Iterate two lists together
sorted(items, key=...)Sort by custom field
any(flags), all(flags)Quick boolean checks
assert condition, "message"Guard assumptions early
from pprint import pprintReadable nested dict output