ML stack
ML stack reference
Python, NumPy, PyTorch, and LibTorch C++ in one place — tensor work, training loops, and patterns from gsplat, nerfstudio, vne3dgs, and custom C++ inference.
Create arrays
np.zeros((N, 3))N×3 array of zerosnp.ones((N, 3))N×3 array of onesnp.random.randn(N, 3)Normal distributionnp.linspace(0, 1, 100)100 evenly spaced valuesnp.arange(0, 10, 2)[0, 2, 4, 6, 8]Shape & reshape
a.shapeReturns tuple e.g. (5834784, 3)a.reshape(N, -1)Reshape, -1 = infera.flatten()1D arraynp.stack([a,b], axis=1)[N] + [N] → [N, 2]np.concatenate([a,b], axis=0)[N,3] + [M,3] → [N+M, 3]Indexing & masking
a[0]First rowa[:, 1]All rows, column 1a[10:20]Rows 10 to 19a[a > 0.5]Boolean mask — filter valuesnp.where(a > 0, a, 0)Conditional selectMath ops
np.exp(a)Elementwise exp — for 3DGS scalesnp.log(a)Elementwise log1 / (1 + np.exp(-a))Sigmoid — for 3DGS opacitynp.linalg.norm(a, axis=-1)L2 norm per rowa @ bMatrix multiplya.mean(axis=0)Mean along rowsType conversion
a.astype(np.float32)Cast to float32 — always use for GPUa.astype(np.uint8)Cast to uint8 — for saving imagestorch.tensor(a)numpy → torch tensortensor.numpy()torch → numpy (must be on CPU)tensor.cpu().numpy()GPU tensor → numpyCreate tensors
torch.zeros(N, 3)CPU zerostorch.zeros(N, 3, device='cuda')GPU zerostorch.randn(N, 3, device='cuda')GPU random normaltorch.tensor(np_array, dtype=torch.float32)From numpytorch.eye(4, device='cuda')4×4 identity matrixtorch.arange(0, 10, device='cuda')Range on GPUShape manipulation
x.shapetorch.Size([N, 3])x.view(N, -1)Reshape (contiguous memory)x.reshape(N, -1)Reshape (handles non-contiguous)x.unsqueeze(0)[N,3] → [1,N,3] — add dimx.squeeze(0)[1,N,3] → [N,3] — remove dimx.permute(2, 0, 1)Reorder dims — like transposex.contiguous()Make memory contiguous after permuteDevice management
x.to('cuda')Move to GPUx.to('cpu')Move to CPUx.cuda()Shorthand for .to('cuda')x.cpu()Shorthand for .to('cpu')x.deviceReturns device: cuda:0torch.cuda.is_available()Check GPU availabletorch.cuda.get_device_name(0)GPU name stringtorch.device('cuda:1')Explicit device handle — pass to .to(device)torch.cuda.set_device(1)Default CUDA device index for new tensorsDtype
x.dtypeReturns torch.float32 etc.x.float()Cast to float32x.half()Cast to float16 — faster on GPUx.long()Cast to int64x.bool()Cast to boolReduction ops
x.sum()Sum all elementsx.sum(dim=0)Sum along rows → [3]x.mean(dim=-1)Mean along last dimx.max(dim=0).valuesMax values (returns namedtuple)x.min(dim=0).indicesIndices of min valuesx.norm(dim=-1)L2 norm per rowx.median(dim=0).valuesMedian — used in scene_center()x.quantile(0.9)90th percentileElementwise ops
x * yElementwise multiplyx + yElementwise addtorch.exp(x)e^x — 3DGS scale activationtorch.sigmoid(x)1/(1+e^-x) — opacity activationtorch.relu(x)max(0, x)x.clamp(0, 1)Clip to [0, 1] — for RGBx.abs()Absolute valueMatrix ops
x @ yMatrix multiplytorch.matmul(x, y)Same as @x.TTranspose 2D tensortorch.inverse(x)Matrix inversetorch.linalg.norm(x)Matrix normtorch.bmm(a, b)Batched matmul — [B,N,K] @ [B,K,M] → [B,N,M]torch.einsum('ij,jk->ik', a, b)Einstein sum — same idea as tensor cores in papersCombine tensors
torch.cat([a, b], dim=0)Concat along dim — [N,3]+[M,3]→[N+M,3]torch.stack([a, b], dim=0)New dim — [N,3]+[N,3]→[2,N,3]a.expand(4, -1, -1)Repeat without copy — broadcastinga.repeat(4, 1, 1)Repeat with copyBasic training loop
model = MyNet().cuda()
optimizer = torch.optim.Adam(
model.parameters(), lr=1e-3
)
for epoch in range(n_epochs):
# Forward
pred = model(x)
loss = F.mse_loss(pred, target)
# Backward
optimizer.zero_grad() # clear old grads
loss.backward() # compute grads
optimizer.step() # update params
if epoch % 10 == 0:
print(f"{epoch}: {loss.item():.4f}") Optimizers
torch.optim.Adam(params, lr=1e-3)Best default — use this firsttorch.optim.SGD(params, lr=0.01)Stochastic gradient descenttorch.optim.AdamW(params, lr=1e-3)Adam + weight decayoptimizer.zero_grad()Clear gradients — call before backward()optimizer.step()Update parametersLoss functions
F.mse_loss(pred, target)Mean squared errorF.l1_loss(pred, target)Mean absolute errorF.cross_entropy(pred, labels)Classification lossF.binary_cross_entropy(pred, target)Binary classification(1 - ssim(pred, target))SSIM — used in 3DGS trainingGradient control
with torch.no_grad():Disable gradients — for inferencewith torch.inference_mode():Stricter than no_grad — no version counter opsmodel.eval()Turn off dropout / fix BatchNorm stats — call before eval loopmodel.train()Restore training behavior after inferencetensor.detach()Detach from graph — stop gradient flowtensor.requires_grad_(True)Enable gradient trackingloss.backward()Compute all gradientstensor.gradAccess computed gradienttorch.nn.utils.clip_grad_norm_(params, 1.0)Gradient clippingSave & load
torch.save(model.state_dict(), 'model.pt')Save weightsmodel.load_state_dict(torch.load('model.pt'))Load weightstorch.save(checkpoint, 'ckpt.pt')Save full checkpointtorch.compile(model)PyTorch 2 graph compile — optional speedup (Python)Mixed precision (AMP)
scaler = torch.cuda.amp.GradScaler()
for x, y in loader:
optimizer.zero_grad(set_to_none=True)
with torch.cuda.amp.autocast():
loss = model(x, y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update() torch.cuda.amp.autocast()Run forward in lower precision where safeGradScaler()Scale loss to avoid fp16 underflow in backwardDataLoader
DataLoader(ds, batch_size=32, shuffle=True)Basic batched iterationnum_workers=4Parallel CPU loading — start small on Docker/NFSpin_memory=TrueFaster host→GPU copy when training on CUDApersistent_workers=TrueKeep workers alive between epochs (PyTorch 1.7+)collate_fn=batch_listCustom stacking when samples are variable-lengthdrop_last=TrueOmit final partial batch — avoids BatchNorm size-1 batchCUDA memory & sync
torch.cuda.empty_cache()Return cached blocks to driver — does not free tensors in scopetorch.cuda.memory_allocated()Current tensor bytes on devicetorch.cuda.max_memory_allocated()Peak since reset — pair with reset_peak_stats()torch.cuda.synchronize()Block until GPU work completes — use when timingtorch.cuda.reset_peak_memory_stats()Clear peak counter before a benchmark regionLibTorch is the C++ API for PyTorch. Types live in at:: / torch::; autograd and torch::nn mirror Python. Match your LibTorch download to the CUDA/toolchain you link against.
CMake & include
#include <torch/torch.h>
# CMakeLists.txt (typical)
find_package(Torch REQUIRED)
target_link_libraries(myapp PRIVATE "${TORCH_LIBRARIES}")
set_property(TARGET myapp PROPERTY CXX_STANDARD 17) -DCMAKE_PREFIX_PATH=/path/to/libtorchPoint CMake at extracted libtorch (prebuilt)${TORCH_CXX_FLAGS}Append to CMAKE_CXX_FLAGS — ABI / codegen flags from find_packageTensor creation
torch::zeros({N, 3}, torch::kCUDA)Float32 on default CUDA devicetorch::randn({N, 3}, torch::dtype(torch::kFloat16).device(torch::kCUDA))Explicit TensorOptionstorch::tensor({1., 2., 3.}, torch::kFloat32)From initializer listtorch::from_blob(ptr, {H, W, 3}, torch::kFloat32).clone()Wrap CPU memory — clone() if buffer outlives tensortorch::empty_like(x)Uninitialized buffer same shape/device/dtype as xShape, device, dtype
x.sizes()IntArrayRef — like shape in Pythonx.dim()Number of dimensionsx.view({N, -1})Reshape (must be contiguous — else reshape)x.reshape({N, -1})Reshape with copy if neededx.to(torch::kCUDA)Move tensor to CUDAx.to(torch::kCPU)Move to CPUx.to(torch::kFloat16)Cast dtypex.device()c10::Device — cuda:0 / cputorch::cuda::is_available()Runtime GPU checkCommon ops (C++)
torch::matmul(a, b)Same rules as Python @a.mm(b)2D matrix multiplytorch::cat({a, b}, /*dim=*/0)Concat along dimtorch::stack({a, b}, 0)New leading dimtorch::relu(x)Elementwisetorch::sigmoid(x)Opacity-style activationtorch::exp(x)Scale-style activationx.clamp(0, 1)Clip rangex.norm(/*p=*/2, /*dim=*/-1, /*keepdim=*/true)L2 norm along dimAutograd
x.requires_grad_(true)Track ops on this tensorloss.backward()Accumulate grads into .grad()x.grad()Gradient tensor — check .defined() before readingtorch::NoGradGuard no_grad;RAII — disable grad inside scope (inference)torch::Tensor y = x.detach();Stop gradient through ytorch::nn::Module
struct Net : torch::nn::Module {
Net()
: fc1(register_module("fc1", torch::nn::Linear(3, 64))),
fc2(register_module("fc2", torch::nn::Linear(64, 1))) {}
torch::Tensor forward(torch::Tensor x) {
x = torch::relu(fc1->forward(x));
return fc2->forward(x);
}
torch::nn::Linear fc1{nullptr}, fc2{nullptr};
};
auto net = std::make_shared<Net>();
net->to(torch::kCUDA); Training loop (C++)
auto net = std::make_shared<Net>();
net->to(torch::kCUDA);
torch::optim::Adam optimizer(
net->parameters(), torch::optim::AdamOptions(1e-3));
for (int64_t epoch = 0; epoch < 100; ++epoch) {
optimizer.zero_grad();
auto pred = net->forward(batch_x);
auto loss = torch::mse_loss(pred, batch_y);
loss.backward();
optimizer.step();
} torch::optim::AdamWSame family as Python AdamW — AdamWOptions(lr)torch::mse_lossAlso in torch/nn/functional.h as torch::nn::functional::mse_lossSave / load
torch::save(net, "model.pt");Archive module state (trainable module)torch::load(net, "model.pt");Restore — net type must match saved moduletorch::save(tensor, "t.pt");Single tensor or vector of tensorstorch::jit::load("model.ts")TorchScript / traced model — different pipeline from nn.Module savePython ↔ C++ weights
torch.save(model.state_dict(), "w.pt")Python: only tensorstorch::pickle_loadC++: read Python state_dict bytes — map tensor names to named_parameters(), or prefer TorchScript for deploytorch.jit.trace / scriptPython save .pt / archive, then torch::jit::load in C++ for inferenceIndexing & scalars
x.select(0, i)Slice along dim 0 at index ix.masked_select(mask)Bool mask same shape — returns 1D flattened valuesx.item<float>()Scalar tensor → C++ scalar (exactly one element)x.accessor<float, 2>()Fast CPU pointer view — [i][j] (CPU contiguous)Gotchas
view vs reshapeSame as Python — permute/transpose then need contiguous() before viewstd::move(tensor)Moving tensors is cheap; avoid copies in hot loopsTORCH_MODULE(Net)Macro: holder for std::shared_ptr<Net> with ->forward — alternative to raw struct + make_sharedtorch::TensorOptions()Default dtype float32, device CPU — chain .dtype(torch::kFloat16).device(torch::kCUDA)torch::cuda::synchronize()Block until CUDA work finishes — use when timing C++ pathsATen threadingCPU ops use intra-op parallel; set with at::set_num_threads if neededABIPrebuilt LibTorch must match compiler (e.g. libstdc++) and CUDA majornn.Module template
class MyNet(nn.Module):
def __init__(self, in_dim, out_dim):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(in_dim, 128),
nn.ReLU(),
nn.Linear(128, out_dim),
)
def forward(self, x):
return self.layers(x)
model = MyNet(3, 1).cuda()
print(sum(p.numel() for p in model.parameters()),
"parameters") Dataset + DataLoader
from torch.utils.data import Dataset, DataLoader
class MyDataset(Dataset):
def __init__(self, data, labels):
self.data = data
self.labels = labels
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx], self.labels[idx]
dataset = MyDataset(X, y)
loader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=4,
)
for batch_x, batch_y in loader:
loss = train_step(batch_x, batch_y) Inference pipeline
model.eval() # disable dropout/BN
with torch.no_grad(): # disable gradients
x = preprocess(input).cuda()
pred = model(x)
result = postprocess(pred.cpu())
model.train() # back to training mode Debug tensor issues
# Always check these when something breaks
print(x.shape) # shape
print(x.dtype) # float32 / int64 etc.
print(x.device) # cpu / cuda:0
print(x.min(), x.max()) # value range
print(x.isnan().any()) # NaN check
print(x.isinf().any()) # Inf check
# Visualize a batch
import matplotlib.pyplot as plt
img = renders[0].cpu().numpy()
plt.imshow(img); plt.savefig("debug.png") Dataclasses
from dataclasses import dataclass, field
@dataclass
class Config:
lr: float = 1e-3
batch_size: int = 32
device: str = "cuda"
tags: list = field(default_factory=list)
cfg = Config(lr=1e-4)
print(cfg.lr, cfg.device) pathlib
from pathlib import Path
root = Path("/workspace/datasets")
ply = root / "garden" / "point_cloud.ply"
print(ply.exists()) # True/False
print(ply.suffix) # .ply
print(ply.stem) # point_cloud
print(ply.parent) # /workspace/datasets/garden
# Glob
plys = list(root.glob("**/*.ply"))
for p in plys:
print(p) argparse
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--ply", required=True)
parser.add_argument("--output", default="out.png")
parser.add_argument("--width", type=int, default=1280)
parser.add_argument("--fov", type=float, default=60.0)
parser.add_argument("--verbose",action="store_true")
args = parser.parse_args()
print(args.ply, args.width) List comprehensions
# Basic
squares = [x**2 for x in range(10)]
# With filter
big = [x for x in values if x > threshold]
# Nested
flat = [item for row in matrix for item in row]
# Dict comprehension
d = {k: v for k, v in zip(keys, values)}
# Generator (memory efficient, no [])
gen = (x**2 for x in range(1_000_000)) Context managers
# File I/O
with open("log.txt", "w") as f:
f.write("done\n")
# PyTorch no_grad (most common in ML)
with torch.no_grad():
pred = model(x)
# Timing
import time
t = time.time()
result = expensive_op()
print(f"{(time.time()-t)*1000:.1f}ms") f-strings
n = 5_834_784
t = 0.133
# Basic
print(f"Gaussians: {n:,}") # 5,834,784
print(f"Time: {t*1000:.1f}ms") # 133.0ms
print(f"FPS: {1/t:.1f}") # 7.5
# Width padding
print("%-10s %8s" % ("loss", "value"))
for k,v in stats.items():
print("%-10s %8.4f" % (k, v)) Load .ply scene
from plyfile import PlyData
import numpy as np, torch
ply = PlyData.read("point_cloud.ply")
v = ply['vertex']
means = torch.tensor(
np.stack([v['x'], v['y'], v['z']], axis=1),
dtype=torch.float32, device='cuda')
# Scales stored as log → exp()
scales = torch.tensor(
np.exp(np.stack([
v['scale_0'], v['scale_1'], v['scale_2']
], axis=1)), dtype=torch.float32, device='cuda')
# Opacity stored as logit → sigmoid()
opacities = torch.sigmoid(torch.tensor(
np.array(v['opacity']),
dtype=torch.float32, device='cuda'))
# SH DC → RGB: 0.5 + 0.28209 * f_dc
colors = (0.5 + 0.28209 * torch.tensor(
np.stack([v['f_dc_0'], v['f_dc_1'], v['f_dc_2']], axis=1),
dtype=torch.float32, device='cuda')).clamp(0, 1) gsplat rasterization call
from gsplat import rasterization
renders, alphas, info = rasterization(
means=means, # [N, 3]
quats=quats, # [N, 4] wxyz normalized
scales=scales, # [N, 3] after exp()
opacities=opacities, # [N] after sigmoid()
colors=colors, # [N, 3] RGB [0,1]
viewmats=viewmat, # [C, 4, 4] world→cam
Ks=K, # [C, 3, 3] intrinsics
width=W,
height=H,
)
# renders: [C, H, W, 3]
# alphas: [C, H, W, 1] Camera intrinsics
import math
def make_K(W, H, fov_deg=60.0, device='cuda'):
fov = math.radians(fov_deg)
fx = (W / 2) / math.tan(fov / 2)
K = torch.tensor([[
[fx, 0, W/2],
[ 0, fx, H/2],
[ 0, 0, 1],
]], dtype=torch.float32, device=device)
return K # [1, 3, 3] Alpha compositing
# gsplat returns raw render + alpha
# Composite manually over background:
# out = render * alpha + bg * (1 - alpha)
bg = torch.ones(3, device='cuda') # white
alpha = alphas[0] # [H, W, 1]
composited = (renders[0] * alpha
+ bg.view(1,1,3) * (1 - alpha)) # [H, W, 3]
# Save to PNG
img = (composited.clamp(0,1).cpu().numpy()
* 255).astype(np.uint8)
imageio.imwrite("out.png", img) Scene inspection
# Find scene center (median is robust to outliers)
center = means.median(dim=0).values
# Scene spread
spread = (means - center).norm(dim=-1).mean()
# High opacity core
core = means[opacities > 0.5]
core_center = core.mean(dim=0)
# Orbit radius
radius = spread * 2.0
print(f"Gaussians: {len(means):,}")
print(f"Center: {center.cpu().numpy().round(2)}")
print(f"Spread: {spread:.2f}")
print(f"Radius: {radius:.2f}") DGX Spark gotchas
TORCH_CUDA_ARCH_LIST="12.0"Required before building any CUDA extension--no-build-isolationRequired for gsplat pip install — prevents torch version conflict--ipc=hostDocker flag — PyTorch needs host IPC for shared memoryswapoff -aDisable swap — unified memory + swap = machine freezenvcr.io/nvidia/pytorch:25.03-py3Base image — ARM64 + CUDA 13 nativeML stack — notes starter
Copy these templates into your notebook so every study session captures what you learned and what to practice next.
Daily study note (20-30 min)
# Date:
# Topic:
# Goal:
## Quick recap
- What concept did I study?
- What is still unclear?
## Code I ran
# paste snippet here
## Shape + dtype checks
- input shape:
- output shape:
- dtype/device:
## One takeaway
- Example: "Broadcasting works right-to-left by dimension."
## Next step
- 1 small thing to try tomorrow PyTorch debug checklist
# When code breaks, always check:
print("shape:", x.shape)
print("dtype:", x.dtype)
print("device:", x.device)
print("range:", x.min().item(), x.max().item())
print("has_nan:", x.isnan().any().item())
print("has_inf:", x.isinf().any().item())
# If training:
print("loss:", loss.item())
for name, p in model.named_parameters():
if p.grad is not None:
print(name, p.grad.norm().item())
# If output looks wrong:
# 1) inspect one sample
# 2) clamp values to expected range
# 3) move to cpu() then visualize Experiment log template
# Experiment ID:
# Dataset/scene:
# Model/config:
## Hypothesis
# "If I change X, metric Y should improve because..."
## Changes made
- learning rate:
- batch size:
- loss terms:
- augmentation:
## Results
- train loss:
- val metric:
- visual quality notes:
## Decision
- keep / discard / retry
- reason:
## Follow-up
- next experiment to run: Python mini-reference for notes
dict.get(key, default)Safe key lookup without KeyErrorenumerate(items)Loop with index and valuezip(a, b)Iterate two lists togethersorted(items, key=...)Sort by custom fieldany(flags), all(flags)Quick boolean checksassert condition, "message"Guard assumptions earlyfrom pprint import pprintReadable nested dict output