MatLogica benchmark built on the ISDA-SIMM open-source project with added pricers. Compares CPU + AADC against GPU for margin sensitivity calculations — exact gradients in a single adjoint pass, replacing bump-and-revalue.
MatLogica benchmark built on the ISDA-SIMM open-source project with added pricers, comparing CPU + AADC vs GPU. MatLogica does not provide a SIMM model — AADC accelerates your existing models. All timings are from actual runs, not estimated. Try AADC on your own models.
ISDA SIMM v2.6 is a deterministic, differentiable function of portfolio sensitivities. Four computational challenges emerge at production scale.
For T trades and K risk factors, bump-and-revalue costs O(T × K) pricings. 5,000 IR swaps × 12 tenors = 60,000 pricing calls.
O(T × K) pricings"What is the marginal IM of this new trade?" requires a full SIMM recalculation per candidate trade per counterparty.
O(C × N × SIMM)Allocating T trades to P netting sets to minimize IM. The search space is P^T (discrete). Each objective evaluation needs a full SIMM computation.
P^T search space"Which trades consume margin?" Naive leave-one-out requires T full SIMM recalculations. For 100,000 trades at 13 ms/eval, that's 19.1 minutes.
O(T × SIMM)Record once, differentiate everywhere. AADC compiles the SIMM formula into a replayable kernel with exact adjoint gradients.
SIMM formula traced with K ≈ 50 risk-factor inputs. One-time ~409 ms cost.
All P portfolios in one aadc.evaluate() call. 10–200× vs P separate calls.
Exact ∂IM/∂S for all K risk factors in one backward sweep. No bump size.
The critical insight: record the SIMM kernel with K ≈ 100 aggregated risk-factor inputs. The kernel size stays constant as the portfolio grows.
// Kernel: K ~ 100 inputs (constant) agg_S = allocation.T @ sensitivity_matrix // (P, K) — numpy, fast inputs = {sens_handles[k]: agg_S[:, k] for k in range(K)} results = aadc.evaluate(funcs, request, inputs, workers) // ONE call for ALL P portfolios // Chain rule: full T×P allocation gradient gradient = S @ dIM_dS.T // (T, P) — numpy matrix multiply // Cost: O(T × K × P) numpy ops + 1 kernel eval // For T=10K, K=50, P=20: 10M multiply-adds + 1 AADC call (~7 ms)
Why Per-Iteration Cost Stays at ~6 ms: Instead of calling aadc.evaluate() P times (one per portfolio), we pass arrays of length P and compute all portfolios in one dispatch call. The Python→C++ overhead is paid once, not P times. For P=20 portfolios, this yields 10–200× speedup vs a naive loop.
The problem: "Which trades consume margin?" Naive leave-one-out requires T full SIMM recalculations.
With AADC: Single gradient computation gives exact Euler decomposition for all T trades. Euler error < 10-12%.
| Trades | AADC Python | AADC C++ | GPU Pathwise | GPU BF | Baseline* |
|---|---|---|---|---|---|
| 100 | 0.06 ms | 1.83 ms | 0.03 ms | 1.79 ms | ~5 min |
| 500 | 0.3 ms | 2.41 ms | 0.2 ms | 3.73 ms | ~22 min* |
| 1,000 | 0.13 ms | 2.67 ms | 0.07 ms | 2.75 ms | ~43 min* |
| 5,000 | 1.46 ms | 3.95 ms | 1.18 ms | 3.9 ms | ~3.6 hr* |
| 10,000 | 1.53 ms | 6.15 ms | 1.11 ms | 5.08 ms | ~7.2 hr* |
Why all backends are fast: AADC and GPU Pathwise cache the full gradient (dIM/dS) from portfolio setup — attribution is then a free dot product on the cached gradient, requiring zero additional SIMM evaluations. GPU BF lacks a cached gradient and must bump-and-revalue each trade. Timings exclude one-time compilation: AADC kernel recording (~25 ms) and GPU Numba CUDA JIT compilation (seconds on first invocation), both done once at start of day.
Why AADC C++ is slower than Python here: attribution is a single-evaluation workflow where the time is dominated by O(T×K×P) matrix multiplications (sensitivity aggregation and gradient chain rule), not the AADC kernel. Python uses BLAS-optimised NumPy for these; C++ uses unoptimised OpenMP loops. C++ AADC's kernel advantage (26×) only dominates at EOD scale where thousands of kernel evaluations are needed.
* Baseline = bump-and-revalue (no AADC or GPU). Measured at up to 200 trades on earlier hardware (Dual Intel Xeon, 112 cores); larger trade counts extrapolated linearly. Baseline computes gradients by bumping each sensitivity and re-evaluating SIMM — O(T×K) full recalculations per gradient, making interactive use infeasible at scale.
The problem: "Which counterparty gives lowest marginal IM?" Full SIMM recalculation per query per counterparty. With AADC: Pre-compute gradient once, then each query = O(K) dot product.
| Trades | AADC Python | AADC C++ | GPU Pathwise | GPU BF | Baseline* |
|---|---|---|---|---|---|
| 1,000 | 847 q/s | 575 q/s | 448 q/s | 769 q/s | <1 q/s* |
| 5,000 | 893 q/s | 410 q/s | 462 q/s | 722 q/s | <1 q/s* |
| 10,000 | 823 q/s | 265 q/s | 448 q/s | 725 q/s | <1 q/s* |
Why AADC is fast: the gradient from portfolio setup turns each routing query into an O(K) dot product — marginal_IM = grad · s_new — instead of a full SIMM recalculation per counterparty.
AADC and GPU Pathwise need only 5 gradient evaluations for 50 trades (refresh every 10); GPU BF needs 100 forward-only evaluations (no gradient shortcut).
Timings exclude one-time compilation: AADC kernel recording (~25 ms) and GPU Numba CUDA JIT compilation (seconds on first invocation).
Why AADC C++ is slower than Python here: same as attribution — few kernel evaluations, dominated by matrix multiplications where Python's BLAS-backed NumPy outperforms C++'s unoptimised OpenMP loops. At EOD scale (thousands of evaluations), C++ AADC reaches 34,012 evals/sec — 26× faster than AADC Python. See Allocation Optimization for convergence data.
The problem: "What if I unwind this trade? Add a hedge? Apply a stress scenario?" Each scenario requires recomputing SIMM from scratch.
With AADC: The recorded kernel enables instant scenario evaluation. Gradient-based approximations for small perturbations; full kernel replay for larger changes.
| Scenario | AADC Python | AADC C++ | GPU Pathwise | GPU BF | Baseline* | Use Case |
|---|---|---|---|---|---|---|
| Unwind top 10 contributors | 138 ms | 7.9 ms | 316 ms | 197 ms | ~2s* | Identify margin reduction opportunities |
| Add offsetting hedge | 17.7 ms | 1 ms | 40 ms | 25 ms | ~2s* | Test hedge effectiveness before execution |
| IR stress +50 bps | 7.5 ms | 0.43 ms | 17 ms | 11 ms | ~10s* | Real-time stress testing |
| FX shock ±10% | 8.2 ms | 0.47 ms | 19 ms | 12 ms | ~10s* | Currency exposure analysis |
Scenario throughput on H100 (5,000 trades): AADC Python ~1,320/sec · AADC C++ ~34,012/sec · GPU Pathwise ~721/sec · GPU Brute-Force ~31/sec
At AADC speeds, a risk manager can sweep parameter grids interactively. GPU kernel launch overhead per scenario forces batch-all-at-once workflows — answering pre-defined questions rather than enabling discovery.
Why GPU Pathwise is slower than GPU Brute-Force here: What-if scenarios are forward-only evaluations — computing the new SIMM value, not gradients. GPU Pathwise carries the overhead of its AD computation graph even when only the forward value is needed, while GPU Brute-Force runs a leaner forward-only kernel with no differentiation infrastructure. Where gradients are needed (attribution, optimization), Pathwise is faster; for forward-only what-if evaluations, the BF kernel has lower per-call overhead.
Why AADC Python is slower than C++ here: What-if scenarios require kernel re-evaluation per scenario. AADC C++ executes the compiled kernel directly in native code (~0.4–8 ms per scenario), while AADC Python pays Python→C++ dispatch overhead on each call (~8–138 ms). For single-evaluation tasks like attribution, this overhead is negligible relative to NumPy matrix ops; for multi-scenario workflows it dominates. See EOD Pipeline for throughput comparison.
The pipeline: CRIF generation + SIMM aggregation + gradient computation. CRIF is shared across all implementations. SIMM + gradient time differs per backend.
| Implementation | CRIF (shared) | SIMM + Gradient | Total |
|---|
| Backend | Evals/sec | Optimization converges? |
|---|---|---|
| AADC Python | 1,320/s | ✓ converges |
| AADC C++ | 34,012/s | ✓ 1 iteration |
| GPU Pathwise | 721/s | ✓ 2 iterations |
| GPU Brute-Force | 31/s | ✗ fails at 500+ trades |
| Baseline (bump & revalue)* | <0.1/s | ✗ infeasible |
AADC C++ is 47× faster than GPU pathwise, and GPU brute-force cannot converge due to noisy finite-difference gradients. See Allocation Optimization for details.
JIT compilation note: All throughput figures exclude one-time compilation costs. AADC kernel recording: ~25 ms (Python) / ~200–400 ms (C++). GPU implementations use Numba CUDA JIT, which incurs seconds of compilation on the first kernel invocation. All compilation is amortised over thousands of evaluations at EOD scale.
Per risk class (intra-bucket):
K_r = sqrt(Σ_i Σ_j ρ_ij × WS_i × WS_j)
where WS_i = w_i × s_i (weighted sensitivity), w_i = risk weight, ρ_ij = intra-bucket correlation.
Total IM (cross-risk-class):
IM = sqrt(Σ_r Σ_s ψ_rs × K_r × K_s)
where ψ_rs = cross-risk-class correlation (6×6 matrix covering Rates, CreditQ, CreditNonQ, Equity, Commodity, FX).
Record phase: with aadc.record_kernel() as funcs traces the computational graph of the SIMM formula.
Inputs: K aggregated sensitivities (one per risk factor bucket). Constants (weights, correlations) baked into the tape as Python floats.
Output: Total IM value, marked with .mark_as_output().
Evaluate phase: aadc.evaluate(funcs, request, inputs, workers) replays the tape with new input values.
Batching: Pass arrays of length P (one per portfolio) to compute all portfolios in a single dispatch call.
Per-trade pricing + full Greeks via AADC in C++ (100 trades, 4 threads):
| Trade Type | Kernel Inputs | Eval Time (per trade) | Recording Time |
|---|---|---|---|
| IR Swap | 12 | 0.57 μs | 10.1 ms |
| Equity Option | 14 | 0.29 μs | 6.8 ms |
| FX Option | 26 | 0.32 μs | 6.8 ms |
| Inflation Swap | 24 | 0.10 μs | 6.4 ms |
| XCCY Swap | 25 | 0.32 μs | 8.5 ms |
| Risk Class | Measures | Tenors/Buckets |
|---|---|---|
| Rates (IR) | Delta, Vega | 12 tenor buckets (2w – 30y) |
| Credit Qualifying | Delta, Vega, BaseCorr | 1–12 buckets |
| Credit Non-Qualifying | Delta, Vega | 1–2 buckets |
| Equity | Delta, Vega | 1–12 sectors |
| Commodity | Delta, Vega | 1–17 buckets |
| FX | Delta, Vega | By currency pair |
AADC Kernel Recording (~25ms Python, ~200-400ms C++): One-time cost to trace the SIMM formula as a computational graph. Done once at start of day.
GPU JIT Compilation (seconds): One-time cost for Numba CUDA to compile Python kernels to GPU machine code on first invocation. Excluded from per-evaluation timings.
Kernel Evaluation (~1-10ms): Replay the recorded kernel with new inputs. Produces both IM value and exact gradients.
Gradient Computation (included in eval): AADC adjoint pass computes dIM/dSensitivity for all K risk factors simultaneously — O(K) cost.
Marginal IM Query (~6μs): After gradient is computed, each new trade's impact = dot product with gradient. Sub-millisecond.
Hardware: X13 8U GPU System · NVIDIA HGX H100 8-GPU · Dual 5th Gen Intel Xeon Platinum 8568Y+
* Baseline = bump-and-revalue (no AADC or GPU). Measured at up to 200 trades on earlier hardware (Dual Intel Xeon, 112 cores); larger trade counts extrapolated linearly. Each gradient evaluation requires O(T×K) full recalculations, making interactive use infeasible at scale.
AADC C++ computes SIMM sensitivities at 34,000 evaluations/sec on CPU — 47× faster than GPU pathwise differentiation on NVIDIA H100. The key difference is that AADC computes exact adjoint gradients in a single backward pass, while GPU brute-force uses noisy finite-difference approximations that fail to converge at 500+ trades.
GPU brute-force approximates gradients by bumping each risk factor and re-evaluating, producing noisy estimates. At 500+ trades, the noise overwhelms the signal and the optimizer wanders, hitting the iteration cap without converging. AADC and GPU pathwise both provide analytic gradients and converge in 1-2 iterations at all scales.
The AADC pipeline has four steps: (1) Record the SIMM formula as a kernel with K ≈ 50 risk-factor inputs, (2) Batch evaluate all P portfolios in one aadc.evaluate() call, (3) Compute adjoint gradients for all K risk factors in one backward sweep, (4) Apply to attribution, pre-trade routing, or optimization.
For 100,000 trades, AADC computes full trade-level margin attribution in 73ms — a 15,618× speedup over naive leave-one-out recalculation which takes 19.1 minutes. The speedup grows with portfolio size because AADC cost is O(K) while leave-one-out is O(T × SIMM).
The benchmark ran on an X13 8U GPU System with NVIDIA HGX H100 8-GPU and Dual 5th Gen Intel Xeon Platinum 8568Y+. AADC runs on CPU only — it outperforms the H100 GPU by 47× for SIMM sensitivity calculations.
AADC pre-computes the gradient once, then each pre-trade query reduces to an O(K) dot product — marginal_IM = grad · s_new. This enables 823-893 queries/sec for 10K trade portfolios, compared to less than 1 query/sec for baseline bump-and-revalue.
For 10K IR swaps, the full EOD pipeline (CRIF generation + SIMM + gradients) takes 12.0s with AADC C++, 23.1s with AADC Python, 34.1s with GPU pathwise, and 22.2s with GPU brute-force. Baseline bump-and-revalue takes approximately 7 hours.
Yes. AADC records your existing SIMM implementation as a kernel — no model rewrite needed. The kernel is O(K) where K ≈ 50-100 risk factors, independent of portfolio size. Kernel recording takes 200-400ms (one-time, start of day), then each evaluation takes 1-10ms.