ISDA SIMM v2.6 · MatLogica Benchmark

SIMM Sensitivities
CPU + AADC vs GPU

MatLogica benchmark built on the ISDA-SIMM open-source project with added pricers. Compares CPU + AADC against GPU for margin sensitivity calculations — exact gradients in a single adjoint pass, replacing bump-and-revalue.

34K/s
SIMM Evals/sec
47×
vs GPU Pathwise
1-2
Iterations to Converge
15,618×
Attribution Speedup
X13 8U GPU System · NVIDIA HGX H100 8-GPU · Dual 5th Gen Intel Xeon Platinum 8568Y+

MatLogica benchmark built on the ISDA-SIMM open-source project with added pricers, comparing CPU + AADC vs GPU. MatLogica does not provide a SIMM model — AADC accelerates your existing models. All timings are from actual runs, not estimated. Try AADC on your own models.

Why SIMM Is Hard at Scale

ISDA SIMM v2.6 is a deterministic, differentiable function of portfolio sensitivities. Four computational challenges emerge at production scale.

1

Sensitivity Computation

For T trades and K risk factors, bump-and-revalue costs O(T × K) pricings. 5,000 IR swaps × 12 tenors = 60,000 pricing calls.

O(T × K) pricings
2

Marginal IM

"What is the marginal IM of this new trade?" requires a full SIMM recalculation per candidate trade per counterparty.

O(C × N × SIMM)
3

Trade Allocation

Allocating T trades to P netting sets to minimize IM. The search space is P^T (discrete). Each objective evaluation needs a full SIMM computation.

P^T search space
4

Margin Attribution

"Which trades consume margin?" Naive leave-one-out requires T full SIMM recalculations. For 100,000 trades at 13 ms/eval, that's 19.1 minutes.

O(T × SIMM)

The AADC Pipeline

Record once, differentiate everywhere. AADC compiles the SIMM formula into a replayable kernel with exact adjoint gradients.

1

Record Kernel

SIMM formula traced with K ≈ 50 risk-factor inputs. One-time ~409 ms cost.

2

Batch Evaluate

All P portfolios in one aadc.evaluate() call. 10–200× vs P separate calls.

3

Adjoint Gradient

Exact ∂IM/∂S for all K risk factors in one backward sweep. No bump size.

O(K) Kernel, Not O(T×P)

The critical insight: record the SIMM kernel with K ≈ 100 aggregated risk-factor inputs. The kernel size stays constant as the portfolio grows.

// Kernel: K ~ 100 inputs (constant)
agg_S = allocation.T @ sensitivity_matrix   // (P, K) — numpy, fast
inputs = {sens_handles[k]: agg_S[:, k] for k in range(K)}
results = aadc.evaluate(funcs, request, inputs, workers)  // ONE call for ALL P portfolios

// Chain rule: full T×P allocation gradient
gradient = S @ dIM_dS.T   // (T, P) — numpy matrix multiply

// Cost: O(T × K × P) numpy ops + 1 kernel eval
// For T=10K, K=50, P=20: 10M multiply-adds + 1 AADC call (~7 ms)

Why Per-Iteration Cost Stays at ~6 ms: Instead of calling aadc.evaluate() P times (one per portfolio), we pass arrays of length P and compute all portfolios in one dispatch call. The Python→C++ overhead is paid once, not P times. For P=20 portfolios, this yields 10–200× speedup vs a naive loop.

Margin Attribution

The problem: "Which trades consume margin?" Naive leave-one-out requires T full SIMM recalculations.

With AADC: Single gradient computation gives exact Euler decomposition for all T trades. Euler error < 10-12%.

Trades AADC Python AADC C++ GPU Pathwise GPU BF Baseline*
100 0.06 ms 1.83 ms 0.03 ms 1.79 ms ~5 min
500 0.3 ms 2.41 ms 0.2 ms 3.73 ms ~22 min*
1,000 0.13 ms 2.67 ms 0.07 ms 2.75 ms ~43 min*
5,000 1.46 ms 3.95 ms 1.18 ms 3.9 ms ~3.6 hr*
10,000 1.53 ms 6.15 ms 1.11 ms 5.08 ms ~7.2 hr*

Why all backends are fast: AADC and GPU Pathwise cache the full gradient (dIM/dS) from portfolio setup — attribution is then a free dot product on the cached gradient, requiring zero additional SIMM evaluations. GPU BF lacks a cached gradient and must bump-and-revalue each trade. Timings exclude one-time compilation: AADC kernel recording (~25 ms) and GPU Numba CUDA JIT compilation (seconds on first invocation), both done once at start of day.

Why AADC C++ is slower than Python here: attribution is a single-evaluation workflow where the time is dominated by O(T×K×P) matrix multiplications (sensitivity aggregation and gradient chain rule), not the AADC kernel. Python uses BLAS-optimised NumPy for these; C++ uses unoptimised OpenMP loops. C++ AADC's kernel advantage (26×) only dominates at EOD scale where thousands of kernel evaluations are needed.

* Baseline = bump-and-revalue (no AADC or GPU). Measured at up to 200 trades on earlier hardware (Dual Intel Xeon, 112 cores); larger trade counts extrapolated linearly. Baseline computes gradients by bumping each sensitivity and re-evaluating SIMM — O(T×K) full recalculations per gradient, making interactive use infeasible at scale.

Pre-Trade Routing

The problem: "Which counterparty gives lowest marginal IM?" Full SIMM recalculation per query per counterparty. With AADC: Pre-compute gradient once, then each query = O(K) dot product.

Trades AADC Python AADC C++ GPU Pathwise GPU BF Baseline*
1,000 847 q/s 575 q/s 448 q/s 769 q/s <1 q/s*
5,000 893 q/s 410 q/s 462 q/s 722 q/s <1 q/s*
10,000 823 q/s 265 q/s 448 q/s 725 q/s <1 q/s*

Why AADC is fast: the gradient from portfolio setup turns each routing query into an O(K) dot product — marginal_IM = grad · s_new — instead of a full SIMM recalculation per counterparty. AADC and GPU Pathwise need only 5 gradient evaluations for 50 trades (refresh every 10); GPU BF needs 100 forward-only evaluations (no gradient shortcut). Timings exclude one-time compilation: AADC kernel recording (~25 ms) and GPU Numba CUDA JIT compilation (seconds on first invocation).

Why AADC C++ is slower than Python here: same as attribution — few kernel evaluations, dominated by matrix multiplications where Python's BLAS-backed NumPy outperforms C++'s unoptimised OpenMP loops. At EOD scale (thousands of evaluations), C++ AADC reaches 34,012 evals/sec — 26× faster than AADC Python. See Allocation Optimization for convergence data.

What-If Scenario Analysis

The problem: "What if I unwind this trade? Add a hedge? Apply a stress scenario?" Each scenario requires recomputing SIMM from scratch.

With AADC: The recorded kernel enables instant scenario evaluation. Gradient-based approximations for small perturbations; full kernel replay for larger changes.

Scenario AADC Python AADC C++ GPU Pathwise GPU BF Baseline* Use Case
Unwind top 10 contributors 138 ms 7.9 ms 316 ms 197 ms ~2s* Identify margin reduction opportunities
Add offsetting hedge 17.7 ms 1 ms 40 ms 25 ms ~2s* Test hedge effectiveness before execution
IR stress +50 bps 7.5 ms 0.43 ms 17 ms 11 ms ~10s* Real-time stress testing
FX shock ±10% 8.2 ms 0.47 ms 19 ms 12 ms ~10s* Currency exposure analysis

Scenario throughput on H100 (5,000 trades): AADC Python ~1,320/sec · AADC C++ ~34,012/sec · GPU Pathwise ~721/sec · GPU Brute-Force ~31/sec

At AADC speeds, a risk manager can sweep parameter grids interactively. GPU kernel launch overhead per scenario forces batch-all-at-once workflows — answering pre-defined questions rather than enabling discovery.

Why GPU Pathwise is slower than GPU Brute-Force here: What-if scenarios are forward-only evaluations — computing the new SIMM value, not gradients. GPU Pathwise carries the overhead of its AD computation graph even when only the forward value is needed, while GPU Brute-Force runs a leaner forward-only kernel with no differentiation infrastructure. Where gradients are needed (attribution, optimization), Pathwise is faster; for forward-only what-if evaluations, the BF kernel has lower per-call overhead.

Why AADC Python is slower than C++ here: What-if scenarios require kernel re-evaluation per scenario. AADC C++ executes the compiled kernel directly in native code (~0.4–8 ms per scenario), while AADC Python pays Python→C++ dispatch overhead on each call (~8–138 ms). For single-evaluation tasks like attribution, this overhead is negligible relative to NumPy matrix ops; for multi-scenario workflows it dominates. See EOD Pipeline for throughput comparison.

Full Pipeline & GPU Comparison

The pipeline: CRIF generation + SIMM aggregation + gradient computation. CRIF is shared across all implementations. SIMM + gradient time differs per backend.

100 IR swaps, 3 netting sets
Implementation CRIF
(shared)
SIMM + Gradient Total

Gradient+Optimization Throughput (5,000 IR swaps)

Backend Evals/sec Optimization converges?
AADC Python 1,320/s ✓ converges
AADC C++ 34,012/s ✓ 1 iteration
GPU Pathwise 721/s ✓ 2 iterations
GPU Brute-Force 31/s ✗ fails at 500+ trades
Baseline (bump & revalue)* <0.1/s ✗ infeasible

AADC C++ is 47× faster than GPU pathwise, and GPU brute-force cannot converge due to noisy finite-difference gradients. See Allocation Optimization for details.

JIT compilation note: All throughput figures exclude one-time compilation costs. AADC kernel recording: ~25 ms (Python) / ~200–400 ms (C++). GPU implementations use Numba CUDA JIT, which incurs seconds of compilation on the first kernel invocation. All compilation is amortised over thousands of evaluations at EOD scale.

Technical Deep Dive

ISDA SIMM v2.6 Formula

Per risk class (intra-bucket):

K_r = sqrt(Σ_i Σ_j ρ_ij × WS_i × WS_j)

where WS_i = w_i × s_i (weighted sensitivity), w_i = risk weight, ρ_ij = intra-bucket correlation.

Total IM (cross-risk-class):

IM = sqrt(Σ_r Σ_s ψ_rs × K_r × K_s)

where ψ_rs = cross-risk-class correlation (6×6 matrix covering Rates, CreditQ, CreditNonQ, Equity, Commodity, FX).

Kernel Recording Mechanics

Record phase: with aadc.record_kernel() as funcs traces the computational graph of the SIMM formula.

Inputs: K aggregated sensitivities (one per risk factor bucket). Constants (weights, correlations) baked into the tape as Python floats.

Output: Total IM value, marked with .mark_as_output().

Evaluate phase: aadc.evaluate(funcs, request, inputs, workers) replays the tape with new input values.

Batching: Pass arrays of length P (one per portfolio) to compute all portfolios in a single dispatch call.

C++ AADC Kernel Micro-Benchmarks

Per-trade pricing + full Greeks via AADC in C++ (100 trades, 4 threads):

Trade TypeKernel InputsEval Time (per trade)Recording Time
IR Swap120.57 μs10.1 ms
Equity Option140.29 μs6.8 ms
FX Option260.32 μs6.8 ms
Inflation Swap240.10 μs6.4 ms
XCCY Swap250.32 μs8.5 ms
Supported Risk Classes and Measures
Risk ClassMeasuresTenors/Buckets
Rates (IR)Delta, Vega12 tenor buckets (2w – 30y)
Credit QualifyingDelta, Vega, BaseCorr1–12 buckets
Credit Non-QualifyingDelta, Vega1–2 buckets
EquityDelta, Vega1–12 sectors
CommodityDelta, Vega1–17 buckets
FXDelta, VegaBy currency pair
What Each Timing Means

AADC Kernel Recording (~25ms Python, ~200-400ms C++): One-time cost to trace the SIMM formula as a computational graph. Done once at start of day.

GPU JIT Compilation (seconds): One-time cost for Numba CUDA to compile Python kernels to GPU machine code on first invocation. Excluded from per-evaluation timings.

Kernel Evaluation (~1-10ms): Replay the recorded kernel with new inputs. Produces both IM value and exact gradients.

Gradient Computation (included in eval): AADC adjoint pass computes dIM/dSensitivity for all K risk factors simultaneously — O(K) cost.

Marginal IM Query (~6μs): After gradient is computed, each new trade's impact = dot product with gradient. Sub-millisecond.

Hardware: X13 8U GPU System · NVIDIA HGX H100 8-GPU · Dual 5th Gen Intel Xeon Platinum 8568Y+

MatLogica Benchmark · ISDA SIMM v2.6 open-source + added pricers · CPU + AADC vs GPU · All timings from actual runs

* Baseline = bump-and-revalue (no AADC or GPU). Measured at up to 200 trades on earlier hardware (Dual Intel Xeon, 112 cores); larger trade counts extrapolated linearly. Each gradient evaluation requires O(T×K) full recalculations, making interactive use infeasible at scale.

Frequently Asked Questions (8)

How does AADC compare to GPU for SIMM calculations?

AADC C++ computes SIMM sensitivities at 34,000 evaluations/sec on CPU — 47× faster than GPU pathwise differentiation on NVIDIA H100. The key difference is that AADC computes exact adjoint gradients in a single backward pass, while GPU brute-force uses noisy finite-difference approximations that fail to converge at 500+ trades.

Why does GPU brute-force fail at scale?

GPU brute-force approximates gradients by bumping each risk factor and re-evaluating, producing noisy estimates. At 500+ trades, the noise overwhelms the signal and the optimizer wanders, hitting the iteration cap without converging. AADC and GPU pathwise both provide analytic gradients and converge in 1-2 iterations at all scales.

What is the AADC pipeline for SIMM?

The AADC pipeline has four steps: (1) Record the SIMM formula as a kernel with K ≈ 50 risk-factor inputs, (2) Batch evaluate all P portfolios in one aadc.evaluate() call, (3) Compute adjoint gradients for all K risk factors in one backward sweep, (4) Apply to attribution, pre-trade routing, or optimization.

What speedup does AADC provide for margin attribution?

For 100,000 trades, AADC computes full trade-level margin attribution in 73ms — a 15,618× speedup over naive leave-one-out recalculation which takes 19.1 minutes. The speedup grows with portfolio size because AADC cost is O(K) while leave-one-out is O(T × SIMM).

What hardware was used for the SIMM benchmark?

The benchmark ran on an X13 8U GPU System with NVIDIA HGX H100 8-GPU and Dual 5th Gen Intel Xeon Platinum 8568Y+. AADC runs on CPU only — it outperforms the H100 GPU by 47× for SIMM sensitivity calculations.

How does AADC handle pre-trade margin routing?

AADC pre-computes the gradient once, then each pre-trade query reduces to an O(K) dot product — marginal_IM = grad · s_new. This enables 823-893 queries/sec for 10K trade portfolios, compared to less than 1 query/sec for baseline bump-and-revalue.

What is the end-of-day pipeline performance?

For 10K IR swaps, the full EOD pipeline (CRIF generation + SIMM + gradients) takes 12.0s with AADC C++, 23.1s with AADC Python, 34.1s with GPU pathwise, and 22.2s with GPU brute-force. Baseline bump-and-revalue takes approximately 7 hours.

Can AADC SIMM integrate with existing risk systems?

Yes. AADC records your existing SIMM implementation as a kernel — no model rewrite needed. The kernel is O(K) where K ≈ 50-100 risk factors, independent of portfolio size. Kernel recording takes 200-400ms (one-time, start of day), then each evaluation takes 1-10ms.