ISDA SIMM v2.6 · MatLogica Benchmark

SIMM Trade Allocation Optimisation
CPU + AADC vs GPU

MatLogica benchmark built on the ISDA-SIMM open-source project with added pricers. Compares CPU + AADC against GPU for gradient-based trade allocation optimization — AADC converges in 1–2 iterations, 25× faster than GPU.

33K/s

Evals/sec (AADC C++)

25×

Faster than GPU

19 ms

50K Trade Portfolio

1-2

Iterations to Converge

SIMM Sensitivities SIMM Optimisation

X13 8U GPU System · NVIDIA HGX H100 8-GPU · Dual 5th Gen Intel Xeon Platinum 8568Y+

Portfolio Configuration 5K IR, 100 groups

5,000

Trades

100

Netting Sets

Currencies

IR Tenor Buckets

Asset Classes

AADC C++ Time: 30.5 ms (JumpStart Benchmark)

—

Trades

—

Netting Sets

—
IM Reduction

—

Trades Moved

—

Greedy Rounds

Greedy Refinement Convergence

Total IM reduction per round — hover for accepted/tried moves

The Optimization Pipeline

From random portfolio to optimized allocation

Generate Portfolio

T trades × K risk factors
Random initial allocation to P netting sets

→

Record AADC Kernel

SIMM formula traced once
K inputs → 1 IM output + ∂IM/∂S

→

Continuous Optimization

Adam/BFGS with exact gradients
Soft allocation → simplex constraints

→

Greedy Refinement

Discrete rounding
IM-aware local search per trade

→

Final Allocation

Optimized in 1–2 iterations
Trades moved to optimal netting sets

AADC Advantage

Single adjoint pass → all T×P gradients

Convergence

1-2 iterations with exact gradients

GPU BF Problem

Noisy gradients → 1,848 iterations

Optimization Time: AADC vs GPU

Wall-clock time (log scale) — GPU BF marked ✘ where it fails to converge

* Baseline (bump & revalue) not shown: estimated optimization time >20 min at 1K trades, >3.5 hr at 5K trades — off chart scale. Each gradient evaluation requires O(T×K) bump-and-revalue pricings, making iterative optimization infeasible at scale.

Deep Dive

Convergence: 5,000 Trades

GPU brute-force can eventually converge — but at what cost?

	AADC C++	AADC Python	GPU Pathwise	GPU BF (100 iters)	GPU BF (1,848 iters)	Baseline* (bump & revalue)
Iterations	1	2	2	100	1,848	N/A
Wall-clock time	268 ms	6.91 s	12.6 s	3.2 s	56–86 s	N/A
Evals/sec	34,012	1,320	721	31	31	<0.1
IM Reduction	7.00%	7.00%	7.0%	7.04%	7.16%	N/A
Converged	✔	✔	✔	✘	✔	✘ infeasible

AADC C++ converges in 1 iteration and 268 ms — 34,012 evals/sec throughput. AADC Python converges in 2 iterations and 6.91 s — Python wrapper adds ~26× overhead. GPU Pathwise converges in 2 iterations and 12.6 s — analytic gradients work but 47× slower throughput. GPU Brute-Force needs 1,848 iterations and 56–86 seconds to converge — noisy finite-difference gradients. Baseline* computes gradients via O(T×K) bump-and-revalue per iteration — each gradient evaluation takes >10s at 5K trades, making iterative optimization infeasible.

Root cause: The ADAM optimiser needs the gradient of total IM with respect to every trade-portfolio allocation. AADC computes this exactly via automatic differentiation — one adjoint pass produces all gradients simultaneously. GPU brute-force approximates gradients by bumping each risk factor and re-evaluating, producing noisy estimates. At 500+ trades, the noise overwhelms the signal and the optimiser wanders, hitting the iteration cap without converging.

GPU pathwise note: GPU pathwise also provides analytic gradients and converges in 2 iterations at all scales — but at 47× lower throughput than AADC C++. Where analytic gradients are available on GPU, the quality issue disappears; only the speed gap remains.

Reference

How AADC Enables Real-Time Optimization

Trade allocation optimization minimizes total IM by reallocating trades across P netting sets. The search space is P^T (discrete), but AADC makes gradient-based optimization feasible:

// The gradient w.r.t. allocation (T × P matrix)
∂total_IM / ∂x[t,p] = Σ_k (∂IM_p / ∂S_p[k]) × S[t,k]

// AADC computes ∂IM/∂S for all P portfolios in ONE call
// Chain rule via numpy: gradient = S @ dIM_dS.T
// Cost: O(T×K×P) numpy ops + 1 AADC call (~3-30 ms)

Three-phase approach: (1) Continuous relaxation with simplex constraints, (2) Projected gradient descent / Adam / BFGS, (3) Greedy rounding to discrete allocation with IM-aware local search.

MatLogica Benchmark · ISDA SIMM v2.6 open-source + added pricers · CPU + AADC vs GPU · All timings from actual runs

* Baseline = bump-and-revalue (no AADC or GPU). Measured at up to 200 trades on earlier hardware (Dual Intel Xeon, 112 cores); larger trade counts extrapolated linearly. Each gradient evaluation requires O(T×K) full pricings, making iterative optimization infeasible at scale.

Related Resources

This benchmark uses a C++ SIMM engine on production-scale portfolios (20–36K trades) comparing AADC CPU vs NVIDIA H100 GPU. The application note below benchmarks the same optimization problem from Python (AADC vs PyTorch vs NumPy) on a desk-level portfolio — demonstrating 41× speedup with zero code duplication.

Accelerating ISDA SIMM Margin Optimization with AADC

Application note benchmarking AADC vs PyTorch autograd vs NumPy finite differences for SIMM margin optimization. AADC delivers 41× speedup over PyTorch — reducing optimization from 3.5 minutes to 5 seconds — using the same NumPy source code with zero duplication.

41× faster than PyTorch autograd end-to-end
5.2 seconds total vs 214s (PyTorch) vs 26 min (NumPy FD)
Same NumPy source code — zero code duplication

Download Paper

Frequently Asked Questions (8)

What initial margin reduction can AADC achieve?

AADC achieves 20-48% initial margin reduction across all portfolio scales tested, from 20 trades to 36K multi-asset trades. The reduction comes from gradient-based optimization that efficiently reallocates trades across netting sets to minimize total ISDA SIMM.

How fast is AADC compared to GPU for SIMM optimization?

AADC C++ achieves 33,580 evaluations/sec — 25× faster than GPU. AADC C++ converges in 1 iteration and 268 ms for 5,000 trades, while GPU brute-force needs 1,848 iterations and 56-86 seconds. GPU pathwise converges in 2 iterations but at 47× lower throughput.

How many iterations does AADC need to converge?

AADC C++ converges in 1 iteration because it computes exact adjoint gradients. AADC Python and GPU pathwise converge in 2 iterations. GPU brute-force requires 1,848 iterations due to noisy finite-difference gradients, and fails to converge at 500+ trades.

What portfolio sizes does SIMM optimization support?

The benchmark demonstrates optimization from 20 trades to 36K multi-asset trades across 3-100 netting sets. AADC C++ optimizes a 50K trade portfolio in 19 ms. Performance scales linearly with portfolio size.

How does the optimization pipeline work?

The five-phase pipeline: (1) Generate portfolio with random allocation, (2) Record AADC kernel tracing the SIMM formula, (3) Continuous optimization with Adam/BFGS and simplex constraints, (4) Greedy refinement with discrete rounding, (5) Final optimized allocation. AADC computes all T×P allocation gradients in a single adjoint pass.

Why does GPU brute-force fail to converge?

GPU brute-force approximates gradients by bumping each risk factor and re-evaluating, producing noisy estimates. At 500+ trades, the noise overwhelms the optimization signal and the optimizer wanders without converging. AADC and GPU pathwise provide exact analytic gradients and converge reliably at all scales.

What asset classes are supported?

The benchmark supports IR, FX, Equity, Inflation, and Cross-Currency (XCCY) swaps. Multi-asset portfolios up to 36K trades have been tested with 3 currencies and full ISDA SIMM v2.6 granularity (12 IR tenor buckets, intra-bucket correlations, concentration thresholds).

How does baseline bump-and-revalue compare?

Baseline bump-and-revalue is infeasible for optimization — estimated >20 minutes at 1K trades and >3.5 hours at 5K trades. Each gradient evaluation requires O(T×K) full pricings, making iterative optimization impossible at production scale.

SIMM Trade Allocation Optimisation CPU + AADC vs GPU