ISDA SIMM v2.6 · MatLogica Benchmark

SIMM Trade Allocation Optimisation
CPU + AADC vs GPU

MatLogica benchmark built on the ISDA-SIMM open-source project with added pricers. Compares CPU + AADC against GPU for gradient-based trade allocation optimization — AADC converges in 1–2 iterations, 25× faster than GPU.

33K/s
Evals/sec (AADC C++)
25×
Faster than GPU
19 ms
50K Trade Portfolio
1-2
Iterations to Converge
X13 8U GPU System · NVIDIA HGX H100 8-GPU · Dual 5th Gen Intel Xeon Platinum 8568Y+
5K IR, 100 groups
5,000
Trades
100
Netting Sets
3
Currencies
12
IR Tenor Buckets
IR
Asset Classes
AADC C++ Time: 30.5 ms (JumpStart Benchmark)
Trades
Netting Sets
IM Reduction
Trades Moved
Greedy Rounds

Greedy Refinement Convergence

Total IM reduction per round — hover for accepted/tried moves

The Optimization Pipeline

From random portfolio to optimized allocation

1

Generate Portfolio

T trades × K risk factors
Random initial allocation to P netting sets

2

Record AADC Kernel

SIMM formula traced once
K inputs → 1 IM output + ∂IM/∂S

3

Continuous Optimization

Adam/BFGS with exact gradients
Soft allocation → simplex constraints

4

Greedy Refinement

Discrete rounding
IM-aware local search per trade

5

Final Allocation

Optimized in 1–2 iterations
Trades moved to optimal netting sets

AADC Advantage
Single adjoint pass → all T×P gradients
Convergence
1-2 iterations with exact gradients
GPU BF Problem
Noisy gradients → 1,848 iterations

Optimization Time: AADC vs GPU

Wall-clock time (log scale) — GPU BF marked ✘ where it fails to converge

* Baseline (bump & revalue) not shown: estimated optimization time >20 min at 1K trades, >3.5 hr at 5K trades — off chart scale. Each gradient evaluation requires O(T×K) bump-and-revalue pricings, making iterative optimization infeasible at scale.

Convergence: 5,000 Trades

GPU brute-force can eventually converge — but at what cost?

AADC C++ AADC Python GPU Pathwise GPU BF
(100 iters)
GPU BF
(1,848 iters)
Baseline*
(bump & revalue)
Iterations 1 2 2 100 1,848 N/A
Wall-clock time 268 ms 6.91 s 12.6 s 3.2 s 56–86 s N/A
Evals/sec 34,012 1,320 721 31 31 <0.1
IM Reduction 7.00% 7.00% 7.0% 7.04% 7.16% N/A
Converged ✘ infeasible

AADC C++ converges in 1 iteration and 268 ms — 34,012 evals/sec throughput. AADC Python converges in 2 iterations and 6.91 s — Python wrapper adds ~26× overhead. GPU Pathwise converges in 2 iterations and 12.6 s — analytic gradients work but 47× slower throughput. GPU Brute-Force needs 1,848 iterations and 56–86 seconds to converge — noisy finite-difference gradients. Baseline* computes gradients via O(T×K) bump-and-revalue per iteration — each gradient evaluation takes >10s at 5K trades, making iterative optimization infeasible.

Root cause: The ADAM optimiser needs the gradient of total IM with respect to every trade-portfolio allocation. AADC computes this exactly via automatic differentiation — one adjoint pass produces all gradients simultaneously. GPU brute-force approximates gradients by bumping each risk factor and re-evaluating, producing noisy estimates. At 500+ trades, the noise overwhelms the signal and the optimiser wanders, hitting the iteration cap without converging.

GPU pathwise note: GPU pathwise also provides analytic gradients and converges in 2 iterations at all scales — but at 47× lower throughput than AADC C++. Where analytic gradients are available on GPU, the quality issue disappears; only the speed gap remains.

How AADC Enables Real-Time Optimization

Trade allocation optimization minimizes total IM by reallocating trades across P netting sets. The search space is PT (discrete), but AADC makes gradient-based optimization feasible:

// The gradient w.r.t. allocation (T × P matrix)
∂total_IM / ∂x[t,p] = Σk (∂IMp / ∂Sp[k]) × S[t,k]

// AADC computes ∂IM/∂S for all P portfolios in ONE call
// Chain rule via numpy: gradient = S @ dIM_dS.T
// Cost: O(T×K×P) numpy ops + 1 AADC call (~3-30 ms)

Three-phase approach: (1) Continuous relaxation with simplex constraints, (2) Projected gradient descent / Adam / BFGS, (3) Greedy rounding to discrete allocation with IM-aware local search.

MatLogica Benchmark · ISDA SIMM v2.6 open-source + added pricers · CPU + AADC vs GPU · All timings from actual runs

* Baseline = bump-and-revalue (no AADC or GPU). Measured at up to 200 trades on earlier hardware (Dual Intel Xeon, 112 cores); larger trade counts extrapolated linearly. Each gradient evaluation requires O(T×K) full pricings, making iterative optimization infeasible at scale.

Frequently Asked Questions (8)

What initial margin reduction can AADC achieve?

AADC achieves 20-48% initial margin reduction across all portfolio scales tested, from 20 trades to 36K multi-asset trades. The reduction comes from gradient-based optimization that efficiently reallocates trades across netting sets to minimize total ISDA SIMM.

How fast is AADC compared to GPU for SIMM optimization?

AADC C++ achieves 33,580 evaluations/sec — 25× faster than GPU. AADC C++ converges in 1 iteration and 268 ms for 5,000 trades, while GPU brute-force needs 1,848 iterations and 56-86 seconds. GPU pathwise converges in 2 iterations but at 47× lower throughput.

How many iterations does AADC need to converge?

AADC C++ converges in 1 iteration because it computes exact adjoint gradients. AADC Python and GPU pathwise converge in 2 iterations. GPU brute-force requires 1,848 iterations due to noisy finite-difference gradients, and fails to converge at 500+ trades.

What portfolio sizes does SIMM optimization support?

The benchmark demonstrates optimization from 20 trades to 36K multi-asset trades across 3-100 netting sets. AADC C++ optimizes a 50K trade portfolio in 19 ms. Performance scales linearly with portfolio size.

How does the optimization pipeline work?

The five-phase pipeline: (1) Generate portfolio with random allocation, (2) Record AADC kernel tracing the SIMM formula, (3) Continuous optimization with Adam/BFGS and simplex constraints, (4) Greedy refinement with discrete rounding, (5) Final optimized allocation. AADC computes all T×P allocation gradients in a single adjoint pass.

Why does GPU brute-force fail to converge?

GPU brute-force approximates gradients by bumping each risk factor and re-evaluating, producing noisy estimates. At 500+ trades, the noise overwhelms the optimization signal and the optimizer wanders without converging. AADC and GPU pathwise provide exact analytic gradients and converge reliably at all scales.

What asset classes are supported?

The benchmark supports IR, FX, Equity, Inflation, and Cross-Currency (XCCY) swaps. Multi-asset portfolios up to 36K trades have been tested with 3 currencies and full ISDA SIMM v2.6 granularity (12 IR tenor buckets, intra-bucket correlations, concentration thresholds).

How does baseline bump-and-revalue compare?

Baseline bump-and-revalue is infeasible for optimization — estimated >20 minutes at 1K trades and >3.5 hours at 5K trades. Each gradient evaluation requires O(T×K) full pricings, making iterative optimization impossible at production scale.