MatLogica benchmark built on the ISDA-SIMM open-source project with added pricers. Compares CPU + AADC against GPU for gradient-based trade allocation optimization — AADC converges in 1–2 iterations, 25× faster than GPU.
Total IM reduction per round — hover for accepted/tried moves
From random portfolio to optimized allocation
T trades × K risk factors
Random initial allocation to P netting sets
SIMM formula traced once
K inputs → 1 IM output + ∂IM/∂S
Adam/BFGS with exact gradients
Soft allocation → simplex constraints
Discrete rounding
IM-aware local search per trade
Optimized in 1–2 iterations
Trades moved to optimal netting sets
Wall-clock time (log scale) — GPU BF marked ✘ where it fails to converge
* Baseline (bump & revalue) not shown: estimated optimization time >20 min at 1K trades, >3.5 hr at 5K trades — off chart scale. Each gradient evaluation requires O(T×K) bump-and-revalue pricings, making iterative optimization infeasible at scale.
GPU brute-force can eventually converge — but at what cost?
| AADC C++ | AADC Python | GPU Pathwise | GPU BF (100 iters) | GPU BF (1,848 iters) | Baseline* (bump & revalue) | |
|---|---|---|---|---|---|---|
| Iterations | 1 | 2 | 2 | 100 | 1,848 | N/A |
| Wall-clock time | 268 ms | 6.91 s | 12.6 s | 3.2 s | 56–86 s | N/A |
| Evals/sec | 34,012 | 1,320 | 721 | 31 | 31 | <0.1 |
| IM Reduction | 7.00% | 7.00% | 7.0% | 7.04% | 7.16% | N/A |
| Converged | ✔ | ✔ | ✔ | ✘ | ✔ | ✘ infeasible |
AADC C++ converges in 1 iteration and 268 ms — 34,012 evals/sec throughput. AADC Python converges in 2 iterations and 6.91 s — Python wrapper adds ~26× overhead. GPU Pathwise converges in 2 iterations and 12.6 s — analytic gradients work but 47× slower throughput. GPU Brute-Force needs 1,848 iterations and 56–86 seconds to converge — noisy finite-difference gradients. Baseline* computes gradients via O(T×K) bump-and-revalue per iteration — each gradient evaluation takes >10s at 5K trades, making iterative optimization infeasible.
Root cause: The ADAM optimiser needs the gradient of total IM with respect to every trade-portfolio allocation. AADC computes this exactly via automatic differentiation — one adjoint pass produces all gradients simultaneously. GPU brute-force approximates gradients by bumping each risk factor and re-evaluating, producing noisy estimates. At 500+ trades, the noise overwhelms the signal and the optimiser wanders, hitting the iteration cap without converging.
GPU pathwise note: GPU pathwise also provides analytic gradients and converges in 2 iterations at all scales — but at 47× lower throughput than AADC C++. Where analytic gradients are available on GPU, the quality issue disappears; only the speed gap remains.
Trade allocation optimization minimizes total IM by reallocating trades across P netting sets. The search space is PT (discrete), but AADC makes gradient-based optimization feasible:
Three-phase approach: (1) Continuous relaxation with simplex constraints, (2) Projected gradient descent / Adam / BFGS, (3) Greedy rounding to discrete allocation with IM-aware local search.
* Baseline = bump-and-revalue (no AADC or GPU). Measured at up to 200 trades on earlier hardware (Dual Intel Xeon, 112 cores); larger trade counts extrapolated linearly. Each gradient evaluation requires O(T×K) full pricings, making iterative optimization infeasible at scale.
AADC achieves 20-48% initial margin reduction across all portfolio scales tested, from 20 trades to 36K multi-asset trades. The reduction comes from gradient-based optimization that efficiently reallocates trades across netting sets to minimize total ISDA SIMM.
AADC C++ achieves 33,580 evaluations/sec — 25× faster than GPU. AADC C++ converges in 1 iteration and 268 ms for 5,000 trades, while GPU brute-force needs 1,848 iterations and 56-86 seconds. GPU pathwise converges in 2 iterations but at 47× lower throughput.
AADC C++ converges in 1 iteration because it computes exact adjoint gradients. AADC Python and GPU pathwise converge in 2 iterations. GPU brute-force requires 1,848 iterations due to noisy finite-difference gradients, and fails to converge at 500+ trades.
The benchmark demonstrates optimization from 20 trades to 36K multi-asset trades across 3-100 netting sets. AADC C++ optimizes a 50K trade portfolio in 19 ms. Performance scales linearly with portfolio size.
The five-phase pipeline: (1) Generate portfolio with random allocation, (2) Record AADC kernel tracing the SIMM formula, (3) Continuous optimization with Adam/BFGS and simplex constraints, (4) Greedy refinement with discrete rounding, (5) Final optimized allocation. AADC computes all T×P allocation gradients in a single adjoint pass.
GPU brute-force approximates gradients by bumping each risk factor and re-evaluating, producing noisy estimates. At 500+ trades, the noise overwhelms the optimization signal and the optimizer wanders without converging. AADC and GPU pathwise provide exact analytic gradients and converge reliably at all scales.
The benchmark supports IR, FX, Equity, Inflation, and Cross-Currency (XCCY) swaps. Multi-asset portfolios up to 36K trades have been tested with 3 currencies and full ISDA SIMM v2.6 granularity (12 IR tenor buckets, intra-bucket correlations, concentration thresholds).
Baseline bump-and-revalue is infeasible for optimization — estimated >20 minutes at 1K trades and >3.5 hours at 5K trades. Each gradient evaluation requires O(T×K) full pricings, making iterative optimization impossible at production scale.