Open Source xVA Benchmark

Q: What speedup does AADC achieve for xVA pricing?

AADC achieves up to 1770x speedup for xVA pricing only using 112 threads with AVX-512 on Intel Xeon Cascade Lake, compared to a single-threaded baseline.

Q: How does AADC compare to tape-based AAD libraries like Adept?

AADC achieves 832x speedup for xVA pricing plus Greeks using 96 threads with AVX-512, compared to single-threaded Adept. This is possible because AADC generates optimized machine code that's inherently thread-safe, while tape-based libraries require careful redesign for multi-threading.

Q: Does AADC require code changes for multi-threading?

No. AADC automatically generates thread-safe code. The original single-threaded analytics can be parallelized without any changes to the user's code. Memory is allocated per thread, allowing multiple threads to process Monte Carlo paths safely in parallel.

Q: What's the difference between AVX2 and AVX512 performance?

AVX-512 processes 8 samples in parallel versus 4 for AVX2, resulting in approximately 1.3-1.5x additional speedup. For xVA pricing, AVX-512 achieves 1770x vs 1302x for AVX2 at 112 threads.

Introduction

This benchmark demonstrates how real-world problems can be solved using the AADC library to achieve top performance on Intel Scalable Xeon CPUs (Cascade Lake). The benchmark is purposely implemented in plain C++ for ease of understanding and minimization of runtime overheads due to object abstractions.

The code that represents client analytics is designed and implemented without any upfront intentions for multi-threading or vectorization. The result of this work demonstrates that applying MatLogica's technique to this code base allows achieving top performance on modern Xeon CPUs.

Previously, to get comparable performance, users would need to do extensive rework to port existing analytics to GPU. Moreover, we demonstrate that computing sensitivities using the Automatic Differentiation method is also very efficient and problem size isn't constrained by available memory.

Benchmark Description and Configuration

We used simple and efficient analytics implemented as scalar single-threaded C++ which computes so-called Valuation Adjustments (xVA) that are computationally difficult and required by the financial industry for daily operation. We chose to focus on financial applications; however, this technique is not limited to any specific domain and can be applied across a multitude of problems.

Interest Rate Model Hull-White model with 1 IR curve

Projection Curves 3 projection curves

Credit Curves 2 curves (company and counterparty)

Portfolio 100 50-year IR swaps with random future start dates

IR Curve Points 250 interpolation points

Credit Curve Points 140 interpolation points

Risk Outputs 2564 sensitivities (CVA and DVA)

Hardware Configuration

CPU Cascade Lake CLX-8280M with 56 cores (112 threads)

Memory 192GB

AADC Version MatLogica AADC library (5/15 release)

Compiler ICC 19.1.1

Valuation Acceleration

In the provided source code, the user's analytics is assumed to be implemented by the xVAProblem class in single-thread mode. It uses a template type for all real values and can be instantiated using native double type as well as active type idouble.

Computations which use native double instance set the performance baseline of the benchmark, with the baseline execution time measurements taken using a single CPU core and relying on the compiler (Intel C++) to do all vectorization work.

Using operator overloading technique with the idouble active type on the xVAProblem class allows AADC to extract the valuation graph and form binary instructions that replicate user analytics at runtime.

How AADC Acceleration Works

Record an AADC function for one Monte Carlo path
Include evolution of all market objects, pricing of all trades at all future time points
Compute CVA/DVA integrals with normal random variables as input
Automatically create vectorized function to handle 4 (AVX2) or 8 (AVX512) samples in parallel
Allocate memory per thread for safe multi-threaded execution

As a result, we have safely parallelized analytics that wasn't originally designed for multi-thread execution.

xVA Pricing Only (Multi-threaded)

xVA Pricing benchmark results showing speedup with AADC vs baseline

Threads	1	2	4	8	16	28	56	112
AVX256 Time (sec)	73.94	42.88	19.40	11.13	5.56	3.27	1.64	1.34
AVX512 Time (sec)	49.33	25.64	12.68	7.25	3.63	2.13	1.18	0.98
AVX256 Speedup	23x	40x	89x	154x	310x	531x	1068x	1302x
AVX512 Speedup	35x	67x	135x	236x	473x	808x	1467x	1770x

Valuation and AAD Risk Acceleration

In this section, we demonstrate how AADC can be used to accelerate valuations and computation of sensitivities using Automatic Adjoint Differentiation. We showcase the performance of the AADC library against state-of-the-art AAD libraries—not the AAD method itself.

For this reason, we define our baseline as the time it takes for the open source AAD Adept library to calculate all AAD sensitivities using the original user analytics and a single CPU core. Running the original analytics and Adept in multi-thread mode would require careful design and programming, so for the baseline we resort to executing it in single-threaded mode only.

Technical Approach

The same operator overloading pattern with active idouble instance of the xVAProblem class can be used to form machine code instructions for the adjoint function that calculates sensitivities to the specified input variables.

We mark all model parameters and points on the market curves as variables we want to calculate sensitivities for using the .markAsDiff() method. This allows AADC to only form instructions to variables where sensitivities are actually needed, even though these variables are considered to be "constant" from the Monte Carlo path valuation point of view.

This is a very powerful optimization and can be configured at runtime for a specific set of required risks.

In order to produce the full Jacobian for CVA and DVA output variables and all model inputs, we need to execute 2 passes of the adjoint function. So each Monte Carlo path contains one valuation and two adjoint valuations. Averaging over all simulations gives us the xVA values and their path-wise sensitivities.

xVA Pricing & Greeks (Multi-threaded)

xVA Pricing and Greeks benchmark showing AADC vs Adept AAD library

Threads	1	2	4	8	16	24	48	96
AVX256 Time (sec)	679.86	353.41	173.65	88.26	47.18	32.19	18.49	13.61
AVX512 Time (sec)	360.26	185.54	90.41	47.25	24.58	15.26	11.30	10.11
AVX256 Speedup	12x	23x	48x	94x	176x	262x	455x	618x
AVX512 Speedup	23x	45x	93x	175x	338x	543x	737x	832x

Comparison to Commercial AAD Libraries

This benchmark uses Adept, an open-source tape-based AAD library, as the baseline comparison. NAG dco/c++, a commercial tape-based AAD solution, is not included because NAG does not publish benchmark code.

However, tape-based AAD libraries share similar architectural characteristics—storing computation traces in memory and replaying them for differentiation. The performance patterns demonstrated with Adept are representative of tape-based approaches generally.

AADC's code generation approach differs fundamentally: instead of storing a tape, it generates optimized machine code at compile time, enabling:

Thread-safe execution without synchronization overhead
AVX-512 vectorization for maximum throughput
Memory efficiency (no tape storage required)
Adjoint factor <1 (derivatives faster than value-only computation)

Conclusion

In this paper we demonstrated how MatLogica's AADC library can be used to achieve breakthrough performance improvement for a computationally expensive xVA valuation and risk model on Intel AVX-512.

We used an approachable example in the finance sector as well as relatively simple C++ code to showcase ease of integration of MatLogica AADC with an existing code base, introducing a novel way not only to improve performance but to simplify development and support of computationally extensive analytics.

Full AVX-512 Utilization

By using MatLogica AADC, the benefits of Intel AVX-512 can be realized fully and top performance results can be achieved due to our patent-pending proprietary way to drive CPU execution flow.

Proven at Scale

For this benchmark we focused on a simple C++ example; however, this technique has been proven to work with large projects such as the popular open source QuantLib library in quantitative finance.

Cross-Industry Application

This solution can be extended to other industries with a need for high computational performance on problems that are usually associated with GPU.

Glossary

xVA

X-Value Adjustment - umbrella term for CVA, DVA, and other valuation adjustments

CSA

Credit Support Annex - collateral agreement terms

AVX-512

Advanced Vector Extensions 512-bit - Intel SIMD instruction set

AAD

Automatic Adjoint Differentiation - reverse-mode automatic differentiation

Operator Overloading

Technique to intercept operations for AD recording

Adept

Open source C++ AAD library for comparison

Related Resources

How AADC Works

AADC extracts a computational graph (DAG) from existing analytics at runtime and compiles it to optimized machine code, achieving 100-1000× for Python and 6-100× for C++ with automatic derivatives.

MatLogica AADC Quantitative Development Toolkit

Combine ease of integration with high performance. A toolkit that delivers 6-1000x speedups and automatic derivatives with <1% code changes. Focus on your models—AADC handles performance and AAD automatically.

More Than a Thousand-fold Speed-up for xVA Pricing Calculations with Intel Xeon Scalable Processors

Intel whitepaper demonstrating up to 1770x speedup for xVA pricing and 832x for xVA + Greeks on Intel Xeon Scalable Processors with MatLogica AADC.

Performance Benchmarks

Independent benchmarks demonstrating AADC's performance advantages over traditional AAD approaches, GPU implementations, and ML frameworks.

Frequently Asked Questions

What speedup does AADC achieve for xVA pricing?