Benchmark

Open Source xVA Benchmark

1770x Speedup for Valuation, 832x for AAD Risk

Download Full Benchmark PDF

Introduction

This benchmark demonstrates how real-world problems can be solved using the AADC library to achieve top performance on Intel Scalable Xeon CPUs (Cascade Lake). The benchmark is purposely implemented in plain C++ for ease of understanding and minimization of runtime overheads due to object abstractions.

The code that represents client analytics is designed and implemented without any upfront intentions for multi-threading or vectorization. The result of this work demonstrates that applying MatLogica's technique to this code base allows achieving top performance on modern Xeon CPUs.

Previously, to get comparable performance, users would need to do extensive rework to port existing analytics to GPU. Moreover, we demonstrate that computing sensitivities using the Automatic Differentiation method is also very efficient and problem size isn't constrained by available memory.

Benchmark Description and Configuration

We used simple and efficient analytics implemented as scalar single-threaded C++ which computes so-called Valuation Adjustments (xVA) that are computationally difficult and required by the financial industry for daily operation. We chose to focus on financial applications; however, this technique is not limited to any specific domain and can be applied across a multitude of problems.

Interest Rate Model Hull-White model with 1 IR curve
Projection Curves 3 projection curves
Credit Curves 2 curves (company and counterparty)
Portfolio 100 50-year IR swaps with random future start dates
IR Curve Points 250 interpolation points
Credit Curve Points 140 interpolation points
Risk Outputs 2564 sensitivities (CVA and DVA)

Hardware Configuration

CPU Cascade Lake CLX-8280M with 56 cores (112 threads)
Memory 192GB
AADC Version MatLogica AADC library (5/15 release)
Compiler ICC 19.1.1

Valuation Acceleration

In the provided source code, the user's analytics is assumed to be implemented by the xVAProblem class in single-thread mode. It uses a template type for all real values and can be instantiated using native double type as well as active type idouble.

Computations which use native double instance set the performance baseline of the benchmark, with the baseline execution time measurements taken using a single CPU core and relying on the compiler (Intel C++) to do all vectorization work.

Using operator overloading technique with the idouble active type on the xVAProblem class allows AADC to extract the valuation graph and form binary instructions that replicate user analytics at runtime.

How AADC Acceleration Works

  1. Record an AADC function for one Monte Carlo path
  2. Include evolution of all market objects, pricing of all trades at all future time points
  3. Compute CVA/DVA integrals with normal random variables as input
  4. Automatically create vectorized function to handle 4 (AVX2) or 8 (AVX512) samples in parallel
  5. Allocate memory per thread for safe multi-threaded execution

As a result, we have safely parallelized analytics that wasn't originally designed for multi-thread execution.

xVA Pricing Only (Multi-threaded)

xVA Pricing benchmark results showing speedup with AADC vs baseline
Threads 1248162856112
AVX256 Time (sec) 73.9442.8819.4011.135.563.271.641.34
AVX512 Time (sec) 49.3325.6412.687.253.632.131.180.98
AVX256 Speedup 23x40x89x154x310x531x1068x1302x
AVX512 Speedup 35x67x135x236x473x808x1467x1770x

Valuation and AAD Risk Acceleration

In this section, we demonstrate how AADC can be used to accelerate valuations and computation of sensitivities using Automatic Adjoint Differentiation. We showcase the performance of the AADC library against state-of-the-art AAD libraries—not the AAD method itself.

For this reason, we define our baseline as the time it takes for the open source AAD Adept library to calculate all AAD sensitivities using the original user analytics and a single CPU core. Running the original analytics and Adept in multi-thread mode would require careful design and programming, so for the baseline we resort to executing it in single-threaded mode only.

Technical Approach

The same operator overloading pattern with active idouble instance of the xVAProblem class can be used to form machine code instructions for the adjoint function that calculates sensitivities to the specified input variables.

We mark all model parameters and points on the market curves as variables we want to calculate sensitivities for using the .markAsDiff() method. This allows AADC to only form instructions to variables where sensitivities are actually needed, even though these variables are considered to be "constant" from the Monte Carlo path valuation point of view.

This is a very powerful optimization and can be configured at runtime for a specific set of required risks.

In order to produce the full Jacobian for CVA and DVA output variables and all model inputs, we need to execute 2 passes of the adjoint function. So each Monte Carlo path contains one valuation and two adjoint valuations. Averaging over all simulations gives us the xVA values and their path-wise sensitivities.

xVA Pricing & Greeks (Multi-threaded)

xVA Pricing and Greeks benchmark showing AADC vs Adept AAD library
Threads 124816244896
AVX256 Time (sec) 679.86353.41173.6588.2647.1832.1918.4913.61
AVX512 Time (sec) 360.26185.5490.4147.2524.5815.2611.3010.11
AVX256 Speedup 12x23x48x94x176x262x455x618x
AVX512 Speedup 23x45x93x175x338x543x737x832x

Comparison to Commercial AAD Libraries

This benchmark uses Adept, an open-source tape-based AAD library, as the baseline comparison. NAG dco/c++, a commercial tape-based AAD solution, is not included because NAG does not publish benchmark code.

However, tape-based AAD libraries share similar architectural characteristics—storing computation traces in memory and replaying them for differentiation. The performance patterns demonstrated with Adept are representative of tape-based approaches generally.

AADC's code generation approach differs fundamentally: instead of storing a tape, it generates optimized machine code at compile time, enabling:

  • Thread-safe execution without synchronization overhead
  • AVX-512 vectorization for maximum throughput
  • Memory efficiency (no tape storage required)
  • Adjoint factor <1 (derivatives faster than value-only computation)

Conclusion

In this paper we demonstrated how MatLogica's AADC library can be used to achieve breakthrough performance improvement for a computationally expensive xVA valuation and risk model on Intel AVX-512.

We used an approachable example in the finance sector as well as relatively simple C++ code to showcase ease of integration of MatLogica AADC with an existing code base, introducing a novel way not only to improve performance but to simplify development and support of computationally extensive analytics.

Full AVX-512 Utilization

By using MatLogica AADC, the benefits of Intel AVX-512 can be realized fully and top performance results can be achieved due to our patent-pending proprietary way to drive CPU execution flow.

Proven at Scale

For this benchmark we focused on a simple C++ example; however, this technique has been proven to work with large projects such as the popular open source QuantLib library in quantitative finance.

Cross-Industry Application

This solution can be extended to other industries with a need for high computational performance on problems that are usually associated with GPU.

Glossary

xVA
X-Value Adjustment - umbrella term for CVA, DVA, and other valuation adjustments
CSA
Credit Support Annex - collateral agreement terms
AVX-512
Advanced Vector Extensions 512-bit - Intel SIMD instruction set
AAD
Automatic Adjoint Differentiation - reverse-mode automatic differentiation
Operator Overloading
Technique to intercept operations for AD recording
Adept
Open source C++ AAD library for comparison

Further Reading on AAD

Related Resources

Frequently Asked Questions

What speedup does AADC achieve for xVA pricing?

AADC achieves up to 1770x speedup for xVA pricing only using 112 threads with AVX-512 on Intel Xeon Cascade Lake, compared to a single-threaded baseline.

How does AADC compare to tape-based AAD libraries like Adept?

AADC achieves 832x speedup for xVA pricing plus Greeks using 96 threads with AVX-512, compared to single-threaded Adept. This is possible because AADC generates optimized machine code that's inherently thread-safe, while tape-based libraries require careful redesign for multi-threading.

Does AADC require code changes for multi-threading?

No. AADC automatically generates thread-safe code. The original single-threaded analytics can be parallelized without any changes to the user's code. Memory is allocated per thread, allowing multiple threads to process Monte Carlo paths safely in parallel.

What's the difference between AVX2 and AVX512 performance?

AVX-512 processes 8 samples in parallel versus 4 for AVX2, resulting in approximately 1.3-1.5x additional speedup. For xVA pricing, AVX-512 achieves 1770x vs 1302x for AVX2 at 112 threads.

Ready to Benchmark AADC on Your Hardware?

See the performance difference with your actual production code