The CTO’s Dilemma: Striking the Balance Between Performance and Ease-of-Use in Financial Modelling
In the ever-evolving landscape of financial simulations and computational intricacies, Developers retain a strong affinity for coding in Object-Oriented (OO) Languages, due to by the extensive feature sets resulting from over a decade of IT investment in their projects.
However, as we explore the challenges of developing easy-to-maintain and performant financial models, we discover that their repetitive nature is both a performance bottleneck and an incentive to achieving speed-ups. The calculations themselves only involve basic arithmetic operations and are inherently straightforward. The complexity lies in the layers of virtual functions, abstractions, and memory allocations, constructs designed to enhance code-writing convenience but achieved at the expense of execution speed.
The challenge demands a transformative solution that preserves the coding preferences of C++ and Python while simultaneously eliminating the performance toll driven by these object-oriented constructs. This is precisely what MatLogica delivers: ensuring that the technical investment of the past decade is leveraged whilst also unlocking GPU-equivalent performance. In this post we demonstrate how we have redefined the narrative around financial simulations and the entwined computational challenges, marrying the OO languages with extraordinary acceleration from MatLogica’s Code Generation Kernels.
What are the Code Generation Kernels?
If for example your function needs to be executed 10,000 times, and very quickly, you could spend days squeezing out a few milliseconds in performance with some clever optimisations. Or, you could instantly achieve 6-100x speed-ups by using MatLogica’s Automatic Adjoint Differentiation Compiler (AADC).
AADC Explained - What's Inside
AADC is a custom JIT compiler, meticulously designed for repetitive simulations and the computation of AAD risks. The solution allows a user to simply mark inputs and outputs to the function and instruct AADC to record the calculation, and when required, its adjoint. On the fly, AADC generates a perfectly optimised, vectorised, multi-thread-safe binary kernel, which is fully ready to process new data samples for subsequent iterations. Once created for a specific task configuration, the kernel is reusable and can be used to efficiently compute the original function and its adjoints, as and when required. Essentially, AADC transforms the original object-oriented code into a data-oriented performance.
Whether in Physics, weather modelling, industrial mathematics, or financial modelling, simulations (identical calculations) need to be performed across numerous samples of data to obtain reliable results. The sequence of operations remains constant for every dataset, be it a Monte-Carlo simulation, a scenario like stress-testing, backtesting, or Value at Risk (VaR).
Ultimately, the performance of each iteration becomes a crucial concern. Virtual functions and abstractions, while providing code flexibility, introduce overhead, impacting the speed of execution. Memory allocations further compound the challenge, leading developers to seek innovative solutions that balance flexibility and performance.
What is the problem with repetitive calculations?
Expression templates offer a means to write flexible code in C++ without sacrificing performance, allowing for the dynamic specification of data types. While templates offer a way to achieve flexibility, their use can lead to code bloat, where multiple instances of template code are generated, potentially impacting compile times and binary sizes. Additionally, debugging templated code poses unique challenges, requiring a delicate balance to harness their benefits effectively.
In finance, the computation of high-order derivatives (Greeks) is required to enable accurate hedging ratios and P&L explain. The debate over the methodology for computing higher-order derivatives rages on, with the consensus in favour of Bump&Revalue over the AAD risks. MatLogica’s Code Generation Kernels are the catalyst for performance and precision when calculating derivatives like Vanna, Volga, Cross-Gamma, Charm, or Speed, because they enable fast AAD, and can also speed up Bump&Revalue.
How code generation kernels can help with the performance and precision of Greeks, XVA calculations, and expected shortfall modeling
In the world of XVA (CVA, DVA, FVA, etc.) daily compute costs can exceed $100K, and every millisecond matters. MatLogica’s Code Generation Kernels revolutionise these calculations across their spectrum, from managing IT infrastructure cost, to counterparty risk and optimising trading strategies.
With market dynamics in constant flux, the modelling of expected shortfalls is a central linchpin for effective risk management. The Kernels essentially enable financial institutions to swiftly calculate and respond to potential losses in extreme scenarios, ensuring risk assessments are both precise and lightning-fast.
Whilst MatLogica’s solution is mainly tuned for finance applications, it is notable that these Code Generation Kernels have a very broad range of potential applications across Machine Learning. For tasks like training deep neural networks and optimising complex algorithms, the GPU-like performance of these kernels is a formidable enabler.
Where Does the Performance Come From?
We have so far explained the source of the performance penalty from OO languages. Below are the primary optimisations that the Code Generation Kernels deliver.
1. Optimisations Relating to Static Data and Constants
Schedules, model parameters, and other static data are hard-coded in the kernels, reducing runtime computations and enhancing efficiency. A good example is volatility interpolation in Black-Scholes pricing when a binary search is performed based on the maturity date Vs a volatility bucket before the weights are established. With AADC, this binary search needs to be performed only once, as the maturity date is constant. The weights are subsequently input into the kernel, resulting in a huge performance gain.
2. Vectorisation and Multi-Threading:
The resulting code is generated in a single thread and then optimised for the target hardware architecture, such as AVX2 or AVX512. AVX2 allows the processing of four double-precision data samples in one CPU cycle, and AVX512 delivers another 1.7x performance boost. These kernels are NUMA-aware, requiring the minimal number of operations theoretically necessary to complete the task. They are multi-thread-safe, even if the original analytics are not. This ensures parallel execution, further elevating performance.
3. Enhanced memory use
In a Monte Carlo scenario, multi-threading can be enabled by AADC Code Generation Kernels, even if the original analytics are not multi-thread safe. Multi-threading generally needs less memory as the threads share the same memory space, enhancing efficiency through direct data access. In contrast, multi-processing incurs additional memory overhead as it uses separate memory space for each process. Multi-threading is thus the methodology of choice for AADC-enabled simulations.
Additionally, traditional memory utilisation approaches often lead to scattered variables across memory, resulting in suboptimal CPU cache memory utilisation. AADC’s groundbreaking solution to this is to selectively retain just the memory required for real values, in conjunction with a thought-through approach to the use of CPU registers, delivering a profound reduction in memory consumption.
4. Enhanced Inputs with Derivatives:
Derivatives can be calculated rapidly, cost-effectively, and precisely by using the Adjoint Kernels. With an adjoint factor < 1, the function and all its derivatives are computed quicker than the original function itself.
AADC is proven to be 16x faster than an implementation involving manual adjoints. It is also less error-prone, as a consistent methodology can be used for the first as well as the higher-order Greeks. Confirming its efficiency, it requires up to 70% fewer code lines than the alternatives.
5. The Need for a Custom JIT Compiler in Code Generation
The kernel recording time is a critical factor as the function and its adjoint must be regenerated each time the task configuration changes - whether pricing a new trade, altering the trading date, or amending the portfolio. The time taken to generate the kernel becomes a pivotal element in the overall execution, making AADC an indispensable tool for achieving substantial performance gains in real-life quant and risk systems. When using an off-the-shelf compiler, code generation can take the time equivalent of over 10,000 executions of the original code, making its use prohibitive for smaller simulations. In contrast, with the AADC JIT compiler, it takes on average, ~200x of the time needed to execute the original function, thus making two-fold performance increases a reality.
Can Code Generation Kernels be used in the Cloud?
Yes, they can! And, remarkably, these kernels are almost Enigma-secure. AADC receives a sequence of elementary operations from the original analytics, with all the loops unrolled, and just hard-coded values that represent strike, expiry date, or valuation date. With no visibility of the original models, AADC generates optimised and compressed kernels, where all business-critical information is hidden between the ones and zeros. Accordingly, even the same portfolio of trades will have a different binary representation from one trading day to another.
Object-oriented languages, such as C++ and Python, with their extensive feature sets, not only serve as familiar coding pillars but embody a significant legacy of development. Unveiling the challenges embedded in the repetitive nature of computations exposes not only performance bottlenecks but also opens a gateway to unprecedented speeds. Amidst layers of virtual functions, abstractions, and memory allocations, MatLogica's Code Generation methodology emerges as the optimal transformative solution, preserving your coding preferences while annihilating the speed barriers imposed by object-oriented constructs.