Executive Summary
In some cases this decision was made almost a decade ago. Since then CPU based systems have made a huge leap with their parallel-compute capacity and are now comparable to, and sometimes actually more performant than, GPU systems, when total cost of ownership is accounted for.
In this post we present a solution that allows organisations with existing CUDA projects to assess performance loss (or gain) for transitioning from GPU to modern CPU systems. Using real-life CUDA examples, we demonstrate how existing GPU-only code can be adopted to run on CPU or GPU at the same time. This should allow companies to make fair assessments of the performance delivered by both technologies.
Developing analytics for CPU usually requires less effort and allows for advanced techniques such as Automatic Adjoint Differentiation (AAD). Companies have to make difficult decisions when considering all of the pros and cons of both technologies.
We have made available an open-source equity pricing model benchmark implemented for both CPU and GPU: which will help practitioners extract top performance from both platforms and therefore compare results from both using unbiased metrics.
Introduction
For many years, the crucial factor in favour of GPU was the ability to generate kernels that can be safely processed in parallel. With proclaimed performance gains of 1000x, a CFO might be persuaded to switch to GPU, despite the significant investment required to transition to CUDA and subsequent higher software support costs.
However, according to the most trustworthy and impartial benchmark (STAC-A2), when hardware manufacturers put maximum software development effort to extract top performance from their offerings, CPU and GPU actually run neck-and-neck.
In the example below we provide an approximate comparison of performance and operational costs of modern CPUs vs GPUs, using cloud costs as a proxy for owning such a set up. It shows that the average cost of a CPU TFLOP is ~< 30% higher than GPU. Therefore, the maximum theoretical saving for a CFO is about 30%, not 1000x!
Other key considerations include the software redesign effort, and the increased support and maintenance costs associated with CUDA, which are driven by the specialised nature of the code and thus requiring specialist developers. In addition, GPU vendor-lock is likely to drive an incremental cost increase as older generations of hardware become outdated.
The performance gained in transitioning from CPU to GPU cannot be solely attributed to the change of hardware. It also involves a costly, and easy-to-dismiss change of mindset, stemming from the transition from object-oriented languages to a matrix-vector multiplication paradigm. This would yield performance improvements despite chip architecture.
Many large banks made a long-term commitment to CUDA/GPU a few years ago. Some have come to realise that this decision has in fact created a raft of new liabilities. These include: high maintenance costs; scarcity of expertise and thus difficulty to find and recruit new talent; hardware reaching end-of-service and so becoming obsolete; as well as hitting the technical limits of GPU due to new business needs. However, there is a way out of the vendor-lock imposed by this migration to CUDA/NVIDIA.
Until now technology similar to CUDA, allowing and supporting safe multithreading was unavailable on a CPU. In response, MatLogica has developed AADC. Unlike CUDA, AADC can use existing C++ object-oriented code to generate optimised kernels for scalable execution on a CPU with minimal effort from developers.
AADC is able to simply reuse existing CUDA analytics, implemented for GPU, and run it on scalable CPUs instead. With minimal changes, existing CUDA code can be adapted for AADC and executed using multi-threading and vectorization on a CPU to get top performance. Unlike GPU, CPU has plenty of memory to solve large problems and support AAD!
Idea: Using AADC to generate scalable CPU kernels with EXISTING CUDA analytics
More complex problems, such as American Monte-Carlo pricing and xVA, can be handled with a similar approach albeit with modest increases in the complexity of the code.
How: Going back to host
New compilation unit for AADC on CPU may look like this:
#define double idouble change native types to active AADC types #define bool ibool to take advantage of operator overloading // Override CUDA extensions: #define __global__ ignore __global__ void __syncthreads() {}; provide simple stub implementation for CUDA specific API. Other methods can also be implemented such as CUDAMemGetInfo etc. struct { int x = 0; } threadIdx; use the zero-th thread to record MC path 0 struct { int x = 0; } blockIdx; struct { int x = 0; } blockDim; #include "kernel.cu" Original user CUDA kernel // Revert back overrides: #undef double #undef bool #undef __global__ // Normal C++ code follows here
We can now add the AADC kernel compilation and the execution driver as with any other C++ code. This normally consists of 2 steps:
1. Starting kernel compilation and execution analytics from kernel.cu. For this we need to explicitly identify model inputs and outputs.
2. Use the compiled CPU kernel instead of the original function for subsequent Monte-Carlo iterations and running simulation across multiple CPU cores and avx2/avx512 parallelization.
Example: Equity Derivative Pricing
The original code is taken from GitHub and is inspired by QuantStart
The source code can be built on Linux and Windows and is available in “CUDA_Example/AADC_Enabled/one-asset ELS/code” and the user manual is available as “Manual.pdf”. For the vanilla option pricer, no changes to "kernel.cu" are required. For path-dependent ELS option, minimal changes were needed and GPU/CPU compatibility is maintained.
The source code can be obtained on request.
What about performance?
*The results are preliminary and are being validated by the hardware vendors.
Based on these results, we get comparable performance between top-of-the-line GPU/CUDA and AADC-adapted CUDA code on a CPU. The changes required for CUDA code are minimal. Apart from integrating MatLogica AADC, no additional optimisations were performed.
This example is open source and anyone can run it themselves as well as recommend improvements for both CPU and GPU. We will update this tablet as we receive feedback from hardware manufacturers and developers.
Conclusion
In this post, we used an example of an embarrassingly parallel pricing method. In MatLogica, we have solutions for a wide range of more complex models typical in quantitative finance such as Longstaff-Schwarz pricing of callable products, XVA, PDEs, etc.