Migration Guide

An Elegant Approach to Run Existing CUDA Analytics on Both GPU and CPU

Escape GPU vendor lock-in with minimal code changes. Run your CUDA code on scalable CPUs with comparable performance, plus the added benefit of Automatic Adjoint Differentiation (AAD). Make informed hardware decisions for your quantitative finance applications.

The Decision Made Nearly a Decade Ago

Why CPU has caught up with GPU

Historical Context

Many financial institutions committed to CUDA/GPU years ago, pursuing proclaimed 100-1000x performance gains against CPU. The substantial investment in transitioning analytics to CUDA seemed justified at the time.

CPU Evolution

Since then, CPU-based systems have made a quantum leap in parallel compute capacity. Modern CPUs are now comparable to, and sometimes exceed, GPU systems when total cost of ownership is properly accounted for.

Key Insight

Organizations with existing CUDA projects can now assess performance gain or loss for transitioning from GPU to modern CPU systems. With minimal code changes, existing GPU-only code can be adapted to run on CPU or GPU simultaneously.

The Real Economics: 30%, Not 1000x

Cloud cost analysis reveals the truth about GPU vs CPU pricing

According to the most trustworthy and impartial benchmark (STAC-A2), when hardware manufacturers invest maximum software development effort to extract top performance, CPU and GPU go neck-and-neck.

Feature GPU (V100) CPU (Xeon)
Number of cores 5,120 56 x 2
Clock frequency 877 MHz 2.6 GHz
Operations per clock (float precision) 1 32
FMA 2 2
TFLOPS (all of above multiplied) 8.98 18.64
Approx monthly cost (GCP) $1,300 $3,416
Approx monthly cost per TFLOP $145 $183

The average cost of a CPU TFLOP is ~30% higher than GPU. Therefore, the maximum theoretical saving is about 30%, not 1000x! When factoring in development costs, maintenance, specialist developers, and vendor lock-in, the true economics often favor CPU.

The Hidden Liabilities of GPU Commitment

What many banks discovered after their CUDA migration

Maintenance Burden

  • Specialized CUDA code requires specialized developers
  • Higher support costs due to complexity
  • Scarcity of talent makes hiring difficult and expensive
  • Knowledge concentration risk with limited team members

Vendor Lock-in

  • CUDA/NVIDIA ecosystem dependency
  • Hardware obsolescence as older GPUs reach end-of-service
  • Forced upgrades drive incremental costs
  • Limited negotiating power with single vendor

Technical Limitations

  • Strict memory constraints (8-48GB vs 64-512GB+ CPU)
  • No AAD support for gradient calculations
  • Complex control flow challenges
  • Hitting GPU limits with new business needs

Development Overhead

  • Matrix-vector paradigm shift from OOP
  • Costly code redesign effort
  • Extended development cycles
  • Testing complexity across environments

Until now, technology similar to CUDA for safe multithreading was unavailable on CPU. MatLogica's AADC changes this. Unlike CUDA, AADC uses existing C++ object-oriented code to generate optimized kernels for scalable CPU execution with minimal developer effort.

The AADC Approach: Reuse Existing CUDA Analytics

Run your GPU code on scalable CPUs with minimal changes

AADC can simply reuse existing CUDA analytics implemented for GPU and run them on scalable CPUs instead. With minimal changes, existing CUDA code can be adapted for AADC and executed using multi-threading and vectorization on CPU to achieve top performance.

Unlike GPU, CPU has plenty of memory to solve large problems and natively supports AAD!

How It Works: Recording Scalable CPU Kernels

CUDA mainly uses C++ syntax with extensions for parallel programming and GPU management. The AADC approach records scalable CPU kernels by executing original user code for one data sample (e.g., one Monte Carlo path).

  • Execute CUDA analytics with AADC for one data sample on CPU - This records the full valuation graph
  • Compile scalable CPU kernels - Support execution in safe multithreaded environment
  • Leverage AVX native CPU vector arithmetics - Achieve comparable performance to GPU

More complex problems such as American Monte Carlo pricing and xVA calculations can be handled with a similar approach, with only modest increases in code complexity.

Implementation: Going Back to Host

Step-by-step guide to enable CPU execution of CUDA code

To run existing CUDA code with AADC on the CPU, we disable CUDA extensions to make the code compatible with standard C++ compiler and ready for AADC kernel compilation.

1

Type Overrides

Change native types to active AADC types

  • #define double idouble - Enable AADC tracking
  • #define bool ibool - Take advantage of operator overloading
  • Minimal intrusion to existing code
  • Maintain type safety and semantics
2

Override CUDA Extensions

Disable GPU-specific directives and API calls

  • #define __global__ - Ignore GPU kernel marker
  • void __syncthreads() {} - Provide stub implementation
  • Implement other CUDA methods as needed (CUDAMemGetInfo, etc.)
  • Use zero-th thread to record MC path 0
3

Include Original Kernel

Reference your existing CUDA kernel code

  • #include "kernel.cu" - Original user CUDA kernel
  • No changes to kernel.cu required for vanilla options
  • Minimal changes for path-dependent options
  • GPU/CPU compatibility maintained
4

Revert Overrides

Clean up preprocessor definitions

  • #undef double - Restore original types
  • #undef bool - Restore original types
  • #undef __global__ - Restore original definitions
  • Normal C++ code follows
5

AADC Kernel Compilation

Compile and execute with AADC

  • Identify model inputs and outputs explicitly
  • Start kernel compilation from kernel.cu
  • Execute analytics for one sample
  • Use compiled CPU kernel for subsequent iterations
6

Scale to Production

Run simulation across multiple CPU cores

  • Deploy across multiple CPU cores
  • Enable AVX2/AVX512 parallelization
  • Run Monte Carlo iterations at scale
  • Monitor and optimize performance
#define double idouble
#define bool ibool

// Override CUDA extensions:
#define __global__
void __syncthreads() {};
struct { int x = 0; } threadIdx;
struct { int x = 0; } blockIdx;
struct { int x = 0; } blockDim;

#include "kernel.cu"  // Original user CUDA kernel

// Revert back overrides:
#undef double
#undef bool
#undef __global__

// Normal C++ code follows here with AADC compilation

In real projects, this code can be wrapped for simplified use. The explicit approach is shown here for demonstration purposes.

Performance Benchmarks: GPU vs CPU

Real-world equity derivative pricing comparison

Benchmark Setup: One-Asset Equity Linked Security

Option Type: Path-dependent ELS
Timesteps: 1,080
Simulations: 100,000 Monte Carlo paths
Source: Based on github.com/ymh1989/CUDA_MC
Measurement: Process simulation and pricing logic only
Excluded: Random number generation
Machine Price Time Ratio
NVIDIA V100 (GPU) $1,300 USD/month 10.2 ms 7.8 ms/$1000
CPU, 30 threads + AVX512 $915 USD/month 13.5 ms 14.8 ms/$1000

Performance Difference: 32% slower, but 30% cheaper - plus AAD support impossible on GPU

Results are preliminary and being validated by hardware vendors

Your Migration Path Forward

Practical steps for organizations with existing CUDA investments

1

Assess Current Position

Understand your CUDA investment and constraints

  • Inventory CUDA projects and dependencies
  • Identify GPU memory constraints in production
  • Calculate true CUDA maintenance costs
  • Document AAD requirements for risk models
2

Proof of Concept

Test AADC with representative workload

  • Select representative pricing model
  • Apply minimal CUDA overrides
  • Compile with AADC on CPU
  • Benchmark against GPU baseline
3

Support Dual Builds

Maintain both GPU and CPU capabilities

  • Create dual build configuration
  • Test compatibility across platforms
  • Document performance characteristics
  • Choose optimal platform per use case
4

Gradual Transition

Migrate workloads strategically

  • Start with AAD-critical models
  • Move memory-intensive workloads
  • Transition complex control flow models
  • Reduce CUDA dependency over time

CUDA Is Not a One-Way Street

With minimal changes, it's possible to run CUDA code on scalable 64-bit CPU and take advantage of AAD as an additional benefit.

We've demonstrated it's reasonably simple to support existing CUDA projects for dual CPU and GPU builds.

  • Make informed hardware decisions
  • Escape vendor lock-in
  • Choose best platform per workload
  • Leverage AAD where needed
  • Reduce maintenance costs
  • Future-proof infrastructure

The performance gained from transitioning CPU to GPU often comes from the shift to matrix-vector multiplication paradigm, not just hardware - benefits that AADC brings to CPU.

Ready to Test Your CUDA Code on CPU?

Get a comprehensive benchmark of your CUDA code running on modern CPUs with AADC. Experience comparable performance with the added benefits of AAD support and larger memory capacity - capabilities impossible on GPU.

Comparable Performance
Real benchmarks show minimal performance difference
Minimal Code Changes
Simple overrides and AADC integration
AAD Support Included
Gradient calculations impossible on GPU