Multi-threading Contract

AADC provides specific guarantees and constraints for multi-threaded execution that differ between the recording phase and the kernel execution phase. Understanding these contracts is essential for developing robust multi-threaded applications with AADC.

Executive Summary

Recording Phase: Multiple threads can record simultaneously into separate AADCFunctions kernel instances. Each thread maintains its own independent recording state.

Kernel Execution Phase: Fully thread-safe. Multiple threads can execute AADC kernels simultaneously with different input data using separate workspaces.

Mixed Operations: Non-recording threads can safely perform mathematical operations using active types while other threads are recording, with some performance overhead.

Enable Multi-threading: AADC can be used to enable scalable multi-threaded execution for even large legacy single-threaded codebases.

Recording Phase Threading Contract

Multi-Threaded Recording Support

AADC supports simultaneous recording from multiple threads, with each thread recording into its own AADCFunctions kernel instance. This is achieved through:

Per-thread recording state: Each thread maintains its own recording state
Per-instance tracking: Each AADCFunctions object tracks whether it is currently recording
Optimized checks: Fast-path optimization that avoids overhead when no thread is recording

Multi-Threaded Recording Example

#include <aadc/aadc.h>
#include <thread>
#include <vector>

void recordKernelInThread(int thread_id, std::vector<KernelData>& results) {
    // Each thread creates its own AADCFunctions instance
    auto kernel = std::make_shared<AADCFunctions<__m256d>>();

    idouble x(1.0 + thread_id), y(2.0), z;

    // Use RecordingGuard for exception-safe recording
    {
        aadc::recording::RecordingGuard<__m256d> guard(*kernel);

        auto x_arg = x.markAsInput();
        auto y_arg = y.markAsInput();

        z = x * x + y * y;  // Each thread records its own operations

        auto z_res = z.markAsOutput();

        results[thread_id] = {kernel, x_arg, y_arg, z_res};
    }
    // Recording automatically stopped when guard goes out of scope
}

int main() {
    const int num_threads = 4;
    std::vector<KernelData> results(num_threads);
    std::vector<std::thread> threads;

    // Launch multiple recording threads simultaneously
    for (int i = 0; i < num_threads; ++i) {
        threads.emplace_back(recordKernelInThread, i, std::ref(results));
    }

    // Wait for all recordings to complete
    for (auto& t : threads) {
        t.join();
    }

    // Each thread produced an independent kernel
    for (int i = 0; i < num_threads; ++i) {
        auto ws = results[i].kernel->createWorkSpace();
        // ... execute each kernel
    }
}

Recording Constraints

While multiple threads can record simultaneously, each thread must:

Use a separate AADCFunctions instance: Two threads cannot record into the same kernel
Cross-thread variable access: If active variables from one thread’s recording are accessed by another recording thread, they are treated as constants (their current values are captured, not tracked as active)
Complete recording before sharing the kernel: The kernel can only be shared for execution after stopRecording() returns

Exception Safety with RecordingGuard

Use aadc::recording::RecordingGuard for exception-safe recording:

void safeRecording(AADCFunctions<mmType>& kernel) {
    idouble x(1.0), y(2.0), z;

    {
        aadc::recording::RecordingGuard<mmType> guard(kernel);
        auto x_arg = x.markAsInput();
        auto y_arg = y.markAsInput();

        z = computeSomething(x, y);  // If this throws, recording is properly stopped

        auto z_res = z.markAsOutput();
    }
    // Recording automatically stopped here
}

Performance Impact During Recording

Non-recording threads experience minimal overhead during concurrent recording. AADC uses optimized checks to ensure non-recording threads have minimal overhead even when other threads are recording.

Kernel Execution Phase Threading Contract

Full Thread Safety

AADC recording and execution are now fully multi-thread safe:

// Multiple threads can record and execute simultaneously
void worker_thread(int thread_id) {
    // Each thread creates its own AADCFunctions instance for recording
    AADCFunctions<mmType> kernel;
    AADCArgument x_arg, y_arg;
    AADCResult z_res;

    // Thread-safe recording
    {
        idouble x(1.0 + thread_id), y(2.0), z;
        aadc::recording::RecordingGuard<mmType> guard(kernel);
        x_arg = x.markAsInput();
        y_arg = y.markAsInput();
        z = x * x + y * y;
        z_res = z.markAsOutput();
    }

    // Thread-safe execution (each thread uses its own workspace)
    std::shared_ptr<AADCWorkSpace<mmType>> ws(kernel.createWorkSpace());
    ws->setVal(x_arg, 3.0);
    ws->setVal(y_arg, 4.0);

    kernel.forward(*ws);
    kernel.reverse(*ws);

    // Extract results safely
    mmType result = ws->val(z_res);
    double scalar_result = mmSum(result);
}

Workspace Requirements

Each thread must use its own AADCWorkSpace instance:

// Correct: Each thread has its own workspace
std::vector<std::unique_ptr<std::thread>> threads;
std::vector<double> thread_results(num_threads);

for (int i = 0; i < num_threads; ++i) {
    threads.push_back(
        std::make_unique<std::thread>([&kernel, &thread_results, i]() {
            std::shared_ptr<AADCWorkSpace<mmType>> ws(kernel.createWorkSpace());

            // Set thread-specific inputs
            setInputData(*ws, i);

            // Safe parallel execution
            kernel.forward(*ws);
            thread_results[i] = processResults(*ws, i);
        })
    );
}

// Wait for all threads to complete
for (auto&& t : threads) {
    t->join();
}

Shared Read-Only Data

Kernel objects and constant data can be safely shared across threads:

// Safe: Kernel objects are immutable after creation
AADCFunctions<mmType> shared_kernel = /* ... */;

// Safe: Read-only access to shared inputs
const std::vector<double> market_data = /* ... */;

// Each thread processes different segments
parallel_for(0, total_scenarios, [&](int scenario_id) {
    std::shared_ptr<AADCWorkSpace<mmType>> ws(shared_kernel.createWorkSpace());

    // Use shared read-only data
    copyInputs(*ws, market_data, scenario_id);
    shared_kernel.forward(*ws);
});

Mixed Recording and Execution

AADC supports concurrent recording and execution operations:

// Thread 1: Recording a new kernel
void recording_thread() {
    AADCFunctions<mmType> new_kernel;
    aadc::recording::RecordingGuard<mmType> guard(new_kernel);
    // ... record operations
}

// Thread 2: Executing an existing kernel (runs concurrently)
void execution_thread(const AADCFunctions<mmType>& existing_kernel) {
    auto ws = existing_kernel.createWorkSpace();
    existing_kernel.forward(*ws);
    existing_kernel.reverse(*ws);
}

Performance Considerations

Recording Phase Overhead

During recording, non-recording threads experience:

Minimal overhead: Lightweight checks per mathematical operation
No blocking: Threads never wait for other recordings to complete
Independent progress: Each recording thread proceeds independently

With AADC++: Non-recording threads have near-zero overhead (~1.0x native performance).

Execution Phase Performance

Kernel execution scales close to linearly with thread count:

No synchronization overhead: Each thread operates independently
Cache-friendly: Each workspace maintains its own memory space
NUMA-aware: Workspaces can be allocated on appropriate NUMA nodes

Memory Usage

Each thread requires:

Workspace memory: Proportional to maximum number of active variables in recording
Stack space: For adjoint kernel execution if AAD derivatives are required. This is part of WorkSpace object.

Plan memory allocation accordingly for multi-threaded scenarios.

Debugging Multi-threaded Issues

AADC’s per-thread recording isolation helps avoid many common multi-threading issues:

No shared mutable state during recording: Each thread’s recording state is isolated
Clear ownership: Each AADCFunctions instance belongs to one recording thread
Deterministic graphs: The computational graph recorded by each thread is deterministic

This is a preview of the Multi-threading documentation.

The full documentation includes implementation details, debugging techniques, and reference test examples.