Multi-threading Contract
AADC provides specific guarantees and constraints for multi-threaded execution that differ between the recording phase and the kernel execution phase. Understanding these contracts is essential for developing robust multi-threaded applications with AADC.
Executive Summary
Recording Phase: Multiple threads can record simultaneously into separate AADCFunctions kernel instances. Each thread maintains its own independent recording state.
Kernel Execution Phase: Fully thread-safe. Multiple threads can execute AADC kernels simultaneously with different input data using separate workspaces.
Mixed Operations: Non-recording threads can safely perform mathematical operations using active types while other threads are recording, with some performance overhead.
Enable Multi-threading: AADC can be used to enable scalable multi-threaded execution for even large legacy single-threaded codebases.
Recording Phase Threading Contract
Multi-Threaded Recording Support
AADC supports simultaneous recording from multiple threads, with each thread recording into its own AADCFunctions kernel instance. This is achieved through:
- Per-thread recording state: Each thread maintains its own recording state
- Per-instance tracking: Each
AADCFunctionsobject tracks whether it is currently recording - Optimized checks: Fast-path optimization that avoids overhead when no thread is recording
Multi-Threaded Recording Example
#include <aadc/aadc.h>
#include <thread>
#include <vector>
void recordKernelInThread(int thread_id, std::vector<KernelData>& results) {
// Each thread creates its own AADCFunctions instance
auto kernel = std::make_shared<AADCFunctions<__m256d>>();
idouble x(1.0 + thread_id), y(2.0), z;
// Use RecordingGuard for exception-safe recording
{
aadc::recording::RecordingGuard<__m256d> guard(*kernel);
auto x_arg = x.markAsInput();
auto y_arg = y.markAsInput();
z = x * x + y * y; // Each thread records its own operations
auto z_res = z.markAsOutput();
results[thread_id] = {kernel, x_arg, y_arg, z_res};
}
// Recording automatically stopped when guard goes out of scope
}
int main() {
const int num_threads = 4;
std::vector<KernelData> results(num_threads);
std::vector<std::thread> threads;
// Launch multiple recording threads simultaneously
for (int i = 0; i < num_threads; ++i) {
threads.emplace_back(recordKernelInThread, i, std::ref(results));
}
// Wait for all recordings to complete
for (auto& t : threads) {
t.join();
}
// Each thread produced an independent kernel
for (int i = 0; i < num_threads; ++i) {
auto ws = results[i].kernel->createWorkSpace();
// ... execute each kernel
}
}Recording Constraints
While multiple threads can record simultaneously, each thread must:
- Use a separate
AADCFunctionsinstance: Two threads cannot record into the same kernel - Cross-thread variable access: If active variables from one thread’s recording are accessed by another recording thread, they are treated as constants (their current values are captured, not tracked as active)
- Complete recording before sharing the kernel: The kernel can only be shared for execution after
stopRecording()returns
Exception Safety with RecordingGuard
Use aadc::recording::RecordingGuard for exception-safe recording:
void safeRecording(AADCFunctions<mmType>& kernel) {
idouble x(1.0), y(2.0), z;
{
aadc::recording::RecordingGuard<mmType> guard(kernel);
auto x_arg = x.markAsInput();
auto y_arg = y.markAsInput();
z = computeSomething(x, y); // If this throws, recording is properly stopped
auto z_res = z.markAsOutput();
}
// Recording automatically stopped here
}Performance Impact During Recording
Non-recording threads experience minimal overhead during concurrent recording. AADC uses optimized checks to ensure non-recording threads have minimal overhead even when other threads are recording.
Kernel Execution Phase Threading Contract
Full Thread Safety
AADC recording and execution are now fully multi-thread safe:
// Multiple threads can record and execute simultaneously
void worker_thread(int thread_id) {
// Each thread creates its own AADCFunctions instance for recording
AADCFunctions<mmType> kernel;
AADCArgument x_arg, y_arg;
AADCResult z_res;
// Thread-safe recording
{
idouble x(1.0 + thread_id), y(2.0), z;
aadc::recording::RecordingGuard<mmType> guard(kernel);
x_arg = x.markAsInput();
y_arg = y.markAsInput();
z = x * x + y * y;
z_res = z.markAsOutput();
}
// Thread-safe execution (each thread uses its own workspace)
std::shared_ptr<AADCWorkSpace<mmType>> ws(kernel.createWorkSpace());
ws->setVal(x_arg, 3.0);
ws->setVal(y_arg, 4.0);
kernel.forward(*ws);
kernel.reverse(*ws);
// Extract results safely
mmType result = ws->val(z_res);
double scalar_result = mmSum(result);
}Workspace Requirements
Each thread must use its own AADCWorkSpace instance:
// Correct: Each thread has its own workspace
std::vector<std::unique_ptr<std::thread>> threads;
std::vector<double> thread_results(num_threads);
for (int i = 0; i < num_threads; ++i) {
threads.push_back(
std::make_unique<std::thread>([&kernel, &thread_results, i]() {
std::shared_ptr<AADCWorkSpace<mmType>> ws(kernel.createWorkSpace());
// Set thread-specific inputs
setInputData(*ws, i);
// Safe parallel execution
kernel.forward(*ws);
thread_results[i] = processResults(*ws, i);
})
);
}
// Wait for all threads to complete
for (auto&& t : threads) {
t->join();
}Mixed Recording and Execution
AADC supports concurrent recording and execution operations:
// Thread 1: Recording a new kernel
void recording_thread() {
AADCFunctions<mmType> new_kernel;
aadc::recording::RecordingGuard<mmType> guard(new_kernel);
// ... record operations
}
// Thread 2: Executing an existing kernel (runs concurrently)
void execution_thread(const AADCFunctions<mmType>& existing_kernel) {
auto ws = existing_kernel.createWorkSpace();
existing_kernel.forward(*ws);
existing_kernel.reverse(*ws);
}Performance Considerations
Recording Phase Overhead
During recording, non-recording threads experience:
- Minimal overhead: Lightweight checks per mathematical operation
- No blocking: Threads never wait for other recordings to complete
- Independent progress: Each recording thread proceeds independently
With AADC++: Non-recording threads have near-zero overhead (~1.0x native performance).
Execution Phase Performance
Kernel execution scales close to linearly with thread count:
- No synchronization overhead: Each thread operates independently
- Cache-friendly: Each workspace maintains its own memory space
- NUMA-aware: Workspaces can be allocated on appropriate NUMA nodes
Memory Usage
Each thread requires:
- Workspace memory: Proportional to maximum number of active variables in recording
- Stack space: For adjoint kernel execution if AAD derivatives are required. This is part of WorkSpace object.
Plan memory allocation accordingly for multi-threaded scenarios.
Debugging Multi-threaded Issues
AADC’s per-thread recording isolation helps avoid many common multi-threading issues:
- No shared mutable state during recording: Each thread’s recording state is isolated
- Clear ownership: Each
AADCFunctionsinstance belongs to one recording thread - Deterministic graphs: The computational graph recorded by each thread is deterministic
This is a preview of the Multi-threading documentation.
The full documentation includes implementation details, debugging techniques, and reference test examples.
Contact us to request a demo version and get access to the complete documentation.