Interactive Benchmarks - Claude Documentation

This document provides mappings and instructions for updating benchmark data from execution results.

Quick Reference: Updating Benchmarks

# Regenerate benchmark_data.json after any changes
python3 scripts/update_benchmarks.py

Key Files:


Naming Convention

IMPORTANT: Always use “AADC Python” (not “Python AADC”)

Model Name in Execution LogDisplay NameNotes
gbm_asian_AADC_scalar_pyAADC PythonMain Python implementation (scalar kernel v6.0.0)
gbm_asian_AADC_cppAADC C++C++ implementation
gbm_asian_optimised_safe_cppOptimised C++Hand-optimised C++ baseline

Actual vs Estimated Values

The system tracks whether data is actual (from execution log) or estimated (extrapolated):

Display Rules

  1. Performance Comparison table:

  2. Key Findings section:

  3. Missing implementations:

Implementation in update_benchmarks.py

# Estimate detection
if not has_actual_data:
    impl["is_estimate"] = True
    impl["estimate_note"] = f"Estimated from {closest_trades}×{closest_scenarios}"
    impl["closest_trades"] = closest_trades
    impl["closest_scenarios"] = closest_scenarios

Dynamic Key Insights

Key Insights now update based on selected configuration (no longer static).

Category-Specific Insights Generated

CategoryInsights
aad_toolsAADC vs Enzyme-AD/CoDiPack speedups, open-source alternative recommendation
ml_librariesAADC vs JAX/PyTorch speedups, JIT overhead warning
python_implementationsAADC vs NumPy/Basic Python speedups
languagesAADC C++ vs Optimised C++, Python vs C++ ratio

Example Output (at 10×100K)

Performance: AADC Python (668.7 ms) is 9x faster than Enzyme-AD (6.27s)
             and 13x faster than CoDiPack (8.85s) at 100K scenarios

Open-Source Choice: If commercial AADC isn't an option, Enzyme-AD (6.27s)
                    is the fastest open-source alternative but 9x slower than AADC.

Kernel Call Explanation

AADC C++ and Python have very different batch models - the component now shows separate explanations:

ImplementationBatch SizeReason
AADC Python65,536 scenariosLarge NumPy-style batches
AADC C++4-8 scenariosAVX vector width (AVX2=4, AVX512=8)

Example at 1000 trades × 100K scenarios:


CRITICAL: Total Evaluation Time is the Production Metric

The total_eval_time_sec column is the MOST IMPORTANT metric for production benchmarks.

total_eval_time_sec = eval_time_sec + recording_time_sec

Why This Matters

Implementationeval_timerecording_timetotal_eval_time
AADC Python2.7s0.15s2.85s
JAX6.5s12.5s19.0s

Key Insight: JAX’s eval_time (6.5s) is only 2.4× slower than AADC, but total_eval_time is 6.7× slower (19s vs 2.85s) due to JIT compilation overhead.

Recording/Compilation Happens on Every Cold Start

Small Benchmarks Hide the Problem

At 10 trades × 1K scenarios, JAX’s 12.5s compilation might seem acceptable. But this isn’t production-representative. In production:

Always use total_eval_time_sec for production performance comparisons. Always show both eval_time and total_eval_time in benchmark displays.


Important: Execution Log Column Mapping

Before using execution log data, be aware that different model types have different column counts:

Model TypeColumnsIssue
Python AADC, NumPy, etc.39-40✓ Correct
Julia, Haskell37Missing recording_time, kernel_execution_time
Old C++ (adept, codipack, etc.)34Missing timing and memory breakdown fields

Always use execution_log_clean_16Jan.csv which has been corrected with fix_column_mapping.py. See /home/natashamanito/Asian Options Benchmark/AI Asian Pricer/data/CLAUDE.md for full details.

New calculated column: total_eval_time_sec = eval_time_sec + recording_time_sec

Data Sources

There are two data files that must be kept in sync:

  1. Scatter Plot Data: src/content/data/benchmark-scatter-data.md

  2. Asian Option Benchmark Stats: src/content/resources/benchmarks/asian-option-monte-carlo.md

Field Mappings

CSV Execution Fields → scatter-data.md Properties

CSV Fieldscatter-data.md PropertyNotes
eval_time_secgreeks_timeEvaluation time (EXCLUDES recording for AADC)
recording_time_secrecording_timeKernel recording time (AADC only)
steady_state_time_secsteady_state_timeSteady-state eval time (all kernel calls)
total_eval_time_secN/A (calculated in component)eval_time + recording_time
memory_mbmemoryPeak memory usage
model_total_linescode_linesTotal lines changed for integration
recording_time_sec or kernel_execution_time_seccompile_timeJIT/kernel compilation time (for scatter plot)
N/A (count Greeks: delta, rho, vega)num_greeksUsually 3 (Delta, Rho, Vega)

Important Timing Relationship:

For JIT/Recording implementations (AADC, JAX, Enzyme):
  greeks_time (eval_time) = first_eval + steady_state (excludes recording/JIT)
  total eval = greeks_time + recording_time (or compile_time)

For traditional implementations (NumPy, PyTorch, Basic Python, etc.):
  greeks_time = total evaluation time

Note: The component uses both recording_time and compile_time fields interchangeably. Implementations with JIT compilation (JAX, Enzyme) have compile_time in scatter data.

CSV Execution Fields → asian-option-monte-carlo.md benchmarkStats

CSV FieldbenchmarkStats PropertyNotes
eval_time_sec (AADC)aadcTimeFormat as “Xs” or “~X hours”
eval_time_sec (C++)cppTimeFormat as “Xs” or “~X hours”
eval_time_sec (Basic Python)basicPythonTimeFormat as “~X hours”
eval_time_sec (NumPy)numpyTimeFormat as “~X hours”
model_total_lines (AADC)aadcLinesRaw line count
model_total_lines (Basic Python)basicPythonLinesRaw line count
model_total_lines (AADC) - model_total_lines (Basic Python)linesAddedFormat as “+X”

Configuration Fields

CSV FieldbenchmarkStats Property
num_tradesPart of testConfig
num_scenariosPart of testConfig
num_timestepsPart of testConfig
num_threadsthreads

testConfig Format: "{num_trades} trades × {num_scenarios/1000}K scenarios × {num_timesteps} timesteps"


Computed Metrics (Formulas)

Speedup Calculations

# Speedup vs Baseline
speedup = baseline_time / implementation_time

# Format: "Xx" (e.g., "420x")
aadcVsBasicPython = basicPythonTime / aadcTime
aadcVsNumpy = numpyTime / aadcTime
aadcGreeksVsCpp = cppGreeksTime / aadcGreeksTime
cppVsBasicPython = basicPythonTime / cppTime
numpyVsBasicPython = basicPythonTime / numpyTime

Greeks Overhead Calculation

# Greeks overhead = how much slower Greeks is vs price-only
# Format: "+X%" (e.g., "+42%")

overhead_pct = ((greeks_time - priceonly_time) / priceonly_time) * 100

# Example:
# AADC: priceonly = 67s, greeks = 95s
# overhead = ((95 - 67) / 67) * 100 = 41.8% → "+42%"

Progress Bar Width (InteractiveBenchmark.astro)

The progress bar uses logarithmic scale to handle wide time ranges:

// All times in same unit (seconds)
const logMin = Math.log10(Math.max(minTime, 0.001));
const logMax = Math.log10(Math.max(maxTime, 0.001));
const logVal = Math.log10(Math.max(currentTime, 0.001));
const logRange = logMax - logMin;

// Colored bar represents time: slowest = 100% fill, fastest = smallest fill
const barWidthPercent = logRange > 0
  ? ((logVal - logMin) / logRange) * 100
  : 50;

Speedup vs Baseline (Table Column)

# Baseline = slowest implementation in the group (typically Basic Python or autodiff)
speedup = slowest_time / implementation_time

# Format: "X.Xx" (e.g., "420.0x", "1.0x" for baseline)

Scatter Plot Data Structure

Each implementation in benchmark-scatter-data.md:

- id: aadc_python              # Unique identifier
  name: "AADC Python"          # Display name
  shortName: "Py AADC"         # Short name for legends
  category: "ml-libraries"     # One of: aad-tools, ml-libraries, python-implementations, languages
  color: "#22d3ee"             # Hex color for chart
  language: "Python"           # Programming language

  # Metrics (from CSV)
  greeks_time: 0.348           # eval_time_sec (seconds)
  memory: 52                   # memory_mb
  code_lines: 176              # model_total_lines or delta from baseline
  compile_time: 0.095          # recording_time_sec or kernel_execution_time_sec
  num_greeks: 3                # Number of Greeks computed

  recommended: true            # Optional: highlight as recommended

Page-Specific Categories

aad-tools.astro

ml-libraries.astro

python-implementations.astro

languages.astro

production.astro


Updating accelerate-python-models

The resources/publications/accelerate-python-models.astro page should use the same data from asian-option-monte-carlo.md. Currently it does via:

const benchmark = await getEntry('resources', 'benchmarks/asian-option-monte-carlo');
const benchmarkStats = benchmark?.data?.benchmarkStats;

When benchmark data changes:

  1. Update asian-option-monte-carlo.md with new benchmarkStats:

  2. Update benchmark-scatter-data.md with new metrics:

  3. The accelerate-python-models page will automatically reflect the changes because it reads from asian-option-monte-carlo.md.

Displayed Metrics on accelerate-python-models

Display LocationbenchmarkStats FieldFormat
Hero headlineaadcVsBasicPython”420x”
Key findingsaadcVsBasicPython, linesAdded, aadcGreeksVsCpp”420x speedup with +176 lines”
Performance tablebasicPythonLines, numpyLines, aadcLines, cppLinesRaw numbers
Performance tablebasicPythonTime, numpyTime, aadcTime, cppTime”~X hours” or “Xs”
Comparison boxaadcGreeksVsCpp, bumpOverhead, aadOverhead”5.4x”, “+582%“
Scale notetestConfig, threads”1000 trades × 500K scenarios…”
FAQ answersVarious speedups and overheadsDynamic substitution

Step-by-Step: Processing New Benchmark Results

1. Parse CSV Execution Log

Extract these fields for each implementation:

model_name, language, eval_time_sec, memory_mb, model_total_lines,
recording_time_sec, num_trades, num_scenarios, num_timesteps, num_threads

2. Calculate Derived Metrics

# For each implementation pair
speedup = baseline_time / impl_time
overhead = ((greeks_time - price_time) / price_time) * 100
lines_added = aadc_lines - baseline_lines

3. Update benchmark-scatter-data.md

implementations:
  - id: aadc_python
    greeks_time: <eval_time_sec>
    memory: <memory_mb>
    code_lines: <model_total_lines>
    compile_time: <recording_time_sec>

4. Update asian-option-monte-carlo.md

benchmarkStats:
  aadcVsBasicPython: "<calculated>x"
  aadcTime: "<eval_time_sec>s"
  linesAdded: "+<calculated>"
  # ... etc

5. Verify accelerate-python-models Updates

Run npm run build and verify:


Validation Checklist

When updating benchmark data:


Timing Breakdown Explanation

AADC / Kernel-Recording Implementations

For AADC and similar kernel-recording implementations, the timing fields have specific meanings:

CSV FieldDisplay NameMeaning
recording_time_secRecordingOne-time cost to build the computation graph (kernel). Happens once per model.
eval_time_sec (greeks_time)EvaluationTime to run the recorded kernel. This excludes recording time.
(calculated)Cold Startrecording_time + eval_time. Total time if starting fresh without a recorded kernel.

IMPORTANT: eval_time_sec (displayed as “Greeks Time” in main table) does NOT include recording time. It measures only the kernel evaluation.

Key Insight:

Per-Kernel Execution Time Calculation

The “Per Kernel” column shows the time for a SINGLE kernel execution. This is calculated as:

Per Kernel = steady_state_time / num_kernel_calls_steady_state

Where num_kernel_calls depends on the implementation:

ImplementationBatch SizeKernel Calls Calculation
AADC Pythonscenario_batch_size (65536)(num_trades - 1) × ceil(num_scenarios / batch_size)
AADC C++vector_size (4=AVX2, 8=AVX512)(num_trades - 1) × ceil(num_scenarios / vector_size)

Example (10 trades × 1K scenarios):

ImplementationBatch SizeKernel Calls (steady)Steady StatePer Kernel
AADC C++ (AVX512)89 × 125 = 1,12512.8 ms11.4 μs
AADC Python655369 × 1 = 935.5 ms3.94 ms

Key insight: C++ AADC processes 8 scenarios per kernel call (AVX512), while Python AADC processes up to 65536 scenarios per call. This explains the large difference in per-kernel time but similar total evaluation time.

Full timing breakdown (Python AADC at 10 trades × 1K scenarios):

Traditional Implementations

For non-kernel-recording implementations:

CSV FieldDisplay NameMeaning
first_run_time_secJIT WarmupFirst-run time including JIT compilation (if applicable)
steady_state_time_secEvaluationSteady-state per-run evaluation time
eval_time_secTotal Eval TimeTotal wall-clock time

Traditional implementations re-compute everything each evaluation - no amortization benefit.


Additional CSV Fields (Reference)

These fields are available but not currently mapped:

CSV FieldPotential Use
portfolio_valueValidation - should match across implementations
avg_option_valueValidation
avg_delta, avg_rho, avg_vegaValidation - should match across implementations
first_run_time_secCold start time (JIT warmup)
steady_state_time_secWarmed-up performance
data_memory_mb, kernel_memory_mbMemory breakdown
batch_sizeVectorization info
total_generic_ops, total_exp_ops, etc.Operation counts for complexity analysis
model_math_linesLines with mathematical operations

File Locations Summary

src/
├── content/
│   ├── data/
│   │   └── benchmark-scatter-data.md     # Scatter plot metrics
│   └── resources/
│       ├── benchmarks/
│       │   └── asian-option-monte-carlo.md  # benchmarkStats (single source)
│       └── articles/
│           └── accelerate-python-models.md  # References benchmarkStats
├── pages/
│   ├── technology/
│   │   └── benchmarks/
│   │       └── interactive-benchmarks/
│   │           ├── CLAUDE.md (this file)
│   │           ├── aad-tools.astro
│   │           ├── ml-libraries.astro
│   │           ├── python-implementations.astro
│   │           ├── languages.astro
│   │           └── production.astro
│   └── resources/
│       └── publications/
│           └── accelerate-python-models.astro  # Uses benchmarkStats
└── components/
    └── benchmark/
        └── BenchmarkScatterPlot.astro    # Uses scatter-data.md