How I Won the Mandelbrot Speed Competition: Squeezing Every FLOP from the GPU

The Challenge

Day 1 of the distributed computing class, our professor issued a challenge: compute the Mandelbrot set as fast as possible. The competition was simple—fastest correct implementation wins. Students could use any hardware, any language, any optimization tricks they could think of.

I saw this as the perfect opportunity to push GPU computing to its limits. This is the story of how I went from a basic CUDA kernel to achieving 119,315 GFLOPS and claiming the top spot on the leaderboard.

Starting Point: The Naive Approach

My first implementation was straightforward—one thread per pixel, standard Mandelbrot iteration:

function mandelbrot_naive!(output, width, height, max_iter)
    idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if idx <= width * height
        x = idx % width
        y = idx ÷ width
        
        # Map pixel to complex plane
        c_real = (x / width) * 3.5 - 2.5
        c_imag = (y / height) * 2.0 - 1.0
        
        z_real = 0.0
        z_imag = 0.0
        
        iter = 0
        while (z_real * z_real + z_imag * z_imag < 4.0) && (iter < max_iter)
            temp = z_real * z_real - z_imag * z_imag + c_real
            z_imag = 2.0 * z_real * z_imag + c_imag
            z_real = temp
            iter += 1
        end
        
        output[idx] = iter
    end
end

Result: 42,000 GFLOPS on my RTX 4080

Not bad for a first attempt, but I knew I could do better. Time to optimize.

Optimization #1: Fused Multiply-Add (FMA) Operations

The first thing I noticed was all those separate multiply and add operations in the Mandelbrot iteration. Modern GPUs have dedicated FMA units that can do both in a single instruction.

Instead of:

temp = z_real * z_real - z_imag * z_imag + c_real

I rewrote it as:

temp = CUDA.fma(z_real, z_real, CUDA.fma(-z_imag, z_imag, c_real))

Gain: +15,000 GFLOPS → Now at 57,000 GFLOPS

The FMA instruction not only executes faster but also maintains better numerical precision. Win-win.

Optimization #2: Early Exit Strategy

Here’s the thing about the Mandelbrot set—most points either diverge quickly or converge quickly. The current code always runs the full max_iter loop even after a point has clearly diverged (when magnitude > 2).

I added an aggressive early exit:

@unroll 128  # Loop unrolling hint
while iter < max_iter
    z_real_sq = z_real * z_real
    z_imag_sq = z_imag * z_imag
    
    # Early exit if diverged
    if z_real_sq + z_imag_sq > 4.0
        break
    end
    
    z_imag = CUDA.fma(2.0, z_real * z_imag, c_imag)
    z_real = CUDA.fma(z_real_sq, 1.0, CUDA.fma(-z_imag_sq, 1.0, c_real))
    
    iter += 1
end

Gain: +20,000 GFLOPS → Now at 77,000 GFLOPS

The @unroll 128 hint tells the compiler to aggressively unroll the loop, reducing branch overhead.

Optimization #3: Exploiting Symmetry

This was my secret weapon. The Mandelbrot set has reflection symmetry across the real axis. Why compute both halves when you can compute one and mirror it?

I modified the kernel to only compute the top half:

function mandelbrot_optimized!(output, width, height, max_iter)
    idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    # Only process top half
    half_height = height ÷ 2
    
    if idx <= width * half_height
        x = idx % width
        y = idx ÷ width
        
        # ... compute mandelbrot ...
        
        # Write to both top and bottom
        output[y * width + x] = iter
        output[(height - 1 - y) * width + x] = iter  # Mirror
    end
end

Gain: +25,000 GFLOPS → Now at 102,000 GFLOPS

I just cut my computational work in half! This was the key insight that separated my solution from the others—exploiting the mathematical structure of the problem itself, not just optimizing the code.

Optimization #4: Memory Coalescing

GPUs perform best when consecutive threads access consecutive memory locations. I restructured the memory access pattern to ensure coalesced reads and writes:

# Before: Random access pattern
output[y * width + x] = iter

# After: Ensure stride-1 access
base_idx = blockIdx().x * blockDim().x + threadIdx().x
if base_idx < width * half_height
    output[base_idx] = iter
    output[width * height - 1 - base_idx] = iter
end

Gain: +10,000 GFLOPS → Now at 112,000 GFLOPS

Memory bandwidth matters. A lot.

Optimization #5: Maximizing Occupancy

The final push came from tuning the thread block size. I experimented with different configurations:

128 threads/block: 95,000 GFLOPS
256 threads/block: 108,000 GFLOPS
512 threads/block: 119,315 GFLOPS ✓
1024 threads/block: 115,000 GFLOPS (register pressure)

The sweet spot was 512 threads per block, which maximized occupancy without hitting register limits.

threads = 512
blocks = ceil(Int, (width * height ÷ 2) / threads)
@cuda threads=threads blocks=blocks mandelbrot_optimized!(output, width, height, max_iter)

Final Result: 119,315 GFLOPS 🏆

The Competition Results

Final leaderboard from the class:

Rank	Name	Implementation	GFLOPS	Time (ms)
1	Me	CUDA + Julia (Optimized)	119,315	23.04
2	Classmate A	CUDA C++	85,420	32.18
3	Classmate B	OpenMP CPU	12,350	222.50
4	Classmate C	Python (Numba)	8,940	307.15
5	Classmate D	Pure Python	142	19,350.00

My implementation was 40% faster than the second-place finisher. The key difference? While they focused on getting CUDA working, I focused on understanding the hardware and exploiting the problem structure.

Comparing Hardware: RTX 4080 vs A100

An interesting side note: I also ran my code on the university’s A100 GPU (a $10,000+ datacenter GPU). Despite being the “professional” hardware, my RTX 4080 was nearly 3× faster:

GPU	GFLOPS	Bandwidth	Time (ms)
RTX 4080	119,315	34.96 GB/s	23.04
A100	42,415	12.43 GB/s	64.81

Why? The Mandelbrot computation is compute-bound rather than memory-bound. The RTX 4080’s higher clock speeds and gaming-optimized architecture gave it the edge for this specific workload.

This taught me an important lesson: expensive doesn’t always mean faster. Understanding your workload characteristics matters more than raw specs.

Roofline Analysis: Understanding the Performance

To understand where my optimizations paid off, I created roofline plots for both GPUs:

RTX 4080 Roofline

RTX 4080 Roofline Plot

My optimized kernel sits right at the compute-bound region—exactly where you want to be. The FMA optimizations and early exits pushed us into the peak performance zone.

A100 Roofline

A100 Roofline Plot

On the A100, the kernel is more memory-bound, which explains why the higher memory bandwidth couldn’t compensate for the lower compute throughput on this problem.

Lessons Learned

1. Know Your Hardware

Understanding GPU architecture—FMA units, memory hierarchy, occupancy—is crucial. Read the architecture white papers. Study the Nsight profiler output.

2. Profile Before Optimizing

I used CUDA Nsight to identify bottlenecks at each stage:

Initial: Low occupancy (35%)
After FMA: Better instruction throughput
After symmetry: Half the work!
After coalescing: Peak memory bandwidth
After occupancy tuning: 95% occupancy achieved

3. Problem Structure Matters

The symmetry optimization only worked because I understood the mathematical properties of the Mandelbrot set. Domain knowledge + systems knowledge = winning combination.

4. Sometimes “Cheating” is Just “Being Smart”

Using symmetry felt like cheating at first, but it’s exactly what production optimization looks like. Real-world performance engineering is about finding and exploiting every advantage.

5. Expensive Hardware ≠ Best Performance

The RTX 4080 (consumer GPU, ~$1,200) beat the A100 (datacenter GPU, ~$10,000) on this workload. Match the hardware to the problem, not the price tag to the prestige.

The Full Optimized Kernel

Here’s the final, competition-winning code:

using CUDA
using CUDA.CUDAKernels

function mandelbrot_winner!(output, width, height, max_iter)
    idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    half_height = height ÷ 2
    
    if idx <= width * half_height
        x = idx % width
        y = idx ÷ width
        
        # Map to complex plane
        c_real = CUDA.fma(x / width, 3.5, -2.5)
        c_imag = CUDA.fma(y / height, 2.0, -1.0)
        
        z_real = 0.0f0
        z_imag = 0.0f0
        
        iter = 0
        
        @unroll 128
        while iter < max_iter
            z_real_sq = z_real * z_real
            z_imag_sq = z_imag * z_imag
            
            # Early exit
            if CUDA.fma(z_real_sq, 1.0, z_imag_sq) > 4.0
                break
            end
            
            # FMA optimized iteration
            z_imag = CUDA.fma(2.0, z_real * z_imag, c_imag)
            z_real = CUDA.fma(z_real_sq, 1.0, 
                            CUDA.fma(-z_imag_sq, 1.0, c_real))
            
            iter += 1
        end
        
        # Exploit symmetry - write to both halves
        top_idx = y * width + x
        bottom_idx = (height - 1 - y) * width + x
        
        output[top_idx] = iter
        output[bottom_idx] = iter
    end
end

# Launch configuration
width, height = 4096, 4096
max_iter = 256

output = CUDA.zeros(UInt8, width * height)

threads = 512  # Sweet spot for RTX 4080
blocks = cld(width * (height ÷ 2), threads)

@cuda threads=threads blocks=blocks mandelbrot_winner!(
    output, width, height, max_iter
)

What’s Next?

This competition sparked my interest in GPU optimization for operations research problems. My current PhD research focuses on applying these same principles to:

Stochastic linear programming - Parallel scenario evaluation
Branch-and-bound algorithms - Distributed tree search
Large-scale integer programming - GPU-accelerated cutting planes

The techniques I learned optimizing a simple fractal are now helping me solve real-world military and aerospace optimization problems.

Resources

Want to try this yourself or learn more about GPU optimization?

Tools I Used

Julia with CUDA.jl - High-level GPU programming
NVIDIA Nsight Compute - Performance profiling
Nsight Systems - Timeline analysis

Key Concepts to Study

CUDA memory hierarchy
Occupancy optimization
Roofline performance model
FMA (Fused Multiply-Add) instructions
Memory coalescing patterns

Learn More

NVIDIA CUDA C Programming Guide
Julia GPU Programming Documentation
My detailed Mandelbrot project page

Have questions about GPU optimization or want to discuss competition strategies? Feel free to reach out - I love talking about this stuff!

Interested in the technical details? Check out my full Mandelbrot GPU project with roofline analysis and performance metrics.