Accelerating an SRCNN on an FPGA

Image super-resolution is the process of reconstructing a high-resolution (HR) image from a low-resolution (LR) counterpart. This post details an effort to accelerate the Super-Resolution Convolutional Neural Network (SRCNN), a three-layer architecture proposed by Dong et al., on an FPGA. The primary goal was to maximize throughput and minimize latency while respecting the hardware’s resource constraints.

The SRCNN Architecture

The SRCNN pipeline begins by upscaling an LR input image using bicubic interpolation. This upscaled image is then fed into a three-layer CNN to refine details and produce the final HR output. Each layer serves a distinct purpose:

Feature Extraction: A 9x9 convolution extracts low-level features from the input image.
Non-linear Mapping: A 1x1 convolution maps these features into a higher-dimensional space, capturing more complex relationships.
Reconstruction: A final 5x5 convolution aggregates the features to reconstruct the final HR image.

The first two layers use a ReLU activation function to introduce non-linearity.

The Acceleration Journey: An Iterative Approach

Our acceleration strategy centered on Output Stationary (OS) Tiled Convolution. Tiling is crucial for FPGAs as it breaks large convolutions into smaller, manageable chunks that can fit into on-chip BRAM, avoiding costly DRAM access. We progressed through several design iterations.

Baseline: The Golden Reference

First, we established a baseline with a simple, unoptimized C++ implementation. This “Golden Reference” processed each layer sequentially, loading and writing entire feature maps from memory for each operation. As expected, performance was poor:

Latency: ~4.97 billion cycles
Accuracy (MSE): 0.00223 (on the set5 Butterfly test image)

This provided a correct but slow benchmark to measure our optimizations against.

Test 1: Initial Tiled Convolution

Our first optimized attempt implemented OS Tiled Convolution with HLS pragmas for unrolling and pipelining within each layer. The results were a mixed success:

Latency: Reduced dramatically to ~64.8 million cycles (a ~76x speedup).
BRAM: Exploded to 11,272 units (3913% of the available BRAM).

The latency improvement was significant, but the design was impossible to implement. The fatal flaw was storing the full, intermediate feature maps between each tiled layer, consuming an unsustainable amount of on-chip memory.

Test 2: Aggressive Memory Optimization

To fix the BRAM issue, we modified the design to use only a single tile buffer for data transfer between layers. This drastically reduced the memory footprint.

Latency: Further reduced to ~1.83 million cycles.
BRAM: An excellent 132 units.
Accuracy (MSE): Degraded catastrophically to 0.275.

While this version was incredibly efficient in resource usage and speed, it failed to produce a correct output. Processing the image one tile at a time without sufficient context from neighboring tiles at layer boundaries led to significant data loss and reconstruction errors.

Test 3: A Balanced Approach with HLS Dataflow

The final design sought a balance between performance, resource usage, and accuracy. We replaced the intermediary tile arrays with HLS streams and used the #pragma HLS dataflow directive. This transformed the architecture into a true producer-consumer pipeline where layers could operate concurrently. As one layer produced output data, the next layer could immediately begin consuming it via a stream buffer.

Latency: ~72.8 million cycles.
BRAM: 286 units (99% of available BRAM).
Accuracy (MSE): Restored to the correct 0.00223.

This implementation successfully balanced the trade-offs. While latency was slightly higher than the inaccurate Test 2, it was still a ~68x speedup over the Golden Reference. Most importantly, it achieved this while maintaining model accuracy and fitting within the FPGA’s BRAM constraints. The increased usage of DSPs, FFs, and LUTs reflected a more effective parallelization of the workload across the FPGA fabric.

Conclusion

Accelerating CNNs on FPGAs is a process of navigating trade-offs between latency, throughput, and resource utilization. Our iterative journey with SRCNN demonstrated that:

Tiled convolution is an effective strategy for managing memory on FPGAs.
Naive memory optimization can easily break model accuracy.
HLS dataflow and streaming provide a powerful paradigm for creating efficient, pipelined hardware architectures that balance performance with correctness.