FPGA Accelerator for 3DES Algorithm Based on OpenCL: Design and Performance Analysis

1. Introduction & Overview

In the domains of digital currency, blockchain, and cloud data encryption, the demand for high-speed, low-power cryptographic processing is paramount. Traditional software-based implementations of algorithms like 3DES suffer from significant performance bottlenecks, high CPU resource consumption, and elevated power draw. While Field-Programmable Gate Arrays (FPGAs) offer a hardware-accelerated solution, development using low-level Hardware Description Languages (HDLs) like Verilog/VHDL is time-consuming and complex.

This paper presents a novel design for a 3DES algorithm accelerator on FPGA using the Open Computing Language (OpenCL) framework. The proposed architecture leverages high-level synthesis (HLS) to bridge the productivity gap, implementing a 48-iteration pipeline parallel structure. Through strategic optimizations—including data storage adjustment, bit-width improvement, instruction stream optimization, kernel vectorization, and compute unit replication—the design achieves remarkable performance and energy efficiency gains compared to both CPU and GPU platforms.

111.8 Gb/s

Peak Throughput on Intel Stratix 10

372x

Performance vs. Intel Core i7-9700

9x

Energy Efficiency vs. NVIDIA GTX 1080 Ti

2. Technical Background

2.1 The 3DES Algorithm

Triple Data Encryption Standard (3DES) is a symmetric-key block cipher derived from the older DES algorithm. To enhance security against brute-force attacks, 3DES applies the DES cipher three times to each data block. The standard defines three keying options, with the most secure using three independent keys (Keying Option 1): $C = E_{K3}(D_{K2}(E_{K1}(P)))$, where $E$ is encryption, $D$ is decryption, $K1, K2, K3$ are the keys, $P$ is plaintext, and $C$ is ciphertext. This results in an effective key length of 168 bits and 48 rounds of computation.

2.2 OpenCL for FPGA Programming

OpenCL is an open, royalty-free standard for parallel programming across heterogeneous platforms (CPUs, GPUs, FPGAs, DSPs). For FPGAs, tools like the Intel FPGA SDK for OpenCL act as a High-Level Synthesis (HLS) compiler, translating kernel code written in a C-like language into efficient hardware circuits. This abstraction significantly reduces development time and complexity compared to RTL design, making FPGA acceleration accessible to software developers and domain experts.

3. Accelerator Architecture & Design

3.1 Pipeline Parallel Structure

The core of the accelerator is a deeply pipelined architecture that unrolls the 48 rounds of the 3DES algorithm. This design allows multiple data blocks to be processed simultaneously at different stages of the encryption pipeline, maximizing hardware utilization and throughput. The pipeline is carefully balanced to avoid stalls and ensure continuous data flow.

3.2 Data Transmission Optimization

To overcome the memory bandwidth bottleneck common in accelerator designs, two key strategies are employed:

Data Storage Adjustment: Optimizing data layout in host and device memory to enable efficient burst transfers and minimize access latency.
Data Bit-width Improvement: Increasing the width of data paths between memory and the kernel to match the FPGA's internal bus capabilities, thereby improving the effective bandwidth utilization.

3.3 Kernel Optimization Strategies

The OpenCL kernel is optimized using several techniques:

Instruction Stream Optimization: Reordering and simplifying operations to create an efficient pipeline schedule, reducing dependencies and idle cycles.
Kernel Vectorization: Using Single Instruction, Multiple Data (SIMD) operations to process multiple data elements concurrently within a single kernel instance.
Compute Unit Replication: Instantiating multiple copies of the optimized kernel (Compute Units) on the FPGA fabric to process independent data streams in parallel, scaling performance with available resources.

4. Experimental Results & Performance

The accelerator was implemented and tested on an Intel Stratix 10 GX2800 FPGA. The key performance metrics are as follows:

Throughput: Achieved a peak throughput of 111.801 Gb/s.
vs. CPU (Intel Core i7-9700): Performance improved by a factor of 372, with energy efficiency 644 times better.
vs. GPU (NVIDIA GeForce GTX 1080 Ti): Outperformed in both metrics, delivering 20% higher performance and 9 times better energy efficiency.

Chart Description (Implied): A bar chart would effectively visualize this comparative analysis. The x-axis would list the three platforms (Stratix 10 FPGA, Core i7 CPU, GTX 1080 Ti GPU). Two y-axes could be used: the left for Throughput (Gb/s), showing a single very high bar for the FPGA; the right for Normalized Performance (CPU=1), showing the FPGA bar at 372 and the GPU bar slightly above 1. A separate clustered bar chart could show Energy Efficiency (Ops/J or similar), highlighting the FPGA's massive 644x lead over the CPU and 9x lead over the GPU.

5. Core Insight & Analyst Perspective

Core Insight: This paper isn't just about making 3DES fast on an FPGA; it's a compelling blueprint for democratizing hardware acceleration. The authors demonstrate that by strategically applying OpenCL-based HLS, you can achieve performance that not only crushes general-purpose CPUs but also surpasses high-end GPUs in a targeted domain, all while sidestepping the prohibitive engineering cost of traditional RTL design.

Logical Flow: The argument is methodical. It starts by identifying the critical pain points in software (slow) and traditional FPGA development (hard). The solution path is clear: use OpenCL/HLS for productivity, then apply a sequence of well-understood but critical optimizations (pipelining, vectorization, replication) to extract maximum hardware efficiency. The performance comparisons against established CPU and GPU baselines validate the entire approach.

Strengths & Flaws: The strength is undeniable: the reported 372x/644x gains over a modern CPU are staggering and highlight the raw potential of domain-specific hardware. The use of OpenCL is a major practical strength, aligning with industry trends towards accessible heterogeneous computing, as seen in frameworks like TensorFlow for ML or OneAPI. However, a critical flaw is the lack of a comparative baseline with a hand-optimized Verilog/VHDL 3DES core on the same Stratix 10 FPGA. While the GPU/CPU comparison is excellent for market positioning, the HLS community needs to know the "efficiency gap" between HLS and expert RTL design for this specific problem. Furthermore, as noted by research from the University of Toronto on HLS productivity, the abstraction can sometimes obscure low-level control, potentially leaving some performance on the table compared to an optimal RTL implementation.

Actionable Insights: For product teams, the message is clear: For high-volume, fixed-function cryptographic workloads (beyond just 3DES), an OpenCL-based FPGA accelerator should be a serious contender in the architecture evaluation phase, especially where power efficiency is a key constraint (e.g., edge data centers, network appliances). The methodology is portable. The real takeaway is the optimization playbook—data layout, bit-width, pipelining, vectorization, replication. These are not new concepts, but seeing them cohesively applied in an OpenCL context to beat a flagship GPU is a powerful proof point. The next step is to apply this same blueprint to post-quantum cryptography algorithms like Kyber or Dilithium, which are computationally intensive and prime candidates for such acceleration.

6. Technical Details & Mathematical Formulation

The 3DES encryption process with three independent keys (EDE mode) is formally defined as:

$Ciphertext = E_{K_3}(D_{K_2}(E_{K_1}(Plaintext)))$

Where a single DES round function $F(R, K)$, applied during each of the 16 rounds per DES operation, is central to the computation. It involves:

Expansion: The 32-bit right half $R$ is expanded to 48 bits via a fixed permutation table $E$.
Key Mixing: The expanded $R$ is XORed with a 48-bit round key $K$ derived from the main key.
Substitution (S-Boxes): The 48-bit result is divided into eight 6-bit chunks, each transformed into a 4-bit output by a non-linear substitution box (S-Box). This is the core non-linear operation: $S(B) = S_i(B)$, where $B$ is a 6-bit input and $S_i$ is the $i^{th}$ S-Box table.
Permutation (P-Box): The 32-bit output from the S-Boxes is permuted by a fixed function $P$.

The round function output is: $F(R, K) = P(S(E(R) \oplus K))$.

The accelerator's pipeline effectively computes this $F$ function 48 times per data block, with the pipeline stages mapping to the expansion, XOR, S-Box lookup, and permutation operations, all optimized for parallel execution.

7. Analysis Framework & Case Example

Framework for Evaluating HLS-based Accelerators:

When analyzing a paper like this, we apply a multi-dimensional framework:

Performance: Absolute throughput (Gb/s) and latency. Comparison to relevant baselines (CPU, GPU, other FPGA works).
Efficiency: Performance per Watt (Energy Efficiency). Resource utilization (Logic Elements, BRAM, DSP blocks on FPGA).
Productivity: Implied development time saved by using OpenCL vs. HDL. Portability of code across FPGA families.
Methodology Validity: Are the optimization strategies clearly explained and justified? Is the experimental setup (tools, versions, benchmark data) reproducible?
Generality: Can the core architectural strategies (pipeline, vectorization) be applied to other algorithms (e.g., AES, SHA-3)?

Case Example: Applying the Framework

Let's apply point #5 (Generality) to the AES algorithm. The paper's strategy is highly transferable:

Pipeline Parallel Structure: AES-128 has 10 rounds. A 10-stage (or deeper via unrolling) pipeline can be constructed.
Data Transmission Optimization: The same data width and layout optimizations would apply to feed the AES kernel.
Kernel Vectorization: AES operations on the 128-bit state matrix are highly parallelizable within a single block.
Compute Unit Replication: Multiple independent AES cores can be instantiated.

The primary architectural change would be replacing the DES $F$-function datapath with the AES round transformation (SubBytes, ShiftRows, MixColumns, AddRoundKey). The optimization principles remain identical. A similar study by researchers at ETH Zurich on OpenCL-based AES acceleration on FPGAs achieved comparable performance leaps, confirming the generality of this approach.

8. Future Applications & Research Directions

The success of this design opens several promising avenues:

Post-Quantum Cryptography (PQC): Standardization of PQC algorithms (e.g., by NIST) is underway. Algorithms like CRYSTALS-Kyber (key encapsulation) and CRYSTALS-Dilithium (signatures) involve complex polynomial arithmetic that is highly parallelizable and computationally intensive, making them ideal targets for this FPGA acceleration blueprint.
Homomorphic Encryption Acceleration: Performing computations on encrypted data is massively compute-bound. Optimized FPGA accelerators could make certain homomorphic schemes practical for real-world use.
Integrated Secure Data Processing Units: Combining this cryptographic accelerator with network interface controllers (SmartNICs) or storage controllers to provide transparent, line-rate encryption/decryption for data-at-rest and data-in-motion within data centers.
Toolchain Enhancement: Future research could focus on automating the optimization strategies presented here. Could the OpenCL compiler automatically infer optimal data bit-widths or suggest compute unit replication based on kernel analysis and target FPGA resources?
Multi-Algorithm Agile Accelerators: Designing reconfigurable kernels that can support multiple symmetric ciphers (3DES, AES, ChaCha20) based on workload demand, leveraging the partial reconfiguration capability of modern FPGAs.

9. References

WU J., ZHENG B., NIE Y., CHAI Z. (2021). FPGA Accelerator for 3DES Algorithm Based on OpenCL. Computer Engineering, 47(12), 147-155, 162.
National Institute of Standards and Technology (NIST). (1999). Recommendation for the Triple Data Encryption Algorithm (TDEA) Block Cipher. NIST Special Publication 800-67.
Khronos Group. (2024). OpenCL Overview. https://www.khronos.org/opencl/
Intel Corporation. (2023). Intel FPGA SDK for OpenCL. https://www.intel.com/content/www/us/en/software/programmable/sdk-for-opencl/overview.html
Ismail, A., & Shannon, L. (2019). High-Level Synthesis for FPGA-Based Cryptography: A Survey. In Proceedings of the International Conference on Field-Programmable Technology (FPT).
University of Toronto, Department of Electrical & Computer Engineering. (2022). Research in High-Level Synthesis and FPGA Architectures. https://www.eecg.utoronto.ca/~jayar/research/hls.html
ETH Zurich, Secure & Reliable Systems Group. (2021). Hardware Acceleration of Modern Cryptography. https://srs.group.ethz.ch/research.html
Zhuo, L., & Prasanna, V. K. (2005). High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware. IEEE Transactions on Parallel and Distributed Systems.