Institutional Repository
Technical University of Crete
EN  |  EL



My Space

A CNN framework accelerated by an FPGA in synthesized C

Chatzidaki Eleftheria

Full record

Year 2019
Type of Item Diploma Work
Bibliographic Citation Eleftheria Chatzidaki, "A CNN framework accelerated by an FPGA in synthesized C", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2019
Appears in Collections


These days, Convolutional Neural Networks are popular for image classification and recognition. We prefer to utilize them because they achieve high accuracy by exploiting the inherent properties of images. A major disadvantage of CNN is that they perform many and complex calculations that cost a lot of time, energy and resources. The best solution that we can suggest is to take advantage of the properties of Field Programmable Gate Arrays. FPGAs are specialized in the acceleration of calculations and they consume less energy than Graphics Processing Units or Central Processing Units. We introduce a framework written in C++ that can adopt FPGA kernels to accelerate calculations as matrix multiplications. We connected an available matrix multiplication with addition implementation to the framework and we tested it on a Trenz Platform. Apart from that, we implemented a fast and cache-aware framework applying OpenMP, GCC option flags, OpenMP environment variables and features from C++17. Our CNN framework is tested on a LeNet-5 architecture using the MNIST dataset containing L1, L2 Regularizations, Vanilla, Momentum, Momentum with Nesterov Updates, He-et-al weight initialization, Fisher-Yates shuffle, Stochastic Gradient Descent techniques that are all implemented from scratch. Furthermore, we implemented 3 ways of load MNIST dataset, as well as, naive, cache blocking, OpenMP and Hybrid cache blocking with OpenMP in matrix multiplication, transpose and copy algorithms, that we tested and investigated their behavior among the mini-batch sizes and the number of used threads. Besides all the aforementioned that was made from scratch, we used the Xilinx Vivado SDK to make a bare-metal C++ project with the appropriate cache size linker script and adjusted the matrix multiplication with addition code to our framework. Afterwards, we programmed the Trenz Platform that contains an ARM CPU and an FPGA accelerator. As a result, we achieved a 4.3x-8.5x better performance using an FPGA to accelerate matrix multiplication with addition than using a naive or a cache blocking single thread implementations and in specific unfair(lack of multi-threading on Trenz) cases, depending on the mini-batch size of multi-threading OpenMP(up to 1.27x) or Hybrid algorithms(up to 2.27x) on a CPU.

Available Files