Institutional Repository
Technical University of Crete
EN  |  EL



My Space

Evaluating the Intel Harp (tightly-coupled CPU-FPGA) platform with an ARM many-core accelerator

Pekridis Georgios

Full record

Year 2019
Type of Item Master Thesis
Bibliographic Citation Georgios Pekridis, "Evaluating the Intel Harp (tightly-coupled CPU-FPGA) platform with an ARM many-core accelerator", Master Thesis, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2019
Appears in Collections


The purpose of this thesis was to evaluate Intel's platform with scalable Xeon CPU and an integrated Arria 10 FPGA. The communication between CPU and FPGA is done with three physical channels, one QPI (Quick Path Interconnect) coherent channel and two PCIe non-coherent channels. There is also a shared memory between the two sides. The read and write bandwidth of the FPGA to the shared memory is approximately 19 GB/s respectively. The FPGA side consists of a static and a reconfigurable part. The static part implements all the necessary components to establish the communication with the CPU. The reconfigurable part is connected with the static part through the Core Cache Interface protocol (CCI-P) that provides a level of abstraction to the developer for starting developing accelerators. The system consists of software and hardware implementations. The evaluation was done with an ARM many core accelerator. The ARM core has a 3-stage pipeline, it uses a 32-bit architecture and is implements the ARMv4 instruction set. Also it implements a few basic floating point instructions. The RTL for the ARM core was written in Bluespec System Verilog (BSV). The hardware architecture has 16 ARM cores. Each core has a direct-map cache with a variable size. Instruction and data memories of every core can be initialized from software in order to the processors can execute the programs that are defined by the developer. The code and the data for the internal memories of each core are read form binary files. Each core is assigned with buffers with a certain amount of memory space to read and write data from/to it. The hardware can have access to them with the use of physical addresses. For the purpose of measuring the bandwidth of the design STREAM benchmark was used. Plus a matrix multiplication test was made as a way to check how the architecture handles real life applications.

Available Files