WP6-13
OODK: Onboard Overlay Development Kit
ID | WP6-13 |
Contributor | UNIMORE |
Levels | Tool |
Require | OODK configuration, HW accelerators and FPGA-based System-on-Chip |
Provide | Automated deployment of HW accelerators in the companion computer platform |
Input |
|
Output |
|
C4D tooling | n.a. |
TRL | 5/6 |
License | Open-source |
Detailed Description
Although FPGA technology can satisfy the performance, energy and predictability requirements of drone systems and applications, FPGA development is a notoriously complex task. This component is a methodology to ease the deployment of application-specific accelerators - or Hardware Processing Units (HWPU) - in the companion computer platform.
It allows for:
- Automated deployment of HW accelerators in the companion computer platform.
- Generation of an application-tailored FPGA overlay. The latter is an HW/SW abstraction layer that integrates and supports HW accelerators and is instantiated in the FPGA-based System-on-Chip.
- SW stack support for streamlined offloading of computation from the host processor side to the hardware-specific accelerators.
Target drone application(s) will then run on a Heterogeneous System-on-Chip where:
- Host CPU is an industry-standard, hard-macro multi-core CPU. It executes full-fledged operating systems and other legacy software (e.g., ROS).
- FPGA overlay consists of several clusters grouping a small number of RISC-V-based proxy cores to control the operation of one or more HW accelerators.
- Applications start on the host CPU, and then compute-intensive parts can be offloaded to the FPGA overlay on the programmable logic.
- Target SoC supports interaction with autopilot, ground station, sensor, and/or other I/O.
Contribution and Improvements
In the context of the C4D contributions, this component had to demonstrate requirements associated with UC5-DEM10-DTC-04 and UC5-DEM10-DTC-05:
- Computation offloading capabilities are enabled by the OpenMP4 offloading support. The approach features single source and compiler-assisted code generation (pragma), thus enabling streamlined offloading between the host CPU and HW accelerators.
- OODK enables the automatic integration of HW accelerators with a single configuration file. Thus, users can integrate accelerators without writing in HDL and without the need to be an expert HW designer.
In WP4, synthetic and realist applications have been compared showing a minor implementation effort regarding LOC written for the application.
Interoperability with other C4D tools
The original Onboard Overlay Development kit also includes a tool for integrating Xilinx Vivado HLS accelerators into the overlay template. Thanks to the C4D project, this component is extended to automatically integrate and configure coarse-grained reconfigurable HW accelerators that are implemented by the more complete and powerful Multi-Dataflow Compose (MDC, WP6-15 component) tool developed by the UNISS. In the context of UC5-D1, OODK has been used to instantiate the Onboard Overlay Compute Platform as a fabric for the Lightweight Cryptography (AES) HW accelerator generated with MDC tool designed by UNISS. The UC5-D1 application providers designed an AES accelerator using MDC. OODK has been employed to integrate the MDC-based dataflow graph (DF Network) into the Onboard Overlay Compute Platform, and then synthesize the generated system targeting a Xilinx FPGA. The high-level programming interfaces for the AES accelerator are implemented through the OpenMP 4.0 Spec. The figure above shows the dependency/interoperability between OODK and the MDC inside the UC5-D1.
Current Status
OODK has been tested in the context of UC5-D1 to evaluate the improvement with regard to SoA FPGA-based acceleration (e.g., Xilinx SDSoC). Relatively to the main metrics:
- The code required for the HW/SW integration of the Lightweight Cryptography accelerator in the FPGA overlay is automatically generated. A total of 46KLOC (thousand lines of code) is replaced by the definition of just 11 parameters to describe the system micro-architecture plus 33 for the accelerator wrapper. Overall, this reduces LOC by a factor of 1049X.
- About heterogeneous applications, the reduction in the number of lines of code is application-specific. Focusing on the UC5-D1 use case, the application consists of 927 lines of code for the main program, which are written manually following the convenient OpenMP coding style. In addition to that, 1709 lines of code are required for extending the firmware’s semantics for the specific accelerators in the platform. In the proposed methodology, the latter are also automatically generated starting from the same 11+33 parameters used to define the architecture. This leads to a reduction in the number of lines of code equal to 2,7X.
- The execution time speedup when comparing the HW and SW versions of the Lightweight Cryptography layer amounts to approximately 2X. This does not directly translate into an equivalent ratio in energy savings, because the host CPU processor is active while the accelerator operates, which contributes to the overall energy spent. The energy reduction when using the accelerated version of the Lightweight Cryptography layer amounts to 48%.
Design and Implementation
The snipped above shows the only configuration file for OODK that the user should compile to integrate one or more HW Application-Specific Accelerators. For the application design, our tool provides a single-source OpenMP4.5-enabled programming interface. Supporting an OpenMP4.5 Accelerator Model means having an OpenMP4.5-enabled compiler supporting both the host ISA and the RISC-V ISA and a runtime system implementing the OpenMP standard. Thus, the OODK collection contains: Clang/LLVM Compiler. Compiler configured for supporting OpenMP offloading from AARCH64 ISA to RISC-V ISA. Overlay Runtime Libraries. Host and Overlay Communication and Runtime Libraries. Overlay Rootfs and Linux OS Generator. Automated scripts for the creation of Linux-based rootfs for the host.