COMP4DRONES - User contributions [en]

WP6-13

2022-10-10T10:52:36Z

Unimore:

= OODK: Onboard Overlay Development Kit =
{|class="wikitable"
| ID|| WP6-13
|-
| Contributor || UNIMORE
|-
| Levels || Tool
|-
| Require || OODK configuration, HW accelerators and FPGA-based System-on-Chip
|-
| Provide || Automated deployment of HW accelerators in the companion computer platform
|-
| Input ||
* OODK configuration file
* HW Accelerator RTL designs
|-
| Output ||
* OODK system ready to be deployed on FPGA-based System-on-Chip
* SW stack for offloading computation from the host processor side to the hardware-specific accelerators
|-
| C4D tooling || n.a.
|-
| TRL || 5/6
|-
| License || Open-source
|}

== Detailed Description ==
Although FPGA technology can satisfy the performance, energy and predictability requirements of drone systems and applications, FPGA development is a notoriously complex task.
The Onboard Overlay Development Kit (OODK) [1] is a methodology to ease the deployment of application-specific accelerators - or Hardware Processing Units (HWPU) [2][3][4] - in the companion computer platform.

It allows for:
* Automated deployment of HW accelerators in the companion computer platform.
* Generation of an application-tailored FPGA overlay. The latter is an HW/SW abstraction layer that integrates and supports HW accelerators and is instantiated in the FPGA-based System-on-Chip.
* SW stack support for streamlined offloading of computation from the host processor side to the hardware-specific accelerators.

Target drone application(s) will then run on a Heterogeneous System-on-Chip where:
* Host CPU is an industry-standard, hard-macro multi-core CPU. It executes full-fledged operating systems and other legacy software (e.g., ROS).
* FPGA overlay consists of several clusters grouping a small number of RISC-V-based proxy cores to control the operation of one or more HW accelerators.
* Applications start on the host CPU, and then compute-intensive parts can be offloaded to the FPGA overlay on the programmable logic.
* Target SoC supports interaction with autopilot, ground station, sensor, and/or other I/O.

==Contribution and Improvements==
In the context of the C4D contributions, this component had to demonstrate requirements associated with UC5-DEM10-DTC-04 and UC5-DEM10-DTC-05:
* Computation offloading capabilities are enabled by the OpenMP4 offloading support. The approach features single source and compiler-assisted code generation (pragma), thus enabling streamlined offloading between the host CPU and HW accelerators.
* OODK enables the automatic integration of HW accelerators with a single configuration file. Thus, users can integrate accelerators without writing in HDL and without the need to be an expert HW designer.
In WP4, synthetic and realist applications have been compared showing a minor implementation effort regarding LOC written for the application.

== Interoperability with other C4D tools ==
[[File:oodk_mdc.png|frame|center|OODK interoperability graph]]

The original Onboard Overlay Development kit also includes a tool for integrating Xilinx Vivado HLS accelerators into the overlay template.
Thanks to the C4D project, this component is extended to automatically integrate and configure coarse-grained reconfigurable HW accelerators that are implemented by the more complete and powerful Multi-Dataflow Compose (MDC, WP6-15 component) tool developed by the UNISS.
In the context of UC5-D1, OODK has been used to instantiate the Onboard Overlay Compute Platform as a fabric for the Lightweight Cryptography (AES) HW accelerator generated with MDC tool designed by UNISS. The UC5-D1 application providers designed an AES accelerator using MDC.
OODK has been employed to integrate the MDC-based dataflow graph (DF Network) into the Onboard Overlay Compute Platform, and then synthesize the generated system targeting a Xilinx FPGA. The high-level programming interfaces for the AES accelerator are implemented through the OpenMP 4.0 Spec. The figure above shows the dependency/interoperability between OODK and the MDC inside the UC5-D1.

==Current Status==
OODK has been tested in the context of UC5-D1 to evaluate the improvement with regard to SoA FPGA-based acceleration (e.g., Xilinx SDSoC). Relatively to the main metrics:

* The code required for the HW/SW integration of the Lightweight Cryptography accelerator in the FPGA overlay is automatically generated. A total of 46KLOC (thousand lines of code) is replaced by the definition of just 11 parameters to describe the system micro-architecture plus 33 for the accelerator wrapper. Overall, this reduces LOC by a factor of 1049X.

* About heterogeneous applications, the reduction in the number of lines of code is application-specific. Focusing on the UC5-D1 use case, the application consists of 927 lines of code for the main program, which are written manually following the convenient OpenMP coding style. In addition to that, 1709 lines of code are required for extending the firmware’s semantics for the specific accelerators in the platform. In the proposed methodology, the latter are also automatically generated starting from the same 11+33 parameters used to define the architecture. This leads to a reduction in the number of lines of code equal to 2,7X.

* The execution time speedup when comparing the HW and SW versions of the Lightweight Cryptography layer amounts to approximately 2X. This does not directly translate into an equivalent ratio in energy savings, because the host CPU processor is active while the accelerator operates, which contributes to the overall energy spent. The energy reduction when using the accelerated version of the Lightweight Cryptography layer amounts to 48%.

==Design and Implementation==
[[File:oodk_kit_components.png|frame|center|Onboard Overlay Development Kit Components]]

The figure above shows the main functionalities assigned to the OODK.
OODK is used to integrate and design the complete system of the FPGA overlay.
Thus, it includes an automated flow for integrating and implementing custom HW accelerators on the FPGA overlay.

For the application design, the tool provides a single-source OpenMP4.5-enabled programming interface. Supporting an OpenMP4.5 Accelerator Model means having an OpenMP4.5-enabled compiler supporting both the host ISA and the RISC-V ISA and a runtime system implementing the OpenMP standard. Thus, the OODK collection contains:

* '''Clang/LLVM Compiler'''. Compiler configured for supporting OpenMP offloading from AARCH64 ISA to RISC-V ISA.
* '''Overlay Runtime Libraries'''. Host and Overlay Communication and Runtime Libraries.
* '''Overlay Rootfs and Linux OS Generator'''. Automated scripts for the creation of Linux-based rootfs for the host.

==References==

[1] "Onboard Overlay Development Kit (OODK)", https://github.com/gbellocchi/arov

[2] "Hardware Processing Engines - Documentation", https://hwpe-doc.readthedocs.io/en/latest/

[3] "Hardware Processing Engines - Streamer", https://github.com/pulp-platform/hwpe-stream

[4] "Hardware Processing Engines - Controller", https://github.com/pulp-platform/hwpe-ctrl

WP3-22

2022-10-10T10:48:33Z

Unimore:

=Onboard Overlay Compute Platform (OOCP)=
{|class="wikitable"
| ID|| WP3-22
|-
| Contributor || UNIMORE
|-
| Levels || System
|-
| Require || Application definition and FPGA-based System-on-Chip
|-
| Provide || Accelerator-rich Overlay for FPGA-based System-on-Chip
|-
| Input || Application Specific HW Accelerators description in HDL, or HLS, or using WP3-28 Methodology
|-
| Output || Synthesizable Overlay for FPGA with integrated Application Specific HW Accelerators
|-
| C4D building block || The methodology is generic and applicable to host HW accelerators for different tasks and scenarios. With respect to C4D, it could be used to implement HW accelerators related to perception, actuation, flight-control, payload management or data management.
|-
| TRL || 4
|}

[[File:building_block_wp3_22.png|frame|center|Building Block diagram for WP3-22]]

==Detailed Description==

The '''Onboard Overlay Compute Platform Design Methodology''' is composed of two main contributions:
* '''Onboard Overlay Compute Platform (OOCP)'''. The OOCP is an evolution of the HERO architecture targeting specifically FPGA acceleration on FPGA-based heterogeneous systems-on-chip (HeSoCs).
* '''Onboard Overlay Development Kit (OODK)'''. The OODK contains tools and library for the automatic integration, programming, and offloading to Application Specific Hardware Accelerators.

The '''Onboard Overlay Compute Platform Design Methodology''' is generic and applicable to host accelerators for different tasks and scenarios. With respect to C4D Drone Reference Architecture, the OOCP could potentially be used to host and integrate on a FPGA-based SoC different HW accelerators (e.g., perception, actuation, flight-control, payload management or data management).

==Specifications and contribution==
Increased demand of autonomy on UAV requires adequate on-board smart sensing and computing capability to support safe decision making, based on large amounts of data that is sensed, analysed and understood in real-time. The capability of flexibly defining parallel, non-Von-Neumann processing logic and custom memory hierarchies, all within contained power envelopes, makes the FPGA-based heterogeneous systems-on-chip (HeSoCs) an ideal candidate for implementing onboard compute
platforms for UAV.

Within the C4D project, we are developing an '''Onboard Overlay Compute Platform Design Methodology''' for FPGA-based HeSoCs and leverages ''soft-cores'' for flexible control of user-defined, ''application-specific accelerators''. Different accelerators can flexibly operate and re-configure their operation without the costly need for host intervention, thus avoiding significant performance degradation. Normal accelerator operation and accelerator reconfiguration can both be achieved via standard computation offloading from the host CPU to the soft-cores (e.g., OpenMP v4.x+). The user can rely on any methodology of his/her/their choice to design the accelerators (e.g., by WP3-28 C4D components or Vivado HLS). Moreover, the '''Onboard Overlay Compute Platform''' includes dedicated logic (the wrapper) to provide ''plug-and-play'' HW/SW integration of such accelerators developed within C4D WP6 activities.

==Design and Implementation==
Figure 84 shows an overview of the proposed '''Onboard Overlay Compute Platform (OOCP)'''.
The Onboard Overlay is based on The Parallel Ultra Low Power Platform (PULP) [1], and particularly on HERO [2] is an open-source research platform based on FPGA emulation of PULP-based heterogeneous many-core systems.
HERO can be instantiated on FPGA SoCs like the Xilinx Zynq family.

HERO constitutes a convenient starting point to implement the Onboard Overlay Compute Platform: being conceived as a many-core architecture, HERO naturally complies with some of the basic requirements to build an accelerator-rich design, most notably the cluster-based design and the multibank shared memory design.
However, HERO clusters are designed for general-purpose (or, at best, signal-processing oriented) parallel execution and thus have substantial limitations in the context of FPGA hardware acceleration that we target. HERO uses the FPGA merely as a medium for emulation of projects meant for IC realization.

The proposed Onboard Overlay Compute Platform uses the FPGA as a target for acceleration.
For an overlay to be an efficient and convenient solution, it should offer: (i) System-level design capabilities; (ii) transparent accelerator integration flow; (iii) streamlined resource usage.

The Onboard Overlay Compute Platform is designed to be light (in terms of resources utilization) and configurable.
The OOCP features a customized number of clusters, and each cluster is composed of one (or more) RISC-V IBEX core (RV32IMC) [3], an instruction cache, a DMA, and a multi-ported multibanked L1 Data Memory (scratchpad). The cluster can host one (or more) Application Specific Accelerators that can interfaced to the shared L1 Data Memory thought a ''wrapper''.

==References==

[1] "PULP platform", https://pulp-platform.org/

[2] "HERO", https://pulp-platform.org/hero.html

[3] "Ibex", https://github.com/lowRISC/ibex

WP3-22

2022-10-10T10:48:08Z

Unimore:

=Onboard Overlay Compute Platform (OOCP)=
{|class="wikitable"
| ID|| WP3-22
|-
| Contributor || UNIMORE
|-
| Levels || System
|-
| Require || Application definition and FPGA-based System-on-Chip
|-
| Provide || Accelerator-rich Overlay for FPGA-based System-on-Chip
|-
| Input || Application Specific HW Accelerators description in HDL, or HLS, or using WP3-28 Methodology
|-
| Output || Synthesizable Overlay for FPGA with integrated Application Specific HW Accelerators
|-
| C4D building block || The methodology is generic and applicable to host HW accelerators for different tasks and scenarios. With respect to C4D, it could be used to implement HW accelerators related to perception, actuation, flight-control, payload management or data management.
|-
| TRL || 4
|}

[[File:building_block_wp3_22.png|frame|center|Building Block diagram for WP3-22]]

==Detailed Description==

The '''Onboard Overlay Compute Platform Design Methodology''' is composed of two main contributions:
* '''Onboard Overlay Compute Platform (OOCP)'''. The OOCP is an evolution of the HERO architecture targeting specifically FPGA acceleration on FPGA-based heterogeneous systems-on-chip (HeSoCs).
* '''Onboard Overlay Development Kit (OODK)'''. The OODK contains tools and library for the automatic integration, programming, and offloading to Application Specific Hardware Accelerators.

The '''Onboard Overlay Compute Platform Design Methodology''' is generic and applicable to host accelerators for different tasks and scenarios. With respect to C4D Drone Reference Architecture, the OOCP could potentially be used to host and integrate on a FPGA-based SoC different HW accelerators (e.g., perception, actuation, flight-control, payload management or data management).

==Specifications and contribution==
Increased demand of autonomy on UAV requires adequate on-board smart sensing and computing capability to support safe decision making, based on large amounts of data that is sensed, analysed and understood in real-time. The capability of flexibly defining parallel, non-Von-Neumann processing logic and custom memory hierarchies, all within contained power envelopes, makes the FPGA-based heterogeneous systems-on-chip (HeSoCs) an ideal candidate for implementing onboard compute
platforms for UAV.

Within the C4D project, we are developing an '''Onboard Overlay Compute Platform Design Methodology''' for FPGA-based HeSoCs and leverages ''soft-cores'' for flexible control of user-defined, ''application-specific accelerators''. Different accelerators can flexibly operate and re-configure their operation without the costly need for host intervention, thus avoiding significant performance degradation. Normal accelerator operation and accelerator reconfiguration can both be achieved via standard computation offloading from the host CPU to the soft-cores (e.g., OpenMP v4.x+). The user can rely on any methodology of his/her/their choice to design the accelerators (e.g., by WP3-28 C4D components or Vivado HLS). Moreover, the '''Onboard Overlay Compute Platform''' includes dedicated logic (the wrapper) to provide ''plug-and-play'' HW/SW integration of such accelerators developed within C4D WP6 activities.

==Design and Implementation==
Figure 84 shows an overview of the proposed '''Onboard Overlay Compute Platform (OOCP)'''.
The Onboard Overlay is based on The Parallel Ultra Low Power Platform (PULP) [1], and particularly on HERO [2] is an open-source research platform based on FPGA emulation of PULP-based heterogeneous many-core systems.
HERO can be instantiated on FPGA SoCs like the Xilinx Zynq family.

HERO constitutes a convenient starting point to implement the Onboard Overlay Compute Platform: being conceived as a many-core architecture, HERO naturally complies with some of the basic requirements to build an accelerator-rich design, most notably the cluster-based design and the multibank shared memory design.
However, HERO clusters are designed for general-purpose (or, at best, signal-processing oriented) parallel execution and thus have substantial limitations in the context of FPGA hardware acceleration that we target. HERO uses the FPGA merely as a medium for emulation of projects meant for IC realization.

The proposed Onboard Overlay Compute Platform uses the FPGA as a target for acceleration.
For an overlay to be an efficient and convenient solution, it should offer: (i) System-level design capabilities; (ii) transparent accelerator integration flow; (iii) streamlined resource usage.

The Onboard Overlay Compute Platform is designed to be light (in terms of resources utilization) and configurable.
The OOCP features a customized number of clusters, and each cluster is composed of one (or more) RISC-V IBEX core (RV32IMC) [3], an instruction cache, a DMA, and a multi-ported multibanked L1 Data Memory (scratchpad). The cluster can host one (or more) Application Specific Accelerators that can interfaced to the shared L1 Data Memory thought a ''wrapper''.

==References==
[1] "PULP platform", https://pulp-platform.org/
[2] "HERO", https://pulp-platform.org/hero.html
[3] "Ibex", https://github.com/lowRISC/ibex

WP3-22

2022-10-10T10:45:19Z

Unimore: /* Design and Implementation */

WP6-13

2022-10-10T10:25:52Z

Unimore:

= OODK: Onboard Overlay Development Kit =
{|class="wikitable"
| ID|| WP6-13
|-
| Contributor || UNIMORE
|-
| Levels || Tool
|-
| Require || OODK configuration, HW accelerators and FPGA-based System-on-Chip
|-
| Provide || Automated deployment of HW accelerators in the companion computer platform
|-
| Input ||
* OODK configuration file
* HW Accelerator RTL designs
|-
| Output ||
* OODK system ready to be deployed on FPGA-based System-on-Chip
* SW stack for offloading computation from the host processor side to the hardware-specific accelerators
|-
| C4D tooling || n.a.
|-
| TRL || 5/6
|-
| License || Open-source
|}

== Detailed Description ==
Although FPGA technology can satisfy the performance, energy and predictability requirements of drone systems and applications, FPGA development is a notoriously complex task.
This component is a methodology to ease the deployment of application-specific accelerators - or Hardware Processing Units (HWPU) - in the companion computer platform.

It allows for:
* Automated deployment of HW accelerators in the companion computer platform.
* Generation of an application-tailored FPGA overlay. The latter is an HW/SW abstraction layer that integrates and supports HW accelerators and is instantiated in the FPGA-based System-on-Chip.
* SW stack support for streamlined offloading of computation from the host processor side to the hardware-specific accelerators.

Target drone application(s) will then run on a Heterogeneous System-on-Chip where:
* Host CPU is an industry-standard, hard-macro multi-core CPU. It executes full-fledged operating systems and other legacy software (e.g., ROS).
* FPGA overlay consists of several clusters grouping a small number of RISC-V-based proxy cores to control the operation of one or more HW accelerators.
* Applications start on the host CPU, and then compute-intensive parts can be offloaded to the FPGA overlay on the programmable logic.
* Target SoC supports interaction with autopilot, ground station, sensor, and/or other I/O.

==Contribution and Improvements==
In the context of the C4D contributions, this component had to demonstrate requirements associated with UC5-DEM10-DTC-04 and UC5-DEM10-DTC-05:
* Computation offloading capabilities are enabled by the OpenMP4 offloading support. The approach features single source and compiler-assisted code generation (pragma), thus enabling streamlined offloading between the host CPU and HW accelerators.
* OODK enables the automatic integration of HW accelerators with a single configuration file. Thus, users can integrate accelerators without writing in HDL and without the need to be an expert HW designer.
In WP4, synthetic and realist applications have been compared showing a minor implementation effort regarding LOC written for the application.

== Interoperability with other C4D tools ==
[[File:oodk_mdc.png|frame|center|OODK interoperability graph]]

The original Onboard Overlay Development kit also includes a tool for integrating Xilinx Vivado HLS accelerators into the overlay template.
Thanks to the C4D project, this component is extended to automatically integrate and configure coarse-grained reconfigurable HW accelerators that are implemented by the more complete and powerful Multi-Dataflow Compose (MDC, WP6-15 component) tool developed by the UNISS.
In the context of UC5-D1, OODK has been used to instantiate the Onboard Overlay Compute Platform as a fabric for the Lightweight Cryptography (AES) HW accelerator generated with MDC tool designed by UNISS. The UC5-D1 application providers designed an AES accelerator using MDC.
OODK has been employed to integrate the MDC-based dataflow graph (DF Network) into the Onboard Overlay Compute Platform, and then synthesize the generated system targeting a Xilinx FPGA. The high-level programming interfaces for the AES accelerator are implemented through the OpenMP 4.0 Spec. The figure above shows the dependency/interoperability between OODK and the MDC inside the UC5-D1.

==Current Status==
OODK has been tested in the context of UC5-D1 to evaluate the improvement with regard to SoA FPGA-based acceleration (e.g., Xilinx SDSoC). Relatively to the main metrics:

* The code required for the HW/SW integration of the Lightweight Cryptography accelerator in the FPGA overlay is automatically generated. A total of 46KLOC (thousand lines of code) is replaced by the definition of just 11 parameters to describe the system micro-architecture plus 33 for the accelerator wrapper. Overall, this reduces LOC by a factor of 1049X.

* About heterogeneous applications, the reduction in the number of lines of code is application-specific. Focusing on the UC5-D1 use case, the application consists of 927 lines of code for the main program, which are written manually following the convenient OpenMP coding style. In addition to that, 1709 lines of code are required for extending the firmware’s semantics for the specific accelerators in the platform. In the proposed methodology, the latter are also automatically generated starting from the same 11+33 parameters used to define the architecture. This leads to a reduction in the number of lines of code equal to 2,7X.

* The execution time speedup when comparing the HW and SW versions of the Lightweight Cryptography layer amounts to approximately 2X. This does not directly translate into an equivalent ratio in energy savings, because the host CPU processor is active while the accelerator operates, which contributes to the overall energy spent. The energy reduction when using the accelerated version of the Lightweight Cryptography layer amounts to 48%.

==Design and Implementation==
[[File:oodk_kit_components.png|frame|center|Onboard Overlay Development Kit Components]]

The figure above shows the main functionalities assigned to the OODK.
OODK is used to integrate and design the complete system of the FPGA overlay.
Thus, it includes an automated flow for integrating and implementing custom HW accelerators on the FPGA overlay.

For the application design, the tool provides a single-source OpenMP4.5-enabled programming interface. Supporting an OpenMP4.5 Accelerator Model means having an OpenMP4.5-enabled compiler supporting both the host ISA and the RISC-V ISA and a runtime system implementing the OpenMP standard. Thus, the OODK collection contains:

* '''Clang/LLVM Compiler'''. Compiler configured for supporting OpenMP offloading from AARCH64 ISA to RISC-V ISA.
* '''Overlay Runtime Libraries'''. Host and Overlay Communication and Runtime Libraries.
* '''Overlay Rootfs and Linux OS Generator'''. Automated scripts for the creation of Linux-based rootfs for the host.

WP6-13

2022-10-10T10:13:38Z

Unimore:

= OODK: Onboard Overlay Development Kit =
{|class="wikitable"
| ID|| WP6-13
|-
| Contributor || UNIMORE
|-
| Levels || Tool
|-
| Require || OODK configuration, HW accelerators and FPGA-based System-on-Chip
|-
| Provide || Automated deployment of HW accelerators in the companion computer platform
|-
| Input ||
* OODK configuration file
* HW Accelerator RTL designs
|-
| Output ||
* OODK system ready to be deployed on FPGA-based System-on-Chip
* SW stack for offloading computation from the host processor side to the hardware-specific accelerators
|-
| C4D tooling || n.a.
|-
| TRL || 5/6
|-
| License || Open-source
|}

== Detailed Description ==
Although FPGA technology can satisfy the performance, energy and predictability requirements of drone systems and applications, FPGA development is a notoriously complex task.
This component is a methodology to ease the deployment of application-specific accelerators - or Hardware Processing Units (HWPU) - in the companion computer platform.

It allows for:
* Automated deployment of HW accelerators in the companion computer platform.
* Generation of an application-tailored FPGA overlay. The latter is an HW/SW abstraction layer that integrates and supports HW accelerators and is instantiated in the FPGA-based System-on-Chip.
* SW stack support for streamlined offloading of computation from the host processor side to the hardware-specific accelerators.

Target drone application(s) will then run on a Heterogeneous System-on-Chip where:
* Host CPU is an industry-standard, hard-macro multi-core CPU. It executes full-fledged operating systems and other legacy software (e.g., ROS).
* FPGA overlay consists of several clusters grouping a small number of RISC-V-based proxy cores to control the operation of one or more HW accelerators.
* Applications start on the host CPU, and then compute-intensive parts can be offloaded to the FPGA overlay on the programmable logic.
* Target SoC supports interaction with autopilot, ground station, sensor, and/or other I/O.

==Contribution and Improvements==
In the context of the C4D contributions, this component had to demonstrate requirements associated with UC5-DEM10-DTC-04 and UC5-DEM10-DTC-05:
* Computation offloading capabilities are enabled by the OpenMP4 offloading support. The approach features single source and compiler-assisted code generation (pragma), thus enabling streamlined offloading between the host CPU and HW accelerators.
* OODK enables the automatic integration of HW accelerators with a single configuration file. Thus, users can integrate accelerators without writing in HDL and without the need to be an expert HW designer.
In WP4, synthetic and realist applications have been compared showing a minor implementation effort regarding LOC written for the application.

== Interoperability with other C4D tools ==
[[File:oodk_mdc.png|frame|center|OODK interoperability graph]]

The original Onboard Overlay Development kit also includes a tool for integrating Xilinx Vivado HLS accelerators into the overlay template.
Thanks to the C4D project, this component is extended to automatically integrate and configure coarse-grained reconfigurable HW accelerators that are implemented by the more complete and powerful Multi-Dataflow Compose (MDC, WP6-15 component) tool developed by the UNISS.
In the context of UC5-D1, OODK has been used to instantiate the Onboard Overlay Compute Platform as a fabric for the Lightweight Cryptography (AES) HW accelerator generated with MDC tool designed by UNISS. The UC5-D1 application providers designed an AES accelerator using MDC.
OODK has been employed to integrate the MDC-based dataflow graph (DF Network) into the Onboard Overlay Compute Platform, and then synthesize the generated system targeting a Xilinx FPGA. The high-level programming interfaces for the AES accelerator are implemented through the OpenMP 4.0 Spec. The figure above shows the dependency/interoperability between OODK and the MDC inside the UC5-D1.

==Current Status==
OODK has been tested in the context of UC5-D1 to evaluate the improvement with regard to SoA FPGA-based acceleration (e.g., Xilinx SDSoC). Relatively to the main metrics:

* The code required for the HW/SW integration of the Lightweight Cryptography accelerator in the FPGA overlay is automatically generated. A total of 46KLOC (thousand lines of code) is replaced by the definition of just 11 parameters to describe the system micro-architecture plus 33 for the accelerator wrapper. Overall, this reduces LOC by a factor of 1049X.

* About heterogeneous applications, the reduction in the number of lines of code is application-specific. Focusing on the UC5-D1 use case, the application consists of 927 lines of code for the main program, which are written manually following the convenient OpenMP coding style. In addition to that, 1709 lines of code are required for extending the firmware’s semantics for the specific accelerators in the platform. In the proposed methodology, the latter are also automatically generated starting from the same 11+33 parameters used to define the architecture. This leads to a reduction in the number of lines of code equal to 2,7X.

* The execution time speedup when comparing the HW and SW versions of the Lightweight Cryptography layer amounts to approximately 2X. This does not directly translate into an equivalent ratio in energy savings, because the host CPU processor is active while the accelerator operates, which contributes to the overall energy spent. The energy reduction when using the accelerated version of the Lightweight Cryptography layer amounts to 48%.

==Design and Implementation==
[[File:oodk_kit_components.png|frame|center|Onboard Overlay Development Kit Components]]

The snipped above shows the only configuration file for OODK that the user should compile to integrate one or more HW Application-Specific Accelerators.
For the application design, our tool provides a single-source OpenMP4.5-enabled programming interface. Supporting an OpenMP4.5 Accelerator Model means having an OpenMP4.5-enabled compiler supporting both the host ISA and the RISC-V ISA and a runtime system implementing the OpenMP standard. Thus, the OODK collection contains:
 Clang/LLVM Compiler. Compiler configured for supporting OpenMP offloading from AARCH64 ISA to RISC-V ISA.
 Overlay Runtime Libraries. Host and Overlay Communication and Runtime Libraries.
 Overlay Rootfs and Linux OS Generator. Automated scripts for the creation of Linux-based rootfs for the host.

File:Oodk mdc.png

2022-10-10T09:24:37Z

Unimore:

File:Oodk kit components.png

2022-10-10T09:20:37Z

Unimore:

WP6-13

2022-10-10T09:18:44Z

Unimore: Created page with "= OODK: Onboard Overlay Development Kit = {|class="wikitable" | ID|| WP6-13 |- | Contributor || UNIMORE |- | Levels || Tool |- | Require || OODK configuration, HW accelerators and FPGA-based System-on-Chip |- | Provide || Automated deployment of HW accelerators in the companion computer platform |- | Input || * OODK configuration file * HW Accelerator RTL designs |- | Output || * OODK system ready to be deployed on FPGA-based System-on-Chip * SW stack for..."

= OODK: Onboard Overlay Development Kit =
{|class="wikitable"
| ID|| WP6-13
|-
| Contributor || UNIMORE
|-
| Levels || Tool
|-
| Require || OODK configuration, HW accelerators and FPGA-based System-on-Chip
|-
| Provide || Automated deployment of HW accelerators in the companion computer platform
|-
| Input ||
* OODK configuration file
* HW Accelerator RTL designs

|-
| Output ||
* OODK system ready to be deployed on FPGA-based System-on-Chip
* SW stack for offloading computation from the host processor side to the hardware-specific accelerators

|-
| C4D tooling || n.a.
|-
| TRL || 5/6
|-
| License || Open-source
|}

== Detailed Description ==
Although FPGA technology can satisfy the performance, energy and predictability requirements of drone systems and applications, FPGA development is a notoriously complex task.
This component is a methodology to ease the deployment of application-specific accelerators - or Hardware Processing Units (HWPU) - in the companion computer platform.

It allows for:

* Automated deployment of HW accelerators in the companion computer platform.
* Generation of an application-tailored FPGA overlay. The latter is an HW/SW abstraction layer that integrates and supports HW accelerators and is instantiated in the FPGA-based System-on-Chip.
* SW stack support for streamlined offloading of computation from the host processor side to the hardware-specific accelerators

Target drone application(s) will then run on a Heterogeneous System-on-Chip where:

* Host CPU is an industry-standard, hard-macro multi-core CPU. It executes full-fledged operating systems and other legacy software (e.g., ROS).
* FPGA overlay consists of several clusters grouping a small number of RISC-V-based proxy cores to control the operation of one or more HW accelerators.
* Applications start on the host CPU, and then compute-intensive parts can be offloaded to the FPGA overlay on the programmable logic.
* Target SoC supports interaction with autopilot, ground station, sensor, and/or other I/O.

==Contribution and Improvements==
OODK is a collection of tools for integrating and configuring HW accelerators (see below UC5-DEM10-DTC-05) and for quickly and effectively offloading computation from the host processor side to the hardware-specific accelerators (see UC5-DEM10-DTC-04).

Figure 80 shows the main functionalities assigned to the OODK. OODK is used to integrate and design the complete system of the FPGA overlay.
Thus, it includes an automated flow for the integration and implementation of custom HW accelerators (from MDC and other design methodologies, such as Xilinx Vivado HLS) on our overlay.

== Interoperability with other C4D tools ==
Figure 80 shows the main functionalities assigned to the OODK. OODK is used to integrate and design the complete system of the FPGA overlay.

The original Onboard Overlay Development kit also includes a tool for integrating Xilinx Vivado HLS accelerators into the overlay template. Thanks to the C4D project, the flow for integrating HW Accelerator is going to be enriched by the more complete and powerful MDC tool developed by the UNISS.

The OODK is used in the context of UC5-D1 to instantiate the Onboard Overlay Compute Platform that is used as a fabric for the AES Hardware Accelerator generated with the Multi-Dataflow Compose tool designed by UNISS (see Figure 81).

The UC5-D1 application providers designed an HW accelerator for the AES cryptographical encryption using MDC. The generated dataflow graph (DF Network) is then integrated into the Onboard Overlay Compute Platform using the OODK scripts. The OODK is also used to synthesize the whole RTL design into a Xilinx FPGA bitstream and implement the high-level programming interface for the AES accelerator through the OpenMP 4.0 Spec. Figure 83 shows the dependency/interoperability between OODK and the MDC inside the UC5-D1.

==Current Status==
OODK has been evaluated with the Onboard Overlay design methodology developed in WP3, providing end-to-end examples of HWPU accelerators under development within the WP4 effort.
Note: OODK has not been validated in UC5-D1 directly because the proposed tools and runtime libraries are not "visible" at the UAV-user level but only on the engineering side.

OODK is used for implementing and designing software Subsystem and Component elements of the C4D Drone Reference Platform targeting the Onboard Overlay Compute Platform (WP3-22).

OODK is used in UC5-D1 for software (application) design and implementation and for implementing HW (FPGA-Overlay) design and integration.

completed:
- Enabling OpenMP4 offloading support, OODK enables easy and streamlined computation offload between host and HW Application-Specific Accelerators: single source and compiler-assisted code generation (pragma).
Comparison with synthetic and realist applications (see WP4) shows a minor implementation effort regarding LOC written for the application.
- OODK enables automatic integration of HW Application-Specific Accelerators with a single configuration file (python).
Users can integrate accelerators without writing in HDL and without the need to be an expert HW designer.
Comparison with synthetic and realist applications (see WP4) shows a minor implementation effort in terms of LOC written for the HW integration and design time.

==Design and Implementation==

The snipped above shows the only configuration file for OODK that the user should compile to integrate one or more HW Application-Specific Accelerators.
For the application design, our tool provides a single-source OpenMP4.5-enabled programming interface. Supporting an OpenMP4.5 Accelerator Model means having an OpenMP4.5-enabled compiler supporting both the host ISA and the RISC-V ISA and a runtime system implementing the OpenMP standard. Thus, the OODK collection contains:
 Clang/LLVM Compiler. Compiler configured for supporting OpenMP offloading from AARCH64 ISA to RISC-V ISA.
 Overlay Runtime Libraries. Host and Overlay Communication and Runtime Libraries.
 Overlay Rootfs and Linux OS Generator. Automated scripts for the creation of Linux-based rootfs for the host.

WP3-22

2022-10-06T15:28:03Z

Unimore:

=Onboard Overlay Compute Platform (OOCP)=
{|class="wikitable"
| ID|| WP3-22
|-
| Contributor || UNIMORE
|-
| Levels || System
|-
| Require || Application definition and FPGA-based System-on-Chip
|-
| Provide || Accelerator-rich Overlay for FPGA-based System-on-Chip
|-
| Input || Application Specific HW Accelerators description in HDL, or HLS, or using WP3-28 Methodology
|-
| Output || Synthesizable Overlay for FPGA with integrated Application Specific HW Accelerators
|-
| C4D building block || The methodology is generic and applicable to host HW accelerators for different tasks and scenarios. With respect to C4D, it could be used to implement HW accelerators related to perception, actuation, flight-control, payload management or data management.
|-
| TRL || 4
|}

[[File:building_block_wp3_22.png|frame|center|Building Block diagram for WP3-22]]

==Detailed Description==

The '''Onboard Overlay Compute Platform Design Methodology''' is composed of two main contributions:
* '''Onboard Overlay Compute Platform (OOCP)'''. The OOCP is an evolution of the HERO architecture targeting specifically FPGA acceleration on FPGA-based heterogeneous systems-on-chip (HeSoCs).
* '''Onboard Overlay Development Kit (OODK)'''. The OODK contains tools and library for the automatic integration, programming, and offloading to Application Specific Hardware Accelerators.

The '''Onboard Overlay Compute Platform Design Methodology''' is generic and applicable to host accelerators for different tasks and scenarios. With respect to C4D Drone Reference Architecture, the OOCP could potentially be used to host and integrate on a FPGA-based SoC different HW accelerators (e.g., perception, actuation, flight-control, payload management or data management).

==Specifications and contribution==
Increased demand of autonomy on UAV requires adequate on-board smart sensing and computing capability to support safe decision making, based on large amounts of data that is sensed, analysed and understood in real-time. The capability of flexibly defining parallel, non-Von-Neumann processing logic and custom memory hierarchies, all within contained power envelopes, makes the FPGA-based heterogeneous systems-on-chip (HeSoCs) an ideal candidate for implementing onboard compute
platforms for UAV.

Within the C4D project, we are developing an '''Onboard Overlay Compute Platform Design Methodology''' for FPGA-based HeSoCs and leverages ''soft-cores'' for flexible control of user-defined, ''application-specific accelerators''. Different accelerators can flexibly operate and re-configure their operation without the costly need for host intervention, thus avoiding significant performance degradation. Normal accelerator operation and accelerator reconfiguration can both be achieved via standard computation offloading from the host CPU to the soft-cores (e.g., OpenMP v4.x+). The user can rely on any methodology of his/her/their choice to design the accelerators (e.g., by WP3-28 C4D components or Vivado HLS). Moreover, the '''Onboard Overlay Compute Platform''' includes dedicated logic (the wrapper) to provide ''plug-and-play'' HW/SW integration of such accelerators developed within C4D WP6 activities.

==Design and Implementation==
Figure 84 shows an overview of the proposed '''Onboard Overlay Compute Platform (OOCP)'''.
The Onboard Overlay is based on The Parallel Ultra Low Power Platform (PULP) [11], and particularly on HERO [12] is an open-source research platform based on FPGA emulation of PULP-based heterogeneous many-core systems.
HERO can be instantiated on FPGA SoCs like the Xilinx Zynq family.

HERO constitutes a convenient starting point to implement the Onboard Overlay Compute Platform: being conceived as a many-core architecture, HERO naturally complies with some of the basic requirements to build an accelerator-rich design, most notably the cluster-based design and the multibank shared memory design.
However, HERO clusters are designed for general-purpose (or, at best, signal-processing oriented) parallel execution and thus have substantial limitations in the context of FPGA hardware acceleration that we target. HERO uses the FPGA merely as a medium for emulation of projects meant for IC realization.

The proposed Onboard Overlay Compute Platform uses the FPGA as a target for acceleration.
For an overlay to be an efficient and convenient solution, it should offer: (i) System-level design capabilities; (ii) transparent accelerator integration flow; (iii) streamlined resource usage.

The Onboard Overlay Compute Platform is designed to be light (in terms of resources utilization) and configurable.
The OOCP features a customized number of clusters, and each cluster is composed of one (or more) RISC-V IBEX core (RV32IMC) [13], an instruction cache, a DMA, and a multi-ported multibanked L1 Data Memory (scratchpad). The cluster can host one (or more) Application Specific Accelerators that can interfaced to the shared L1 Data Memory thought a ''wrapper''.

File:Building block wp3 22.png

2022-10-06T15:26:17Z

Unimore: Building Block diagram for WP3-22

== Summary ==
Building Block diagram for WP3-22

WP3-22

2022-10-06T15:24:24Z

Unimore: Created page with "=Onboard Overlay Compute Platform (OOCP)= {|class="wikitable" | ID|| WP3-22 |- | Contributor || UNIMORE |- | Levels || System |- | Require || Application definition and FPGA-based System-on-Chip |- | Provide || Accelerator-rich Overlay for FPGA-based System-on-Chip |- | Input || Application Specific HW Accelerators description in HDL, or HLS, or using WP3-28 Methodology |- | Output || Synthesizable Overlay for FPGA with integrated Application Specific HW..."

=Onboard Overlay Compute Platform (OOCP)=
{|class="wikitable"
| ID|| WP3-22
|-
| Contributor || UNIMORE
|-
| Levels || System
|-
| Require || Application definition and FPGA-based System-on-Chip
|-
| Provide || Accelerator-rich Overlay for FPGA-based System-on-Chip
|-
| Input || Application Specific HW Accelerators description in HDL, or HLS, or using WP3-28 Methodology
|-
| Output || Synthesizable Overlay for FPGA with integrated Application Specific HW Accelerators
|-
| C4D building block || The methodology is generic and applicable to host HW accelerators for different tasks and scenarios. With respect to C4D, it could be used to implement HW accelerators related to perception, actuation, flight-control, payload management or data management.
|-
| TRL || 4
|}

[[File:wp3-28_01.png|frame|center|Building Block diagram for WP3-22]]

==Detailed Description==

The '''Onboard Overlay Compute Platform Design Methodology''' is composed of two main contributions:
* '''Onboard Overlay Compute Platform (OOCP)'''. The OOCP is an evolution of the HERO architecture targeting specifically FPGA acceleration on FPGA-based heterogeneous systems-on-chip (HeSoCs).
* '''Onboard Overlay Development Kit (OODK)'''. The OODK contains tools and library for the automatic integration, programming, and offloading to Application Specific Hardware Accelerators.

The '''Onboard Overlay Compute Platform Design Methodology''' is generic and applicable to host accelerators for different tasks and scenarios. With respect to C4D Drone Reference Architecture, the OOCP could potentially be used to host and integrate on a FPGA-based SoC different HW accelerators (e.g., perception, actuation, flight-control, payload management or data management).

==Specifications and contribution==
Increased demand of autonomy on UAV requires adequate on-board smart sensing and computing capability to support safe decision making, based on large amounts of data that is sensed, analysed and understood in real-time. The capability of flexibly defining parallel, non-Von-Neumann processing logic and custom memory hierarchies, all within contained power envelopes, makes the FPGA-based heterogeneous systems-on-chip (HeSoCs) an ideal candidate for implementing onboard compute
platforms for UAV.

Within the C4D project, we are developing an '''Onboard Overlay Compute Platform Design Methodology''' for FPGA-based HeSoCs and leverages ''soft-cores'' for flexible control of user-defined, ''application-specific accelerators''. Different accelerators can flexibly operate and re-configure their operation without the costly need for host intervention, thus avoiding significant performance degradation. Normal accelerator operation and accelerator reconfiguration can both be achieved via standard computation offloading from the host CPU to the soft-cores (e.g., OpenMP v4.x+). The user can rely on any methodology of his/her/their choice to design the accelerators (e.g., by WP3-28 C4D components or Vivado HLS). Moreover, the '''Onboard Overlay Compute Platform''' includes dedicated logic (the wrapper) to provide ''plug-and-play'' HW/SW integration of such accelerators developed within C4D WP6 activities.

==Design and Implementation==
Figure 84 shows an overview of the proposed '''Onboard Overlay Compute Platform (OOCP)'''.
The Onboard Overlay is based on The Parallel Ultra Low Power Platform (PULP) [11], and particularly on HERO [12] is an open-source research platform based on FPGA emulation of PULP-based heterogeneous many-core systems.
HERO can be instantiated on FPGA SoCs like the Xilinx Zynq family.

HERO constitutes a convenient starting point to implement the Onboard Overlay Compute Platform: being conceived as a many-core architecture, HERO naturally complies with some of the basic requirements to build an accelerator-rich design, most notably the cluster-based design and the multibank shared memory design.
However, HERO clusters are designed for general-purpose (or, at best, signal-processing oriented) parallel execution and thus have substantial limitations in the context of FPGA hardware acceleration that we target. HERO uses the FPGA merely as a medium for emulation of projects meant for IC realization.

The proposed Onboard Overlay Compute Platform uses the FPGA as a target for acceleration.
For an overlay to be an efficient and convenient solution, it should offer: (i) System-level design capabilities; (ii) transparent accelerator integration flow; (iii) streamlined resource usage.

The Onboard Overlay Compute Platform is designed to be light (in terms of resources utilization) and configurable.
The OOCP features a customized number of clusters, and each cluster is composed of one (or more) RISC-V IBEX core (RV32IMC) [13], an instruction cache, a DMA, and a multi-ported multibanked L1 Data Memory (scratchpad). The cluster can host one (or more) Application Specific Accelerators that can interfaced to the shared L1 Data Memory thought a ''wrapper''.