Foto 7

Monopoli, M., Biondi, M., Moranti, S., Nannipieri, P., & Fanucci, L. (2025). Toward Reliable Onboard AI in Space: A Fault-Tolerant Soft GPU-Based System-on-Chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

Written by

Abstract: Artificial intelligence (AI) is becoming increasingly relevant for space applications, reducing reliance on high-bandwidth downlinks. Its use, however, has primarily been demonstrated in low Earth orbit (LEO) using commercial off-the-shelf (COTS) components. Extending its use to deep-space and long-duration missions requires computing platforms capable of operating in harsh radiation environments without incurring prohibitive costs. This is particularly important for systems implemented on field-programmable gate arrays (FPGAs), which are widely used in space applications due to their scalability and support for various radiation-hardening profiles. This flexibility makes them preferable to application-specific integrated circuits, whose low production volumes often do not justify their use in space systems. This article details our efforts to develop a novel system-on-chip (SoC) featuring GPU@SAT, a soft graphic processing unit IP provided by IngeniArs S.r.l., with a focus on enhancing its robustness when implemented on the Xilinx radiation-tolerant XQRKU060 FPGA. This approach leverages FPGA reconfigurability and radiation tolerance alongside the high-performance capabilities of graphic processing units (GPUs), unlocking significant potential for accelerating compute-intensive tasks in edge-space applications. Building on this foundation, we employ a classification-based methodology to improve the fault tolerance of GPU@SAT, analyzing trade-offs on performance, power, and area. Additional IP cores, such as a watchdog timer (WDT), a dual RISC-V lockstep with rollback, and the Xilinx Soft Error Mitigation IP, are employed to further improve the SoC robustness. Fault injection on an FPGA prototype targeting essential configuration bits validates the approach, showing a 15.4× improvement in mean injections before failure over the baseline architecture, with only a 1.2× increase in hardware complexity.

Keywords: Artificial Intelligence (AI), fault tolerance, field-programmable gate arrays (FPGAs), onboard computing, soft graphic processing unit (GPU), space, system-on-chip (SoC)

DOIhttps://doi.org/10.1109/TVLSI.2025.3597044