Autonomous IoT: Advances in Adaptive Fault Diagnosis and Resilience

The Evolving Landscape of Autonomous Driving

The world of autonomous driving has undergone a remarkable transformation in recent years. Autonomous driver assistance systems (ADAS) have been progressively pushed to extremes, with increasingly sophisticated algorithms such as deep neural networks assuming responsibility for critical driving functionality, including operating the vehicle at various levels of autonomy.

Elaborate obstacle detection, classification, and prediction algorithms, mostly vision-based, along with advanced trajectory planning and smooth control algorithms, have taken over what was once the domain of human drivers. Even in cases where humans remain in the loop, for example, to intervene in case of error as required by autonomy levels 3 and 4, it remains questionable whether distracted human drivers will react appropriately given the high speeds at which vehicles drive and the complex traffic situations they have to cope with.

A further pitfall is trusting the whole autonomous driving stack not to fail due to accidental causes and to be robust against cyberattacks of increasing sophistication. As the complexity of autonomous driving systems (ADSs) continues to grow, so too do the risks of vulnerabilities and potential failure points.

Addressing the Challenges of Autonomous Driving

In this experience report, we share our findings in retrofitting application-agnostic resilience mechanisms into an existing hardware-software-stack for autonomous driving, known as Apollo. Our goal is to ultimately decrease the vulnerability of autonomously driving vehicles to accidental faults and attacks, allowing them to absorb and tolerate both as well as to come out of them at least as secure as before the attack has happened.

We demonstrate replication and rejuvenation on the driving stacks’ Control module and indicate how this resilience can be extended both downwards to the hardware level as well as upwards to the prediction and planning modules. Over the years, autonomously driving vehicles (ADVs) have been progressively equipped with increasingly elaborate features to enhance driving experience and autonomy, ranging from high-resolution sensors to deep neural networks.

This increasing sophistication forms the backbone of indispensable computer vision algorithms enabling precise obstacle perception, optimized trajectory planners, and smooth control algorithms, and has effectively been successful in asymptotically poking the level of driving automation to a higher standard. However, this complexity also comes with increased vulnerability in any cyber-physical system (CPS).

Vulnerabilities in Autonomous Driving Systems

As the complexity of ADSs grows, new pathways for malicious intrusions are opened up, and the appearance of new faults becomes more probable, consequently resulting in dangerous and sometimes fatal outcomes. Unfortunately, over the last 20 years, there has been no shortage of bad outcomes involving ADVs, many of which have been caused by unintended accelerations (UA) that killed 89 people.

Erroneous behaviors, such as UAs that are related specifically to the components that enable the autonomy of ADVs and independent from the human driver, can potentially have two origins: either accidental due to an internal fault and/or absent fail-safe mechanism, or provoked due to the presence of a malicious attacker.

Only by convincing the human driver of its trustworthiness can automation take over. In this regard, the resilience of ADVs has to play a crucial role in triggering an effective adoption of ADVs by the general public at scale. In other words, because we expect that faults at any level will occur inevitably, infusing ADVs with mechanisms that enable tolerating those faults is of utmost importance.

The Responsibility Gap in Autonomous Driving

In the presence of faults, the notion of a responsibility gap arises naturally. This gap is characterized by situations in which it is unclear who can justifiably be held responsible for an unintended catastrophic outcome. The width of this gray zone is even more amplified by the over-reliance of modern-day ADVs modules on artificial neural networks (NNs).

Not only are the safety and reliability of these modules rarely studied in cooperation with the whole ADV software stack, but they also inherited the connectionist bug of non-explainability. In addition to their black-box nature, NNs are subject to usual faults which can reside in software or due to hardware issues.

In the former, NNs can be reprogrammed, evaded, and data-poisoned by malicious intruders during the inference and/or the training phase. Not to forget that NNs are subject to faults originating at the hardware level, such as transitory or permanent faults like stuck-at or bit-flip types that can alter the parameter space of the NN or lead to an erroneous computation of the hidden layer activation function.

Similarly, sensors are not spared from attacks. Bad actors can modify the lane-keeping system by installing dirty road patches and ultimately causing the ADV to drift away from its lane, jam the cameras modules, or conduct LIDAR spoofing attacks to inject false obstacle depth, leading to false sensor data and hence causing the ADV data processing chain to compute erroneous control commands.

The Apollo Autonomous Driving Software Stack

The Apollo software stack, maintained by Baidu, is typically composed of a chain of interlocked modules that process information starting from the Perception module down to the Control algorithms, where the planned trajectory is transformed into ECU instructions. Because of this downstream interdependence and where the computation and safe execution of control commands are all causally interlinked, failure of an intermediary module can propagate through this information processing chain and lead to unexpected behaviors.

Efforts have been made to overcome the existence of single points of failure, where, for example, perception information is rendered redundant by gathering raw data from independent modalities (RGB cameras, LIDAR, and RADAR) and fused to dilute the presence of a possible faulty device. However, redundancy implies that additional computational cost, and certain modules that comprise GPU-resource greedy NNs cannot be cheaply replicated.

Validating Autonomous Driving Systems

Validating ADV software in a real physical environment is costly and does not scale sufficiently to cover all possible driving scenarios. Moreover, deploying ADV software directly on-road can be dangerous. Hence, interfacing physics simulators such as SVL with ADV software stacks is of fundamental importance to guarantee quality assurance in the automotive sector, as required by the evolving standard ISO 21448 Safety of the Intended functionality.

Relevant to the study presented herein, researchers have stress-tested the autopilot enabled by the Apollo ADV software stack inside the SVL simulation environment by generating a set of edge cases where the Perception module was unable to detect pedestrians. Similarly, other studies have created test cases in the SVL simulator to assess the safety of the Apollo ADV software stack and observed that the perception modules failed to detect pedestrians in a significant percentage of the scenarios tested.

Retrofitting Resilience Mechanisms into Apollo

In this study, we document our work in designing fault and intrusion tolerant (FIT) mechanisms applied to ADVs, where we demonstrate the feasibility of applying those mechanisms into the sub-components of the Apollo ADV software stack and testing them in different driving scenarios generated by the SVL simulator.

We give a qualitative description of a novel recovery scheme which enables the ADV, in the presence of a detected fault at the sensor level, to maintain a stable trajectory by leveraging the availability of predicted future sensor values, which upon verification are fed back to the Prediction module while the Perception module is rebooting.

The recovery scheme we propose positions itself in the class of shallow recovery, which entails methods that repair compromised sub-component of a CPS with minimal or no operation on the system states. This is in contrast to deep recovery techniques that involve full system state restoration, which can be more costly and time-consuming.

Embedding Apollo in a Simulated Environment

We lay out the system architecture of the popular Apollo ADV software stack and give a description of how to embed it in a simulated physics environment using SVL. We highlight some vulnerabilities of Apollo and describe the threat model we consider in our work.

We then describe the FIT mechanisms that we implemented in three Apollo modules: the Control module, the Perception module, and the CANbus communication. We showcase and validate two of these mechanisms by interfacing the Apollo ADV software stack with the SVL simulation environment.

Fault-Tolerant Control Module Replication

For the Control module, we replicate the module three times and introduce a voter module to consolidate the control outputs of the individual replicas, masking up to f faults behind a f+1 majority of correct outputs. We simulate a worst-case instance of one faulty replica by adding white noise to its output and observe a stable trajectory execution despite the presence of the faulty replica, with an overhead of 70 MiB RAM per additional replica and an added 2 ms latency.

Perception Module Rejuvenation

For the Perception module, we design a recovery scheme that repairs the compromised module and reboots it fast enough to remove adversaries and ensure that the Planning module is supplied with timely obstacle predictions to compute a safe ego-car trajectory. During the Perception module reboot, we leverage a State Storage module to record and save the predicted obstacle trajectories and make them available to the Prediction module.

We validate this mechanism in simulation scenarios with high and low predictability, demonstrating that the State Storage module could reliably supply the Prediction module with enough verified stored internal states before depletion, avoiding disruption in the ego-car planning behavior.

Toward Resilient Autonomous Driving

In this study, we reported on the results and findings of a case study retrofitting an autonomous driving software stack with fault and intrusion tolerance mechanisms. We laid out a powerful methodology to design, test, and validate those mechanisms in interaction with the complex logic of the Apollo ADV software stack, which we embedded inside the SVL simulator.

We were able to not only study the feasibility of the developed schemes but could also showcase their efficacy by measuring relevant metrics. We hereby stretch the importance of validation through simulation, which is fundamental to prove quality assurance before deploying software in conventional on-road testing.

Our work highlights the significance of infusing resilience mechanisms into autonomous driving systems to decrease their vulnerability to accidental faults and attacks, ultimately enhancing the trust and adoption of ADVs by the general public. As the complexity of these systems continues to grow, the need for robust and adaptive fault diagnosis and recovery solutions becomes increasingly crucial.

The sensor networks community plays a vital role in driving the development of these technologies, contributing innovative approaches to improve the reliability, security, and resilience of autonomous systems. By bridging the gap between theory and practice, researchers and practitioners can work together to realize the full potential of autonomous IoT and secure the future of transportation.