Reliability of Deep Neural Network Accelerators and Applications, a growing concern, says Dr. Pattabiraman - Department of Electrical and Computer Engineering

Case where a Self-Driving Car mistakes a transporting truck (left) for a bird (Right) due to soft errors in the hardware

Self-driving cars are the way of the future, and are poised to revolutionize the transportation industry. It is estimated that self-driving cars are only a few years away from large scale adoption. Further, self-driving cars have the potential to reduce the number of accidents and fatalities, and reduce transportation costs.

Unfortunately, there are still many unanswered questions about the safety of self-driving cars and their overall reliability. A core component of self-driving cars are deep neural networks (DNNs), which are used to recognize objects on the road and classify them as potential hazards. For example, when a pedestrian crosses a road that the self-driving car is driving on, it is the DNN that recognizes the pedestrian and tells the car to either slow down or come to a stop. DNNs therefore are critical to the correct operation of the car, and need to be fast to keep up with the car’s operation in real-time. To execute DNNs quickly, researchers have proposed specialized hardware accelerators. It is expected that self-driving cars will indeed adopt such DNN accelerators to ensure fast recognition of objects in real-time.

ECE’s Dr. Pattabiraman and his team have investigated the reliability of DNN accelerators deployed in self-driving cars. A major challenge in future hardware systems is that they are susceptible to what are known as soft errors, which are caused by cosmic rays or high-energy particles striking electronic components and upsetting their operations. Such errors can often lead to incorrect values being computed by the chip, and become a serious issue when the chips are used in safety-critical systems such as in self-driving cars. In particular, these cars have to adhere to strict safety standards, such as ISO 26262, which mandate that the number of failures due to soft errors can be no more than 10 over more than a billion hours of operation (formally known as FIT, or failures in time). Unfortunately, Dr. Pattabiraman’s research team found that current DNN accelerators do not meet this safety standard, with FIT rates that far exceed the mandated maximums. This can lead to catastrophic consequences for the self-driving car, such as mistaking a truck for a bird (see image), potentially leading to accidents. As a result, the researchers also propose two techniques to mitigate the effects of faults in DNNs, and show that their techniques incur very little overhead in practice.

This research was led by Guanpeng (Justin) Li, who is a PhD student in Dr. Pattabiraman’s group. Justin’s PhD thesis focuses on protecting software from hardware faults at low costs, and he became interested in DNN accelerators’ resilience while interning at Nvidia Research. According to Justin, the main challenge is identifying the unique error resilience properties and characteristics in DNN applications and accelerator structures. According to Dr. Pattabiraman, this research is only the tip of the iceberg, and can spawn many future projects in improving the error resilience of machine learning algorithms such as DNNs. He says, “DNN systems are very different from traditional computer systems, and we are only now beginning to understand their corner cases and resilience implications.”

The paper describing this work, co-authored with researchers from Nvidia, will appear in the IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis, also known as SC 2017 (https://sc17.supercomputing.org/). SC is the premier venue in the high-performance computing (HPC) area, and attracts over 10,000 researchers worldwide.

The full paper is available here