Transforming Hardware Faults in Machine Learning Systems

ECE Professor Karthik Pattabiraman and research group members ECE PhD student Zitao Chen, who is graduating this year, and Assistant Professor Guanpeng Li from the University of Iowa and alumnus of ECE, focus on machine learning (ML) applications in high-stakes environments such as autonomous driving and their reliability to the effect of hardware transient faults. Computer systems are composed of software and hardware working together. However, hardware can be subject to faults such as cosmic rays, stress on the system, and age; all can cause a system to malfunction. Dr. Pattabiraman’s team focuses on understanding and improving the error resilience of ML systems to hardware faults.

The researcher’s paper “A Low-cost Fault Corrector for Deep Neural Networks through Range Restriction” was recently awarded the IEEE Top Picks in Test and Reliability (TPTR). This award was given to 7 papers worldwide published in the computer systems reliability area in the last six years. Additionally, this paper won a Best Paper award runner-up at the DSN’21 conference, the leading venue in the area of Dependable Computing research.

Computation Errors

When the hardware isn’t working properly, the software will be affected, and computation errors are the result. In using these erroneous values for computation, the software can emit a faulty output, such as causing a self-driving system to misidentify the traffic sign. Usually, when a fault arises in a software’s computation, someone can step in and recompute it again. However, the process of recomputing can be laborious and is particularly challenging for use in time-sensitive applications, like when a self-driving car is operating.

“Therefore, the focus of this study is, how can we effectively mitigate hardware faults to ensure the functional safety of ML systems without causing significant runtime delays?”

Dr. Pattabiraman

Process of Developing the Research

To prevent hardware faults from corrupting the program output, Dr. Pattabiraman’s team starts by analyzing why these faults can lead to output corruption. Dr. Pattabiraman explains how his team focused on their area of research, “In a previous study by our team [1], we found that a specific fraction of hardware faults are more prone to causing incorrect program output. We identified that these faults tend to result in abnormally large computation values during software computation, a characteristic influenced by the mathematical properties of ML models [1].

“Building on this insight, in this research, we introduced a technique called Ranger [2] to automatically reduce the damage caused by hardware faults, thereby preventing the program output from being corrupted. Ranger performs range restriction during software computation (hence the name), ensuring the values of the software computation fall within an accepted statistical range. If there is a value that falls outside this range, it brings this erroneous value back to a safe region.”

In the above example, Ranger is correcting an autonomous vehicle’s faulty computer vision. On the far left, the vehicle recognizes the road correctly. In the center, a fault has occurred, affecting the vehicle’s ability to infer which way to go- and directing it into traffic. In the right image, Ranger has corrected this error and the vehicle can proceed safely.

Future Impacts

“As ML applications continue to rely on high-performance hardware, we believe Ranger will have a broader impact in both academia and industry.”

Dr. Pattabiraman

Academic researchers have been building improved solutions on top of the Ranger system to combat hardware faults in emerging ML applications such as Vision Transformers. Meanwhile, the Ranger system has been adopted by Intel’s OpenVINO toolkit for practical use, and Dr. Pattabiraman and his group believe it can benefit practitioners in building dependable ML systems in wider application domains, such as large language models.

Learn more about Dr. Pattibiraman’s Research

[1] Zitao Chen, Guanpeng Li, Karthik Pattabiraman, Nathan DeBardeleben “BinFI: An Efficient Fault Injector for Safety-Critical Machine Learning Systems” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019

[2] Zitao Chen, Guanpeng Li, Karthik Pattabiraman “A Low-cost Fault Corrector for Deep Neural Networks through Range Restriction” The 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2021