Bo Fang, an ECE Ph.D. student has been awarded the William C. Carter PhD Dissertation Award in Dependability for his Ph.D. thesis, titled Approaches for Building Error Resilient Applications. The William C. Carter PhD Dissertation Award in Dependability is the most prestigious award a Ph.D. student can receive in the dependable computing field. It is presented annually at the DSN Conference since 1997 to one Ph.D. student worldwide. The award recognizes an individual who has made a significant contribution to the field of dependable computing throughout their graduate dissertation research. Fang says “it is an honor to be awarded by the DSN community. I am humbled to be recognized by the community and my colleagues.”
The award is sponsored by IEEE TC on Dependable Computing and Fault Tolerance (TCFT) and IFIP Working Group 10.4 on Dependable Computing and Fault Tolerance (WG 10.4) to commemorated the late William C. Carter who was a pioneer in the formation and development of the field of dependable computing.
Fang is supervised by Dr. Karthik Pattabiraman and Dr. Matei Ripeanu. His research focuses on the effect hardware faults have on high-performance computing systems. Fang’s research proposes an error propagation model and crash model to identify which faults have the potential to cause silent data corruption and crashes, allowing for selectively triggering recovery. This stems from the idea that most transient hardware faults do not have a significant impact at the software layer. Ignoring faults that do not create problems allows HPC systems to be more efficient. Fang additionally proposes applying a roll forward recovery scheme in standard checkpoint/restart systems. This trades confidence in results for efficiency in performance and energy saving.
Bo Fangs research relates to the award as the “research focuses on designing approaches for building error-resilient applications, in the context of high-performance computing scenarios.” This is tightly in line with the research performed by the dependability community. His work has been published in top tier venues, and inspired many other researchers to write follow up papers based on his research. Bo’s work has been adopted by two national labs, The Pacific Northwestern National Labs (PNNL) and Los Alamos National Labs (LANL) as well as companies such as Nvidia and AMD. Bo is a recipient of the NSERC Post-Doctorial Fellowship and was ranked number two in the computer science division. He is currently doing a post-doc at the Pacific Northwestern Nation Labs (PNNL).
Dr. Pattabiraman states the significance of Fang’s work allows “High-Performance Computing (HPC) systems [to] be much more efficient in terms of performance and energy when it comes to providing fault-tolerance. The latter is especially important as these systems consume large amounts of energy for their operation and hence Bo’s work provides significant cost-savings in these systems”