ECE Professor Karthik Pattabiraman (second from the right) and his MASc student and co-author Xin Chen (far right)
Last month, the 30th International Symposium on Software Reliability Engineering (ISSRE 2019), held in Berlin, Germany announced that ECE Professor Karthik Pattabiraman’s paper was chosen as one of the most influential papers in the conference’s 30-year history, where only 26 out of over 1000 papers were selected. ISSRE is considered the top conference in the software reliability area. The paper, titled “Failure analysis of jobs in compute clouds: A Google cluster case study” was first published at the ISSRE in 2014 and was written by Pattabiraman, his MSc student Xin Chen at UBC, as well as Charng-da Lu, a research collaborator in the US. More details about the chosen papers can be found here.
The paper analyzes the reliability of a real data center at Google based on job scheduling traces – this was based on a publicly released dataset (by Google) consisting of 1 month of job scheduling data for one of their datacenters. This was one of the few publicly available papers featuring failure data from real data-centers. It was the first to analyze this dataset for application failures, and analyze their root causes.
There are three main research findings: (1) Many of the failures observed were due to a small percentage of the jobs, and these consumed significant resources, (2) Simple measures such as periodic restarts and terminating jobs that are resubmitted too often can mitigate many of the failures, and (3) it is possible to predict whether a job will complete successfully even half-way into its execution based on its resource consumption patterns.
Cloud data center operators can use these results to mitigate job failures and save energy. They can also take remedial measures to improve the reliability of the overall data center, and to take preventive actions before the failure. Together, the findings can lead to more reliable and energy-efficient cloud data centers.
The importance of the findings is twofold. As cloud computing becomes more and more prominent, the reliability of jobs running on the cloud is increasingly important. Many critical applications such as banking and healthcare are being hosted on the cloud, and hence it is vital to ensure high-availability of the cloud. Second, cloud data centers consume a significant amount of energy (and they are rapidly growing), and hence it is important to avoid energy wastage due to running jobs that are ultimately going to fail (and terminating them early).
The new element is that Pattabiraman and his team are the first study to consider failures of cloud applications (i.e., jobs) at scale in a real-world, production data center. Previous studies either considered small scale deployments and hence could not observe failures that only manifest at large scale, or consider only hardware failures of individual nodes, without considering the applications running in the cloud, and hence do not observe the software manifestations of the failures.
The paper can be downloaded here.
Learn more about the ISSRE conference here.
About Karthik Pattabiraman
Karthik Pattabiraman is an Associate Professor within the Electrical and Computer Engineering Department at UBC. His research interests are in dependable and secure computer systems. He received his PhD in Computer Science from the University of Illinois at Urbana-Champaign (UIUC), and has won multiple awards for his research such as the Killam Faculty Research Prize, and the NSERC Discovery Accelerator Supplement award. His webpage can be found here.