“I believe there are still large amounts of unsolved problems in machine learning, which makes for a fascinating research area!”
Abraham Chan is a PhD student at Electrical and Computer Engineering, studying Machine Learning- his research focuses on building software tools to assist developers in implementation, debugging and testing. Along with co-authors Niranjhana Narayananan, Arpan Gujarati, Karthik Pattabiraman, and Sathish Gopalakrishnan, his paper “Understanding the Resilience of Neural Network Ensembles against Faulty Training Data” won the Best Paper award at QRS’21. This paper evaluates the resilience of machine learning ensembles against faulty training data, in order to understand how to build better ensembles, ones resistant to mislabelled, missing, or duplicated data.
We spoke to Abraham to learn more about the impacts of faulty data in machine learning, the process behind this paper, and how Abraham and his colleagues’ research will affect this field.
Why is it important to research ML training data?
Machine learning (ML) is being used in many domains today, including safety-critical systems like autonomous vehicles and medical diagnoses. Unlike software programs, ML applications are driven entirely by training data; this means that the quality of the training data is very important in ensuring the correct behaviour of these applications. However, we have seen that many training datasets, even those used for training autonomous vehicles, contain mislabelled or incomplete data.
What was the process of developing this paper’s research like?
After some initial experimentation, we realized that individual ML models are susceptible to faulty training data, meaning that models can misclassify test inputs much more easily. We were inspired to explore the feasibility of using neural network ensembles to tolerate these faults by N-Version programming in traditional software engineering. Neural network ensembles consist of multiple individual networks learning in tandem and combining their results through simple majority voting. We discovered that these ensembles are actually quite effective against training data faults, and set forth to understand why and how to build ensembles that maximize resilience. We tried all different sorts of configurations of ensembles, enabling us to identify trends and devise metrics to reason about ensembles. Much like how ML applications are data-driven, our paper was largely experimental data-driven!
What are the potential uses of your research? What effect will this research have?
Most research in ML is focused on boosting the accuracy of models, with much less attention paid to the resilience of ML models against faulty training data. Hopefully, our work will be able to demonstrate that high accuracy ML models are not necessarily more resilient, and will raise awareness of the importance of quality training data when developing ML applications.
Furthermore, since we find that ensembles are an effective strategy against training data faults, our insights on ensembles can help developers improve resilience in their own applications.
What’s something unexpected about this topic?
Faulty training data, such as mislabelled, incomplete, and repeated training data, is quite prevalent in real life! For example, we encountered a study that found that 30% of a public training dataset used to train many autonomous vehicles was faulty. Unfortunately, in the age of big data and artificial intelligence, large scale data collection processes can inherently introduce faults – it’s just not feasible to manually verify all the data. However, I can assure you that you do not want to be on an autonomous vehicle that’s been trained with poor quality data!
What draws you to this field of research?
ML is a field that is still in its infancy. Concerns, like ML security and reliability, are still often overlooked or brushed aside. This is both a curse and a blessing – I believe there are still large amounts of unsolved problems in this space, which makes for a fascinating research area!