Robustness Testing of AI Systems

Standard model evaluation processes involve measuring the accuracy (or other relevant metrics) on a hold-out test set. However, the performance on these test sets do not always reflect the ability of the model to perform in the real world. This is because a fundamental assumption when deploying AI models is that all future data is of a similar distribution to what the model was trained on. However, in practice, it is very common to encounter data that is statistically different from the train set, which can potentially cause AI systems to become brittle and fail.

An AI model will always be exposed to a variety of new inputs after deployment due to the fact that the testing data is limited (i.e. a finite subset of all data available). Therefore, the concept of robustness testing is to assess the behaviour of the model to such new inputs, and identify its limitations before deployment. One way to achieve this is by curating additional data from other sources to test the model more comprehensively. However, that can be quite difficult practically. An alternative approach is to introduce mutations into the test data, with the aim of systematically mutating the data towards new and realistic inputs, similar to what the AI system will encounter in the real world. This forms the basis of many robustness testing techniques.

The research community has developed many different approaches[1] for robustness testing, which can be broadly categorised into white-box[2] and black-box testing[3]. White-box testing requires knowledge of the way the system is designed and implemented, whereas black-box testing only requires the system’s outputs in response to a certain input. These different testing techniques provide different insights about the models.

Evaluating robustness of Computer Vision (CV) deep learning model with NTU DeepHunter  

One white-box robustness testing tool that we are exploring comes from AI Singapore’s collaborator, the NTU Cybersecurity Lab (CSL). We will briefly introduce the tool before sharing our insights from using it with a computer vision use case.

In traditional software testing, fuzzing is used to detect anomalies by randomly generating or modifying inputs, and feeding it to the system[4]. A complementary concept is testing coverage, which measures how much of the program has been tested, it is used to quantify the rigour of the test. The goal is to maximise test coverage and uncover as many bugs as possible.

Analogously, fuzz testing can also be applied to machine learning systems. The NTU CSL group under Prof Liu Yang developed DeepHunter[5], a fuzzing framework for identifying defects (cases where the model does not behave as expected) in deep learning models. DeepHunter aims to increase the overall test coverage by applying adaptive heuristics based on run-time feedback from the model. We will attempt to give a brief overview of the tool in the next few paragraphs.

A key component of the fuzzing framework is the mechanism by which new inputs to the system are generated: metamorphic mutations. Metamorphic mutations are transformations in the input that are expected to yield unchanged or certain expected changes in the predictive output[7]. These transformed inputs are known as mutants. For example, some mutations for CV tasks can be varying the brightness of the picture or performing a horizontal flip. For NLP tasks, it can be contracting words or changing words to their synonyms. The mutation strategies should be specified by the user depending on their use case and requirements.

Another component is the coverage criteria. This criteria is computed for each mutant, to determine whether it contributes to a coverage increase. There are various definitions of coverage for deep learning models[8], which are based on behaviours of the neurons in a neural network.  For example, Neuron Coverage (NC) measures the neurons that are activated within a predefined threshold (major functional range), while Neuron Boundary Coverage (NBC) measures the corner-case regions. Regardless of the specific criteria used, the general idea is that tests with higher coverage are expected to capture more diverse behaviours of the model, and allow more defects to be identified, i.e. the test data is perceived to be new to the model. For more details on the assessment of the coverages, please refer to the literature.

Figure 1. The overall workflow of DeepHunter. Image is adapted from [6].

The overall workflow of DeepHunter is illustrated in Figure 1. It starts with an initial set of ‘seeds’ (inputs to the model) which are added to a seed queue for mutation. The core of DeepHunter is a fuzzing loop which combines a seed selection strategy (heuristics to select the next seed for mutation) with the metamorphic mutation, coverage criteria, and runtime model prediction. The seed selection strategy chosen is such that mutants which increase the coverage or the model fails to predict correctly will be added back to the queue for further mutation. The test cases which the model failed to predict correctly are collected for analysis, e.g. checking if the mutant is realistic. This coverage-guided fuzzing technique was demonstrated[5] to be more effective than random testing in identifying a greater number of defects in the model. For more details on the methodology, please refer to the literature.

Figure 2. Illustration of the deep learning model inference pipeline for activity classification.

One of the first users of the tool in AI Singapore is the CV Hub team, for their activity classification use case. A typical CV use case consists of a pre-trained object detection or pose estimation model, combined with use-case specific heuristics or models downstream. The CV Hub team was interested in learning about the robustness of the deep learning model that they developed for activity classification. As illustrated in Figure 2, the model takes in key point coordinates of a human pose, from a pre-trained pose estimation model upstream, and classifies it into an activity.

Figure 3. Example renders of pose key points (input to model) before and after mutation. Data is from the JHMDB dataset.

To identify suitable mutation strategies for testing the model, we conducted a discovery session with the CV Hub team to understand the requirements of the use case. We identified a number of possible mutation strategies, and one of them is to mirror key points by flipping the image horizontally, as shown in Figure 3. This mutation strategy is provided to the tool, which uses it as part of its fuzzing process to generate mutants.

We ran the robustness testing process on the original model and the results are shown in Table 1. The coverage-guided fuzzing identified a large number of defects, which implies that the model was not robust to the mutations.

Model Accuracy on test set Number of fuzzer iterations Number of defects
Original 65.3% 5000 2193
Retrained 64.3% 5000 94
Table 1. Test set accuracy and results of coverage-guided fuzzing for each of the models.

After analysing the results of the coverage-guided fuzzing, a strategy was developed to improve the robustness of the model by retraining it with augmented data. The results of the robustness testing on the retrained model are also shown in Table 1. The smaller number of defects identified implies that it is more robust to the mutations. (Note: In this article, we have demonstrated robustness testing using just one mutation strategy. Additional mutation strategies should be used to obtain a more complete picture of the model’s robustness.)

To compare the robustness testing to a standard model evaluation, we have also included the test set accuracy for each of the models in Table 1. Analysing model performance by this metric alone would have led us to infer that both models’ performance were roughly the same. However, the results from the robustness testing revealed that the models perform very differently when subjected to mutations. Therefore, through this testing process, we are more confident that the retrained model will likely be able to handle new and unseen inputs when deployed.

In summary, we have demonstrated how robustness testing can give us additional insights about a model’s performance beyond the standard evaluation, as well as actionable insights for improvement. This gives us more confidence when using the model. In the next article, we will continue our exploration into other robustness testing tools by exploring a different testing tool, Microsoft Checklist, and its application to an NLP use case.


[1] J. Zhang, M. Harman, L. Ma and Y. Liu, “Machine Learning Testing: Survey, Landscapes and Horizons” in IEEE Transactions on Software Engineering, vol. , no. 01, pp. 1-1, 5555.
doi: 10.1109/TSE.2019.2962027

[2] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2019. DeepXplore: automated whitebox testing of deep learning systems. Commun. ACM 62, 11 (November 2019), 137–145. DOI:

[3] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. 2017. Practical Black-Box Attacks against Machine Learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security (ASIA CCS ’17). Association for Computing Machinery, New York, NY, USA, 506–519. DOI:

[4] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz and S. Yoo, “The Oracle Problem in Software Testing: A Survey,” in IEEE Transactions on Software Engineering, vol. 41, no. 5, pp. 507-525, 1 May 2015, doi: 10.1109/TSE.2014.2372785.

[5] Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: a coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis(ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 146–157. DOI:

[6] X. Xie, H. Chen, Y. Li, L. Ma, Y. Liu and J. Zhao, “Coverage-Guided Fuzzing for Feedforward Neural Networks,” 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 1162-1165, doi: 10.1109/ASE.2019.00127.

[7] Chen, T.Y., Cheung, S.C., & Yiu, S. (2020). Metamorphic Testing: A New Approach for Generating Next Test Cases. ArXiv, abs/2002.12543.

[8] Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). Association for Computing Machinery, New York, NY, USA, 120–131. DOI: