SADL Experiment Result

This page accompanies a submission to ICSE 2019 Technical Papers track, “Guiding Deep Learning System Testing Using Surise Adequacy”. We have listed all figures, including the ones omitted from the submission due to space limit (titles of figures that are not in the paper are in colour red). The page also contains additional analysis undertaken as part of author response.

RQ1: Is SADL capable of capturing the relative surprise of an input of a DL system?

Figure 2

All results were included in the paper.

Accuracy of test inputs in MNIST and CIFAR-10 dataset, selected from the input with the lowest SA, increasingly including inputs with higher SA, and vice versa (i.e., from the input with the highest SA to inputs with lower SA).

Figure 4

We have included the DSA plots for MNIST and CIFAR-10 in the paper.

Sorted DSA values of adversarial examples for MNIST and CIFAR-10.

Figure 4’ (NOT IN THE PAPER)

In addition, here are the per-class plots that show sorted DSA values of each class in MNIST. Note that the number of adversarial examples of each class is different because each adversarial example generation algorithm has own method of targeting specific class.

Sorted DSA values of adversarial examples for MNIST-10 per class.

Figure 4’’ (NOT IN THE PAPER)

The following are the per-class plots that show sorted DSA values of each class in CIFAR-10.

Sorted DSA values of adversarial examples for CIFAR-10 per class.

RQ2: Does the selection of layers of neurons used for SA computation have any impact on how accurately SA reflects the behaviour of DL systems?

Figure 5’ (NOT IN THE PAPER)

This figure contains sorted LSA values from all layers in MNIST model. In the paper, pool1 was omitted.

Sorted LSA of randomly selected 2,000 adversarial examples for MNIST from different layers.

Figure 5’’ (NOT IN THE PAPER)

This figure contains sorted LSA values from all layers in CIFAR-10 model. In the paper, only activation_1, activation_5, and activation_8 were presented.

Sorted LSA of randomly selected 2,000 adversarial examples for CIFAR-10 from different layers.

RQ3: Is SC correlated to existing coverage criteria for DL systems?

Figure 6’ (NOT IN THE PAPER)

This figure shows changes in various coverage criteria against increasing input diversity for each subject model. In the paper, only CIFAR-10 and Chauffeur were shown.

Changes in various coverage criteria against increasing input diversity. We put additional inputs into the original test inputs and observe changes in coverage values.

Correlation Analysis for Figure 6’ (NOT IN THE PAPER)

In response to one of the reviewer questions, we have calculated Spearman’s rank correlation coefficient between LSC/DSC and other coverage criteria. While the results show strong correlation, note that the sample sizes are very small (ranging from four to six) and some of the correlations are not statistically significant.

DNN	LSC			DSC
DNN	Criteria	Spearman's \(\rho\)	\(p\)--value	Criteria	Spearman's \(\rho\)	\(p\)--value
MNIST	NC	0.926	0.008	NC	0.926	0.008
	KMNC	1.000	0.000	KMNC	1.000	0.000
	NBC	1.000	0.000	NBC	1.000	0.000
	SNAC	0.971	0.001	SNAC	0.971	0.001
CIFAR-10	NC	0.941	0.005	NC	0.941	0.005
	KMNC	1.000	0.000	KMNC	1.000	0.000
	NBC	1.000	0.000	NBC	1.000	0.000
	SNAC	1.000	0.000	SNAC	1.000	0.000
Dave-2	NC	0.949	0.051	NC	N/A	N/A
	KMNC	0.949	0.051	KMNC	N/A	N/A
	NBC	0.949	0.051	NBC	N/A	N/A
	SNAC	0.949	0.051	SNAC	N/A	N/A
Chauffeur	NC	1.000	0.000	NC	N/A	N/A
	KMNC	1.000	0.000	KMNC	N/A	N/A
	NBC	1.000	0.000	NBC	N/A	N/A
	SNAC	1.000	0.000	SNAC	N/A	N/A