SADL Experiment Result
This page accompanies a submission to ICSE 2019 Technical Papers track, “Guiding Deep Learning System Testing Using Surise Adequacy”. We have listed all figures, including the ones omitted from the submission due to space limit (titles of figures that are not in the paper are in colour red). The page also contains additional analysis undertaken as part of author response.
RQ1: Is SADL capable of capturing the relative surprise of an input of a DL system?
Figure 2
All results were included in the paper.
![]() |
Figure 4
We have included the DSA plots for MNIST and CIFAR-10 in the paper.
![]() |
Figure 4’ (NOT IN THE PAPER)
In addition, here are the per-class plots that show sorted DSA values of each class in MNIST. Note that the number of adversarial examples of each class is different because each adversarial example generation algorithm has own method of targeting specific class.
![]() |
Figure 4’’ (NOT IN THE PAPER)
The following are the per-class plots that show sorted DSA values of each class in CIFAR-10.
![]() |
RQ2: Does the selection of layers of neurons used for SA computation have any impact on how accurately SA reflects the behaviour of DL systems?
Figure 5’ (NOT IN THE PAPER)
This figure contains sorted LSA values from all layers in MNIST model. In the paper, pool1 was omitted.
![]() |
Figure 5’’ (NOT IN THE PAPER)
This figure contains sorted LSA values from all layers in CIFAR-10 model. In the paper, only activation_1, activation_5, and activation_8 were presented.
![]() |
RQ3: Is SC correlated to existing coverage criteria for DL systems?
Figure 6’ (NOT IN THE PAPER)
This figure shows changes in various coverage criteria against increasing input diversity for each subject model. In the paper, only CIFAR-10 and Chauffeur were shown.
![]() |
Correlation Analysis for Figure 6’ (NOT IN THE PAPER)
In response to one of the reviewer questions, we have calculated Spearman’s rank correlation coefficient between LSC/DSC and other coverage criteria. While the results show strong correlation, note that the sample sizes are very small (ranging from four to six) and some of the correlations are not statistically significant.
DNN | LSC | DSC | ||||
---|---|---|---|---|---|---|
Criteria | Spearman's \(\rho\) | \(p\)--value | Criteria | Spearman's \(\rho\) | \(p\)--value | |
MNIST | NC | 0.926 | 0.008 | NC | 0.926 | 0.008 |
KMNC | 1.000 | 0.000 | KMNC | 1.000 | 0.000 | |
NBC | 1.000 | 0.000 | NBC | 1.000 | 0.000 | |
SNAC | 0.971 | 0.001 | SNAC | 0.971 | 0.001 | |
CIFAR-10 | NC | 0.941 | 0.005 | NC | 0.941 | 0.005 |
KMNC | 1.000 | 0.000 | KMNC | 1.000 | 0.000 | |
NBC | 1.000 | 0.000 | NBC | 1.000 | 0.000 | |
SNAC | 1.000 | 0.000 | SNAC | 1.000 | 0.000 | |
Dave-2 | NC | 0.949 | 0.051 | NC | N/A | N/A |
KMNC | 0.949 | 0.051 | KMNC | N/A | N/A | |
NBC | 0.949 | 0.051 | NBC | N/A | N/A | |
SNAC | 0.949 | 0.051 | SNAC | N/A | N/A | |
Chauffeur | NC | 1.000 | 0.000 | NC | N/A | N/A |
KMNC | 1.000 | 0.000 | KMNC | N/A | N/A | |
NBC | 1.000 | 0.000 | NBC | N/A | N/A | |
SNAC | 1.000 | 0.000 | SNAC | N/A | N/A |