Empirical Evaluation of Fault Localisation Using Code and Change Metrics

This page includes full results of RQ5, RQ6, and RQ7 in the paper, "Empirical Evaluation of Fault Localisation Using Code and Change Metrics".

RQ5. Existence of project-specific traits that are effective for fault localisation

Performance of trained models depends on whether these models have succeeded to catch useful features for fault localisation. In RQ5, we ask: does each project have any unique traits that might affect the effectiveness of fault localisation? To answer this question, we define three different training data configurations and compare the performance of FLUCCS with them.

self: compose a training dataset only with the faults from a test subject

other: compose a training dataset only the faults from one of other subjects; e.g., denote training data set consisting only of faults from the project 'P>' as other_P

mixed: compose a training dataset with both the faults from a test subject and the faults from other subjects

Below are the results of this experiment for each project. The results include the tables of A12 statistic (effect size) and U-test (p-value)and the violinplots of the wef distribution. For the tables, the row configuration leads to better performance than the column configuration if A12 effect size is greater than 0.5 or the null hypothesis, the row configuration does not lead to better performance compared to the column configuration, is rejected with the significance level of 0.05.

RQ6. Gradual Cross-Project Training

While there may exist project-specific traits that may improve the performance of fault localisation under the self configuration, learning only from the faults of a test subject may not always be a viable option. For example, if the test subject is in the early stage of development, there may not be enough faults to exploit, yet; in this case, the cross-project learning may be the only feasible option. In RQ6, we ask: when should we transit from cross-project to own configuration? To answer this question, we perform gradual cross-project learning, which varies the ratio of the faults from the subject and the faults from other projects in the training dataset. For this experiment, we define an additional configuration of the training data set, others.

others: compose a training dataset with the faults from other projects, preferably more than one project.

Below are the results of the experiment per project. The results include the tables of three evaluation metrics, acc@n, wef, and MAP, and the boxplots and scatterplots of the wef distribution. For the scatterplots (blue), the size of each datapoint is proportional to the number of datapoints having the same value.

RQ7. Feature Importance

FLUCCS uses 33 SBFL score metrics and seven code and change metrics. To identify which features are useful in FLUCCS, we perform feature importance analysis using Random Forest and identify the top ten most effective features. For these top ten features, we directly measure their effect on the performance of fault localisation by randomly permutating each of these top ten features, one at a time. We, then, observe the performance difference in the models trained with that dataset against the performance of the models trained using the dataset before the permutation of the feature. The more severe the difference is, the more likely the feature to have more effect on fault localisation. In RQ7, we ask: which features contribute the most to ranking models? Below are the results of this experiment per project. The results include the tables of feature importance values of the top ten features in RF and the tables of a performance difference from the random permutation of these features.

Feature Importance Analysis per Project