Clotho is accepted to FSE 2026

A nice Christmas present came in the form of an author notification from FSE 2026 - a paper titled Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs written by Juyeon, Somin, Robert, and Shin, has been accepted. Congratulations!

We think this is an important breakthrough, at the risk of saying so for our own work. Existing coverage criteria defined for DNNs have been evaluated for LLMs to detect out-of-distribution inputs such as jailbreak attempts, but distribution-based adequacy metric such as Surprise Adequacy remained inapplicable for LLMs, simply because it is practically infeasible to measure similarity between neuron activations initiated by the new input and activations by the training data. This, in turn, is due to two reasons. First, training corpora for pre-trained LLMs are simply too big. Second, even if we could handle a large training corpus, the purpose of pre-training is to reduce auto-regressive loss, not to solve a specific task we are writing the prompt for, so even knowing what to measure is not clear.

Clotho mitigates this problem with the use of reference set, active learning, and Gaussian Mixture Model. The use of reference set means we will measure the similarity between activation triggered by a new input and a set of representative inputs for the task. GMM models the distribution of such representative inputs. Finally, active learning allows us to choose the reference set efficiently. Combined, we can now compute SA as a test adequacy measure for a specific LLM task.

A surprising finding for us was that the adequacy does generalise. When we computed the adequacy scores using a local SLM, and prioritsed the inputs for GPT-4o mini, Gemini Flash 2.5 Lite, and Claude Haiku, the prioritisation was still very much meaningful. Compared to randomly selected 100 samples, Clotho can select 126.8% more failing inputs.

This is an exciting start, as it opens so many doors to interesting analyses of internal behaviour of LLMs. More to come!