**Introduction**
Recently, researchers from the University of Oxford have introduced a new method for detecting hallucinations in Large Language Models (LLMs) using semantic entropy. As a strategy to overcome confusion, semantic entropy is based on probabilistic tools for uncertainty estimation and can be directly applied to the underlying models without any modifications to the architecture.
To this day, the phenomenon of large language models fabricating information is still common.
When faced with LLMs' serious but nonsensical statements, do people frown lightly or just laugh it off?
As the saying goes, "When the great wind rises, the clouds soar high, where can the brave warriors go in all directions." LLM hallucinations must be eliminated at all times, they cannot be tolerated.Imagine, when you search for a simple grammar, the top results on the web are all incorrect answers generated by large models, and only after testing do you realize that you have wasted your life.
If Large Language Models (LLMs) are involved in professional fields such as medicine and law, hallucinations can lead to serious consequences, so related research has never stopped.
Advertisement
Recently, researchers from the University of Oxford published a new method for detecting LLM hallucinations using semantic entropy in Nature.
Paper link:
Oxford University computer scientists Sebastian Farquhar and others designed a method based on semantic entropy (similarity) determined by LLMs to measure the uncertainty in the semantic level of large model answers.The method involves having the first LLM generate multiple answers to the same question, which are then analyzed by a second LLM (referee) for semantic similarity.
To verify the accuracy of the above judgment, a third LLM is activated, which simultaneously receives human answers and the evaluation results of the second LLM for comparison, achieving unsupervised but reasoned analysis.
In simple terms, the whole process is: if I want to check whether you are making things up, I will repeatedly ask you the same question. If your answers vary each time... then something is amiss.
The experimental results show that the semantic entropy approach adopted in this paper is superior to all baseline methods:In a commentary article in Nature, Professor Karin Verspoor, Dean of the School of Computing Technologies at RMIT University, stated that this is a method of "Fighting fire with fire":
"The results show that the uncertainty (semantic entropy) associated with these clusters is more effective in estimating the uncertainty of the first Large Language Model (LLM) than the standard word-based entropy. This means that even if the semantic equivalence calculation of the second LLM is not perfect, it is still helpful."
However, Karin Verspoor also pointed out that using an LLM to evaluate a method based on LLMs seems to be circular reasoning and may be biased.
On the other hand, we can indeed draw a lot of inspiration from this, which will help with research on other related issues, including academic integrity and plagiarism, using LLMs to create misleading or fabricated content.
Fighting fire with fire
The hallucinations of LLMs are usually defined as generating "content that is meaningless or unfaithful to the provided source content." This article focuses on a subset of hallucinations—"fiction," that is, answers that are very sensitive to irrelevant content (such as random seeds).Detecting fiction can allow systems built on Large Language Models (LLMs) to avoid answering questions that may lead to fiction, make users aware of the unreliability of question answers, or supplement or restore the responses given by LLMs through more grounded searches.
Semantic Entropy and Confusion Detection
To detect fiction, researchers use probabilistic tools to define and measure the semantic entropy of the content produced by LLMs—entropy calculated based on the meaning of sentences.
Because for language, although the expressions may differ (different in grammar or vocabulary), the answers may imply the same thing (semantically equivalent).
Semantic entropy tends to estimate the distribution of meanings in free-form answers rather than the distribution of words or word fragments, which is in line with reality and can also be seen as a semantic consistency check for variations in random seeds.
General measures of uncertainty would consider "Paris," "This is Paris," and "The capital of France, Paris" as different answers, which is not suitable for language tasks.The method presented in this article allows answers to be clustered based on meaning before the calculation of entropy.
Additionally, semantic entropy can detect confusion in longer paragraphs. As shown in the figure below, the generated long answers are broken down into factual statements.
For each factual statement, an LLM (Large Language Model) generates corresponding questions. Then another LLM provides M possible answers to these questions.
Finally, the semantic entropy of each specific question's answer is calculated (including the original fact). A higher average semantic entropy related to the fact indicates that it is fictional.Intuitively, the working principle of the method in this paper is to sample several possible answers for each question, and then cluster them into answers with similar meanings through an algorithm, and then determine the answer based on whether the answers in the same cluster (cluster) are bidirectionally related to each other.
If the meaning of Sentence A includes the meaning of Sentence B (or vice versa), then we believe they are in the same semantic cluster.
Researchers use a general Large Language Model (LLM) and a specially developed Natural Language Inference (NLI) tool to measure semantic relevance.
Experimental evaluation
Semantic entropy can detect confusion in free-format text generation across a range of language models and domains without prior domain knowledge.The experimental evaluation of this article covers question-and-answer knowledge (TriviaQA), common sense (SQuAD 1.1), life sciences (BioASQ), and open knowledge domain natural questions (NQ-Open).
It also includes the detection of confusion in mathematical text questions (SVAMP) and biographical generation datasets (FactualBio).
TriviaQA, SQuAD, BioASQ, NQ-Open, and SVAMP are all evaluated in a context-independent manner, with sentence lengths of 96±70 characters, using models such as LLaMA 2 Chat (7B, 13B, and 70B), Falcon Instruct (7B and 40B), and Mistral Instruct (7B).
The experiment adopts an embedded regression method as a strong supervision baseline.
Evaluation Metrics
Firstly, for the binary event where the given answer is incorrect, AUROC is used to capture both precision and recall at the same time, ranging from 0 to 1, where 1 represents a perfect classifier, and 0.5 represents a classifier with no information.The second metric is the Area Under the Rejection Accuracy Curve (AURAC), which indicates the improvement in accuracy that a user would experience if semantic entropy is used to filter out questions that lead to the highest entropy.
The results shown in the figure above are the averages of five datasets, indicating that both semantic entropy and its discrete approximation outperform the best baseline generated by sentence length.
The AUROC measurement method predicts the degree of error in LLM (related to fiction), while AURAC measures the improvement in system performance brought about by refusing to answer questions that are considered likely to cause confusion.
Averaging over the 30 task and model combinations in the experiment, semantic entropy achieved the best AUROC value of 0.790, while naive entropy was 0.691, P(True) was 0.698, and the embedding regression baseline was 0.687.In our various model series (LLaMA, Falcon, and Mistral) and scales (from 7B to 70B parameters), semantic entropy demonstrates consistent performance (AUROC ranging from 0.78 to 0.81).
We can observe how semantic entropy detects cases where the meaning remains the same but the form changes (first row of the table), when both form and meaning change together (second row), entropy and naive entropy correctly predict the existence of fiction; when form and meaning remain constant over several resampled generations, both entropy and naive entropy correctly predict the absence of fiction (third row).
The example in the last row illustrates the importance of context and judgment in clustering, as well as the shortcomings of evaluation based on fixed reference answers.
The AUROC and AURAC of discrete semantic entropy are higher than those of the simple self-check baseline (simply asking whether the facts of LLM are likely to be true) or variants of P(True), exhibiting better performance in terms of rejection accuracy.
ConclusionThe success of semantic entropy in detecting errors indicates: Large Language Models (LLMs) are better at "knowing what they do not know" — they just don't know that they know what they do not know.
Semantic entropy, as a strategy to overcome confusion, is based on probabilistic tools for estimating uncertainty. It can be directly applied to any Large Language Model (LLM) or similar foundational model without any modifications to the architecture. Even when the model's predictive probabilities are not accessible, the discrete variant of semantic uncertainty can be applied.