Large Language Models (LLM) and Chatgpt have gained popularity in recent years. But when it comes to logic tasks, artificial intelligences have so far deliver moderate results.
Study of the University of Bristol
A study by Nezhurina and colleagues at the University of Bristol published in June 2024 was able to show the difficultest of the simplest logic tasks for LLM. The researchers asked programs such as GPT-3.5/4, Claude, Gemini or Mistral a simple question: “Alice has a brothers and she has sisters. How many sisters does the brother of Alice have?” While most adults and children would recognize the right solution “M + 1” straight away, the KIS cut off below average.
Results sobering
Even when the variables N and M were replaced with concrete numbers, the LLM could not provide a correct answer. According to the study “it comes [bei den meisten Modellen] Too serious disorders and many are unable to give even a single correct answer. “Only open ais GPT -4 and Claude 3 Opus were able to at least partially produce correct answers – in around 30 percent of cases. But even more worrying than the wrong answers are the supposed arguments for their correctness and persistence with which they were defended.
Error
“This breakdown can be considered dramatic not only because it happens in such a simple problem, but also because the models tend to describe their wrong solutions as correct, while they often deliver confabulations to explain the given answer, where they imitate an argumentation -like tone, but provide nonsensical arguments as support for the equally nonsensical, final answers,” said the researchers in their work. While some arguments maintained the seemingly logical conclusion, others were too simple. One reason was, for example, “this conclusion is simple and clear”.
Warning for companies
As a consequence, the researchers draw that future work should deal with the origins of the Reasoning deficits, i.e. the lack of logical thinking. In addition, current benchmarks should be revised to disclose weaknesses. The results of the study should be “warning against excessive claims […] serve that are often made by commercial companies to present their models as a mature product for users “.
J. Vogel / Redaktion Finanzen.net
