AI -Hallucinations get worse, even if new systems become more powerful

By Shereen Bhan On May 5, 2025

- Advertisement -

Last month an AI -Bot that treats technical support for cursor, An emerging tool for computer programmersVarious customers warned about a change in the company policy. It said they were no longer allowed to use cursor on more than just one computer.

In angry messages until Internet Message Boardsthe customers complained. Some have canceled their cursor accounts. And some became even more angry when they realized what had happened: the AI Bot had announced a policy change that did not exist.

“We do not have such a policy. written In a Reddit post. “Unfortunately, this is an incorrect reaction from a front-line AI support bone.”

More than two years later The arrival of chatgptTech companies, office workers and everyday consumers use AI bots for an ever -wide range of tasks. But there is still No way to ensure that these systems produce accurate information.

The newest and most powerful technologies-so-called reasoning systems More mistakes from companies such as OpenAi, Google and the Chinese start-up Deepseek generation. As their mathematical skills have improved, their handle has become more reluctant. It is not entirely clear why.

Today’s AI bots are based on Complex mathematical systems Who learn their skills by analyzing huge amounts of digital data. They cannot – and cannot decide what is true and what is incorrect. Sometimes they just come up with things, a phenomenon that some AI researchers call hallucinations. On one test were the hallucination rates of newer AI systems up to 79 percent.

These systems use mathematical probabilities to guess the best reaction, not a strict series of rules defined by human engineers. So they make a certain number of errors. “Despite our efforts, they will always hallucinate,” said Amr Awadallah, the Chief Executive of Vectara, a start-up that builds AI tools for companies and a former Google director. “That will never disappear.”

Since a few years, this phenomenon has expressed concern about the reliability of these systems. Although they are useful in some situations – such as Term Papers writeSummarizing office documents and Generate computer code – Their mistakes can cause problems.

The AI bots that are bound by search engines such as Google and Bing sometimes generate search results that are laughable wrong. If you ask them for a good marathon on the west coast, they can present a race in Philadelphia. If they tell you the number of households in Illinois, they can call a source that does not contain that information.

Those hallucinations may not be a big problem for many people, but it is a serious problem for anyone who uses the technology with judicial documents, medical information or sensitive business data.

“You spend a lot of time trying to find out which answers are actually and which are not,” said Pratik Verma, co-founder and chief executive of OkahuA company that helps companies by navigating the hallucination problem. “Not dealing with these errors in principle eliminates the value of AI systems, which are supposed to automate tasks for you.”

Cursor and Mr. Truell did not respond to requests for comments.

For more than two years, companies such as OpenAI and Google steadily improved their AI systems and reduced the frequency of these errors. But with the use of new reasoning systemsErrors rise. The latest OpenAi systems hallucinating higher than the previous system of the company, according to the company’s own tests.

The company discovered that O3 – the most powerful system – 33 percent of the time hallucinated in performing its personqa -benchmarktest, where questions are answered about public figures. That is more than twice the hallucination percentage of the previous reasoning system of OpenAi, called O1. The new O4-Mini hallucinated with an even higher percentage: 48 percent.

When performing another test called Simpleqa, which asks more general questions, the hallucination rates for O3 and O4-Mini were 51 percent and 79 percent. The previous system, O1, hallucinated 44 percent of the time.

In a paper with the testsOpenAi said that more research was needed to understand the cause of these results. Because AI systems learn from more data than people can wrap their heads, technologists struggle to determine why they behave in the ways they do.

“Hallucinations are not more common in reasoning models, although we actively work to reduce the higher hallucination that we saw in O3 and O4-Mini,” said a company spokeswoman, Gaby Raila. “We will continue with our research into hallucinations in all models to improve accuracy and reliability.”

Hannaneh Hajishirzi, professor at the University of Washington and a researcher at the Allen Institute for Artificial Intelligence, is part of a team that recently devised a way to trace the behavior of a system back to the individual data on which it has been trained. But because systems learn from so much data – and because they can generate almost anything – this new tool cannot explain everything. “We still don’t know exactly how these models work,” she said.

Tests from independent companies and researchers indicate that hallucination rates are also rising for reasoning models from companies such as Google and Deepseek.

Since the end of 2023, Mr Awadallah’s company, Vectara, followed how often chatbots bend out of the truth. The company asks these systems to perform a simple task that is easily verified: summarizes specific news items. Even then, chatbots constantly invent information.

The original research by Vectara estimated that in this situation chatbots were the information at least 3 percent of the time and sometimes no less than 27 percent.

Since then, companies such as OpenAI and Google have been pushing those figures in the reach of 1 or 2 percent. Others, such as the San Francisco start-up Anthropic, floated around 4 percent. But hallucination rates on this test have risen with reasoning systems. Deepseek’s reasoning system, R1, hallucinated 14.3 percent of the time. OpenAi’s O3 climbed to 6.8.

(The New York Times has sued OpenAi and his partner, Microsoft, accuse them of infringing copyright with regard to news content with regard to AI systems. OpenAi and Microsoft have denied those claims.)

For years, companies such as OpenAI relied on a simple concept: the more internet data they have entered in their AI systems, The better those systems would perform. But she Used up just about all English text on the internetWhich meant that they needed a new way to improve their chatbots.

So these companies lean heavier on a technique that scientists call learning from reinforcement. With this process, a system can learn behavior through trial and error. It works well in certain areas, such as mathematics and computer programming. But it falls short in other areas.

“The way in which these systems are trained, they will start concentrating on one task-and start with forgetting others,” said Laura Perez-Beltrechini, a researcher at the University of Edinburgh, that one of one Team is investigating the hallucination problem accurately.

Another problem is that reasoning models are designed to spend time on ‘thinking’ through complex problems before you establish an answer. While they try to tackle a problem step by step, they run the risk of hallucinating with every step. The mistakes can put together if they spend more time thinking.

The newest bots reveal every step to users, which means that users can also see every error. Researchers have also discovered that in many cases the steps displayed by a bone are Not related to the answer it ultimately delivers.

“What the system says it thinks it is not necessary what it thinks,” said Aryo Pradipta Gema, an AI researcher at the University of Edinburgh and a fellow at Anthropic.

- Advertisement -