The company’s OpenAI language model has been hailed as “easily the best artificial intelligence-based chatbot ever brought to the public” or “one of the greatest achievements in computer science”. However, researchers at Stanford University and the University of California at Berkeley say they are unwilling to rely on ChatGPT for important decisions.
Researchers Lingjiao Chen, Matei Zaharia, and James Zhu echo the statements of a growing number of users who have expressed concern, claiming that ChatGPT’s performance is not consistent. In some cases it gets progressively worse.
In a paper published last week on the preprint server arXiv, these specialists claim that “the performance and behavior of the GPT 3.5 and GPT 4 models vary significantly” and that the answers to certain tasks “have definitely become less accurate over time”.
These researchers noted significant changes in performance over a four-month period between March and June.
To get to the bottom of the matter, the authors of the work focused on specific areas, including solving mathematical problems and creating computer code.
In March 2023, GPT-4 achieved an accuracy rate of 97.6% when it came to solving problems related to prime numbers. According to Stanford researchers, that rate dropped to 2.4% when the updated model went live in June this year.
In the field of computer programming, GPT-4 also responded to the wishes of the programmers in March by providing efficient and immediately executable routines in just over 50% of the cases. But in June, that rate dropped to 10%. Chatp GPT-3.5 also saw a sharp drop in accuracy, from 22% in March to 2% in June.
Interestingly, the math capabilities of ChatGPT-3.5 followed an almost reverse trend: in March, the accuracy in solving prime number problems was only 7.4%, but in June, the success rate with the updated version was 86.8%.
According to Zhu, it’s difficult to determine the cause of these changes, although it appears that system changes and updates are all factors.
“We don’t fully understand what caused these changes in ChatGPT responses as the models are opaque,” Zhu continued. “It’s possible that adjustments made within the model to improve its performance in some areas had unexpected side effects and worsened results in other tasks. »
Conspiracy theorists, noting deteriorating results, suggest that OpenAI should experiment with alternative, smaller versions of the language models to save money. Others believe that OpenAI intentionally weakens GPT-4 so that frustrated users are more likely to pay for additional services and features.
The company rejects such claims. In early July, the researchers recall, Peter Welinder, VP of Product at OpenAI, said on Twitter: “We didn’t make GPT-4 stupid. Quite the opposite: we made each new version smarter than the one before it.”
According to Welinder, the solution could be simpler. “If you use it [l’outil] The more intense you become, the more you discover problems you didn’t see before. »
Meanwhile, the researchers say, some industry watchers are concerned about the impact of this disruptive change and are urging OpenAi to disclose the content used in the development of their ChatGPT-4.0 underlying language model.
According to Sasha Luccioni of AI company Hugging Face, “All results based on closed-source models are neither repeatable nor verifiable, and so from a scientific point of view we are comparing apples and oranges.”
“It’s not up to scientists to constantly monitor language models,” Ms. Luccioni recently told Ars Technica. “The creators of the models must provide access to their underlying data, if only for verification purposes. »