GPT-3, the AI tool developed by OpenAI, successfully solved problems using analogies and even outperformed a group of students in some tests. OpenAI published an impressive list of professional and academic exams that its successor, GPT-4, would have passed, including a few dozen high school tests and the bar exam. Many researchers say large language models can pass tests aimed at identifying specific human cognitive abilities, from chain thinking to theory of mind. However, there is no consensus about the actual meaning of these results. Some are blinded by what they see as sparks of human intelligence, others are not convinced at all.
GPT-3 is an autogressive language model that uses deep learning to create human-like texts. It is the third-generation GPT-n series speech prediction model developed by OpenAI, a San Francisco-based artificial intelligence research laboratory that consists of the for-profit OpenAI LP and its parent company, the non-profit OpenAI Inc.
Taylor Webb, a psychologist at the University of California who studies the different ways people and computers solve abstract problems, was impressed by GPT-3’s capabilities. Although it was only a simple autocomplete, it was able to solve many of Webb’s abstract problems similar to those of an IQ test.
Research by Webb and colleagues has shown that GPT-3 can pass various tests designed to assess the use of analogies to solve problems. On some of these tests, GPT-3 performed better than a group of students. The results suggest that analogy is a key element of human reasoning and that any form of artificial intelligence should demonstrate this.
Artificial intelligence: definition, history
Artificial intelligence (AI) is a broad branch of computer science concerned with building intelligent machines capable of performing tasks that normally require human intelligence. Although AI is an interdisciplinary science with diverse approaches, advances in machine learning and deep learning in particular are leading to a paradigm shift in almost all areas of the technology industry.
Artificial intelligence allows machines to model or even improve the capabilities of the human mind. From the development of self-driving cars to the proliferation of generative AI tools like ChatGPT and Google’s Bard, AI is increasingly becoming a part of everyday life – and it’s an area in which companies across industries are investing.
The history of artificial intelligence dates back to 1943, when the article “A Logical Calculus of Ideas Immanent in Nervous Activity” by Warren McCullough and Walter Pitts was published. Scientists present the first mathematical model for building a neural network.
In 1950, Marvin Minsky and Dean Edmonds developed Snarc, the first neural network computer. That same year, Alan Turing published the Turing Test, which is still used to evaluate AI today. This test reveals the basics of artificial intelligence, its vision and its goals: to recreate or simulate human intelligence in machines.
However, the term artificial intelligence was first mentioned by John McCarthy in 1956 at the Dartmouth Summer Research Project on Artificial Intelligence conference. At this event, researchers present the goals and vision of AI. Many consider this conference to be the true birth of so-called artificial intelligence.
Work on artificial intelligence will continue for years. In 1959, Arthur Samuel coined the term “machine learning” while working at IBM. In 1989, Yann Lecun developed the first neural network that could recognize handwritten numbers; this invention was the origin of the development of deep learning.
In 1997, a major event shaped the history of AI: IBM’s Deep Blue system defeated world chess champion Gary Kasparov. For the first time a machine has defeated a human.
The emergence of the theory of mind in artificial language models
Language models (LMs) have made remarkable progress in recent years thanks to the emergence of large language models (LLMs) such as OpenAI’s GPT-3 and Google’s Palm 2. Language models are used in artificial intelligence, natural language processing (NLP), natural language understanding, and natural language generation systems, particularly those systems that perform text generation, machine translation, and translation. Answering questions.
LLMs are advanced language models that process billions of parameters of learning data and generate text. These are advanced language models such as OpenAI’s GPT-3 and Google’s Palm 2, which process billions of parameters of training data and generate text.
LLMs also use language modeling to predict the next possible word in a sequence. The launch of the third generation of Open AI’s pre-trained language model was met with great enthusiasm and impressive testing success.
Over the past two years, these models have become capable of answering challenging questions and solving problems using persuasive language. We can therefore ask ourselves whether they also developed a theory of mind. An individual has a theory of mind when they attribute mental states to themselves and others. Such an inference system is rightly considered a theory because these states are not directly observable and the system can be used to make predictions about the behavior of others.
Michal Kosinski, a computer psychologist at Stanford Palo Alto University, subjected AI systems to standard psychological tests used on humans. Kosinski’s extraordinary conclusion is that there appeared to be no theory of mind in these AI systems until it spontaneously emerged last year. His findings have profound implications for our understanding of artificial intelligence and theory of mind in general.
The results of the work by Michal Kosinski and his team show that models published before 2022 reveal virtually no possibility of solving theory of mind tasks. However, the January 2022 version of GPT-3 (davinci-002) solved 70% of the theory of mind tasks, a performance comparable to that of seven-year-old children. In addition, the November 2022 version (davinci-003) solved 93% of theory of mind tasks, a performance comparable to that of nine-year-old children.
These results suggest that the theory of mental abilities (previously considered exclusively human) may have arisen spontaneously as a byproduct of the improved linguistic abilities of language models.
GPT-4, the AI model that passes the bar exam better than humans
A recent study found that AI is now capable of beating the majority of law graduates on the bar exam, the tough two-day test that future lawyers must pass in order to practice law in the United States. GPT-4, the advanced artificial intelligence model powered by Microsoft, achieved a bar exam score of 297 in a test by two law professors and two employees at legal technology company Casetext.
This result puts the GPT-4 in the 90th percentile of test takers and, according to the researchers, is sufficient to qualify for admission to the bar in most states. However, the National Conference of Bar Examiners, which develops the multiple-choice section, said in a statement that lawyers have unique skills acquired through training and experience that AI cannot yet match.
Study co-author Daniel Martin Katz, a professor at Chicago-Kent College of Law, said in an interview that what surprises him most is GPT-4’s ability to provide generally relevant and consistent answers for essays and achievement tests be. I’ve heard a lot of people say, ‘He may answer multiple choice questions, but he’ll never answer essay questions,'” Katz said.
LIA has also passed other standardized tests such as the SAT (Scholastic Assessment Test is an exam that assesses your general verbal English skills and mathematical reasoning) and the GRE (Graduate Record Examination is an English test created and administered by the ETS company ). Admission requires an exam at most universities or graduate schools in English-speaking countries, but the bar exam has attracted the most attention.
As noted, these findings fuel a media frenzy that predicts computers will soon occupy white-collar jobs, replacing teachers, journalists, lawyers, and more. Geoffrey Hinton, a Canadian researcher who specializes in artificial intelligence and artificial neural networks in particular, pointed out that GPT-4’s apparent ability to chain minds was one of the reasons he was now afraid of the technology which he helped to create.
There is no consensus on the interpretation of the GPT-4 results
Opinions differ on the interpretation of the GPT-4 results. Some are impressed by what they see as signs of human intelligence, others remain skeptical. “Current techniques for evaluating large language models raise several critical issues,” says Natalie Shapira, a computer scientist at Bar-Ilan University in Ramat Gan, Israel. This creates the illusion that they have greater abilities than they actually have.
That’s why more and more researchers – computer scientists, cognitive scientists, neuroscientists and linguists – want to review the way important language models are evaluated and call for more rigorous and comprehensive evaluation. Some believe that the practice of evaluating speech patterns based on human testing is wrong and should be abandoned.
Since the early days of AI, humans have been administering human intelligence tests (IQ tests, etc.) to machines, says Melanie Mitchell, an artificial intelligence researcher at the Santa Fe Institute in New Mexico. From the outset, the question arises as to what such a test means for a machine. It doesn’t mean the same thing as it does for a human being. There is a lot of anthropomorphism, she adds. And that influences how we think about these systems and how we test them.
According to some analysts, most of the problems in testing important language models are related to the issue of interpreting the results. As hopes and fears about this technology reach their peak, it is important to have a clear idea of what large language models can and cannot do.
Can LLMs understand or just repeat?
Tests designed for humans, such as high school diplomas and IQ tests, take many things for granted. If a person performs well, they can be considered to have the knowledge, understanding, or cognitive abilities that the test is intended to measure. (In practice, this assumption has limited validity. Academic tests do not always reflect students’ true abilities. IQ tests measure a specific set of skills, not general intelligence. Both types of assessment favor people who do well in these types of assessment are.)
But when a large language model performs well on these tests, it is not at all clear what was being measured. Is this evidence of real understanding or just repetition? Developing methods to test the human mind has a long history, says Laura Weidinger, principal investigator at Google DeepMind. Given that large language models produce texts that appear so human, it is tempting to assume that tests of human psychology will be useful in evaluation. However, this is not true: human psychology tests rely on many assumptions that are not necessarily valid for important language models.
Webb is aware of the trouble he has gotten into. “I share the feeling that these are difficult questions,” he says. He points out that while the GPT-3 performed better than students on some tests, it produced absurd results on others. For example, he failed a version of a test of analogical reasoning about physical objects that developmental psychologists sometimes give to children.
LLMs often fail in areas that require an understanding of the real worldL
In this test, Webb and his colleagues told GPT-3 the story of a magical genie who transported jewels between two bottles and then asked him how to transfer gumdrops from one bowl to another, using items such as a billboard and a cardboard tube helped. The idea is that the story suggests ways to solve the problem. GPT-3 largely proposed elaborate but mechanically absurd solutions, with many redundant steps and no clear mechanism for transferring the gumballs between the two bowls, the researchers write.
These are the types of problems that children are good at solving, says Webb. The areas in which these systems often fail are those that require an understanding of the real world, such as elementary physics or social relationships – areas that are “instinctive” to humans.
The big question now is how GPT-3 achieves the analog capability often seen as the heart of human intelligence. One possibility is that GPT-3 may have been forced by the size and diversity of the training data to develop mechanisms similar to those thought to underlie the logical human analogue, despite not being explicitly trained to do so. Cognitive science researchers who study analogies agree that this human ability depends on the systematic comparison of knowledge based on explicit relational representations.
Although mechanisms built into LLMs like GPT-3 may have important connections to the building blocks of human thought, it is also worth considering the possibility that this type of intelligence is fundamentally different from that of humans. Humans have evolved to think within the limits imposed by limited computing power and biological constraints.
It should also be noted that regardless of the extent to which GPT-3 utilizes human-like mechanisms to perform analogical reasoning, it did not acquire these mechanisms in a manner similar to humans. LLMs receive orders of magnitude more training data than individual humans (at least if we only consider language input) and therefore cannot be considered models for the acquisition of analogical reasoning during human development.
Source: University of California researchers
And you ?
Are the conclusions of the study conducted by Taylor Webb relevant? Do you agree that the enthusiasm for AI is based on erroneous test results?
According to the researchers, when a large language model performs well in tests, it is not at all clear what was being measured. Do you think this is evidence of real understanding or just a repetition?
See also:
Bar exam results show AI can compete with “human lawyers.” The GPT-4 AI model scores 297 on the US bar exam
One professor admits to being amazed that ChatGPT went from a D grade to an A grade on his audit in just three months. He believes this software is an exception that proves the rule.
GPT-3, OpenAI’s text generation system, performs as well as a nine-year-old human on standard theory of mind tests, according to a psychologist