Harry Potter, Hunger Games: ChatGPT Practices By Looting Protected Books

researchers out University of Berkeley (California) argue with supporting evidence that ChatGPT and its successor GPT-4 stored a large number of copyrighted book pages. The authors, including Kent Chang, Mackenzie Cramer, Sandeep Soni and David Bamman, thus raise the question of the legality of language models as soon as they refer to copyrighted works.

Contents

The two artificial intelligences were developed by the OpenAI private company and trained on huge amounts of data, but it is unclear what texts the AIs are built from.

572 uniquely identified books

The title of the article comes from the poem itself: Speak, Memory: An Archeology of Books Known to ChatGPT/GPT-4 (see here). The conclusions will enrage you legal advice “We find that the OpenAI models have stored a large collection of copyrighted material. »

READ – The judgment of a robot’s crimes, or the process of an artificial intelligence

And the rest is in the yardstick: “The degree of use correlates with the frequency with which passages of these books appear on the web. In short, the more a title is cited on the web, the more robots are interested in it. And take it for reference.

Science fiction, fantasy: the taste of AI

Controlling and verifying the sources that AIs draw from is a headache. In fact, the completely unknown corpora make any quantitative and qualitative analysis difficult. To answer this, the researchers conducted a “name clone” type test: the solution ratings the number of occurrences of terms. It leads to the identification of the passages used by the machine.

Conclusion: the novels Science Fiction and Fantasy are among the first sources: the titles Harry Potter, the saga of Hunger Games, Dune or even The Iron Throne… We also discover public domain works like Orwell’s 1984 or The Lord of the Rings. No difficulty on this point. The full list of titles identified is also available at this address: more than half of the 572 titles listed were published after 1960.

AI and transparency…

In their recommendations, scientists plead for a more consistent use of publicly accessible data and works. Transparency which would also result to avoid counterfeiting.

READ – IA: Authors concerned about “misappropriation” of works

With AI development labs not disclosing any of the sources used in their machine enrichment work, the legal risks are now becoming clear.

to make efforts

“Data maintenance is still very immature in terms of machine learning,” summarizes Margaret Mitchell, AI researcher at the register.

However, the study focuses less on the copyright implications and more on the corpora – and thus the nature of the serving works for the development of AI. However, the effects can be easily guessed as long as the tools use protected works.

READ – Cut the fat with Plato, Marx or Nietzsche: it’s possible

In addition, there is a precedent: the natural language developed by a certain search engine, Google, was based on the digitization of millions of books. This was one of the great purposes of google books, now converted into an online bookstore. However, of all the documents scanned and reproduced, the number of books in copyright was worth trying.

Certainly lost by the plaintiffs… but still.

Photo credit: ActuaLitté, CC BY SA 2.0