Auth Lab Community

Meta Admits to Using Pirated Books to Train AI and Refuses to Compensate Authors

In recent years, Large Language Models (LLMs) have made tremendous progress in technology, but behind the scenes, there is a cloud of copyright disputes. Tech giants use massive amounts of text data to train LLMs, inevitably involving copyrighted works, which has sparked strong protests from authors and media organizations.

Recently, Meta (formerly known as Facebook) has faced a collective lawsuit from a group of authors, including comedian Sarah Silverman and writer Richard Kadrey, for training their LLAM 1 and LLAM 2 models using the “Books3” dataset, which contains a large number of pirated books. Meta admits to using the Books3 dataset but refuses to compensate the authors appropriately.

Books3 is a text dataset created by AI researcher Shawn Presser in 2020, containing 195,000 books with a total capacity of nearly 37GB. It aims to provide a better data source for improving machine learning algorithms. Meta also used this dataset to train its own LLM models. However, Books3 contains a large number of copyrighted works obtained from pirate websites like Bibliotik, which puts Meta at legal risk.

IT House has noticed that Meta’s actions are not isolated cases. Previously, The New York Times also sued OpenAI and Microsoft for using its articles to train the ChatGPT chatbot. OpenAI argued that it is “almost impossible” to train AI models without using copyrighted materials and requested the court to dismiss the lawsuit. Meta similarly denies intentionally infringing copyrights, claiming that its use of the Books3 dataset falls under fair use and does not require permission, attribution, or compensation.

In addition, Meta is objecting to the legitimacy of the lawsuit as a collective action and refusing to provide any form of financial compensation to the authors or other parties involved in the Books3 controversy.

It is worth noting that some of the content in the Books3 dataset comes from the pirate website Bibliotik, which has been requested to be taken down by the Danish anti-piracy organization Rights Alliance in 2023 and is currently facing a ban on digital archiving.

Source: IT House