US Companies Buying and Destroying European Books: A Controversial AI Training Method
In a time where artificial intelligence (AI) is rapidly evolving, a troubling trend has emerged involving US companies acquiring and subsequently destroying European books. The aim? To create vast datasets for training language models like those from Anthropic and OpenAI. This practice raises significant ethical questions about cultural preservation and copyright.
The Process of Book Acquisition and Destruction
Recent legal documents related to a copyright dispute shed light on how companies like Anthropic operate. They purchase millions of printed works, which are then sent to service providers where the books are disassembled, scanned page by page, and the physical copies are disposed of through recycling. While the digital files created are intended for model training, this process effectively erases the original works from existence.
The Importance of Books as Cultural Heritage
Books are often viewed as cultural treasures—narratives that shape societies and educate individuals. However, for companies in the AI sector, books primarily serve as high-quality training material. In their pursuit of large datasets, these companies see no value in retaining the actual copies.
Why Are Books Essential for AI Training?
AI models like those from Anthropic and Google rely on extensive amounts of text to learn and improve their language comprehension. High-quality written works, such as books, offer meticulously edited language, complex reasoning, and intricate storytelling structures that short online articles cannot match.
Distinguishing Quality from Quantity
Internally, Anthropic emphasizes that books help teach AI to write well, rather than merely emulating the often haphazard language found on many internet platforms. Data quality is crucial, and books serve as a robust foundation for this.
Purchasing Instead of Pirating
Initially, many AI firms resorted to “shadow libraries” such as LibGen, which provide unauthorized access to millions of digitized books. This practice, however, led to numerous lawsuits. Authors and publishers accused companies of infringing copyright by using their works without permission for model training. Consequently, Anthropic took a different route: they opted to buy physical books in bulk and digitize them themselves.
The Source: European Book Markets
For its vast acquisitions, Anthropic utilized vendors like Better World Books and the British second-hand book dealer World of Books. These sellers possess extensive inventories of used books, allowing companies to purchase vast quantities at relatively low costs. This bulk buying simplifies the process of gathering data for digitalization compared to obtaining individual licenses from publishers and authors.
The Shocking Reality: Destruction of Books
To efficiently convert books into digital formats, the spines are removed before scanning their pages. This method enables rapid capturing of content but eradicates the physical existence of the books. After scanning, the remaining paper is recycled, resulting in a solely digital archive devoid of its original form.
Conclusion: The Ethical Implications
While the push for advanced AI capabilities is understandable, the practice of acquiring and destroying European books highlights a significant ethical dilemma. The intersection of technology, culture, and copyright presents complex challenges that need to be navigated carefully. As AI continues to develop, it’s essential to consider the cultural heritage that’s being sacrificed for commercial gain and technological advancement.

