Press "Enter" to skip to content

Meta’s Alleged Book Piracy Raises IP Risks Across the Generative AI Ecosystem

And the story of the end of last week was about how Meta pirated books to create it’s AI.  

A recent investigation by The Atlantic highlights the extensive use of pirated books by Meta to train its AI model, Llama 3. Court documents reveal that Meta employees expressed urgency in acquiring books, stating that they are “actually more important than web data.” To expedite this process, Meta turned to Library Genesis, a massive repository of over seven and a half million books and eighty-one million research papers, which is known for its illegal distribution of copyrighted material. Despite legal risks, Meta reportedly engaged in torrenting to access this vast library. The implications of using such pirated resources raise significant ethical questions about the future of knowledge sharing, as generative AI technologies increasingly rely on copyrighted works without proper licensing.

Why do we care?

This is not just about one company’s ethical lapse—this is about the foundational legality of large-scale generative AI development. The strategic implication is massive.  Training data provenance is now a commercial liability.  Enterprises will not be able to blindly trust foundation models for commercial or regulated use cases if there’s no clear IP chain-of-custody.

A two-tier AI ecosystem is forming: one that plays by copyright/licensing rules (and moves slower), and one that scrapes, pirates, and ships fast.

Service providers and consultants need to evaluate the AI models they integrate or resell.

Clients in regulated sectors (finance, healthcare, education, etc.) could face legal exposure by proxy if the underlying model includes illicit training data.

Vetting model provenance—once a technical curiosity—is a due diligence requirement.

Meta has positioned LLaMA as the “open” alternative to OpenAI and Google. But here’s the problem. If open-weight models are built on illegally sourced data, they are no longer safely reusable.  Companies that value open models for flexibility or cost have to factor in IP exposure as a hidden cost.