Back to blog
AIcopyrightfair useintellectual propertyAI regulationdata rightsbig techgovernance

Fair use in 2026: is there even such a thing?

AI Usage
20%
GlennApril 7, 20267 min read

Fair use in 2026: is there even such a thing?

Meta has been claiming that torrenting is fair use for over a year now. So let us have a look at the rules, and talk about what fair use actually is.

For anyone old enough to remember the war on piracy, you already know how absurd this situation is. In the early 2000s, the RIAA (Recording Industry Association of America) was on a warpath.

With Metallica against Napster as their posterchild, the RIAA sued literal children, some of whom were fined beyond reason, for the dawning act of downloading a few songs. Cease and desist letters went out by the tens of thousands, and approximately 30,000 to 40,000 people were threatened or sued. Among them was a 12 year old girl named Brianna LaHara, whose parents were forced to pay a $2,000 settlement, because their 12 year old had been downloading music on Kazaa (a peer-to-peer sharing platform).

The campaign escalated as DVDs became mainstream. You probably remember the unskippable warning that played before every film: "You wouldn't steal a car." The message was clear and it was hammered into an entire generation. Copyright infringement is theft. It is a serious crime with serious consequences.

This is something the creators of The Pirate Bay took into account. They did not share a single file themselves. They wrote software and hosted a service. For that, they were investigated for years, raided and eventually charged with promoting others' copyright infringement and sent to prison.

By the 2010s, everyone knew: piracy is wrong. Copyright infringement is a serious crime. The cultural and legal consensus was total.

That is, unless you pirate the entire recorded history of human knowledge and feed it into a large language model.

Meta is claiming they are allowed to do exactly that, and their excuse is that it is for training purposes, and that the scale of what they are doing somehow changes the nature of the act. As if industrializing infringement makes it research. Every flagship model has been through some form of legal dispute over intellectual property, and the word they all reach for is the same one: transformative. So what is transformative, and does it hold up?

Let's use a thought experiment here:

If I memorized every book in a library and then sold access to my recollections, I would be in serious legal trouble. But if I instead got blackout drunk, studied the linguistic patterns, word frequencies and every author's distinct style, so that I with extreme precision could guess the contents of said books, and built a service around that capability? Apparently, that is fine. I could then argue that I have no true recollection of the content, and that I have just become extremely good at guessing things.

The question now becomes whether that distinction holds up, or whether it is just a more sophisticated way of doing the same thing.

Retention vs. statistical reconstruction

Retained data preserves the original information intact. Statistical reconstruction does not. The source material is discarded, and only patterns, distributions, and summary characteristics are left in the model. The original substance is gone. What remains is a mathematical representation of it.

In machine learning, this process is called transformation. Raw data is converted into features, labels, or embeddings that a model can learn from. The model does not store the source. It stores what the source implied.

What does the law say about fair use?

Fair use is a legal doctrine that is supposed to balance the rights of creators with the public interest, by allowing limited use of copyrighted work without permission. It is supposed to ensure that we can criticize, comment on, teach about, and build upon existing works. It preserves free expression, innovation, scholarship, and democratic discourse even when strict copyright would otherwise forbid it.

Courts weigh four factors: the purpose and character of the use, the nature of the work, how much of it was used, and the effect on the market for the original. No single factor is decisive. Every case is judged on its own facts. And as we are about to see, that becomes a serious problem when the defendant is a trillion dollar AI company and the facts are buried inside a black box nobody is allowed to inspect.

This brings us back to the word "transformative". If you take a copyrighted work and add something new, change its purpose, or create something with a fundamentally different function, courts are more likely to call it fair use. This is the argument Meta, Anthropic, and OpenAI have all leaned on. They are not copying your book. They are learning from it. That is their transformative claim. That sounds reasonable. It has also, depending on the court, been both accepted and rejected for exactly the same act.

The first major ruling came in Thomson Reuters v. Ross Intelligence in February 2025, and it went against the AI side. Ross had used Westlaw's proprietary legal headnotes to train a competing legal research tool. The court said that was not transformative, because the end product competed directly with the original. Ross was building a non-generative legal research tool, which courts treated differently from a general purpose language model, and crucially, it competed directly with the product it had learned from. The process was similar to what Meta and Anthropic would later argue. The headnotes were not stored verbatim. But the court did not care, because the output landed in the same market as the original. What this tells you is that the transformative argument is not really about what you did to the data. It is about what you built with it, and for whom. The process is irrelevant. The product is everything.

Then, just months later, two California courts reached the opposite conclusion. In Bartz v. Anthropic, a federal judge ruled that training Claude on copyrighted books was fair use, describing it as quintessentially transformative. The model was not storing or reproducing the books. It was extracting patterns and converting them into numerical weights. The original expression was gone. What remained was the shape of it. The training itself was ruled transformative, though the case ultimately settled for a reported sum of up to $1.5 billion over the handling of pirated source material.

Two days later, in Kadrey v. Meta, the same logic was applied to a company that had conceded the use of pirated copies of novels in their training. Meta won anyway. The judge found the purpose transformative regardless of how the data was obtained. The piracy-related claims were not fully resolved, but the core training use was ruled transformative. However, seeing that they are not a 12 year old girl, but a billion dollar company, they get a pass.

The variable that decided all of this was generality. Build a model that does everything and you are probably fine. Build one that competes with a specific creator in their own market and you lose. The law as it currently stands essentially rewards scale and indiscrimination. A general purpose model ingests everything and produces anything, so it substitutes for nothing specifically. In other words, the bigger and more indiscriminate your data grab, the safer you are legally. Which is a strange incentive to have written into copyright law.

Then there is New York Times v. OpenAI, which is a story unto itself. The Times sued OpenAI in 2023 for training on millions of its articles without permission. OpenAI tried to have the case dismissed. The judge said no and sent it to trial. Then it emerged that logs that could have served as evidence had been deleted during the litigation. OpenAI contested the characterization, but the court was unconvinced. A preservation order was issued, forcing OpenAI to retain all user conversations indefinitely, including ones users had already deleted. OpenAI fought it at every stage and lost every time. They have now been ordered to hand over 20 million chat logs to the plaintiffs. The case is heading to trial, and if the Times can show that ChatGPT routinely reproduces their content in ways that substitute for reading the original, OpenAI's fair use defense gets very hard to sustain.

This is worth keeping in mind when anyone argues that AI companies can be trusted to self-regulate. One of the largest AI companies in the world had potential evidence deleted during active litigation, and only stopped when a court forced them to retain it.

The courts are probably doing their best. But they are making it up as they go, because the law was never written with any of this in mind. And the companies being judged have every incentive to keep it that way.

A new hope

So how do we fix this? One option is to wave the white flag in exchange for full transparency. And I mean full transparency. So much so that it would globally level the playing field.

There is no realistic way to roll back, remove, or tune out data that is already trained upon. Fighting over historical training data is a battle that will drag on for decades in courts and legal disputes, and by the time we land on anything everyone agrees on, it will already be too late. Even if you believe, as I do, that creators deserve better than what they got, the fighting approach is tedious, expensive, and increasingly tilted in favor of the people with the best lawyers.

So here is the proposal. In exchange for amnesty on what has already been trained, we get a searchable and auditable record of what is in these models, who contributed to it, and how the models were weighted and tuned. Regulators, researchers, and creators get real visibility into what is happening. We as a society get a better understanding of why the models give the output they do. We get to see if the models are deliberately leaving out data, or weighing towards an agenda. Fairness audits become possible. Future infringement becomes harder to hide.

Creators and individuals could opt out of being included in any future training. And as a bonus, it levels the playing field in a way that ensures no single stakeholder runs away with the most powerful technology in human history.

It is not a perfect solution. The most obvious flaw is that developers could quietly delete training data before disclosing it, cycling out opted-out content without ever acknowledging it was there. So far, the most promising work on extracting what is actually inside a model has come from clever prompting techniques, but that field is still in its early days and nowhere near reliable enough to serve as a proper accountability mechanism.

How to actually audit a model is a problem that deserves its own article. And it is a bigger problem than most people realize, because the answer right now is: we largely cannot do it in an effective manner. But that is a story for another day.

A rant on the phrase Fundamentally different function

Fundamentally different function. This is a problematic statement, as there is no hard cap anywhere. There is no way to definitively say that on this side it serves a different function and on this side it does not. Generalistic models serve both purposes at once. Let's say you are a Stephen King fan, you read all of his works, and now you're dying to get just one more story like it. In here lie some of the value of human creativity. It could be sparse, and the output rate of one's own style is entirely up to the creator. However, with some less than clever prompting, you could get the model to make a novel in a style mimicking Stephen King very well. And thus, it is serving the same function as Stephen King and his publisher. At the same time, the model is generalistic. So it can do so much else. It can talk to you about sports, help you learn how to cook a perfect Sunday meal, help you decide what car is the best fit for you etc. So the argument that it serves a fundamentally different function is too weak in my opinion. Because a generalistic model is designed to serve all purposes. No matter who is suing, they claim "our model serves a fundamentally different function", even though it is serving all purposes at the same time. They just shift focus based on who is suing. An author? Oh, but our model isn't designed to be a writer. A painter? Oh, our model is designed for generic graphical design only. So not only is the statement vague, it is simply not accurate.

Further reading

RIAA / Brianna LaHara

The Pirate Bay

Bartz v. Anthropic

Kadrey v. Meta

Thomson Reuters v. Ross Intelligence

New York Times v. OpenAI