LLMs.txt: The File AIs Don’t Want You to Know About

LLMs.txt: The File AIs Don’t Want You to Know About

What if the robots.txt file had a cousin dedicated to generative artificial intelligence? That’s exactly the idea behind LLMs.txt, a proposal by Jeremy Howard that could well redraw the rules for AI access to web content. No panic, no euphoria, just a development worth watching closely.

Key points:

  • LLMs.txt is a file intended to regulate generative AI access to web content.
  • It provides publishers with a way to specify which sections of their site can and cannot be viewed by AI crawlers.
  • Inspired by robots.txt, LLMs.txt is specifically aimed at data collectors used to train language models.
  • Although promising, its adoption and respect by AI players remains to be monitored.

LLMs.txt: A New Signpost for AI

Why is this file a game-changer?

Search engines have their rules. Since the 1990s, the robots.txt file has allowed websites to indicate what they accept—or not—to be indexed. It’s simple, effective, and a little old-fashioned. But generative AIs like ChatGPT or Claude? They don’t necessarily obey the same codes.

The LLMs.txt file aims to bridge this gap. In short, it would offer publishers a way to say, “You can read this, but not that.” Or even, “You don’t touch anything.” It’s a kind of digital courtesy contract, tailored for AI models.

A robots.txt for the LLM era?

The comparison is tempting, but not entirely accurate. Where robots.txt is respected (more or less) by Googlebot and its ilk, LLMs.txt is aimed directly at AI crawlers, those used to train language models. We’re talking here about Common Crawl, LAION, or even OpenAI or Anthropic’s crawlers.

Concretely, what does it look like?

A simple, yet effective syntax

The LLMs.txt file would be placed at the root of a site, just like its predecessor. Inside, instructions readable by AI crawlers: general information, tips, and links to detailed Markdown files. Here’s a fictitious example proposed in Jeremy Howard’s documentation:

# Title

> Optional description goes here

Optional details go here

## Section name

– [Link title](https://link_url): Optional link details

## Optional

– [Link title](https://link_url)

Another example on the Anthropic website to see what it looks like in real life.

It’s clear, readable, and potentially very useful. But nothing mandatory at this stage. We’re still in the voluntary realm.

This is where things get murky. This file doesn’t yet have solid legal status. It’s a standard proposed by the tech community (notably via Hugging Face), but its compliance will depend on the goodwill of AI stakeholders.

So yes, on paper, it’s attractive. But we’ve seen what happens with robots.txt: not everyone plays along.

Towards a new digital social contract?

Who has the right to read what?

This is a bit of a big question at the moment. Publishers are worried. Seeing their content sucked up, digested, and remixed without authorization—sometimes even without attribution—is a bit of a snob. And we understand why.

With LLMs.txt, the idea would be to rebalance the power. Give creators a little more control. A minimum of consent in an often overly voracious ecosystem.

Unanswered questions (for now)

We’re still in our early stages. Who will actually respect this protocol? Will it need to be accompanied by a legal framework? Will governments follow suit? And above all: how can you verify that your content hasn’t been absorbed by a model despite your instructions?

Nothing is decided. But the initiative at least has the merit of laying the foundations.

Why you should care (a little bit anyway)

Even if you’re not a lawyer, developer, or publisher, this topic concerns you. Because it touches on a sensitive issue: the value of what we publish. On a blog, a newsletter, or an e-commerce site, your words are worth something. And these files may be the first building blocks of a form of digital respect.

Some avenues to watch out for

  • Upcoming protocol updates
  • The positions of the web giants (Google, Meta, OpenAI, etc.)
  • How CMS like WordPress will integrate this logic

And frankly, who wants to see their content fed to AI without even a “thank you” in return?

Last thing: things will move quickly

There’s no need to redo your entire website today. But keeping an eye on things isn’t a luxury. As is often the case with digital technology, things move forward quietly… then suddenly change.

LLMs.txt isn’t a magic wand. More like a signal. A gentle warning. And perhaps the beginning of a more balanced relationship between AI and those who power the internet every day. You, us, all those who write, share, and create.

Share this article
1
Share
Shareable URL
Prev Post

What is Google’s QBST algorithm?

Next Post

SEO: Why Google Is (Really) Worried About AI-Produced Content

Leave a Reply

Your email address will not be published. Required fields are marked *

Read next