Markdown pages, are they a good solution?

I already written a post about a HTML to markdown converter solution. In that post I suggested that the functionality on the application level would be a better solution. And in the last week Laravel markdown response and Symfony markdown response bundle popped up. I guess other languages with web frameworks will get similar solutions.

I consider those solutions to be partial fixes, because they lack the tools to trim or augment the page content for an LLM to get the data it can act on. If you want to provide LLM content I think the best solution is a backend one instead of a frontend one. I consider fully created HTML as a part of the frontend.

The elephant in the room is that webpages are also a human construct. An AI scraper doesn't need to follow human navigation. Page links are useless if it can grep the content.

How did we get here?

With search engine bots their goal was to find all the pages of a website to put them in a search index and rank them.

The purpose of AI bots is to scrape content from websites to use as additional knowledge for an LLM.

While scraping content was a part of the search engine bots, the content was not the main objective.

Search engine bots are also a minor part of the traffic. And are a part of the marketing cost, because they expose the website to a bigger audience.

AI bots are becoming a substantial part of the traffic, and they haven't proven their marketing worth or any other benefit.

It seems logical to me that the first reaction of people was to block AI traffic. When people discovered food wrappings contained less food the people were not happy. When food companies started to use lower quality products because their sales have reached the ceiling, the people where again unhappy. The sad fact is people keep buying the product. And I think we are at the same level with AI. Websites are allowing AI scrapers because it could be beneficial.

What is the solution?

If you want to provide data for an LLM, I think an LLM website and a human website is a better way to go. The LLM website can be nothing more than a collection of linked markdown files.

The second part of the solution is to provide a search that returns data an LLM or an agent can use. The main goal of the search is to provide specific information or information not found on the LLM website. I don't think REST(ful) or Graphql endpoints are good enough because their output is not LLM specific.

The benefit of the LLM website is that they are static pages, so you could host them on edge servers when you see traffic ramping up in a certain region. The benefit of a search is that you could create a paywall for AI scrapers to access the searchable content more frequently, or extra information. The benefit of this solution is that HTML page traffic will be more human again, once the people that use the AI scrapers are aware of these options.