What I gathered from the post was that one of the investigations was to ask what was on [some page url] and then check the logs moments later and saw it using a normal user agent.
You can just point it at a webserver and ask it a question like "Summarize the content at [URL]" with a sufficiently unique URL that no one would hit, maybe with an UUID. This is also explored on the very article itself.
In my testing they're using crawlers on AWS and they do not parse Javascript or CSS, so it is sufficient to serve some kind of interstitial challenge page like the one on Cloudflare, or you can build your own.
> Is it actually retrieving the page on the fly though?
They are able to do so.
> How do you know this?
The access logs.
> Even if it were - it’s not supposed to be able to.
There is a distinction from data used to train a model, which is the indexing bot with the custom user-agent string, and the user-query input given to the aforementioned AI model. When you ask an AI some question, you normally input text into a form, and the text goes back to the AI model where the magic happens. In this scenario, instead of inputting a wall text into a form, the text is coming from a url.
These forms of user input are equivilent, and yet distinctly different. Therefore it's intelectually dishonest for the OP to claim the AI is indexing them, when OP is asking the AI to fetch their website to augment or add context to the question being asked.
To steel man this, even though I think the article did a fine job already, maybe the author could’ve changed the content on the page so you would know if they were serving a cached response.
Author here. The page I asked it to summarize was posted after I implemented all blocking on the server (and robots.txt). So they should not have had any cached data.