Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Perplexity should always respect robots.txt, even for summarization requests. If I say that I don't want Perplexity crawling my site, I mean at all

Issuing a single HTTP request is definitionally not crawling, and the robots.txt spec is specifically for crawlers, which this is not.

If you want a specific tool to exclude you from their web request feature you have to talk to them about it. The web was designed to maximize interop between tools, it correctly doesn't have a mechanism for blacklisting specific tools from your site.



You are definitionally incorrect. From Wikipedia:

> robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

From robotstxt.org/orig.html (the original proposed specification), there is a bit about "recursive" behaviour, but the last paragraph indicates "which parts of their server should not be accessed".

> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page.

> In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

> These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

The draft RFC at robotstxt.org/norobots-rfc.txt, the definition is a little more strict about "recursive", but indicates that heuristics used and/or time spacing do not make it less a robot.

On robotstxt.org/faq/what.html, there is a paragraph:

> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

One might argue that the misbehaviour of Perplexity on this matter is "at the instruction" of a human, but as Perplexity does not present itself as a web browser, but a data processing entity, it’s clearly not a web browser.

Here's what would be permitted unequivocally, even on a site that blocks bad actors like Perplexity: a browser extension that used Perplexity's LLM to pretend to summarize but actually shorten the content (https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actu...) when you visit the page as long as that summary were not saved in Perplexity's data.


Every paragraph that you've included up there just reinforces my point.

The recursive behavior isn't incidental, it's literally part of the definition of a crawler. You can't just skip past that and pretend that the people who specifically included the word recursive (or the phrase "many pages") didn't really mean it.

The first paragraph of the two about access controls is the context for what "should not be accessed" means. It refers to "very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting)", which are pages that should not be indexed by search engines but for the most part shouldn't be a problem for something like perplexity. As I said in my comment, it's about search engine crawlers and indexers.

I'm glad that you at least cherry-picked a paragraph from that second page, because I was starting to worry that you weren't even reading your sources to check if they support your argument. That said, that paragraph means very little in support of your argument (it just gives one example of what isn't a robot, which doesn't imply that everything else is) and you're deliberately ignoring that that page is also very specific about the recursive nature of the robots that are being protected against.

Again, this is the definition that you just cited, which can't possibly include a single request from Perplexity's server (emphasis added):

> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

The only way you can possibly apply that definition to the behavior in TFA is if you delete most of it and just end up with "programs ... that traverse ... the WWW", at which point you've also included normal web browsers in your new definition.

It honestly just feels like you really have a lot of beef with LLM tech, which is fair, but there are much better arguments to be made against LLMs than "Perplexity's ad hoc requests are made by a crawler and should respect robots.txt". Your sources do not back up what you claim—on the contrary, they support my claim in every respect—so you should either find better sources or try a different argument.


Perplexity's ad hoc requests are still made by a crawler — whether you believe it or not. A web browser presents the content directly to the user. There may be extensions or features (reader mode) which modify the retrieved content in browser, but Perplexity's summarization feature does not present the content directly to the user in any way.

It honestly just feels like you have no critical thinking when it comes to LLM tech and want to pretend that an autonomous crawler that only retrieves a single page to process it isn't a crawler.

I have used, with permission of the site owner, a crawler to retrieve data from a single URL on a scheduled basis. It is fully automated data retrieval not intended for direct user consumption. THAT is what makes it a crawler. If the page from which I was retrieving the data was included in `/robots.txt`, the site owner would expect that an automated program would not pull the data. Recursiveness is not required to make a web robot. Unattended and/or disconnected requests do.


You are inventing your own definition for a term that is widely understood and clearly and unambiguously defined in sources that you yourself cited. Since you can't engage honestly with your own sources I see no value in continuing this conversation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: