Stealthy Bots: Perplexity AI's Approach to Evading No-Crawl Directives

In a surprising revelation, Cloudflare, a noted network security and optimization service, has alleged that the AI search engine Perplexity is employing covert bots and a range of other tactics to bypass no-crawl instructions on websites. According to Cloudflare, if these claims hold, Perplexity is defying Internet protocols that have been respected for over thirty years.
Cloudflare detailed in a blog post how it encountered complaints from clients who had tried to prevent Perplexity's bots from scraping their site content through settings in their robots.txt files and even web application firewalls. Despite these measures, Perplexity reportedly continued accessing the websites.
Upon further investigation, Cloudflare discovered that when blocked by robots.txt files or firewall rules, Perplexity bots employed a stealth approach, utilizing a variety of strategies to conceal their actions. Notably, these undeclared crawlers operated using multiple IP addresses outside Perplexity’s official range and rotated these addresses to sidestep restrictions, affecting over 10,000 domains with millions of requests daily.
This technique allegedly subverts the Robots Exclusion Protocol, proposed in 1994 and adopted as a standard by the Internet Engineering Task Force in 2022. This protocol is meant to inform web crawlers about the parts of a site they are not welcome to access, using a simple text file at the homepage's root.
Steve Huffman, Reddit’s CEO, had previously criticized Perplexity, along with AI engines from Microsoft and Anthropic, for treating the web as if all its content is freely theirs to use. Huffman’s sentiments were echoed by other media outlets like Forbes and Wired, which have accused Perplexity of unauthorized content replication.
In response to these findings, Cloudflare is implementing measures to block crawlers that contravene these established norms. The company believes that web crawlers should be transparent, have explicit purposes, and most importantly, adhere strictly to website directives.
While no response has been garnered yet from Perplexity representatives, the issue remains pressing as the internet grapples with unprecedented challenges surrounding AI and content rights.