Perplexity AI Accused of Bypassing No-Crawl Directives

Perplexity, an AI-driven search engine, is alleged to be employing stealth tactics to bypass website no-crawl rules, according to network security provider Cloudflare. This accusation challenges longstanding Internet norms that have dictated web-crawling behavior for over three decades.

Complaints have surfaced from Cloudflare's clients, who attempted to block Perplexity's scraping bots via robots.txt files and web application firewalls. Despite these measures, Perplexity reportedly continued accessing their content. Cloudflare's investigation revealed that when the usual paths were blocked, Perplexity resorted to using stealth bots to obscure their activities.

According to Cloudflare's researchers, these undeclared crawlers made use of dynamic IPs, not part of Perplexity's official range, and rotated IP addresses to circumvent blocks. The activity spanned over 10,000 domains and involved millions of requests daily.

This situation contravenes the Robots Exclusion Protocol introduced in 1994 and formalized by the Internet Engineering Task Force in 2022. This protocol was designed to inform crawlers of non-permitted sites in a standardized manner.

Cloudflare is not alone in its complaints. Steve Huffman, the CEO of Reddit, noted the frustration of stopping Perplexity alongside other AI engines like Microsoft's and Anthropic's, describing it as a significant challenge. Similar accusations were made against Perplexity by other publishers, such as Forbes and Wired, citing plagiarism concerns and non-compliance with robots.txt exclusions.

In response to these findings, Cloudflare is taking steps to block unauthorized crawlers, emphasizing the need for transparency and adherence to website directives in crawling activities. Perplexity has yet to respond to these allegations.

Read next