AI Crawlers and Stealth Tactics in Web Security

In a recent turn of events, Perplexity, an AI-powered search engine, is facing allegations from Cloudflare, a prominent network security service. They claim Perplexity is using stealth bots to bypass websites' no-crawl directives. This method, if confirmed, infringes upon long-standing Internet protocols.
Cloudflare's research points out instances where Perplexity's bots, despite being disallowed via robots.txt files and web application firewalls, continued their activity undeterred. The team tested these claims and noted that when known Perplexity crawlers met resistance through blocking measures, stealth bots came into play. These bots employed several tactics to disguise their operations.
An intriguing aspect of this alleged behavior is the sheer scale, involving over 10,000 domains and millions of requests daily. The researchers observed that these undeclared crawlers utilized multiple IPs, not part of Perplexity's acknowledged IP range, and frequently rotated these IPs in response to restrictive measures. Moreover, requests seemed to originate from varying ASNs, adding another layer to their evasion techniques.
The alleged actions directly contravene the principles established by the Robots Exclusion Protocol, initially proposed in 1994. This protocol aimed to inform crawlers through a simple robots.txt file about the areas of a site they are not supposed to access. Over the decades, this standard has become a norm.
Interestingly, this isn't the first time Perplexity has faced scrutiny. Previously, other entities like Reddit have pointed fingers at Perplexity for non-compliance with internet norms. Reddit’s CEO highlighted the ongoing challenge these AI engines posed, emphasizing their presumption that online content is free for their use.
Adding to Perplexity's woes are the accusations from notable publishers of alleged content plagiarism. Not long ago, Forbes accused Perplexity of republishing content akin to one of their articles. Similar concerns have been voiced by Wired, noting patterns of IP addresses likely linked to Perplexity, ignoring standard exclusions in place.
In response to the revelation of these activities, Cloudflare is actively taking steps to mitigate the access of such stealth crawlers on websites utilizing their services. They are now implementing new rules to block such undeclared bots.
A response from Perplexity’s representatives is still pending as they have not commented on the allegations. The situation underscores the ongoing conversation around AI ethics and web security.