Archived posting to the Leica Users Group, 2025/07/31
[Author Prev] [Author Next] [Thread Prev] [Thread Next] [Author Index] [Topic Index] [Home] [Search]Following up on a discussion of AI web crawlers? There was a discussion on the podcast ATP this week about how web crawlers have severely disrupted many sites, small and large. There is a transcript available through Apple?s Podcast app. CloudFlare has been putting mechanisms into place to help curtail this since the harvesting is ignoring the robot.txt files. (Of course - because: PROFIT) I?ll do a bit of quoting from that transcript: ======= "?Yep. Aaron Zinck writes, I'm a consultant who works with many large organizations and government websites. We've seen unprecedented levels of bot crawling across all of our sites. These are aggressive bots that bring our sites to their knees despite us using caching. Any site that has dynamic functionality is particularly vulnerable to the problem because there are nearly infinite combinations of parameters and settings that users can choose. Caching is little to no benefit when a bot decides to go deep exploring every possible combination. Some of the bots we've been seeing have been extremely tricky to identify. In these cases, CloudFlare has been our most effective tool by far. Their proprietary heuristics have been the only things that have enabled us to keep our sites online.? ?The real bad guy here, my opinion, is not CloudFlare, but the aggressive AI companies who are sending DDoS levels of traffic don't honor robots.txt files and make it nearly impossible to identify and block them selectively. I want to reiterate again, the recent traffic levels have been truly historic. Additionally, Anonymous writes, When I worked at GitHub, one of the biggest technical challenges we had was the explosive growth of AI crawlers against the site. Over the course of about six months, we saw traffic increase by about 40 percent due to AI scrapers. This is regularly threatened to take the site down. Intuitively, you'd think it would be a few companies doing this sort of scraping, but it's effectively every AI startup plus everyone that wants to sell datasets to them.?