[Leica] On bot crawlers (techie, nerdy, above my pay grade stuff but sti

Archived posting to the Leica Users Group, 2025/07/31

[Author Prev] [Author Next] [Thread Prev] [Thread Next] [Author Index] [Topic Index] [Home] [Search]

Subject: [Leica] On bot crawlers (techie, nerdy, above my pay grade stuff but still interesting)
From: abridge at mac.com (Adam Bridge)
Date: Thu, 31 Jul 2025 08:54:56 -0700

Following up on a discussion of AI web crawlers?

There was a discussion on the podcast ATP this week about how web crawlers 
have severely disrupted many sites, small and large. There is a transcript 
available through Apple?s Podcast app.

CloudFlare has been putting mechanisms into place to help curtail this since 
the harvesting is ignoring the robot.txt files. (Of course - because: PROFIT)

I?ll do a bit of quoting from that transcript:

=======
"?Yep. Aaron Zinck writes, I'm a consultant who works with many large 
organizations and government websites. We've seen unprecedented levels of 
bot crawling across all of our sites.

These are aggressive bots that bring our sites to their knees despite us 
using caching. Any site that has dynamic functionality is particularly 
vulnerable to the problem because there are nearly infinite combinations of 
parameters and settings that users can choose. Caching is little to no 
benefit when a bot decides to go deep exploring every possible combination.

Some of the bots we've been seeing have been extremely tricky to identify. 
In these cases, CloudFlare has been our most effective tool by far. Their 
proprietary heuristics have been the only things that have enabled us to 
keep our sites online.?

?The real bad guy here, my opinion, is not CloudFlare, but the aggressive AI 
companies who are sending DDoS levels of traffic don't honor robots.txt 
files and make it nearly impossible to identify and block them selectively. 
I want to reiterate again, the recent traffic levels have been truly 
historic. Additionally, Anonymous writes, When I worked at GitHub, one of 
the biggest technical challenges we had was the explosive growth of AI 
crawlers against the site.

Over the course of about six months, we saw traffic increase by about 40 
percent due to AI scrapers. This is regularly threatened to take the site 
down. Intuitively, you'd think it would be a few companies doing this sort 
of scraping, but it's effectively every AI startup plus everyone that wants 
to sell datasets to them.?