The pages getting most of the bot action are search and product details. Search ...

YPCrumble · on April 18, 2022

Ahhhhh, search results makes a whole lot more sense! Thank you. Search can't be cached and the people who want to use your search functionality as a high availability API endpoint use different IP addresses to get around rate limiting.

The low millions of products also makes some sense I suppose but it's hard to imagine why this doesn't simply take a login for the customer to see the products if they're unique to each customer.

On the other hand, I suspect the price this company is paying to mitigate scrapers is akin to a drop of water in the ocean, no? As a percent of the development budget it might seem high and therefore seem big to the developer, but I suspect the CEO of the company doesn't even know that scrapers are scraping the site. Maybe I'm wrong.

Thanks again for the multiple explanations in any case, it opened my eyes to a way scrapers could be problematic that I hadn't thought about.

pdimitar · on April 19, 2022

Good explanation, thank you.

I would think that artificially slowing down search results can discourage part of the bots. Humans don't care much it a starch finishes in 5 seconds and not 2 AFAIK.

Especially on backends where each request is a relatively cheap operations wise (especially when each request is a green thread like in Erlang/Elixir), I think you can score a win against the bots.

Have you attempted something like this?

YPCrumble · on April 19, 2022

This is really interesting but they’re using a network of bots already - even if you put a spinner that makes them wait a couple seconds the scrapers would just make more parallel requests no?

pdimitar · on April 19, 2022

Yes, they absolutely will, but that's the strength of certain runtimes: green threads (i.e. HTTP request/response sessions in this case) cost almost nothing so you can hold onto 5-10 million of them on a VPS with 16-32 GB RAM, easily.

I haven't had to defend against extensive bot scraping operations -- only against simpler ones -- but I've utilized such a practice in my admittedly much more limited experience, and was actually successful. Not that the bots gave up but their authors realized they can't accelerate the process of scraping data so they dialed down their instances, likely to save money from their own hosting bills. Win-win.

Apologies, I don't mention to lecture you, just sharing a small piece of experience. Granted that's very specific to the backend tech but what the heck, maybe you'll find the tidbit valuable.