Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The pages getting most of the bot action are search and product details.

Search results obviously can't be cached, as it's completely ad hoc.

Product details can't be cached either, or more precisely, there are parts of each product page that can't be cached because

* different customers have different products in the catalog

* different products have different prices for a given product

* different products have customer-specific aliases

* there's a huge number of products (low millions) and many thousands of distinct catalogs (many customers have effectively identical catalogs, and we've already got logic that collapses those in the backend)

* prices are also based on costs from upstream suppliers, which are themselves changing dynamically.

Putting all this together, the number of times a given [product,customer] tuple will be requested in a reasonable cache TTL isn't very much greater than 1. The exception being for walk-up pricing for non-contract users, and we've been talking about how we might optimize that particular cases.



Ahhhhh, search results makes a whole lot more sense! Thank you. Search can't be cached and the people who want to use your search functionality as a high availability API endpoint use different IP addresses to get around rate limiting.

The low millions of products also makes some sense I suppose but it's hard to imagine why this doesn't simply take a login for the customer to see the products if they're unique to each customer.

On the other hand, I suspect the price this company is paying to mitigate scrapers is akin to a drop of water in the ocean, no? As a percent of the development budget it might seem high and therefore seem big to the developer, but I suspect the CEO of the company doesn't even know that scrapers are scraping the site. Maybe I'm wrong.

Thanks again for the multiple explanations in any case, it opened my eyes to a way scrapers could be problematic that I hadn't thought about.


Good explanation, thank you.

I would think that artificially slowing down search results can discourage part of the bots. Humans don't care much it a starch finishes in 5 seconds and not 2 AFAIK.

Especially on backends where each request is a relatively cheap operations wise (especially when each request is a green thread like in Erlang/Elixir), I think you can score a win against the bots.

Have you attempted something like this?


This is really interesting but they’re using a network of bots already - even if you put a spinner that makes them wait a couple seconds the scrapers would just make more parallel requests no?


Yes, they absolutely will, but that's the strength of certain runtimes: green threads (i.e. HTTP request/response sessions in this case) cost almost nothing so you can hold onto 5-10 million of them on a VPS with 16-32 GB RAM, easily.

I haven't had to defend against extensive bot scraping operations -- only against simpler ones -- but I've utilized such a practice in my admittedly much more limited experience, and was actually successful. Not that the bots gave up but their authors realized they can't accelerate the process of scraping data so they dialed down their instances, likely to save money from their own hosting bills. Win-win.

Apologies, I don't mention to lecture you, just sharing a small piece of experience. Granted that's very specific to the backend tech but what the heck, maybe you'll find the tidbit valuable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: