I loved the graphics! However, I'm not sure I understand the statistical soundne...

eru · 2025-05-08T22:13:22 1746742402

> However, I'm not sure I understand the statistical soundness of this approach. I get that every log during a given period has the same chance to be included, but doesn't that mean that logs that happen during "slow periods" are disproportionately overrepresented in overall metrics?

Yes, of course.

You can fix this problem, however. There are (at least) two ways:

You can do an alternative interpretation and implementation of reservoir sampling: for each item you generate and store a random priority as it comes into the system. For each interval (eg each second) you keep the top k items by priority. If you want to aggregate multiple intervals, you keep the top k (or less) items over the intervals.

This will automatically deal with dealing all items the same, whether they arrived during busy or non-busy periods.

An alternative view of the same approach doesn't store any priorities, but stores the number of dropped items each interval. You can then do some arithmetic to tell you how to combine samples from different intervals; very similar to what's in the article.

> What are the use-cases that reservoir sampling are good for? What kind of statistical analysis can you do on the data that's returned by it?

Anything you can do on any unbiased sample? Or are you talking about the specific variant in the article where you do reservoir sampling afresh each second?

samwho · 2025-05-08T21:00:14 1746738014

Good question. I'm not sure how suitable this would be to then do statistical analysis on what remains. You'd likely want to try and aggregate at source, so you're considering all data and then only sending up aggregates to save on space/bandwidth (if you were at the sort of scale that would require that).

The use-case I chose in the post was more focusing on protecting some centralised service while making sure when you do throw things away, you're not doing it in a way that creates blind-spots (e.g. you pick a rate limit of N per minute and your traffic is inherently bursty around the top of the minute and you never see logs for anything in the tail end of the minute.)

A fun recent use-case you might have seen was in https://onemillionchessboards.com. Nolen uses reservoir sampling to maintain a list of boards with recent activity. I believe he is in the process of doing a technical write-up that'll go into more detail.