Horrid advice at the end about logging every error, exception, slow request, etc...

46Bit · 2025-12-22T00:00:29 1766361629

What we're doing at Cloudflare (including some of what the author works on) samples adaptively. Each log batch is bucketed based on a few fields, and in each bucket if there's lots of logs in each bucket we only keep the sqrt or log of the number of input logs. It works really well... but part of why it works well is we always have blistering rates of logs, so can cope with spikes in event rates without the sampling system itself getting overwhelmed.

otterley · 2025-12-21T19:17:46 1766344666

It’s an important architectural requirement for a production service to be able to scale out their log ingestion capabilities to meet demand.

Besides, a little local on-disk buffering goes a long way, and is cheap to boot. It’s an antipattern to flush logs directly over the network.

lanstin · 2025-12-21T23:23:58 1766359438

And everything logging from the API to the network to the ingestion pipeline needs to be best effort - configure a capacity and ruthlessly drop msgs as needed, at all stages. Actually a nice case for UDP :)

otterley · 2025-12-21T23:41:56 1766360516

It depends. Some cases like auditing require full fidelity. Others don’t.

Plus, if you’re offering a logging service to a customer, the customer’s expectation is that once successfully ingested, your service doesn’t drop logs. If you’re violating that expectation, this needs to be clearly communicated to and assented by the customer.

lanstin · 2025-12-23T21:09:30 1766524170

1. those ingested logs are not logs for you, they are customer payload which are business criticial; 2. I've yet to see a Logging as a Service provider not have outages where data was lost or severely delayed. Also, the alternative to best effort/shed excessive load isn't 100% availability, it's catastrophic failure when capacity is reached.

Auditing has the requirement to be mostly not lost, but most importantly not being able to be deleted by people on the host. And for the capacity side, again the design question is "what happens when incoming events exceed our current capacity - all the collectors/relays balloon their memory and become much much slower, effectively unresponsive, or immediately close the incoming sockets, lower downstream timeouts, and so on." Hopefully, the audit traffic is consistent enough that you don't get spikes and can over-capacitize with confidence.

otterley · 2025-12-23T22:33:02 1766529182

> those ingested logs are not logs for you, they are customer payload which are business criticial

Why does that make any difference? Keep in mind that at large enough organizations, even though the company is the same, there will often be an internal observability service team (frequently, but not always, as part of a larger platform team). At a highly-functioning org, this team is run very much like an external service provider.

> I've yet to see a Logging as a Service provider not have outages where data was lost or severely delayed.

You should take a look at CloudWatch Logs. I'm unaware of any time in its 17-year history that it has successfully ingested logs and subsequently lost or corrupted them. (Disclaimer: I work for AWS.) Also, I didn't say anything about delays, which we often accept as a tradeoff for durability.

> And for the capacity side, again the design question is "what happens when incoming events exceed our current capacity - all the collectors/relays balloon their memory and become much much slower, effectively unresponsive, or immediately close the incoming sockets, lower downstream timeouts, and so on."

This is one of the many reasons why buffering outgoing logs in memory is an anti-pattern, as I noted earlier in this thread. There should always -- always -- be some sort of non-volatile storage buffer in between a sender and remote receiver. It’s not just about resilience against backpressure; it also means you won’t lose logs if your application or machine crashes. Disk is cheap. Use it.

trevor-e · 2025-12-21T19:23:55 1766345035

Yea that was my thought too. I like the idea in principle, but these magic thresholds can really bite you. It claims to be P(99), probably off some historical measurement, but that's only true if it's dynamically changing. Maybe this could periodically query the OTEL provider for the real number to at least limit the time window of something bad happening.

Veserv · 2025-12-21T19:57:04 1766347024

I do not see how logging could bottleneck you in a degraded state unless your logging is terribly inefficient. A properly designed logging system can record on the order of 100 million logs per second per core.

Are you actually contemplating handling 10 million requests per second per core that are failing?

otterley · 2025-12-21T20:10:40 1766347840

Generation and publication is just the beginning (never mind the fact that resources consumed by an application to log something are no longer available to do real work). You have to consider the scalability of each component in the logging architecture from end to end. There's ingestion, parsing, transformation, aggregation, derivation, indexing, and storage. Each one of those needs to scale to meet demand.

Veserv · 2025-12-21T20:23:15 1766348595

I already accounted for consumed resources when I said 10 million instead of 100 million. I allocated 10% to logging overhead. If your service is within 10% of overload you are already in for a bad time. And frankly, what systems are you using that are handling 10 million requests per second per core (100 nanoseconds per request)? Hell, what services are you deploying that you even have 10 million requests per second per core to handle?

All of those other costs are, again, trivial with proper design. You can easily handle billions of events per second on the backend with even a modest server. This is done regularly by time traveling debuggers which actually need to handle these data rates. So again, what are we even deploying that has billions of events per second?

otterley · 2025-12-21T20:27:58 1766348878

In my experience working at AWS and with customers, you don't need billions of TPS to make an end-to-end logging infrastructure keel over. It takes much less than that. As a working example, you can host your own end-to-end infra (the LGTM stack is pretty easy to deploy in a Kubernetes cluster) and see what it takes to bring yours to a grind with a given set of resources and TPS/volume.

Veserv · 2025-12-21T20:47:12 1766350032

I prefaced all my statements with the assumption that the chosen logging system is not poorly designed and terribly inefficient. Sounds like their logging solutions are poorly designed and terribly inefficient then.

It is, in fact, a self-fulfilling prophecy to complain that logging can be a bottleneck if you then choose logging that is 100-1000x slower than it should be. What a concept.

otterley · 2025-12-21T20:52:56 1766350376

At the end of the day, it comes down to what sort of functionality you want out of your observability. Modest needs usually require modest resources: sure, you could just append to log files on your application hosts and ship them to a central aggregator where they're stored as-is. That's cheap and fast, but you won't get a lot of functionality out of it. If you want more, like real-time indexing, transformation, analytics, alerting, etc., it requires more resources. Ain't no such thing as a free lunch.

dpark · 2025-12-21T21:51:06 1766353866

Surely you aren’t doing real time indexing, transformation, analytics, etc in the same service that is producing the logs.

A catastrophic increase in logging could certainly take down your log processing pipeline but it should not create cascading failures that compromise your service.

otterley · 2025-12-21T21:57:36 1766354256

Of course not. Worst case should be backpressure, which means processing, indexing, and storage delays. Your service might be fine but your visibility will be reduced.

dpark · 2025-12-21T22:13:47 1766355227

For sure. Your can definitely tip over your logging pipeline and impact visibility.

I just wanted to make sure we weren’t still talking about “causing a cascading outage due to increased log volumes” as was mentioned above, which would indicate a significant architectural issue.

mrkeen · 2025-12-22T11:24:45 1766402685

Damn that's fast! I'm gonna stick my business logic in there instead.

golem14 · 2025-12-21T21:56:18 1766354178

For high volume services, you can still log a sample of healthy requests, e.g., trace_id mod 100 == 0. That keeps log growth under control. The higher the volume, the smaller percentage you can use.

Cort3z · 2025-12-21T21:15:35 1766351735

Just implement exponential backoff for slow requests logging, or some other heuristic, to control it. I definitely agree it is a concern though.

debazel · 2025-12-21T19:32:12 1766345532

My impression was that you would apply this filter after the logs have reach your log destination, so there should be no difference for your services unless you host your own log infra, in which case there might be issues on that side. At least that's how we do it with Datadog because ingestion is cheap but indexing and storing logs long term is the expensive part.

XCSme · 2025-12-21T21:34:28 1766352868

Good point. It also reminded me of when I was trying to optimize my app for some scenarios, then I realized it's better to optimize it for ALL scenarios, so it works fast and the servers can handle no matter what. To be more specific, I decided NOT to cache any common queries, but instead make sure that all queries are fast as possible.