Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Horrid advice at the end about logging every error, exception, slow request, etc if you are sampling healthy requests.

Taking slow requests as an example, a dependency gets slower and now your log volume suddenly goes up 100x. Can your service handle that? Are you causing a cascading outage due to increased log volumes?

Recovery is easier if your service is doing the same or less work in a degraded state. Increasing logging by 20-100x when degraded is not that.



What we're doing at Cloudflare (including some of what the author works on) samples adaptively. Each log batch is bucketed based on a few fields, and in each bucket if there's lots of logs in each bucket we only keep the sqrt or log of the number of input logs. It works really well... but part of why it works well is we always have blistering rates of logs, so can cope with spikes in event rates without the sampling system itself getting overwhelmed.


It’s an important architectural requirement for a production service to be able to scale out their log ingestion capabilities to meet demand.

Besides, a little local on-disk buffering goes a long way, and is cheap to boot. It’s an antipattern to flush logs directly over the network.


And everything logging from the API to the network to the ingestion pipeline needs to be best effort - configure a capacity and ruthlessly drop msgs as needed, at all stages. Actually a nice case for UDP :)


It depends. Some cases like auditing require full fidelity. Others don’t.

Plus, if you’re offering a logging service to a customer, the customer’s expectation is that once successfully ingested, your service doesn’t drop logs. If you’re violating that expectation, this needs to be clearly communicated to and assented by the customer.


1. those ingested logs are not logs for you, they are customer payload which are business criticial; 2. I've yet to see a Logging as a Service provider not have outages where data was lost or severely delayed. Also, the alternative to best effort/shed excessive load isn't 100% availability, it's catastrophic failure when capacity is reached.

Auditing has the requirement to be mostly not lost, but most importantly not being able to be deleted by people on the host. And for the capacity side, again the design question is "what happens when incoming events exceed our current capacity - all the collectors/relays balloon their memory and become much much slower, effectively unresponsive, or immediately close the incoming sockets, lower downstream timeouts, and so on." Hopefully, the audit traffic is consistent enough that you don't get spikes and can over-capacitize with confidence.


> those ingested logs are not logs for you, they are customer payload which are business criticial

Why does that make any difference? Keep in mind that at large enough organizations, even though the company is the same, there will often be an internal observability service team (frequently, but not always, as part of a larger platform team). At a highly-functioning org, this team is run very much like an external service provider.

> I've yet to see a Logging as a Service provider not have outages where data was lost or severely delayed.

You should take a look at CloudWatch Logs. I'm unaware of any time in its 17-year history that it has successfully ingested logs and subsequently lost or corrupted them. (Disclaimer: I work for AWS.) Also, I didn't say anything about delays, which we often accept as a tradeoff for durability.

> And for the capacity side, again the design question is "what happens when incoming events exceed our current capacity - all the collectors/relays balloon their memory and become much much slower, effectively unresponsive, or immediately close the incoming sockets, lower downstream timeouts, and so on."

This is one of the many reasons why buffering outgoing logs in memory is an anti-pattern, as I noted earlier in this thread. There should always -- always -- be some sort of non-volatile storage buffer in between a sender and remote receiver. It’s not just about resilience against backpressure; it also means you won’t lose logs if your application or machine crashes. Disk is cheap. Use it.


Yea that was my thought too. I like the idea in principle, but these magic thresholds can really bite you. It claims to be P(99), probably off some historical measurement, but that's only true if it's dynamically changing. Maybe this could periodically query the OTEL provider for the real number to at least limit the time window of something bad happening.


I do not see how logging could bottleneck you in a degraded state unless your logging is terribly inefficient. A properly designed logging system can record on the order of 100 million logs per second per core.

Are you actually contemplating handling 10 million requests per second per core that are failing?


Generation and publication is just the beginning (never mind the fact that resources consumed by an application to log something are no longer available to do real work). You have to consider the scalability of each component in the logging architecture from end to end. There's ingestion, parsing, transformation, aggregation, derivation, indexing, and storage. Each one of those needs to scale to meet demand.


I already accounted for consumed resources when I said 10 million instead of 100 million. I allocated 10% to logging overhead. If your service is within 10% of overload you are already in for a bad time. And frankly, what systems are you using that are handling 10 million requests per second per core (100 nanoseconds per request)? Hell, what services are you deploying that you even have 10 million requests per second per core to handle?

All of those other costs are, again, trivial with proper design. You can easily handle billions of events per second on the backend with even a modest server. This is done regularly by time traveling debuggers which actually need to handle these data rates. So again, what are we even deploying that has billions of events per second?


In my experience working at AWS and with customers, you don't need billions of TPS to make an end-to-end logging infrastructure keel over. It takes much less than that. As a working example, you can host your own end-to-end infra (the LGTM stack is pretty easy to deploy in a Kubernetes cluster) and see what it takes to bring yours to a grind with a given set of resources and TPS/volume.


I prefaced all my statements with the assumption that the chosen logging system is not poorly designed and terribly inefficient. Sounds like their logging solutions are poorly designed and terribly inefficient then.

It is, in fact, a self-fulfilling prophecy to complain that logging can be a bottleneck if you then choose logging that is 100-1000x slower than it should be. What a concept.


At the end of the day, it comes down to what sort of functionality you want out of your observability. Modest needs usually require modest resources: sure, you could just append to log files on your application hosts and ship them to a central aggregator where they're stored as-is. That's cheap and fast, but you won't get a lot of functionality out of it. If you want more, like real-time indexing, transformation, analytics, alerting, etc., it requires more resources. Ain't no such thing as a free lunch.


Surely you aren’t doing real time indexing, transformation, analytics, etc in the same service that is producing the logs.

A catastrophic increase in logging could certainly take down your log processing pipeline but it should not create cascading failures that compromise your service.


Of course not. Worst case should be backpressure, which means processing, indexing, and storage delays. Your service might be fine but your visibility will be reduced.


For sure. Your can definitely tip over your logging pipeline and impact visibility.

I just wanted to make sure we weren’t still talking about “causing a cascading outage due to increased log volumes” as was mentioned above, which would indicate a significant architectural issue.


Damn that's fast! I'm gonna stick my business logic in there instead.


For high volume services, you can still log a sample of healthy requests, e.g., trace_id mod 100 == 0. That keeps log growth under control. The higher the volume, the smaller percentage you can use.


Just implement exponential backoff for slow requests logging, or some other heuristic, to control it. I definitely agree it is a concern though.


My impression was that you would apply this filter after the logs have reach your log destination, so there should be no difference for your services unless you host your own log infra, in which case there might be issues on that side. At least that's how we do it with Datadog because ingestion is cheap but indexing and storing logs long term is the expensive part.


Good point. It also reminded me of when I was trying to optimize my app for some scenarios, then I realized it's better to optimize it for ALL scenarios, so it works fast and the servers can handle no matter what. To be more specific, I decided NOT to cache any common queries, but instead make sure that all queries are fast as possible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: