Mimir is just architected for a totally different order of magnitude of metrics. At that scale, yeah, Kafka is actually necessary. There are no other open-source solutions offering the same scalability, period.
That's besides the point that most customers will never need that level of scale. If you're not running Mimir on a dedicated Kubernetes cluster (or at least a dedicated-to-Grafana / observability cluster) then it's probably over-engineered for your use-case. Just use Prometheus.
Have a look at Victoria Metrics - have run it at a relatively high scale with much more success than any other metric stores. It's one of those things that just work. It's extremely easy to run at in a single-instance mode and handles much more than you would expect. Scaling it is a breeze too.
(I'm not affiliated, but a very happy user across multiple orgs and personal projects)
The project where I looked at Mimir was a 500+ million timeseries project, with the desire to support scaling to the ten-figure level of timeseries (working for a BigCo supporting hundreds of product development teams).
All of these systems that store metrics in object storage - you have to remember that object storage is not file storage. Generally speaking (stuff like S3 One Zone being a relatively recent exception) you cannot append to object files. Metrics queries are resolved by querying historical metrics in object storage plus a stateful service hosting the latest 2 hours of data before it can be compressed and uploaded to object storage as a single block. At a certain scale, you simply need to choose which is more important - being able to answer queries or being able to insert more timeseries. And if you don't prioritize insertion, it just results in the backlog getting bigger and bigger, which especially in the eventual case (Murphy's Law guarantees it) of a sudden flood of metrics to ingest will cause several hour ingestion delays during which you are blind. And if you do prioritize insertion, well the component simply won't respond to queries, which makes you blind anyway. Lose-lose.
Mimir built in Kafka because it's quite literally necessary at scale. You need the stateful query component (with the latest 2 hours) to prioritize queries, then pull from the Kafka topic on a lower priority thread, when there's spare time to do so. Kafka soaks up the sudden ingestion floods so that they don't result in the stateful query component getting DoS'd.
I took a quick look at VictoriaMetrics - no Kafka or Kafka-like component to soak up ingestion floods? DOA.
Again, most companies are not BigCos. If you're a startup/scaleup with one VP supervising several development teams, you likely don't need that scale, probably VictoriaMetrics is just fine, you're not the first person I've heard recommend it. But I would say 80% of companies are small enough to be served with a simple Prometheus or Thanos Query over HA Prometheus setup, 17% of companies will get a lot of value out of Victoria Metrics, the last 3% really need Mimir's scalability.
I'm not sure where you saw that Victoria Metrics uses object storage. It doesn't - it uses block storage and it runs completely fine on HDD, you don't even need SSD/NVMe.
There are multiple ways to deal with ingestion floods. Kafka/distributed log is one of them, but it's not the only one. In cluster mode VM is a distributed set of services that scale out independently and buffer at different levels.
Resource usage for ingestion/storage is much lower than other solutions, and you get more for your money. At $PREVIOUS_JOB, we migrated from a very expensive Thanos to a VM cluster backed by HDDs, and saved a lot. Performance was much better as well. It was a while ago, and I don't remember the exact number of time series, but it was meant to handle 10k+ VMs (and a lot of other resources, multiple k8s clusters) and did it with ease (also for everybody involved).
I don't think you have really looked into VM - you might get pleasantly surprised by what you find :) Check out this benchmark with Mimir[1] (it is a few years old though), and some case studies [2]. Some of the companies in the case studies run at significantly higher volume than your requirements.
There were other problems with VictoriaMetrics - a failed migration attempt by previous engineers made it politically difficult to raise as a possibility, lack of a promise of full PromQL compatibility (too many PromQL dashboards built by too many teams), seeing features locked behind the Enterprise version (Mimir Enterprise had features added on top, not features locked away).
> HDD
You're right, I'm misremembering here, that particular complaint about a lack of Kafka was a Thanos issue, not VM.
That said, HDD is a hard sell to management. Seen as "not cloud native". People with old trauma from 100% full disks not expanded in time. Organizational perception that object storage does not need to be backed up (because redundancy is built into the object storage system) but HDD does (and automated backups are a VM Enterprise feature, and even more important if storing long-term metrics in VM).
> In cluster mode VM is a distributed set of services that scale out independently and buffer at different levels
So are Thanos and Mimir, which suffer from ingest floods causing DoS, at least until Kafka was added. vminsert is billed as stateless, same as Thanos Receiver, same as Mimir Distributor. Not convinced.
> lack of a promise of full PromQL compatibility (too many PromQL dashboards built by too many teams)
This is a classical FUD. VictoriaMetrics is used as a drop-in replacement for Prometheus, Thanos and Mimir. It works perfectly across all the existing dashboards in Grafana, and across all the existing recording and alerting rules. I'm unaware of VictoriaMetrics users who hit PromQL compatibility issues during the migration from Prometheus, Thanos and Mimir to VictoriaMetrics. There are a few deliberate incompatibilities aimed towards improving user experience. See https://medium.com/@romanhavronenko/victoriametrics-promql-c...
> seeing features locked behind the Enterprise version (Mimir Enterprise had features added on top, not features locked away)
All the VictoriaMetrics features, which are useful across the majority of practical use cases, are included in open-source version. The main Enterprise feature - high-quality technical support by VictoriaMetrics engineers. Other Enterprise features are needed only for large enterprise companies. See https://docs.victoriametrics.com/victoriametrics/enterprise/
What's your preferred solution for observability and monitoring of tiny apps?
I'm looking for something with really compact storage, really simple deployment (preferably a single statically linked binary that does everything), and compatible with OpenTelemetry (including metrics and distributed tracing). If/when I outgrow it, I can switch to another OpenTelemetry provider (but realistically this will not happen)
I'm personally not convinced OpenTelemetry is the future. I get the desire to not be vendor-locked to a single provider, but Prometheus and Jaeger are very solid, battle-hardened, popular, well-maintained, easily self-hosted open-source projects. For small deployments you do not need to overthink things here - Grafana, Prometheus, Jaeger (with local disk storage), logging depends on how many machines you're talking about and where they're hosted (e.g. GCP Cloud Logging is fine for GCP-hosted projects, the 50 GB free tier is a lot for a small project) but as a default Loki is also just fine and much better than Elastic/OpenSearch.
OpenTelemetry is, last I looked at it, way too immature, unstable, and resource-hungry to be such a foundational part of infrastructure.
This is a lot of infrastructure, we are talking about a tiny app here. Are you sure this is warranted?
Honestly I would prefer to have observability as a library, that's not feasible because of two factors, a) I really want distributed tracing (no microservices - I just want to combine traces from frontend and backend) so I need a place to join them, and b) it could/would lead to loss of traces when the program crashes.
In any case, it makes sense for me to choose tracing and metrics libraries that can output either OpenTelemetry or Prometheus and Jaeger, in the event that OpenTelemetry is not enough.
> Loki is also just fine and much better than Elastic/OpenSearch.
I'm scratching my head a little bit on what your expectation is here. Traces and real-user-monitoring are not the same thing here. Distributed tracing is specifically a microservices thing. Maybe all you're looking for is to just attach a UUID to each request and log it? Jaeger and Tempo aren't going to help you with frontend code.
> A lot of infrastructure
> Prometheus
You need something to tell you when your tiny app isn't running, so it can't be a library embedded into the app itself.
> Grafana
You need something with dashboards to help you understand what's going on. If your thing telling you your app has crashed is outside your app, the thing that helps visualize what happened before your app crashed also needs to be outside your app.
> Jaeger
Do you really need traces? Or just to log how long request times took, and have metrics for p50/p95/p99?
> Loki
If you're running only one instance of your app, you don't need it, just read through your logfiles with hl. If you have multiple machines, sending your logs to one place may not be necessary, but it's incredibly helpful; the alternative is basically ssh multiplexing...
Thank you for that. I absolutely love that this uses tantivy.
I was previously leaning torwards VictoriaMetrics and VictoriaTraces (I will need both) but I think that OpenObserve is even simpler. Later I found Gigapipe/qryn https://github.com/metrico/gigapipe
Does OpenObserve ships something to view traces and metrics? (it appears that Gigapipe does). Or am I supposed to just use Grafana? I want to cut down on moving pieces.
I'm looking at OpenTelemetry because of broad tooling compatibility (both Rust tracing crates, tracing and emit, support it - for logs, tracing and metrics) and it seems like something that will stick around.
Also I'm not sure I will ever need actual performance out of an observability solution; it's a tiny app after all.
AMP is an internal proprietary fork of Cortex, they're not up-streaming their changes, also in large part due to the scalability limits of Cortex's design. It has the same scalability limitation I described earlier with the lack of a Kafka-like component to soak up ingest floods.
> Multi-petabyte
Sheer storage size is a meaningless point here, as longer retention requires more storage. There may or may not be compaction components that help speed up queries over larger windows, but that's irrelevant to the point that the queries will still succeed. I have no doubt that any of the solutions on the table will handle storing that much data.
The real scaling question is how many active timeseries the system can handle, at which resolution (per 15 seconds? per 60 seconds? worse?), and no, "we scale horizontally" doesn't mean much without more serious benchmarks.
>The real scaling question is how many active timeseries the system can handle
handle? What does it mean? Be able to ingest data? Be able to query?
ingest data - Using kafka helps only during ingestion for handling the spike.
query data - Kafka has no role to play in it. Querying performantly at scale is a hard problem. I do not doubt Mimir's capability in being able to query high volumes of data, but other systems can do it too and OpenObserve's internal benchmarks show that it's querying is much faster at scale than Mimir and we will publish it at the right time (We don't just publish benchmarks to satisfy plain curiosity of people on internet), but this is not about OpenObserve so let's push it aside for a while.
About - how many active timeseries
We've built OpenObserve with a fundamentally different architecture. We don't have the "active timeseries" constraint that Prometheus-based systems do. High cardinality isn't an issue by design It's a topic for another day though.
The primary function of a message broker is to decouple producer and consumer so writes can happen efficiently (consumers do not get bogged down by high incoming volume). Something like Kafka allows that very, very well, and it is one of the best systems designed to do it. It allows massive volumes of ingestion reliably without dropping packets. It's a beast on it's own though.
Kafka was also built in an era when autoscaling was not available (Still very relevant though and will be for a very long time). Autoscaling to a great degree can allow you to handle write spikes (It's not the same thing but can attack the same problem from a different angle) and extreme spikes will still require a message broker. Horizontally scalable does cut it to a great degree though.
Having architected massive systems for multiple large companies, I can argue about technology for a long time, but the only point I want to drive is to avoid the use of words like "period". Mimir's architecture makes sense but it's not the only solution that works at scale, and the operational complexity has real costs. There are no absolutes in tech as in life.
That's besides the point that most customers will never need that level of scale. If you're not running Mimir on a dedicated Kubernetes cluster (or at least a dedicated-to-Grafana / observability cluster) then it's probably over-engineered for your use-case. Just use Prometheus.