SQL queries over traces are definitely worth it. Android and Chrome have had it for a while [1]. I once wrote about quantifying the UI janky-ness using it [2].
The point is that it can give you 1. quantitative comparison at scale and also 2.
alternative visualization that reveals problems which isn't obvious from the default timeline view. With it performance investigation becomes more like exploratory data analysis than a torture of your eyes.
It's not clear if Promscale can cross-reference other types of the performance metric. If it's possible that'd be a game changer.
The Promscale team is at KubeCon right now, so I'll jump in to answer this question.
Yes, you can actually cross-analyze traces with prometheus metrics in Promscale. That in fact is one of the key reasons we built Promscale, and is something we can do because it is built on top of TimescaleDB.
If it's possible that'd be a game changer.
I hope it is! And if not, we're always open to product feedback.
As @akulkarni said, Promscale supports Prometheus metrics and OpenTelemetry traces natively and there are different ways to correlate both signals. I am actually delivering a talk that goes over the different ways you can correlate Prometheus and OpenTelemetry data tomorrow at Prometheus Day :)
One is via adding exemplars to your Prometheus metrics that link to specific traces that are representative of the value of those metrics. In Promscale you can store all that information and then display it in Grafana as explained in 1. That's the way that is most often discussed but typically involves deploying one backend for metrics and one backend for traces instead of just one as is the case with Promscale.
With Promscale you can also correlate metrics and traces using SQL joins. That opens a whole set of new possibilities. For example, imagine if you could retrieve the slowest requests happening on services running on nodes where CPU usage is high to understand how they are impacting their performance. Or imagine you are seeing a specific OOM error often in your traces and you could run a SQL query to look at the evolution of memory usage in the last 24 hours of nodes where those OOM errors are more frequent to see if you spot anything strange happening. You could even go a step forward andretrieve in the same query what processes are consuming the most memory in those nodes to pinpoint the processes that could be causing the issue.
I'm not a server side person but from my experience, other types of data app developer might want to join are various kinds of product-specific data, like feature flags or user-specific dimensions for each request. These aren't typically in the trace data itself.
These are different from what product-agnostic "performance engineers" tend to look into, so I understand if this is out of scope. Although I think product people should look into these numbers as well, instead of just throwing them into the performance team's plate :-/
I should have mentioned that correlating observability data (or sometimes product metrics collected via Prometheus) with product data (or any other data really like business data) can be super useful and totally possible with Promscale because PostgreSQL is under the hood. So you could have a copy that data into the same PostgreSQL instance used by Promscale or maybe use Foreign Data Wrappers (1). This would allow you to analyze, for example, api request latency by product plan the customer is subscribed to or based on which feature flags are enabled for their account, etc. without having to add all those attributes as labels to all your metrics which can be technically complex and also costly.
We actually do this within Timescale Cloud, and it's amazing.
It allows us to cohort performance data across data stored in others microservice databases (e.g., by account types, projects, billing data, etc.). JOINs across foreign data wrappers using TimescaleDB + Postgres, all within the database and without ETL or application code needed.
So you could look at Prometheus data for your trialers vs. customers, for customers running more than X services, for customers that pay more than $X per month or have been a customer for more than 6 months, etc.
It's super useful across operations, support, product, customer care, and more...
Maybe I'm missing something, but what are the major differences between Perfetto and an OpenTelemetry tracing / metrics approach? In other words, why would someone choose one tool over the other?
Naively, it seems like Perfetto was designed around tracing on-device behavior while OTel focuses more on distributed tracing. However, I'm not sure why two distinct solutions would be required for that constraint.
The point is that it can give you 1. quantitative comparison at scale and also 2. alternative visualization that reveals problems which isn't obvious from the default timeline view. With it performance investigation becomes more like exploratory data analysis than a torture of your eyes.
It's not clear if Promscale can cross-reference other types of the performance metric. If it's possible that'd be a game changer.
[1] https://perfetto.dev/
[2] https://notes-dodgson-org.translate.goog/android/trace-proce...