Over the last couple of days I have been evaluating different solutions for monitoring Registry nodes and the Radicle network. I’d like to share my insights here, make a recommendation and move forward with a decision. I’ve tried to be brief but I’d be happy to share more details.
We want a flexible and powerful analysis tools. Since we’re not sure yet what we need to monitor we might need advanced capabilities to combine metrics and do calculations on them.
The solution should integrate seamlessly with Prometheus metrics. Substrate, which our nodes are built on, provides Prometheus metrics out of the box.
We want to operate as little monitoring infrastructure as possible.
Current and future team members should require as little education on the solution as possible. Solutions that are widely used and documented are preferable.
The solution is affordable
The self-hosted Grafana/Prometheus stack checks all the boxes for our needs except operating our own infrastructure. It’s powerful and flexible and widely used. Out of all the solutions our team (and also other Radicle teams) has the most experience with this stack. This likely holds true for future members.
The biggest drawback with this solution is that we would need to operate Grafana and Prometheus ourselves.
Grafana Cloud is a service that provides a hosted version of Grafana and Prometheus.
It offers two options for integrating with Prometheus metrics: Either via their own agent running on a node or via remote write for existing Prometheus instances. I’ve chosen the latter because I was familiar with setting up Prometheus on K8s but not with setting up the agent. I was able to quickly set it up and get it working.
Grafana Cloud charges 16$ per 1000 unique Prometheus series per month with a minimum of 50$ per month. We’re currently using a third of their basic plan so we might need to scale this up. Their basic plan also includes 10 users which is more than enough for us.
Datadog is an observability platform. I’ve integrated it into our stack by running their agent which scrapes Prometheus metrics. The integration was fairly straight-forward. Their feature set for analyzing metrics is comparable to Grafana/Prometheus but more limited in some cases. For example it was not possible to calculate the rate of counter metric over a configurable time window (e.g. block production rate over the last 10 minutes)—only fixed time windows where available. There is no stand-out feature that is missing from Grafana/Prometheus.
Datadog charges 15$ per host per month. This is not ideal since we want to be flexible with the number of hosts. Having to consider that we might pay more if we spin up a host might be an issue. It is also unclear how ephemeral hosts are charged.
Google Stack Driver
Stack Driver is the monitoring solution for everything running on Google Cloud Platform. It integrates well with metrics from GCP but is more limited in its capabilities than the other solutions. I tried experimented with dashboards for some GCP resources but did not set it up for our custom network metrics.
Based on my research I recommend we use Grafana Cloud for now. Stack driver is not really an option since it does not satisfy our needs. Datadog has no more feature but some limitations when compared to Grafana Cloud and Grafana Cloud has no serious limitations. Grafana Cloud has an advantage on openness and expertise and the pricing seems reasonable.
The choice of using the hosted version of an established stack (for better or for worse :trollface:) with a price tag < 100$ is very reasonable.
I’m still curious to hear what operational burden you’re afraid of – both Grafana and Prometheus are by design rather easy to run, and not very resource intensive.
I’m also curious to hear about how you plan to utilise logs for observability. Having built numerous logging pipelines over the years, I am mostly convinced that it should either be built on grep or The log/event processing pipeline you can’t have (which is, basically, distributed grep).
When I tried to setup Grafana myself on K8s I spent more than one hour on it and then got to a point where I would have had to spent at least another half day to write the terraform code to setup an ingress with SSL certificates and a domain so that we can log into it. After that only I would have been able to fix any issues with Grafana and the team would have relied on me. With us spending 50$ on the service it already pays of if we spent more than one hour per month on operating Grafana/Prometheus.
It’s definitely possible for us to run it ourselves and I don’t doubt that it would be rather easy. But at the moment it would just drain resources without benefit. Of course this evaluation might change over time and at some point it might make sense to operate the stack ourselves.
I hope that logs don’t need to play a big role in monitoring the system. Since we control the code we should expose metrics for all the things we care about. However, logs will definitely be integral to debugging and understanding the system if something goes wrong. With this in mind our needs should be very simple: We should be able to retrieve stored logs, filter them, and inspect them. In addition, as for our metrics pipeline, the system should be very easy to operate and easy to use (ideally with a lot of documentation and public knowledge).
Ya fifty bucks is alright. If it goes over 150-200, I’d reevaluate.
I took a look at what is emitted currently, and it does seem free-form. Most services or self-hosted indexers I’m aware of won’t provide much value on unstructed textual log lines – but charge you 10x the storage price of a S3/GCS bucket.
So my question is: do you want to be able to tail the logs of all nodes in realtime, or retrieve historical ones for forensics?
I think the latter is more important in the long run. Real time inspection will also be necessary (and is indeed something we’re doing already) but it will be used mainly for checking single instances and the current tools (kubectl logs and stack driver) are enough for the moment.
I’ll give the logging discussion a bit more time to find out what we really need (definitely structured logs) and then evaluate some solutions. For now we can stick with what we have.