A checklist to choose a monitoring system

Prathamesh Sonpatki
4 min readFeb 25, 2024

Monitoring is extremely important to understand the system's health. Monitoring tools range from open-source tools such as Prometheus to many vendors. Monitoring tools form the backbone of any organization's observability strategy.

How do you choose one? Do you choose the latest open-source tool or use the battle-tested one? How do you decide to build vs. buy? How do we gauge developer productivity when deciding on the monitoring tool?

Here’s a simple checklist of sorts anyone can use before narrowing in on a monitoring system:

Scalability — High Cardinality

If you’re blessed with more customers and must add more engineers to match this growth, scaling monitoring becomes apparent as systems struggle to keep pace with business. From engineering to customers, a monitoring product needs to be simple and intuitive for the engineering team and match the growth and scale of customers being onboarded.

So, the first step is to check how well a monitoring solution can scale. Typically, the problem manifests itself in managing high cardinality for monitoring tools.

Reliability

All monitoring systems should be accountable for their uptime guarantees and how reliable they can be. ‘Who’s monitoring the monitoring system’ — I believe not enough folks ask this, especially when hosting open source tools. By having guarantees, teams can have more accountability for monitoring system performance.

And no, Open Source does not mean it’s cheap.

After all, there’s no point in having a monitoring system if it can’t guarantee uptimes.

Mean Time To Detect (MTTD)

The core of a sound monitoring system is to reduce MTTDs. A well-thought-out product will have simple and easy alerting, robust pattern matchmaking and anomaly detection solutions, and help/guide an engineer to learn more from their system.

Among all the points mentioned, this is the one that will matter the most for an SRE on the floor who’s tasked with on-call duties and is the first respondent.

Data Exploration

Storing data is the easy part; providing avenues to explore and dissect is the hard one. A monitoring system should be able to accelerate queries and provide results ASAP and not crash when multiple people are concurrently using it.

These issues are extremely prevalent in e-commerce, ride-sharing, video streaming, and food tech companies. Most business and product teams are dissuaded from exploring their data because of how fragile the system is.

Exploration should be rated in two categories:

  1. Ability to rake in Cardinality.
  2. A visual health board of Alerting rules.

Engineering Overheads

This is one of the most overlooked points when choosing a monitoring solution. A good monitoring solution should not just reduce storage and exploration costs. It should be intuitive and simple enough to reduce engineering toil.

Most organizations don’t factor in the cost of engineering salaries to run a scale and manage a monitoring system.

Automation Workflows

The infrastructure and services under monitoring will grow as growth kicks in. Any new ifra or component needs to be covered under existing monitoring practices. This has to happen automatically, given the pace at which engineering will grow as customers grow. The cascading nature of microservices means dependencies are intertwined. Manual monitoring and mapping infra will only hamper end-to-end monitoring.

Onboarding time

A good monitoring tool has a simple and easy onboarding flow and helps migrate existing workflows regardless of how complicated or varied they are. Not only should the onboarding time be dramatically reduced to ensure business continuity, but existing workflows should not be hampered. Learning a new language, protocol, or a locked-in solution only adds more stress and time to the engineering team. Being compatible with open standards becomes very crucial in this regard, especially with OpenTelemetry.

OTel Compatibility

OTel is essential, and potentially, the future of the quagmire ‘Observability’ finds itself in. Being vendor-locked into closed platforms kills innovation, hampers interoperability, and doesn’t take advantage of all the helpful features that make life easier for an SRE.

At this juncture, this point is a no-brainer. Any monitoring solution must be OTel Compatible and support open standards — OpenMetrics and OpenTelemetry, and integrate with open-source tools such as Prometheus, VictoriaMetrics, InfluxDB, Telegraf, and StatsD.

Customer Support

One of the pain points of using open-source solutions is that you have to rely on the community for answers. This is good, but it can be pretty bad when things go south. Large companies can’t afford to depend on a community to debug a problem.

A vendor solution should have a clear escalation matrix and a ‘time taken to respond’ framework to bring accountability. late support means loss of revenue from customers and defeats the very purpose of having a robust monitoring solution.

Tooling fatigue

A monitoring solution should be interoperable and match your organization's needs to have essentials under one roof. For example, A TSDB should have an alerting solution so you don’t have to depend on third-party tools for alerting.

The more you have it all under one roof, the likelier you can bring in accountability and reduce knowledge transfer times and demand for customized features that help your engineering teams.

Excessive third-party integrations only accelerate tooling fatigue and eventually add to costs.

Are there other factors I may have missed in choosing a monitoring tool? Please do let me know.

--

--