A framework for monitoring

Bharat Kalluri / 2023-07-16

I’ve written about why I think this exercise is so important before in another post titled frameworks over tools

This is an attempt to build a framework over monitoring so that someone starting out to solve for monitoring finds this as a starting point.

Why

Monitoring is to make sure you don’t fly blind.

Not knowing issues are happening is the worst place to be in. But knowing every single out of order activity occurring in a system is a utopian target to reach. The idea is to define & get the right amount of visibility into the system so that decision making and execution becomes straightforward.

Steps

Step one: Identify stakeholders

Every system tends to have stake holders across departments. Identify key stake holders, this is important to make sure we gather domain knowledge and also notify them if something happens to go off track.

Step two: List out what stakeholders scared/afraid of

This is the question I start with whenever I am presented with a monitoring problem. Every stakeholder has definitions of a disaster case scenarios which they work towards avoiding.

Find the disaster case scenarios, make sure the monitoring system supports the setup. Rinse and repeat, module by module, service by service.

Step three: Identify & setup metrics to measure what stakeholders are scared of

From the stage of inception, everything the stake holders are scared of might not be measurable. In that case, identify which ones are and start acting towards them. And queue up projects to establish measures for metrics which are not measurable as of now.

Since the stakeholders have listed out what they intend on keeping an eye on, make sure all the corresponding metrics are piped to a platform or tool of choice.

Step four: Identify thresholds

Collaborate with stakeholders to establish clear thresholds for which alerts need to be triggered. Make sure false positives are minimized in the alerting structure, ideally every threshold breach should be an event of concern.

Step five: Setup actionable Alerts

Dashboards fill people with a false sense of control & visibility. The idea of monitoring is not to look at it, but to make sure actionable items can be inferred out of it. Dashboards are good investigatory tools to begin with, but inherently dashboards do not solve anything.

When someone asks for a dashboard, its because they want to get a lay of the land to understand what they are supposed to get alerted on. Sit with them and try to understand what they want alerts for and setup alerts first.

Eventually every dashboard’s view count converges towards zero.

Alerts are exceptions, exceptions are exceptional. Means they are not normal. If something is not normal, it needs looking into. If there is an alert with no action item, it means that its not an alert.

Once its an actionable, there should be a corresponding life cycle to the alert where its notified, allocated and resolved.


For every new system built, this framework can be followed to setup a strong monitoring system.

Spotify album cover