A framework for monitoring
Bharat Kalluri / 2023-07-16
I’ve written about why I think this exercise is so important before in another post titled frameworks over tools
This is an attempt to build a framework over monitoring so that someone starting out to solve for monitoring finds this as a starting point.
Why
Monitoring is to make sure you don’t fly blind.
Not knowing issues are happening is the worst place to be in. But knowing every single out of order activity occurring in a system is a utopian target to reach. The idea is to define & get the right amount of visibility into the system so that decision making and execution becomes straightforward.
Steps
Step one: Identify stakeholders
Every system tends to have stake holders across departments. Identify key stake holders, this is important to make sure we gather domain knowledge and also notify them if something happens to go off track.
Step two: List out what stakeholders scared/afraid of
This is the question I start with whenever I am presented with a monitoring problem. Every stakeholder has definitions of a disaster case scenarios which they work towards avoiding.
Find the disaster case scenarios, make sure the monitoring system supports the setup. Rinse and repeat, module by module, service by service.
Step three: Identify & setup metrics to measure what stakeholders are scared of
From the stage of inception, everything the stake holders are scared of might not be measurable. In that case, identify which ones are and start acting towards them. And queue up projects to establish measures for metrics which are not measurable as of now.
Since the stakeholders have listed out what they intend on keeping an eye on, make sure all the corresponding metrics are piped to a platform or tool of choice.
Step four: Identify thresholds
Collaborate with stakeholders to establish clear thresholds for which alerts need to be triggered. Make sure false positives are minimized in the alerting structure, ideally every threshold breach should be an event of concern.
Step five: Setup actionable Alerts
Dashboards fill people with a false sense of control & visibility. The idea of monitoring is not to look at it, but to make sure actionable items can be inferred out of it. Dashboards are good investigatory tools to begin with, but inherently dashboards do not solve anything.
When someone asks for a dashboard, its because they want to get a lay of the land to understand what they are supposed to get alerted on. Sit with them and try to understand what they want alerts for and setup alerts first.
Eventually every dashboard’s view count converges towards zero.
Alerts are exceptions, exceptions are exceptional. Means they are not normal. If something is not normal, it needs looking into. If there is an alert with no action item, it means that its not an alert.
Once its an actionable, there should be a corresponding life cycle to the alert where its notified, allocated and resolved.
For every new system built, this framework can be followed to setup a strong monitoring system.