The ReportMagic [LogicMonitor.PercentageAvailability: ] macro
Introduction
The [LogicMonitor.PercentageAvailability: ] macro was the original reason that ReportMagic was created back in 2013. Since then, it has stood the test of time and is still at the heart of many of our Managed Service Provider reports. Its results may not match the Alert filter provided in the LogicMonitor UI, which only filters using the alert start time.
But what is it, how does it work and why must it be so complex?!
Definitions
To calculate "availability", first we must define a few terms. To do so, let's review the diagram below, showing a number of LogicMonitor alerts having been active over August, September and October.
- Reporting Period
- The reporting period in the diagram above is "September UTC". Why UTC? Well, strictly the timezone itself doesn't matter, though by default all RepotMagic date/times are calculated in UTC. The important thing is that Daylight Savings should never factor into your calculations. Should an alert occur over the spring DST changeover between midnight and 4 am, is it active for 3, 4 or 5 hours? This is very important for your SLA reporting... Daylight Savings may have its' advocates, but we're certainly not among them! Fortunately, it appears that the European Union are going to move away from Daylight Savings in the near future.
- Report Execution Time
- It should not matter when the report is executed, the results should be the same (so far as possible). Note that between Report 1 and Report 2 being generated, Device 6 was removed from the LogicMonitor portal. ReportMagic has no record of this device, so it cannot be reported on. This is why, until you have completed all reports, you should not remove unused devices from your LogicMonitor portal until you no longer have any reporting requirements for that device. Note that because reports can take minutes or even hours to generate.
- Event
- ITIL defines an Event as a measurable occurrence, as distinct from a measured occurrence. If it was unobserved, did it happen?! For example, if the CPU Usage % on Device 4 hits 100 at 13:00, the next measurement may not be made until 13:02. This means that caution should be taken over availability reporting precision. We recommend that no more than 3 significant figures be used when reporting availability percentages. Additionally, if it wasn't measured, no, we can't help you retrospectively - please stop asking us! For this reason, be careful what you claim to your customers about things you are not monitoring.
- Alert
- Once LogicMonitor measures the start of a problematic Event, it creates an "Alert". Note that it is these Alerts (as distinct from the actual Events) that ReportMagic uses to calculate availability. For any given portal, this can mean hundreds of thousands of Alerts for a given reporting period. In order to efficiently process this data, ReportMagic can cache all the alert data. We strongly recommend that any ReportMagic customer with more than 100 devices considers spending 1 minute setting up this feature in ReportMagic (or you can ask us to do so). This can bring [LogicMonitor.PercentageAvailability: ] calculations down from hours to mere seconds.
- Alert Severity
- This is a very important concept - CPU may be high, but not so high that an "outage" is declared. 80% may be fine, but 100% may be disastrous. For this reason, LogicMonitor supports multiple alert levels. It is considered good practice to note Warning alerts, but not necessarily act on them, particularly if there are different Devices with active Critical and Error alerts. For this reason, by default, ReportMagic ignores Warnings. This can be overridden. For example, consider Device 2. What availability does this Device have during October UTC? By default, ReportMagic would consider this Device to have 50% availability. If configured to consider the Warning level, that figure would be closed to 90%.
- Non-availability
- When calculating Availability, ReportMagic is mainly concerned with Non-Availability, i.e. periods when no alerts are active. Consider Device 1. No alerts, so no non-availability, so 100% availability. The same is true for Device 6. There is an alert, but it is outside of the reporting
How does it work?
Consider Device 4. For the whole of the reporting period, there were two active alerts at the Warning level and above. These alerts were simultaneously active, but that doesn't give us 200% non-availability - we only have 100% Non-Availability (and therefore 0% availability). ReportMagic uses a Coverage Engine to make these calculations.
Further, when considering multiple devices, you may with to provide an aggregate view. We use the aggregation parameter to aggregate this data over all devices, at the device level, or at the instance level. If you need more information about how our aggregation works, please contact us - we are always happy to talk!
Comments
Post a Comment