We are using Uptime.com for our service reliability dashboard. Monitoring status page can be found here: https://status.developer.gov.bc.ca/
This is a high level monitoring aiming to provide the community with a sense of our service reliability. Monitored services include the DevOps OpenShift 4 Platform Clusters and the shared services (also knows as Next Gen Security tools). Uptime.com tracks the history of service uptime and outages for each of the monitored service. There will be announcements regarding each major service planned downtime and maintenance as well as updates during outages.
The monitoring relies on a combination of the built-in Uptime.com monitoring functionality as well as more sophisticated custom metrics and checks that the Platform Services Team has added on top of it.
Here are a list of service status that Platform Services Team provides:
|Gold / GoldDR / Silver Cluster||readyz & Cerberus**||1 min||RC / MSTeams / SMS / Email|
|Klab / Clab / ARO Cluster||readyz & Cerberus**||1 min||RC|
|RocketChat||service URL||1 min||RC / MSTeams|
|Artifactory||service API ping endpoint||1 min||RC|
|Registry App||service API ehlo endpoint||1 min||RC|
|DevHub||service URL||1 min||RC|
|Vault||service health endpoint||1 min||RC|
We use 1 minute intervals (shortest available from Uptime.com) to ping availability endpoints set up for each service. Occasionally when a service is extremely busy, the response may timeout and 1 min downtime is recorded. However, we feel that this small error is better than setting the ping intervals to a lower frequency (e.g. 5 mins) and getting a 5 min outage window when the response is not returned due to the network issues between the Uptime.com and the BC Gov network.
In order to address the problem of false positives that can occur with high frequency pings, the Platform Services Team will only receive alerts when a service is down for 5 consecutive attempts. The Platform Services Team uses a suite of monitoring tools in addition to Uptime.com for monitoring such as Sysdig and Nagios which allows us to detect issues early and narrow down the problem to a specific service or a component.
** Cerberus: is a RedHat suggested monitoring tool for OCP cluster general healthiness. For more details, please refer to the Cerberus Repo.
When a service is down for more than 5 minutes and verified by 3 monitoring locations, the pre-configured alerts will be fired off. Here are some major types of alerts we are currently using:
1. RocketChat and MSTeams:
- webhook setup in a notification channel
- service lead is tagged from the message
- MSTeams used as a backup strategy when RC is affected by cluster wide issue
- text message send to service lead for immediate response
- for cluster downtime alerts
- Note: team should create a custom Uptime.com announcement after cluster issue being resolved. See next section for details.
There are different types of announcements for past and future incidents you will see from Uptime.com:
1. Automatic Announcement:
- whenever there is a service downtime detected by Uptime.com, it will auto generate an Incidents with the timestamp and duration
- there is no much details included in this
2. Custom Announcements:
- Custom announcements will be posted manually by the Platform Services Team following a service outage explaining the root cause and the impact of the service disruption as well as what was done to troubleshoot and restore the service
- all the clusters and services that are impacted will be listed in the announcement
3. Maintenance Window:
- when there is a scheduled maintenance, details will be provided in a Maintenance Window message and the uptime statistic of the service can be opted out during those time period, based on the nature of the maintenance event
We are using the Uptime.com API endpoint to manage the monitors and alerts.