DevHub

beta

Topic

Developer Tools

Tools to assist software developers in building, deploying, and running applications for BC Gov.

Lifecycle:Maturing

OpenShift 4 Platform Services Reliability Dashboard

We are using Uptime.com for our service reliability dashboard. Monitoring status page can be found here: https://status.developer.gov.bc.ca/

This is a high level monitoring aiming to provide the community with a sense of our service reliability. Monitored services include the DevOps OpenShift 4 Platform Clusters and the shared services (also knows as Next Gen Security tools). Uptime.com tracks the history of service uptime and outages for each of the monitored service. There will be announcements regarding each major service planned downtime and maintenance as well as updates during outages.

The monitoring relies on a combination of the built-in Uptime.com monitoring functionality as well as more sophisticated custom metrics and checks that the Platform Services Team has added on top of it.

Monitors and Alerts

Here are a list of service status that Platform Services Team provides:

MonitorEndpointMonitoring IntervalAlerts
Gold / GoldDR / Silver Clusterreadyz & Cerberus**1 minRC / MSTeams / SMS / Email
Klab / Clab / ARO Clusterreadyz & Cerberus**1 minRC
RocketChatservice URL1 minRC / MSTeams
Artifactoryservice API ping endpoint1 minRC
Registry Appservice API ehlo endpoint1 minRC
DevHubservice URL1 minRC
Vaultservice health endpoint1 minRC

We use 1 minute intervals (shortest available from Uptime.com) to ping availability endpoints set up for each service. Occasionally when a service is extremely busy, the response may timeout and 1 min downtime is recorded. However, we feel that this small error is better than setting the ping intervals to a lower frequency (e.g. 5 mins) and getting a 5 min outage window when the response is not returned due to the network issues between the Uptime.com and the BC Gov network.

In order to address the problem of false positives that can occur with high frequency pings, the Platform Services Team will only receive alerts when a service is down for 5 consecutive attempts. The Platform Services Team uses a suite of monitoring tools in addition to Uptime.com for monitoring such as Sysdig and Nagios which allows us to detect issues early and narrow down the problem to a specific service or a component.

** Cerberus: is a RedHat suggested monitoring tool for OCP cluster general healthiness. For more details, please refer to the Cerberus Repo.

Alert Integrations

When a service is down for more than 5 minutes and verified by 3 monitoring locations, the pre-configured alerts will be fired off. Here are some major types of alerts we are currently using:

1. RocketChat and MSTeams:

  • webhook setup in a notification channel
  • service lead is tagged from the message
  • MSTeams used as a backup strategy when RC is affected by cluster wide issue

2. SMS:

  • text message send to service lead for immediate response

3. Email:

  • for cluster downtime alerts
  • Note: team should create a custom Uptime.com announcement after cluster issue being resolved. See next section for details.

History & Incidents

There are different types of announcements for past and future incidents you will see from Uptime.com:

1. Automatic Announcement:

  • whenever there is a service downtime detected by Uptime.com, it will auto generate an Incidents with the timestamp and duration
  • there is no much details included in this

2. Custom Announcements:

  • Custom announcements will be posted manually by the Platform Services Team following a service outage explaining the root cause and the impact of the service disruption as well as what was done to troubleshoot and restore the service
  • all the clusters and services that are impacted will be listed in the announcement

3. Maintenance Window:

  • when there is a scheduled maintenance, details will be provided in a Maintenance Window message and the uptime statistic of the service can be opted out during those time period, based on the nature of the maintenance event

Monitoring Config as Code

We are using the Uptime.com API endpoint to manage the monitors and alerts.

  • Create an Issue

Developer Tools
Content

  • home
  • disclaimer
  • privacy
  • accessibility
  • copyright
  • contact us
  • Government Of BC