Post Incident Report (PIR) for Neat Pulse Service Disruptions on February 6th and 9th, 2024

Last updated March 15, 2024

Introduction

This document serves as a detailed post-incident report (PIR) pertaining to the service degradation experienced by Pulse on the 6th and 9th of February, 2024. It aims to shed light on the underlying causes, the measures taken to address the issue promptly, and future preventative strategies. This report is intended for our external stakeholders, offering transparency on the incident’s specifics and our commitment towards preventing similar events.

Objective

The main objective of this PIR is to describe the service outage, pinpoint the root causes, delineate the immediate mitigation actions undertaken, and detail the long-term solutions put in place to prevent recurrence.

Problem Description

On the 6th of February, Pulse encountered significant service degradation due to performance limitations on our central Postgres service, leading to prolonged database access times which impacted the normal operation of the service.

Initial Problem and Response

The issue was first detected by our monitoring services at 17:11 UTC, February 6, 2024. Our team swiftly responded by analysing database activity and connection counts, while also publishing a service disruption notice on the Neat support website (see notices below). We discovered that the number of database connections had escalated to its maximum, with most connections being stuck while attempting to acquire an internal Postgres lock. This unprecedented state hindered the execution of regular queries, causing service disruption.

It is important to note that the disruption had no impact on Neat device’s ability to join Zoom or Microsoft Teams calls. It only affected Neat Pulse web site access (including provisioning of new Pulse customer orgs) and the ability to enrol and administer devices via Pulse management portal.

The service disruption notifications can be found here:

February 6, 2024: https://support.neat.no/article/neat-pulse-control-service-update-february-6-2024/
February 9, 2024: https://support.neat.no/article/neat-pulse-control-service-update-february-9-2024/

Initial Mitigation Steps

Our initial response focused on mitigating the issue and restoring the service, such as restarting internal services and database servers. These initial mitigations were not successful. The next mitigations which brought success were to temporarily suspend requests for a selection of organisations and disable non-essential periodic jobs to mitigate the load. Following this initial phase of mitigation, the service was restored for two-thirds of our customer organisations by approximately 02:00. Over the subsequent day, we gradually restored service to the remaining third, while monitoring load, ensuring that by the end of the day, service levels were fully reinstated across all customer organisations, although the investigation into and resolution of the fundamental cause was ongoing at that point.

Cause Analysis

Further examination revealed two primary causes of the outage:

A routine operation to upgrade certain internal organisations generated an unexpected load on the system due to a bug.
Lock contention during the establishment of new database connections, exacerbated by the simultaneous initiation of multiple connections in a setup with many database roles.

Cause Description

The root cause was pinpointed as mutex contention in the Postgres backend code upon new connections being established, particularly during the iteration over database roles. Because Pulse utilises Postgres roles to implement role-based access control on a per-user basis, the system has a large number of these roles. This contention was intensified by the concurrent establishment of multiple new connections, each associated with numerous database roles.

Further Mitigation Steps

Upon identifying the root cause, it became evident that a strategy was required to reduce the work done during new connections: this involved reducing the overall connection count, limiting the number of roles, and minimising the number of concurrent connections attempted. To this end, work carried out over the weekend entailed segregating the use of roles by distributing them across 11 different databases and incorporating a secondary database server. This adjustment not only limited the connection count by utilising an additional server but also constrained the role count per database. These measures resulted in a system architecture with significantly enhanced performance headroom.

Ongoing and Following Work

Our team has successfully replicated the incident conditions in a development environment, confirming that reducing the number of database roles prevents recurrence. We are continuing to refine the application’s use of database roles to ensure application feature correctness, the same or better security, and while reducing our use of problematic database roles.

A permanent solution is anticipated for rollout in the first quarter of 2024, with ongoing efforts in data sharding to prevent interim issues.

Conclusion and Future Steps

This incident has underscored the importance of continuous monitoring and the adoption of proactive load-testing. We are dedicated to implementing the necessary adjustments to guarantee the reliability and performance of Pulse services. Our ongoing development work and operational strategies are aimed at preventing future occurrences, demonstrating our dedication to service excellence and customer trust.

We value the patience and support of our customers and stakeholders during this period and commit to maintaining transparent communication about our improvement efforts and service enhancements.