11/10/2023 | Notion

Diagnostic Overview

During the transitional period of migrating away from OVH and prior to the activation of new Points of Presence (POPs), our infrastructure was constrained to three key nodes on UpCloud. A 24-hour network disturbance in the APAC region impacted the reliability of two of these nodes, posing challenges in service continuity.

Impact Assessment

This led to intermittent service outages on the 10th and 11th, accounting for a total downtime of nearly 2 hours, with the longest disruption lasting up to 1 hour and 30 minutes.

Immediate Actions Taken

Our initial remedial action employed a single 'master' node (sg-sin-web01) with the most current data. However, due to severe latency issues, syncing other nodes proved infeasible. At 18:00, we expedited the deployment of new POPs, resulting in an additional 8-minute intermittent outage. Despite difficulties in establishing a stable quorum, we successfully synced additional nodes, mitigating further risk.

Future Strategy

To enhance resiliency, we've deployed three additional POPs in various regions. These are operational but will only handle web traffic once clients have updated their systems to accommodate the new POPs. A 4th POP is currently being integrated, despite challenges in database transfer, and is expected to be operational within 24 hours.

This mitigates the risk of an outage even though they are not serving web traffic yet.

Closing Remarks

This year witnessed considerable infrastructure adjustments and subsequent service interruptions. However, we've now attained a new operational milestone, with redundancy across three providers and seven POPs. Our objective is to approach a 99.9% uptime, from the current 98.8%, as we continue to refine our services and offerings. Thank you for your patience and understanding as we strive to achieve operational excellence.

We’re now in a strong position to make the future more positive. Development is picking back up, as you may have noticed in our change logs we now have multiple new people working on development (albeit still learning about our frameworks), and our infrastructure is where we want it to be; fully redundant against any type of failure outside of our control. Our internal processes have been refined and documented extensively, and I look forward to bringing you sustainably positive news moving forward.