Incidents and Maintenance Actions

Hi everyone,

I will be using this topic to track down the individual maintenance actions that I will have to perform on this website. The truth is that this website is managed and maintained by myself since Discourse hosting is too expensive. This means I have to maintain the server and database that the Forum uses, as well as keeping the Forum software version up to date to avoid vulnerabilities.

I will begin a maintenance action today that will result in some expected downtime. This is because I am migrating the Forum in between servers to a better server provider with better infrastructure. The Forum will also be having dedicated resources instead of sharing them with my other platforms and projects.
This is what the community deserves and I will be posting updates after the migration.

Server migration was successful. We are now on a dedicated virtual server with adequate resources, not sharing with any other apps or projects.

We are now running on a base virtual server on DigitalOcean, a premier cloud provider for Discourse hosting.

2 Likes

Today we had a small outage due to a forum software update that failed pretty badly. I run these updates manually and they are usually 1-click operations, but this time it took the site down. There isn’t much to learn from this incident since it was due to a buggy forum version and, after rebuilding the software from a fresh copy of the newest version (3.1.0.beta5), we’re back online.

While we were down due to the forum software failure, I updated the server’s OS to the latest security version as well.

3 Likes

Outage on 2024/09/11

9 days ago I updated the operating system of the server to a new long term support version. I made sure the update went through without issues, ran a few tests and since everything was okay I logged off. I was alerted by a friend on Facebook that the website earlier today and almost 7 hours later we’re back online.

What happened?

Along with the operating system update, I turned on automatic security updates that would reboot the server if necessary for critical security updates. It turns out that the operating system update was not 100% applied and the network configuration of the server was lost. One of the automatic updates rebooted the server, which then failed to load its network stack and talk to the outside world.

Why did it happen?

To make a long story short, let’s just say that LTS (Long Term Support) versions aren’t necessarily the most stable even a couple of months after release, and I probably should have waited a bit more before moving to this new version. This is, of course, ultimately my fault :slight_smile:

How was it fixed?

Painstakingly manually rebuilding the network configuration of the server.

Will this happen again?

I’ll do a couple of things differently to make sure I don’t commit the same mistake again. Namely:

  • Major server updates will be done with a different process to start fresh every time we use a new version: this reduces the chance of old configurations crashing the server
  • I will add a liveness check to the website to notify me when it’s not available

Summary

Despite being successful in doing a major server update, upon rebooting, the network configuration of the server became unusable and had to be rebuilt. Since I had just updated the server and thought everything was okay, I didn’t come here for a while.

I want to put everyone at ease regarding what happened in terms of security: there was no attack on forum and no information was compromised or lost.

1 Like