rosso on 05/08/2017 at 03:02PM
Our sincerest apologies for the FMA site outage this morning which lasted from roughly 7:00 EDT until 14:00 EDT. In the interest of increased transparency about the FMA's operations, I've decided to write this brief entry describing what happened today. We have a very small staff and I wasn't able to begin rectifying the outage until about 11:30 EDT.
Certain types of requests made to the FMA's servers are logged directly in our database. The size of these logs reached a point where the hard disks on our database servers were filled to their capacity. When that happened, the database servers (a master and several read-only replicas) became completely unresponsive. Since the site relies entirely on our database cluster, no pages could be rendered and no api requests could be completed--end users saw a giant error message!
What was the solution?
As soon as I was able to begin working on the problem, I put the maintenance page up and began downloading a snapshot of the logs which filled the database servers' hard disks. This took much longer than anticipated. Once I was able to retreive the data, I truncated the tables in question (truncated meaning deleting all data in the tables--a database table is similar to a spreadsheet). After that, I waited for the read-only replicas of our master database to catch up. It's not enough to restart the site with only the master database running--the site depends on the read-only replicas as well. I waited almost an hour for the read-only replicas to catch up, but they didn't. Due to the nature of our hosting provider, it was faster to delete the read-only replicas and create new ones. That took another several minutes. Once the replicas were rebuilt, I was able to restart our front-end servers and restore the site to normal operation.
How will we prevent this from happening again?
Logging directly to a database is definitely bad practice, but it was implemented on FMA many years ago by the original development team. For now I will keep my eyes on database disk usage and will set alerts to let me know I need to do something before the disks fill up again! Longer term, I will move all logging activity to a separate service, for example just flat log files. Unfortunately, FMA is no stranger to outages, but whenever they happen, we try to restore service as quickly as we can and take steps to prevent similar outages from happening subsequently.
Is there anything I can do to help?
Yes! FMA operates with a tiny staff (2 people) and extremely limited resources. The best way to help is to Donate! If you are a developer and have any technical suggestions, please write to me directly at email@example.com - We greatly value input from our users and the community. We're dedicated to making the FMA the biggest and best resource for Creative Commons licensed, and other royalty-free music, anywhere on the Internet.
What is this song?
One of my all-time favorite FMA tracks, and an adequate description of how it feels to finally fix a major outage.