You came this way: Home > rosso > Blog > May 8 Outage Postmortem

rosso (FMA Admin)

Mini Profile

REGISTERED:07/21/2014
COMMENTS POSTED:51
MIXES CREATED:6
AFFILIATIONS:---
rosso on 05/08/2017 at 03:02PM

May 8 Outage Postmortem

Our sincerest apologies for the FMA site outage this morning which lasted from roughly 7:00 EDT until 14:00 EDT.  In the interest of increased transparency about the FMA's operations, I've decided to write this brief entry describing what happened today.  We have a very small staff and I wasn't able to begin rectifying the outage until about 11:30 EDT.

What happened?

Certain types of requests made to the FMA's servers are logged directly in our database.  The size of these logs reached a point where the hard disks on our database servers were filled to their capacity.  When that happened, the database servers (a master and several read-only replicas) became completely unresponsive.  Since the site relies entirely on our database cluster, no pages could be rendered and no api requests could be completed--end users saw a giant error message!

What was the solution?

As soon as I was able to begin working on the problem, I put the maintenance page up and began downloading a snapshot of the logs which filled the database servers' hard disks.  This took much longer than anticipated.  Once I was able to retreive the data, I truncated the tables in question (truncated meaning deleting all data in the tables--a database table is similar to a spreadsheet).  After that, I waited for the read-only replicas of our master database to catch up.  It's not enough to restart the site with only the master database running--the site depends on the read-only replicas as well.  I waited almost an hour for the read-only replicas to catch up, but they didn't.  Due to the nature of our hosting provider, it was faster to delete the read-only replicas and create new ones.  That took another several minutes.  Once the replicas were rebuilt, I was able to restart our front-end servers and restore the site to normal operation.

How will we prevent this from happening again?

Logging directly to a database is definitely bad practice, but it was implemented on FMA many years ago by the original development team.  For now I will keep my eyes on database disk usage and will set alerts to let me know I need to do something before the disks fill up again!  Longer term, I will move all logging activity to a separate service, for example just flat log files.  Unfortunately, FMA is no stranger to outages, but whenever they happen, we try to restore service as quickly as we can and take steps to prevent similar outages from happening subsequently.

Is there anything I can do to help?

Yes!  FMA operates with a tiny staff (2 people) and extremely limited resources.  The best way to help is to Donate!  If you are a developer and have any technical suggestions, please write to me directly at [email protected] - We greatly value input from our users and the community.  We're dedicated to making the FMA the biggest and best resource for Creative Commons licensed, and other royalty-free music, anywhere on the Internet.

What is this song?

One of my all-time favorite FMA tracks, and an adequate description of how it feels to finally fix a major outage.

Share

Comments

01
katya-oddio on 05/08/17 at 08:39PM
Cheers to you for saving the day!
02
rosso on 05/08/17 at 09:35PM
Thanks, Katya, for being a kick-ass curator!
03
happypuppyrecords on 05/08/17 at 10:05PM
Yes, thanks Rosso for all your efforts! I was trying to upload something yesterday, but the play arrow would never appear - the uploaded track would get close to the end and just freeze. Got frustrated trying to upload the same track 10x's.. so I gave up for the time. Hope I didn't add to the problem! :P
04
rosso on 05/08/17 at 10:29PM
There are several known issues with the uploader. You didn't do anything wrong. One long-term project I'm working on is to completely replace the upload system. Sorry for any inconvenience. We do appreciate your contributions!
05
rosso on 05/08/17 at 10:30PM
I will write another post toward the end of the week describing some known issues with the FMA and some ongoing projects. We don't want to keep anyone in the dark and Cheyenne and our overlords have asked me to be more of a presence on the site. Best regards!
06
Art Of Escapism on 05/08/17 at 11:17PM
@happypuppyrecords I am guilty of doing this too. Practicing patience. :) Thanks guys for working so hard!
07
Murmure_Intemporel on 05/09/17 at 10:40AM
Thank you for your good work !

(another issue with the uploader: can't upload track > 40 mn)
08
happypuppyrecords on 05/11/17 at 02:45AM
@Murmure_Intemporel
The issue might be the filesize.. I tried uploading something over 100mbs and it wouldn't even accept it. So I reencoded at a lower bitrate and under 100 mbs, and it worked.
09
Murmure_Intemporel on 05/11/17 at 12:53PM
@ happypuppyrecords

Please, give me an example of a track above 40 minutes on your catalogue.
10
rosso on 05/11/17 at 01:43PM
@ Murmure_Intemporel - Thanks for bringing this to my attention. There is no limit on track length. I just checked the servers and uploads over 100MB should be allowed, but I will need to check the actual uploader code. There might be a file size limit set there.

Also be aware that the file may end up in our backend encoding queue and it may take several minutes to process before the play arrow will ever show up. As long as the download completes, you should be ok. If possible, can you send a screenshot of what it looks like when you're waiting for the play arrow to appear? You can email me at [email protected]
11
Murmure_Intemporel on 05/11/17 at 01:44PM
@ happypuppyrecords
Here is an example from suRRism-Phonoethics :
51:55 / 320 Kbps / 124,8 Mo.
It's not a problem of size.

http://freemusicarchive.org/music/Zreen_Toyz/Rouille/
12
rosso on 05/11/17 at 01:44PM
One of my ongoing projects is to replace the uploader and track encoder. That's a lot of work, but I hope to have it done in June. :)
13
Murmure_Intemporel on 05/11/17 at 01:49PM
@rosso
Thank you for your explanation.
When the file is above 40 minutes, the uploader cuts the connection.

I simply cut the file into 2 or 3 parts (I ask to the artist before that).
(http://freemusicarchive.org/music/SRVTR/BRQ2_MX/)
14
Murmure_Intemporel on 05/11/17 at 01:52PM
@rosso
and a big THANK YOU for your good work ! :)
log in to post comments