It's entirely my fault. I screwed up in a few ways, which is what you always see in something like this.
Im on an overnight business trip from Boston to Dallas.. My Real Job wanted me to tour a new datacenter down here.
In any event, I've been running straight, without sleep since yesterday AM when I flew out on a Red Eye, so I.. I'll admit it.. I went to bed.
I was sleeping peacefully from 1AM to 4AM, cradling my old-school, screaming-loud pager secure in the knowledge that all my little servers would alert me if they had problems.. I have Pingdom, PagerDuty, Nagios, Munin, I've got monitoring 16 different ways..
When I woke up at 4AM (3 hours is enough. Why not?) to make my flight back to Boston, I see I've received several hundred text messages. Hrmm.. What is this?
I double-checked the work-servers, but those were fine, and have Other staff to help watch them, and escalate. Checking email, I see Pingdom is patiently explaining that RoboHash is down.
"But WAIT?!", I can hear your exclaim!
Why didn't your pager wake you up?
Well, it turns out that old-school pagers are.. Regional. Once I'm outside of NewEngland, it's a cool looking retro piece of trash.
The text messages arrived, but my sleeping brain was able to peacefully ignore them in bliss.
As to what actually happened? Linode migrated my machine to a new datacenter, and rebooted it.
I had a bug in my init script, so it didn't start up properly when rebooted.
Basically, I need to create a Ramdisk, then copy the code from the stable position on HDD. This is because a bazillion hits/second would overwhelm the disk if I loaded it manually from disk each time.
Anyway, Nginx was started (hence the error), but the init script was trying to start up my python code, without creating the ramdisk first, so.. RoboFail.
Sorry about the problem. Imagine my embarrassment.
On the plus side, I'm now in a TSA-approved boarding area, so I have a good 20 minutes to fix the init script before my flight back. ;)
Pardon the (perhaps very silly) question, but... if you have enough RAM for the ramdisk, isn't that enough RAM to cache all the relevant files anyway? So the file open()/read() would not hit HDD itself time and time again; they'd be served (by the OS) from cache?
Of course, cache warm-up would remain a possible problem source.
I believe it's usually done (I've not had to do it) so that you can force some small set of data to always be in ram. This way even in the case of someone managing to cause something to leak memory the code will always be in ram. Whether this is really needed or not I can't say.
In this case, it's because I don't want to assume that every dataset that will be used with this code stays in RAM, and it's simpler to handle things on the FS level by adding a new png to the stack, than to keep track of the images in RAM. This way the code can be "simpler"
Was the linode migration at your instigation or did they decide to do that all by themselves?
(of course you should have configured the box properly and done a test reboot to see if it would all come up well but as you already wrote there are multiple factors at work here and I'm wondering if Linode is in the habit of taking machines down that are working just fine without their owners consent).
That may point to a DNS stuffup. Running "nslookup www.robohash.org" returns an error about not being able to find the domain, but "nslookup robohash.org" returns the ip address of the server.
Why would you go to www.Robohash.org? It's only one server, it doesn't need subdomains ;)
AFAIK, that URL has never done anything.. But.. Sure!
I've set up DNS for it, and added a 301.
It'll take a bit for DNS to propagate, but it should ensure it works going forward.
I know of unicornify[1], which also provides unique pictures for hashes. However, it approaches the problem differently: instead of using existing artwork, it basically generates each unicorn procedurally; the details of how it works are here[2] and there is code available too.
It's entirely my fault. I screwed up in a few ways, which is what you always see in something like this.
Im on an overnight business trip from Boston to Dallas.. My Real Job wanted me to tour a new datacenter down here.
In any event, I've been running straight, without sleep since yesterday AM when I flew out on a Red Eye, so I.. I'll admit it.. I went to bed.
I was sleeping peacefully from 1AM to 4AM, cradling my old-school, screaming-loud pager secure in the knowledge that all my little servers would alert me if they had problems.. I have Pingdom, PagerDuty, Nagios, Munin, I've got monitoring 16 different ways..
When I woke up at 4AM (3 hours is enough. Why not?) to make my flight back to Boston, I see I've received several hundred text messages. Hrmm.. What is this?
I double-checked the work-servers, but those were fine, and have Other staff to help watch them, and escalate. Checking email, I see Pingdom is patiently explaining that RoboHash is down.
"But WAIT?!", I can hear your exclaim! Why didn't your pager wake you up?
Well, it turns out that old-school pagers are.. Regional. Once I'm outside of NewEngland, it's a cool looking retro piece of trash.
The text messages arrived, but my sleeping brain was able to peacefully ignore them in bliss.
As to what actually happened? Linode migrated my machine to a new datacenter, and rebooted it. I had a bug in my init script, so it didn't start up properly when rebooted.
Basically, I need to create a Ramdisk, then copy the code from the stable position on HDD. This is because a bazillion hits/second would overwhelm the disk if I loaded it manually from disk each time.
Anyway, Nginx was started (hence the error), but the init script was trying to start up my python code, without creating the ramdisk first, so.. RoboFail.
Sorry about the problem. Imagine my embarrassment.
On the plus side, I'm now in a TSA-approved boarding area, so I have a good 20 minutes to fix the init script before my flight back. ;)