Monday’s LibGuides Server Issues Post-Mortem

We waited until today to post the explanation behind yesterday’s LibGuides connectivity issue because we wanted to make sure that the fixes we deployed took care of the issue.

Here’s what happened: a couple of weeks ago our hosting facility had a big infrastructure upgrade – new routers, switches, the whole nine yards. But yesterday something went wrong with the shiny new stuff which caused one of our critical servers to be overloaded with access requests. Pings and connections were failing or being delayed, and  those delayed requests would then hit the servers all at once. We designed our servers to handle large loads and handle spikes in usage, but when you get a rare “super-spike” things slow down. A lot.

“Why don’t you just add more servers and hot-swap them?”, you may say. Well, adding more servers instantly in the case of LibGuides is not so simple: for some critical servers (like the one that we had problems with yesterday) any institution with the custom domain mapping option (most of our libraries have it) would have to update their local DNS records to point to new servers and it takes time (hours, sometimes even days) to propagate the changes to all DNS servers around the world.

In any case, we worked with our hosting provider to alleviate the networking issue and also made some changes to our server cluster so that it can handle way more traffic – about 10x more, to alleviate negative effects from any other potential super-spikes.

Here are a couple of important takeaways:

  • Server issues happen – internet connectivity is complex beast – and unfortunately there are no guarantees that our infrastructure won’t have other issues unrelated to this problem. You can be sure of this, though – we monitor our infrastructure round-the-clock so we are the first to know whenever there are any issues. And whenever there are issues we spring to action immediately to remedy the situation asap.
  • Whenever you have issues accessing LibGuides please check our Twitter webpage first to see if there’s a known issue (http://twitter.com/springshare). When we get thousands of support requests in a span of a couple of hours—literally, no joke—it is impossible to answer all of them quickly. If it’s a known issue our support team is already working on it and will fix it faster if they don’t also need to respond individually to thousands of emails.

    If you see an issue posted on Twitter, we are working on it and will post regular updates as well as “all clear”.

    If you are still having issues after things are back to normal (or no issues are mentioned on Twitter) you should send your support request and we will take a look at it asap – as we always do. 🙂

Again, we apologize for the problems accessing your trusty LibGuides yesterday. It was a networking black-swan type event (and just to reiterate—it was in no way connected to the attack on GoDaddy – please see this post). Remember, we are always doing everything we can to prevent these problems from happening in the first place.

It is worth noting that even counting yesterday’s issues, our uptime has been in the 99.99% range since we started LibGuides in 2007 – meaning it’s been down for only a handful of hours in the past 5 years.

Thank you, and onwards and upwards. Now we need to sign up 10x more libraries, because there’s a lot of room to grow in our infrastructure. 🙂

Best,
-Slaven & the Springshare team

Leave a Reply

Your email address will not be published. Required fields are marked *

Confirm you aren't spamming: * Time limit is exhausted. Please reload the CAPTCHA.