You may have noticed that arXiv was mostly unavailable for several hours on March 26, ultimately leading us to postponing the mailing for that evening. First, please accept our apology for this major service disruption; we know that many of you rely on arXiv as part of your daily workflows. As our second major service outage in 4 months, you may be wondering about arXiv’s long-term reliability. This is certainly something that keeps us up at night (firefighting notwithstanding), and we are actively pursuing options to improve our failover capabilities.
So what happened? Our service provider experienced a major failure with its shared filesystem (SFS) service, causing networked filesystems to suddenly become unavailable for arXiv and numerous other clients at Cornell University. Our service was simply not prepared to handle this type of failure scenario; years of otherwise dependable service had given us a false sense of security, and we ultimately failed to plan for it properly.
After considerable situation assessment and server wrangling, we were eventually able to redirect users to our mirror servers. Once our service provider resolved the problem on their end, we were given the green light to reboot our servers, which restored access to our networked filesystems. No primary or backup data was lost or corrupted, so we were able to bring the service back to its normal state very shortly after the reboots. Since the outage spanned our scheduled publish cycle, we were regrettably forced to postpone the mailing to the next day–hence no new announcements in your inbox the following morning.
Where do we go from here? In the long term, we have already made architectural moves for arXiv-NG that will prevent this kind of catastrophic outage from taking down the whole system. But we also consider resiliency to this kind of failure to be a high priority in the short term, as well. On the day following the outage, the arXiv development team convened to brainstorm failover options and improvements to our processes, and we have identified specific steps to better handle this type of failure that we will begin implementing over the next few days. This will include changes to how our existing web servers are configured, cluster-level changes to ensure the availability of public interfaces even when networked storage goes down, and incorporating off-site failover options using infrastructure developed for arXiv-NG.
We again apologize for this disruption in service and thank you for your continued support of arXiv!