Skip to main content
news from

arXiv NG: incremental decoupling and search

Martin (our IT Team Lead) and I gave a brief update on the arXiv-NG project at the Open Repositories 2018 conference in Bozeman, MT last week. So it seems like a good time to offer an update here, as well. For a high-level refresher on what we’re up to, check out my earlier post. In this post, I’ll provide a bit more detail about how we’re migrating from a legacy code-base to a more evolvable architecture, illustrated by our recent work on our search interface.

The classic (read: legacy) arXiv platform is complex, and the fact that we are in the midst of re-architecting that system in fairly dramatic ways makes it difficult to provide a visual representation of progress. Here is my best attempt, from the perspective of how data “moves” through the arXiv platform. Each of the polygons represents a notional component of the classic system and/or a separate service or application in the NG architecture.

Notional overview of the arXiv system, depicting how data generally “moves” through the platform. This starts in the submission system with new papers, and flows through to a variety of access and discovery interfaces.


Release: search v0.2 + some notes on names

Version 0.2 of arXiv search was deployed to production this morning. This release addresses a variety of bugs identified in v0.1, with special attention to searching by author names. Detailed release notes can be found on the arXiv public wiki.

Author names in arXiv

If you used arXiv’s search interface to look for your papers during the past week, you may have had a moment of panic. While there were some real bugs that needed squashing (and have since been addressed), some of the feedback that we received pointed to deeper issues with the way that arXiv has historically handled information about authors. Our old search system partially obfuscated these issues by providing an artificial impression of precision. In this post, I lay out some of the challenges with the classic arXiv metadata schema related to authorship, how that impacts search, how arXiv author IDs and ORCID IDs factor into all of this, and what we are doing to improve the situation moving forward.


New Release: arXiv Search v0.1

Today we launched a reimplementation of our search system. As part of our broader strategy for arXiv-NG, we are incrementally decoupling components from the classic arXiv codebase, and replacing them with more modular services developed in Python. Our goal was to replace the aging Lucene search backend, achieve feature-parity with the classic search system, and give the search interface an opportunistic face-lift. While the frontend may not look terribly different from the old search interface, we hope that you’ll notice some improvements in functionality. The most important win for us in this milestone is that the new backend lays the groundwork for more dramatic improvements to search, our APIs, and other components targeted for reimplementation in arXiv-NG.

Here’s a rundown of some of the things that changed, and where we plan to go from here.


Our recent outage

You may have noticed that arXiv was mostly unavailable for several hours on March 26, ultimately leading us to postponing the mailing for that evening. First, please accept our apology for this major service disruption; we know that many of you rely on arXiv as part of your daily workflows. As our second major service outage in 4 months, you may be wondering about arXiv’s long-term reliability. This is certainly something that keeps us up at night (firefighting notwithstanding), and we are actively pursuing options to improve our failover capabilities.

So what happened? Our service provider experienced a major failure with its shared filesystem (SFS) service, causing networked filesystems to suddenly become unavailable for arXiv and numerous other clients at Cornell University. Our service was simply not prepared to handle this type of failure scenario; years of otherwise dependable service had given us a false sense of security, and we ultimately failed to plan for it properly.

After considerable situation assessment and server wrangling, we were eventually able to redirect users to our mirror servers. Once our service provider resolved the problem on their end, we were given the green light to reboot our servers, which restored access to our networked filesystems. No  primary or backup data was lost or corrupted, so we were able to bring the service back to its normal state very shortly after the reboots. Since the outage spanned our scheduled publish cycle, we were regrettably forced to postpone the mailing to the next day–hence no new announcements in your inbox the following morning.

Where do we go from here? In the long term, we have already made architectural moves for arXiv-NG that will prevent this kind of catastrophic outage from taking down the whole system. But we also consider resiliency to this kind of failure to be a high priority in the short term, as well. On the day following the outage, the arXiv development team convened to brainstorm failover options and improvements to our processes, and we have identified specific steps to better handle this type of failure that we will begin implementing over the next few days. This will include changes to how our existing web servers are configured, cluster-level changes to ensure the availability of public interfaces even when networked storage goes down, and incorporating off-site failover options using infrastructure developed for arXiv-NG.

We again apologize for this disruption in service and thank you for your continued support of arXiv!

Planning, prioritization, and getting things done in arXiv-NG

The arXiv Next Generation project is an ambitious effort to renew the software that runs for long-term evolvability. In a previous post, I described some of the technical drivers for arXiv-NG, and our high-level approach. We’ve embarked on a mission to progressively rebuild the arXiv software by isolating components from the classic system, and reimplementing them in a more modular architecture.

Implementing our vision for arXiv-NG has entailed several significant changes in the way that the arXiv development team works. The team is learning new technologies, like Flask, Docker, and Kubernetes. We’ve also become increasingly adept at coordinating work among geographically distributed members of the team. Some of the most dramatic changes, however, have been in how we plan and prioritize development effort to advance the long-term goals of arXiv-NG. The complexity of the classic system, and thus the complexity of incrementally porting its components to a new architecture, require strategic planning on multiple time-scales.

In this post, I introduce some of the processes that we are using to plan and prioritize this complex effort. I’ll also talk a bit about our testing and release process. For those of you watching arXiv-NG development proceed, this will be especially useful background as NG components go into public beta testing over the coming weeks and months.


Engaging external developers in arXiv-NG

Starting last June, the arXiv team began reaching out to researchers and developers who have created tools that leverage arXiv APIs and/or content. We identified several hundred projects on GitHub and SourceForge, as well as platforms mentioned in responses to the 2016 arXiv user survey. We conducted a targeted survey for API consumers, and contacted several dozen individuals who have used our APIs for various purposes. This provided a wealth of information about who is building tools (e.g. we learned that tech-savvy researchers likely outnumber professional developers in this area), the problems that they have encountered, and the kinds of things that would make it easier for them to build tools that add value for arXiv users.

Based on that feedback, our strategy for engaging external developers has two major planks.

  1. An arXiv API gateway, to promote the development of innovative third-party tools and platforms, draw attention to those tools that already exist, and foster dialogue with external developers.
  2. Selective direct contribution to the arXiv codebase. The move to an open-source “en plein air” development model for arXiv-NG creates new opportunities to engage trusted individuals and organizations in the development of core features and services.

I’ll elaborate on each of these planks, below.


arXiv Technical Evaluation Rubric

While we eventually decided to adopt an incremental, microservices-based approach to redeveloping arXiv (see the post arXiv NG: Classic Renewal for context), we did spend considerable time evaluating existing repository technologies. To that end, we developed a technical evaluation rubric that we applied to candidate technologies, and are pleased to share that here in case readers of this blog find value in using or adapting the rubric for their own purposes.

Visit to Astrophysics Data System

ADS (Astrophysics Data System) held its second ADS Users Group meeting on 2–3 November. In advance of the meeting, on 1 November I spent the day meeting with Alberto Accomazzi (ADS PI), Michael Kurtz (Project Scientist), Edwin Henneken (IT Specialist), and members of the ADS development team.

The ADS Digital Library was founded in the early 1990s, at about the same time as arXiv, and has become an indispensable resource for the astronomy and astrophysics research community (see Kurtz et al, 2000).

We’ve worked closely with ADS over the years. Given the recent ramp-up of development effort at ADS to support the ADS Bumblebee project, and the parallel ramp-up for arXiv NG, this is an opportune time for ADS and arXiv to coordinate and possibly collaborate on new problems of mutual interest. This post is a recap of some of the things that we discussed. (more…)

arXiv NG: Classic Renewal

To paraphrase an observation by our new Scientific Director: from the perspective of most of our users, arXiv runs on magic. With the exception of a small handful of hiccups, arXiv has just worked for over 25 years. Since I joined the arXiv IT team as lead software architect in June, I’ve been working hard to pull back the proverbial curtain and take stock of how the sausage is made, and to synthesize the team’s aspirations and expertise for the arXiv Next Generation (arXiv-NG) project. We’ve done a considerable amount of research and soul-searching, and an architecture for NG has emerged.

Over the coming weeks and months I’ll discuss the highlights of the NG architecture on this blog, and keep you up to date on development progress. This post is a brief 30,000-foot view of where we’re going over the next two years.


arXiv Developer Spotlight: Librarian, from Fermat’s Library

When I joined arXiv as lead software architect in June, one of the first things that jumped out at me was the incredible range of cool and innovative things that users are doing with arXiv content. In my opinion, that’s a sign that we’re doing things right: arXiv provides a valuable and reliable core service, and empowers the community to build on that foundation.

In the spirit of empowering the community, I’ve decided to start showcasing some of the cool arXiv-based projects that we’ve found around the internet. If you’ve found an app, service, widget, visualization, or anything else that uses arXiv content in interesting ways, please let us know about it! You can get in touch via the arXiv-API Google group, or at

Librarian, from Fermat’s Library

We sat down with the team at Fermat’s Library this week to talk about Librarian, a Chrome extension that displays BibTex and reference links while you’re reading PDFs on Librarian came across our radar while when we started experimenting with reference extraction several weeks ago.


Subscribe By Email

Get every new post delivered right to your inbox.

Please prove that you are not a robot.

Skip to toolbar