Skip to main content
news from

arXiv NG: incremental decoupling and search

Martin (our IT Team Lead) and I gave a brief update on the arXiv-NG project at the Open Repositories 2018 conference in Bozeman, MT last week. So it seems like a good time to offer an update here, as well. For a high-level refresher on what we’re up to, check out my earlier post. In this post, I’ll provide a bit more detail about how we’re migrating from a legacy code-base to a more evolvable architecture, illustrated by our recent work on our search interface.

The classic (read: legacy) arXiv platform is complex, and the fact that we are in the midst of re-architecting that system in fairly dramatic ways makes it difficult to provide a visual representation of progress. Here is my best attempt, from the perspective of how data “moves” through the arXiv platform. Each of the polygons represents a notional component of the classic system and/or a separate service or application in the NG architecture.

Notional overview of the arXiv system, depicting how data generally “moves” through the platform. This starts in the submission system with new papers, and flows through to a variety of access and discovery interfaces.


Release: search v0.2 + some notes on names

Version 0.2 of arXiv search was deployed to production this morning. This release addresses a variety of bugs identified in v0.1, with special attention to searching by author names. Detailed release notes can be found on the arXiv public wiki.

Author names in arXiv

If you used arXiv’s search interface to look for your papers during the past week, you may have had a moment of panic. While there were some real bugs that needed squashing (and have since been addressed), some of the feedback that we received pointed to deeper issues with the way that arXiv has historically handled information about authors. Our old search system partially obfuscated these issues by providing an artificial impression of precision. In this post, I lay out some of the challenges with the classic arXiv metadata schema related to authorship, how that impacts search, how arXiv author IDs and ORCID IDs factor into all of this, and what we are doing to improve the situation moving forward.


New Release: arXiv Search v0.1

Today we launched a reimplementation of our search system. As part of our broader strategy for arXiv-NG, we are incrementally decoupling components from the classic arXiv codebase, and replacing them with more modular services developed in Python. Our goal was to replace the aging Lucene search backend, achieve feature-parity with the classic search system, and give the search interface an opportunistic face-lift. While the frontend may not look terribly different from the old search interface, we hope that you’ll notice some improvements in functionality. The most important win for us in this milestone is that the new backend lays the groundwork for more dramatic improvements to search, our APIs, and other components targeted for reimplementation in arXiv-NG.

Here’s a rundown of some of the things that changed, and where we plan to go from here.


Planning, prioritization, and getting things done in arXiv-NG

The arXiv Next Generation project is an ambitious effort to renew the software that runs for long-term evolvability. In a previous post, I described some of the technical drivers for arXiv-NG, and our high-level approach. We’ve embarked on a mission to progressively rebuild the arXiv software by isolating components from the classic system, and reimplementing them in a more modular architecture.

Implementing our vision for arXiv-NG has entailed several significant changes in the way that the arXiv development team works. The team is learning new technologies, like Flask, Docker, and Kubernetes. We’ve also become increasingly adept at coordinating work among geographically distributed members of the team. Some of the most dramatic changes, however, have been in how we plan and prioritize development effort to advance the long-term goals of arXiv-NG. The complexity of the classic system, and thus the complexity of incrementally porting its components to a new architecture, require strategic planning on multiple time-scales.

In this post, I introduce some of the processes that we are using to plan and prioritize this complex effort. I’ll also talk a bit about our testing and release process. For those of you watching arXiv-NG development proceed, this will be especially useful background as NG components go into public beta testing over the coming weeks and months.


Engaging external developers in arXiv-NG

Starting last June, the arXiv team began reaching out to researchers and developers who have created tools that leverage arXiv APIs and/or content. We identified several hundred projects on GitHub and SourceForge, as well as platforms mentioned in responses to the 2016 arXiv user survey. We conducted a targeted survey for API consumers, and contacted several dozen individuals who have used our APIs for various purposes. This provided a wealth of information about who is building tools (e.g. we learned that tech-savvy researchers likely outnumber professional developers in this area), the problems that they have encountered, and the kinds of things that would make it easier for them to build tools that add value for arXiv users.

Based on that feedback, our strategy for engaging external developers has two major planks.

  1. An arXiv API gateway, to promote the development of innovative third-party tools and platforms, draw attention to those tools that already exist, and foster dialogue with external developers.
  2. Selective direct contribution to the arXiv codebase. The move to an open-source “en plein air” development model for arXiv-NG creates new opportunities to engage trusted individuals and organizations in the development of core features and services.

I’ll elaborate on each of these planks, below.


arXiv Technical Evaluation Rubric

While we eventually decided to adopt an incremental, microservices-based approach to redeveloping arXiv (see the post arXiv NG: Classic Renewal for context), we did spend considerable time evaluating existing repository technologies. To that end, we developed a technical evaluation rubric that we applied to candidate technologies, and are pleased to share that here in case readers of this blog find value in using or adapting the rubric for their own purposes.

Visit to Astrophysics Data System

ADS (Astrophysics Data System) held its second ADS Users Group meeting on 2–3 November. In advance of the meeting, on 1 November I spent the day meeting with Alberto Accomazzi (ADS PI), Michael Kurtz (Project Scientist), Edwin Henneken (IT Specialist), and members of the ADS development team.

The ADS Digital Library was founded in the early 1990s, at about the same time as arXiv, and has become an indispensable resource for the astronomy and astrophysics research community (see Kurtz et al, 2000).

We’ve worked closely with ADS over the years. Given the recent ramp-up of development effort at ADS to support the ADS Bumblebee project, and the parallel ramp-up for arXiv NG, this is an opportune time for ADS and arXiv to coordinate and possibly collaborate on new problems of mutual interest. This post is a recap of some of the things that we discussed. (more…)

arXiv NG: Classic Renewal

To paraphrase an observation by our new Scientific Director: from the perspective of most of our users, arXiv runs on magic. With the exception of a small handful of hiccups, arXiv has just worked for over 25 years. Since I joined the arXiv IT team as lead software architect in June, I’ve been working hard to pull back the proverbial curtain and take stock of how the sausage is made, and to synthesize the team’s aspirations and expertise for the arXiv Next Generation (arXiv-NG) project. We’ve done a considerable amount of research and soul-searching, and an architecture for NG has emerged.

Over the coming weeks and months I’ll discuss the highlights of the NG architecture on this blog, and keep you up to date on development progress. This post is a brief 30,000-foot view of where we’re going over the next two years.


Development update: reference extraction & linking

Based on user and stakeholder feedback, extracting cited references from arXiv papers and providing links to those references for readers was identified for development under the 2017 arXiv Roadmap, based on input from the arXiv user survey. We have also heard from our API consumers that access to cited references would be valuable. arXiv already detects arXiv identifiers in cited references (for LaTeX submissions), and converts those identifiers to hyperlinks to the corresponding arXiv paper in the final PDF.

Over the past several weeks we’ve undertaken an exploratory project focused on reference extraction and possible scenarios for reference linking. This post is a brief snapshot of what we’ve done so far, what we’re hearing from users, and some thoughts about where we go from here.


Another quick arXiv user poll (links to publications cited by articles in arXiv)

We are considering enhancements to display links that will take the reader to the papers cited in arXiv articles. We already add arXiv links to PDFs when the TeX source is available and arXiv ids are present, for example. We would like your input on extending this further:

The survey will remain open until 23:00 UTC-4, Thursday, 21 September, 2017. Your participation in this survey is voluntary. Your responses are confidential, although we may report publicly aggregated information based on the results of the survey.

Thank you for helping us improve arXiv!

Subscribe By Email

Get every new post delivered right to your inbox.

Please prove that you are not a robot.

Skip to toolbar