Engaging external developers in arXiv-NG

Starting last June, the arXiv team began reaching out to researchers and developers who have created tools that leverage arXiv APIs and/or content. We identified several hundred projects on GitHub and SourceForge, as well as platforms mentioned in responses to the 2016 arXiv user survey. We conducted a targeted survey for API consumers, and contacted several dozen individuals who have used our APIs for various purposes. This provided a wealth of information about who is building tools (e.g. we learned that tech-savvy researchers likely outnumber professional developers in this area), the problems that they have encountered, and the kinds of things that would make it easier for them to build tools that add value for arXiv users.

Based on that feedback, our strategy for engaging external developers has two major planks.

  1. An arXiv API gateway, to promote the development of innovative third-party tools and platforms, draw attention to those tools that already exist, and foster dialogue with external developers.
  2. Selective direct contribution to the arXiv codebase. The move to an open-source “en plein air” development model for arXiv-NG creates new opportunities to engage trusted individuals and organizations in the development of core features and services.

I’ll elaborate on each of these planks, below.

arXiv API gateway

Application Programming Interfaces (APIs) allow external parties to interact with the arXiv platform programmatically. Rich, well-documented APIs are a force-multiplier for innovation around technology platforms. Strategic development and support of APIs allows our small team to focus on building software to support arXiv’s core mission, while empowering others to innovate on peripheral features and services. As I mentioned earlier, many people are already building powerful tools on top of arXiv, and we want to continue to promote those kinds of projects.

The classic arXiv system has a variety of APIs – the RSS feed, the “arXiv API” (an XML-based API that supports querying arXiv papers), the SWORDv1 bulk deposit endpoint, an OAI-PMH endpoint, etc. Some of those APIs are relied upon heavily by our partners. For example, ADS relies upon some of these APIs for its daily ingest of arXiv metadata.  In some cases, parts of the system that are not intended to be used as APIs have become de-facto APIs – for example, some people will scrape our HTML-based web pages to collect information, which is less than ideal for all parties involved.

An API gateway will draw all of those APIs together into a single point of access, with uniform documentation, request and response conventions, and authorization mechanisms that are in line with security best-practices. arXiv users will be able to obtain API keys that identifies their requests and allows us to provide different kinds of access, capabilities, and resources to trusted partners. Over the last two months we have been developing infrastructure that will make it easier to version our APIs, keep documentation up to date, and release new APIs as arXiv-NG progresses. The gateway approach also gives us a clearer understanding of who is using our services and how, which puts us in a much better position to prioritize our development effort to support those users.

In addition, we are in the process of designing an API project registry that makes it easier for arXiv users to find external projects. For example, a developer may add a title and summary of their project when registering for an API key, which would then be displayed publicly with useful metadata. It will be important to do this in a way that promotes development, but also does not imply endorsement of any particular project by arXiv. I look forward to sharing further details about this as the project progresses.

Direct contribution to the arXiv codebase

While it is not possible to make source code for the classic arXiv system public, we are releasing all* new code developed as part of arXiv-NG under the MIT open source license. Operating in an open source mode has had several immediate benefits for the arXiv dev team, including the ability to use modern continuous integration and deployment tools. Germane to the present discussion, however, is that it makes it possible for trusted external developers to make direct contributions to core arXiv features and services.

As a collection of software, arXiv does not necessarily fit the mold of a large distributed open-source project – we are not developing general-purpose software or frameworks that we expect to be widely reused. However, the value of arXiv as a resource for the communities that it serves, and the number of existing partners with an interest in arXiv’s technical development, does put us in a position to cautiously solicit and incorporate contributions. Facilitating volunteer contributions would let people work on projects they are actively interested in, with potentially significant reputational rewards.

There are, of course, many risks associated with opening the door to external contributions, which is why we are proceeding with a great deal of caution. Considerations include vetting of potential contributors, legal tools required to facilitate contributions, and effective communication about development processes and quality requirements. Given the complexity and risks, we intend to start small with a few select contributors on specific parts of the project. Based on our experiences, we will then decide whether / how to proceed in this direction.

* With the exception of a small number of sensitive components.