Visit to Astrophysics Data System

ADS (Astrophysics Data System) held its second ADS Users Group meeting on 2–3 November. In advance of the meeting, on 1 November I spent the day meeting with Alberto Accomazzi (ADS PI), Michael Kurtz (Project Scientist), Edwin Henneken (IT Specialist), and members of the ADS development team.

The ADS Digital Library was founded in the early 1990s, at about the same time as arXiv, and has become an indispensable resource for the astronomy and astrophysics research community (see Kurtz et al, 2000).

We’ve worked closely with ADS over the years. Given the recent ramp-up of development effort at ADS to support the ADS Bumblebee project, and the parallel ramp-up for arXiv NG, this is an opportune time for ADS and arXiv to coordinate and possibly collaborate on new problems of mutual interest. This post is a recap of some of the things that we discussed.

Background

As a relative newcomer to arXiv, it was extremely valuable to see firsthand how our close partners interact with arXiv.org and uses arXiv content. ADS currently interacts with arXiv in the following ways:

  • ADS consumes arXiv publications on a daily basis via our OAI-PMH endpoint.
  • To populate the ADS Bumblebee search index, ADS also extracts plain text from arXiv PDFs, and cited references from PDFs and TeX source packages.
  • We share some statistics with ADS to support their recommendation engine for readers.
  • ADS has developed an extensive set of methods for matching arXiv papers with corresponding publications in journals and conference proceedings. ADS shares those matches with us via its API, and we use that information to populate DOI and JREF fields on arXiv papers.

Challenges & Opportunities

ADS is in the process of migrating from their ADS Classic platform to an exciting feature-rich platform called ADS Bumblebee.  There are a few areas in which arXiv and ADS can work together to address challenges in the Bumblebee project.

  • Currently, data availability in ADS Bumblebee lags behind ADS Classic, and the ADS team is making great strides in closing that gap. A major driver of that delay is the daily update from arXiv, which involves extra processing steps such as plain text extraction.
  • ADS is working to improve its metadata models for authors, institutions, and scientific collaborations. Currently, since arXiv lacks an explicit representation of authors and other entities in metadata, ADS must parse author metadata from arXiv heuristically. Challenges include author disambiguation, arXiv’s lack of exposure of ORCID IDs in its metadata, and disambiguation of collaborations and institutions and mapping those entities to authors.

How arXiv can help

  • As arXiv moves toward a more modern, RESTful JSON API for publication metadata, ADS is in a good position to utilize it for their regular updates.
  • General improvements to the metadata that we expose via APIs. ADS encounters some of the same limitations described by other API consumers, e.g. difficulty limiting results by publication date.
  • Improvements to author, institution, and collaboration metadata associated with arXiv papers will make it much easier for partners like ADS to utilize arXiv content.
  • Make full-text and cited references available via API. If these products were of equivalent or superior quality to those produced internally by ADS, it would simplify their workflow to rely on these APIs.
  • Continue to provide data to support ADS’ recommendation tools.

Longer-term possibilities

ADS and arXiv have adopted a remarkably similar set of technologies for new development, including Python/Flask, Docker, and AWS, and are adopting similar architectural patterns (e.g. microservices, API gateway). Now that arXiv and ADS are both releasing source code under open source licenses, this creates a wider opening for potential direct collaboration on projects of shared interest. Some areas for potential collaboration include author disambiguation, reference extraction, sharing expertise related to infrastructure and technology (in particular AWS), and support for automated annotation to aid classification.

I’m excited about the possibilities for closer collaboration with the ADS team throughout the Bumblebee and arXiv-NG projects.