New Release: arXiv Search v0.1

Today we launched a reimplementation of our search system. As part of our broader strategy for arXiv-NG, we are incrementally decoupling components from the classic arXiv codebase, and replacing them with more modular services developed in Python. Our goal was to replace the aging Lucene search backend, achieve feature-parity with the classic search system, and give the search interface an opportunistic face-lift. While the frontend may not look terribly different from the old search interface, we hope that you’ll notice some improvements in functionality. The most important win for us in this milestone is that the new backend lays the groundwork for more dramatic improvements to search, our APIs, and other components targeted for reimplementation in arXiv-NG.

Here’s a rundown of some of the things that changed, and where we plan to go from here.

What’s new

Elasticsearch

The most significant change in this release is that we moved away from a search index in the classic system based on a 2008 version of Lucene to an Elasticsearch cluster running in the cloud. This is part of our broader architectural strategy to increase resilience through clearer separation of concerns in both our software and our infrastructure. The move to Elasticsearch also opens up a rich toolset for powerful search functionality, such as better tools for internationalization. We’ve only scratched the surface with ES, and are excited to take advantage of additional capabilities as we build on this reimplementation in future releases.

Something looks different…

We gave the user interface a facelift. We wanted to preserve austere look-and-feel of arXiv, with a bit of modern styling to support readability and scanning. To that end, we’ve adopted a frontend framework called Bulma, a responsive CSS library without the kitchen sink. In addition to making the search interface a bit more pleasing to look at, it has also made it easier to provide a better mobile experience. As arXiv-NG progresses, you’ll notice these styling improvements spreading to other parts of the site.

Better support for author names

Searching by author name was one of the most challenging parts of this milestone. The arXiv metadata schema is dead simple: except when authors have claimed a paper, the entire list of authors is represented as a single string that follows a canonical format. While this works just fine for most non-hyphenated Anglo-European names, it’s less than ideal for everyone else. As a consequence, we had to apply some creative fuzziness to the author name search. In our first attempt, we tried to parse author names into parts (forename, surname, initials, affixes), and then provided those fields in the search interface. Our beta testers gave us quite a bit of feedback about cases in which this simply didn’t work, so we went back to the drawing board (it’s easy to hold false beliefs about names). Using many of the examples provided by our beta testers, we generated a variety of scenarios and expected behaviors for author name search, and then defined a cascading set of weighted queries against the search index that reproduced the expected behavior. We feel good about the current implementation, but are cognizant that there is ample room for improvement. Please let us know if you see any oddities while searching by author name. You can send feedback directly to the dev team by clicking on the “Feedback” button in the upper-right corner of the page.

By the way, you can search for papers by authors’ ORCID identifiers. Almost 45,000 arXiv authors have added ORCID identifiers to their profiles (and you should too). Note that this only works if the author has claimed their papers, since the core paper metadata itself does not include ORCIDs (but we’re planning to fix that). Now you may also search by arXiv author identifier.

TeXisms

We’re trying TeXism-based search in title and abstract fields! To search for a TeXism, enclose the expression in dollar signs ($). Note that this will return exact matches only. We’re eager to hear whether you find this useful.

Better support for DOI

We’ve added DOI as a search field (exact match only). In response to beta tester feedback, we also made DOIs more prominent in the search results (with links to the doi.org resolver), so that you can quickly ascertain whether there is an external version of record.

Other little things

We made a variety of opportunistic tweaks and changes. Here are just a few:

  • ACM and MSC classification codes are now handled as separate fields.
  • Better wildcard support for searches in title and abstract.
  • The tiny search box in the header now supports all of the fields that the new search interface supports.
  • Hit highlighting. The downside of supporting more fields for search is that it can be more difficult to ascertain why a result was included. We’ve added hit highlighting to indicate which fields and terms matched your query.

What’s next

Full text search

The CS department at Cornell has operated an experimental full text search for a number of years. This has always been separate from the metadata-based search that we operate. While the current release does not introduce full text search into our platform, that functionality is on the 2018 arXiv roadmap, so stay tuned.

API improvements

Much of the backend functionality that makes search possible is also relevant for our API consumers. As part of our vision for an arXiv API Gateway, later this year we plan to leverage Elasticsearch to improve the arXiv API.

Faceted search

The move to Elasticsearch opens up much better support for aggregations in search, which will make it much easier to provide a more user-friendly and interactive faceted search interface. We will likely start developing an interface that supports faceted search in mid- to late-2019.

Thanks!

Many thanks to more than 300 beta testers for taking the time to put early versions of the new search through the wringer, giving us important and valuable feedback and test cases for improvement. A number of key refinements were uncovered and addressed through the beta testing process.

Thanks also to the Sloan Foundation, Simons Foundation, our institutional members, and the many individual donors who made this work possible through their financial contributions.

You can send feedback directly to the dev team by clicking on the “Feedback” button in the upper-right corner of the page.