Release: search v0.2 + some notes on names

Version 0.2 of arXiv search was deployed to production this morning. This release addresses a variety of bugs identified in v0.1, with special attention to searching by author names. Detailed release notes can be found on the arXiv public wiki.

Author names in arXiv

If you used arXiv’s search interface to look for your papers during the past week, you may have had a moment of panic. While there were some real bugs that needed squashing (and have since been addressed), some of the feedback that we received pointed to deeper issues with the way that arXiv has historically handled information about authors. Our old search system partially obfuscated these issues by providing an artificial impression of precision. In this post, I lay out some of the challenges with the classic arXiv metadata schema related to authorship, how that impacts search, how arXiv author IDs and ORCID IDs factor into all of this, and what we are doing to improve the situation moving forward.

The arXiv metadata schema

The metadata that we collect from submitters is dead simple:

  • Title
  • Abstract
  • Primary and secondary classification
  • Publication information (DOI, journal reference)
  • MSC + ACM classification codes
  • Author names

Information about authors is encapsulated in a single text field. We ask submitters to use a canonical format—separate author names with commas, put affiliations in parentheses—but even that format leaves quite a bit of flexibility when it comes to how individual author names are written. For example, forenames may be initialized or not: E. N. Sinskaja, Evgeniya Nicolaevna Sinskaja, or Evgeniya N. Sinskaja are all valid representations for the Russian ecologist (were she to have published in arXiv).

Searching with names

The trouble with this ambiguity when it comes to search is that there is no consistent way to ensure that a search for each of those representations turns up papers in which the other representations were used, and that false-positives (for example, Edna B. Sinskaja) are reliably excluded.

The strategy promoted in the classic search system is to reduce queries to their least common denominator: author name links from a paper’s abstract page generate queries that use the last word of the name and the first initial of the first word of the name (for example, Sinskaja, E). This ensures that substantially all papers possibly written by a particular author are returned, but leaves open the possibility that a link from the name “Evgeniya Sinskaja” on the abstract page of one of her papers will generate a search that returns papers written by “Edna B. Sinskaja” or even “Eustice L. Sinskaja”.

Some users have noted that the old search system supported a name format that used an underscore to delimit surname and forename parts, for example: “sinskaja_e”. This was an artifact of the old search backend, and is equivalent to searching for “sinskaja, e;” in the new search system.

arXiv author identifiers, ORCID IDs

A widely used strategy for resolving the ambiguity around author names is to use a so-called “name authority.” Most readers will be familiar with the Library of Congress Name Authority (part of the Library of Congress Subject Headings) which provides canonical spellings of author names for use in databases and indices, for example: Sinskaíà, E. N. (Evgeniíà Nikolaevna), 1889-. In the age of the internet, those canonical spellings have been fortified with explicit identifiers, or URIs (http://id.loc.gov/authorities/names/n84802474).

arXiv has supported author identifiers—our own name authority—since 2005. An arXiv user may create an arXiv identifier, which is associated with any papers that the user “owns” (submitted or claimed). This creates an automatically-updated listing of all of the user’s papers, for example https://arxiv.org/a/warner_s_1, and is also indexed in the search system.

Note that while this looks a lot like the format for searching by author name in the classic system (for example, warner_s), it is entirely unrelated.

An even better solution is to associate your ORCID ID with your arXiv user account. These IDs function similarly to arXiv author IDs on the arXiv platform—they generate listings like https://arxiv.org/a/0000-0002-7970-7855.html, and are indexed in the search system. The advantage of using your ORCID ID is that it is used across many different publishing platforms. In the future, we plan to make it easier to automatically update your ORCID record with information about your arXiv e-prints.

What we are doing to improve things

We have to work with the data that we already have, and so searching by author name should be a similar experience to the old search system. The current version of search makes a few small improvements:

Better and more consistent support for internationalization.

If you’ve ever tried searching with author names that contain non-ASCII characters, you may have encountered some frustrating bugs. For example, searching for the name “Schönfeld” in the old search interface returned 0 results along with the cryptic message: “One or more non-indexed characters — ö — have been stripped from your query.” Huh?

Diacritics and other non-ASCII characters are now normalized in ways that you would expect, for example: ß -> ss, ö -> o, etc. A search for “Schönfeld” in the author field should now return papers by Schönfeld, Schonfeld, and Schoenfeld.

Leveraging owner information

In addition to the canonical author metadata, we are now also leveraging information about users who “own” each paper (in other words: were a submitting author, or a co-author who claimed the paper). In some cases, this will provide higher-quality results. For example, if an author is listed as “E. Sinskaja” in the author metadata, but has also claimed the paper and entered their full name “Евгения Николаевна Фигнер” in their profile, a search for “Фигнер, Евгения” should return the paper.

Search by ORCID ID and arXiv author ID

Author ORCID IDs and arXiv author IDs (see above) are now indexed and searchable. Note that this only works when an author has claimed their papers, has entered their ORCID ID in their user profile, and has created an arXiv author ID. See this page about author identifiers and this page about ORCID IDs for details.

Future developments: better disambiguation

An upcoming milestone for arXiv-NG is to improve our support for author identifiers (both arXiv IDs and ORCID IDs), and provide better tools for identifying papers based on authors’ identities rather than how their names are written. Stay tuned for further details in the coming months.