By Gail Steinhart, Simeon Warner, and Oya Rieger
From April 2011 through March 2013, arXiv collaborated on a pilot project with the Data Conservancy to support remote data deposit for arXiv submissions. The Data Conservancy, initially funded by the US National Science Foundation (NSF), is a project funded by , which aims to “research, design, implement, deploy and sustain data curation infrastructure for cross-disciplinary discovery with an emphasis on observational data”. We sought to understand how researchers took advantage of the pilot, and looked at over 200 arXiv submissions with data sets in the Data Conservancy. In our examination we noted a publication’s subject area, number of data files and the combined size of all data files, and file extensions. In a subset of cases, we also examined publications for references to the availability of data sets, as well as looking for anything that could be construed as metadata in the publication or with the data sets themselves.
Amount and size of data
For the most part, authors are submitting a few small files to the Data Conservancy. Submissions included 1-944 separate data files, but an average of less than 10 data files per submission. The combined size of the largest data collection was 819MB, with an average of just under 20MB.
We counted 42 different file extensions for 1837 files in the submissions we examined. The most common classes of files were documents (.pdf or .tex), videos, and image or graphic files. The maximum number of different extensions associated with a single submission was 5 (for 19 data files), and only 6 of the submissions we inspected had more than 2 different file extensions. A cursory check of whether the most common file formats for data submitted to the Data Conservancy are listed as “preservation friendly” by either the UK Data Archive or Cornell’s institutional repository, eCommons showed that many are not, which suggests some substantial preservation challenges in the long term.
We should note that files submitted to the Data Conservancy weren’t always what we might think of as “data.” Examples of “non-data” uploaded to the Data Conservancy included copies or alternate formats of papers, documents with extra information about methods or other explanatory material, and higher resolution versions of figures from the papers.
Of the submissions we examined, a handful of arXiv subject areas had more than ten submissions with data in the Data Conservancy. They were Condensed matter (48 submissions), Astrophysics (32), Mathematics (32), Physics (30), Computer science (23), and High energy physics (12).
Metadata and references to data sets
For 54 unique submissions, we looked in the papers for references to supplementary data, and at both the papers and the data sets themselves for anything that could be considered metadata describing the data sets. Here’s what we found:
- For five submissions, the supplementary files uploaded to DC were copies of (or in one case, a translation of) the paper itself.
- 23 submissions included an explicit reference to supplementary materials. Four of these referenced supplementary material available elsewhere (on a lab website, for example); otherwise references to supplementary data or files were not usually explicit about the location of those files.
- In 26 submissions, we found no reference to supplementary materials.
- Authors that did refer to supplementary data files did so either at the end of a paper (12, in references, as an endnote, or appendix), within the text itself (13, in-line, as a footnote, or in a figure legend), or both (4).
- Only nine submissions included information that could be thought of as metadata. When present, metadata was included as a standalone document such as a readme or appendix (four cases), a brief description within or at the end of the paper (three cases), or within the supplementary files themselves (two cases).
Additional observations and conclusions
arXiv has allowed authors to upload supplementary files since 2010, and some authors may have continued to use that option rather than experimenting with the Data Conservancy pilot. With the addition of Data Conservancy as an option for data sets, authors sometimes made supplementary information available for a single submission by multiple means (Data Conservancy, arXiv, or their own website). Others sometimes made one piece of supplementary information available by one method while using another method for another, different supplementary file.
The results of the pilot show that even though a small proportion of arXiv submissions (less than 1%) include data deposit to the Data Conservancy, support for online distribution of data sets and other supplementary content is a useful service to some. At present we wouldn’t consider the volume of data to be overwhelming, however, the lack of specific references between many papers and their related data sets, general lack of metadata, and the preservation challenge presented by a wide array of file formats suggest some non-trivial challenges in providing and sustaining such a service at scale and for the long term.
 File format extensions appearing in at least two submissions: .pdf, .tex, .avi, .mpeg, .eps, .txt, .mp4, .mov, .jpg, .gif, .nb, .wmv, .bbl, .dat, .fits, .png, .ps, .rar, .xls