Skip to main content

Developments at the HathiTrust Research Center (HTRC)

HTRC LogoThe HathiTrust Research Center has announced an important development towards the availability of the full HathiTrust corpus to scholars that use computational methods (“text-mining”) in their research.  The press release also includes a timeline for future steps that will make the entire corpus of HathiTrust, regardless of copyright, open for scholars using computational methods.

What is HathiTrust and the HathiTrust Research Center?  The HathiTrust is a partnership of academic and research institutions, who continuously build a digital library together (currently over 14 million volumes) digitized from libraries around the world.  If a book is not subject to limitations of copyright, it can be read online as well.  Items that are subject to viewing restrictions are indexed in full, such that even though scholars cannot read them online, they can search on the text within the covers, and return both the titles of books that contain a given term, and also the page numbers of where these terms are found within a text.  This full-text indexing can be leveraged for broader computational analysis (often called “text mining“) by scholars who find these methods useful.  The HathiTrust Research Center (HTRC) is a joint effort of the Indiana University and the University of Illinois, who partner with HathiTrust to provide the software, infrastructure and computational scale for this scholarly method using the HathiTrust Digital Library as the primary source of text to be analyzed.

What exactly is the recent development, and who sees immediate benefit?  The HTRC has provided the analytics portal to anyone who makes an account.  In the portal, scholars can create a collection of their own, and run computational algorithms against that collection.  The portal also offers the HTRC bookworm (open source software tied to a segment of the corpus), the data capsule (a virtual machine environment suitable for a scholar to load their own tools) and several data sets.  With all of these tools, scholars have been limited to the segment of the corpus in the public domain.  But beginning this summer, successfully funded Advanced Collaborative Support (ACS) proposals will pilot access to the full corpus (regardless of copyright status) in their projects.  In real numbers, that means that instead of about 5 million books, the ACS scholars can work with the full 14+ million currently found in HathiTrust.

What is the timeline for future steps? The details are in the announcement, but briefly, the plans for the year ahead are:

  • Immediately: Advanced Collaborative Services grant awardees will have access to full corpus.
  • Fall 2016: A new features data set, derived from the full collection at both volume level and page level, will be released.
  • Early 2017: Availability of full HathiTrust corpus through data capsule anticipated for general use.

Please consider me available for questions, concerns and guidance on getting started with the HTRC.



Comments are closed.