HathiTrust Research Center UnCamp 2013

HathiTrust currently contains about 10.8 million volumes.  Approximately 32% of these volumes are in the public domain.  (HathiTrust provides a variety of contemporary snapshots of its holdings.) The HathiTrust Research Center (HTRC) is an independent but associated entity that currently enables computational access for nonprofit and educational users to published works in the public domain.  HTRC also has plans for similar access to works that are in-copyright from the HathiTrust on limited terms, possibly through a virtual machine for authorized scholars.

On 9/8-9/2013, scholars, librarians, project managers and information technologists converged on the University of Illinois iHotel for HTRC UnCamp2013.  I came for a variety of reasons: to represent Cornell, to learn about future plans of the HTRC, to gather examples of projects in the Digital Humanities that use computational approaches, and to network with colleagues on a variety of side-issues related to ingest, quality metadata and all things HathiTrust.   Although this is only the second year of this conference, it is easy to note many improvements over the inaugural year.  Programming was tighter, making much better use of our time.  The conversation appeared more open and more driven by the needs of the participants.  We discussed not just the tools and what they did, but where they might be a good fit, and where they might not be, and perhaps most importantly, what adjustments need to be made to increase usefulness, usability and transparency into what the tools do.  Presentations by scholars of their computationally-based humanities projects abounded, both for those that used the HTRC tool set and those that used non-HTRC tools, occupying two lightning rounds, and two keynotes.

As you might imagine, this stimulating environment affords many lessons, and it is difficult to select a mere few.  I’ll try to summarize the broadly emerging themes:

  • The tools are already providing scholars the means to credibly re-test traditional assumptions of their fields.  Close readers of a subject can develop an intuitive sense of trends related to their interests, but computational access can test these assertions, with actual metrics.  After all, who can claim to read all of Victorian poetry?  A close reader might spend a lifetime doing this.  Computationally, this can be accomplished through distant reading by a small team of people with specific technical and scholarly expertise.  Computational approaches of inquiry sometimes confirm traditional assumptions, but just as often seem to provocatively re-open issues for discussion, moving the conversation beyond conventional wisdom.
  • Move beyond the bag of words.  Digital books consist, for the most part, of page images and associated Optical Character Recognition (OCR).  OCR provides the text flow that is mined when computational tools are used.  But OCR is often not structured in any meaningful way; it doesn’t contain information about paragraphs, or line length.  But a book is not merely a “bag of words”, and deep understanding of any text-based material must move beyond basic text flow. Serials and newspapers have articles, but they also have ads, pictures, charts, and graphs; poetry has stanza, meter, feet and rhyme scheme.  Structuring the text flow in various meaningful ways can help scholars move beyond the “bag of words”, opening up new possibilities in the digital humanities.
  • Collaborate, Collaborate, Collaborate. The people who are comfortable with the use and adaptation of computational tools are most often technologists and statisticians.  The people with questions in various fields of the humanities are humanities scholars.  Successful projects require that these people work together to ask and answer questions.  The humanities has traditionally rewarded scholars who make individual contributions, but computational projects are best accomplished in a collaborative setting.  Scholars in the humanities who are interested in computational approaches might want to consider working within the lab model commonly found in the sciences, where people with diverse skill sets come together to further inquiry.
  • Facilitation is crucial, especially in the present transition.  As people with different skill sets come together, they need to find ways to communicate effectively.  The current descriptions of the HTRC tools are very technical, and it was generally acknowledged that there needs to be a “gloss-description” provided that will help humanities scholars determine what each tool offers.  Similarly, as technologists and humanists work together, they may find themselves speaking different languages.  Setting up facilitated conversations can help.  As the culture and curriculum of the humanities evolves towards adoption of computational approaches, there may be less need for this, but at the present, it is of vital assistance to make digital humanities collaborations effective.

I will post links to the full conference notes as they become available.  In the meantime, feel free to refer to the information on the conference page, and the HTRC wiki.  Your comments and questions are welcome.

Michelle Paolillo


