The HathiTrust Research Center (HTRC) held its third annual UnCamp over March 30 and 31. The HTRC has continued to demonstrate its commitment to the evolution of tools of tools and functionality of interest to scholars in the digital humanities, this year adding two additional tools to the environments, algorithms and datasets that it already offers. The main themes emerging throughout the conference, as evidenced in both the announcements of developments and through various conversations, were related to sustainability of the HTRC to be a continued presence in the scholarly sphere, and the adoption of the computational environment by more scholars in their work. The latter topic was both demonstrated throughout various presentations of scholars work, and engaged through discussion of ways in which the HTRC could improve its offerings to scholars by lowering barriers and enhancing adoption.
The HTRC Bookworm is a tool very like the Google nGram Viewer but it leverages the HTRC indices instead of the Google data set as the data to be analyzed. Given the similarities between the two, it is not surprising that the same genius was at work in the development of both tools (Erez Lieberman Aiden and Jean Michel Baptiste, among others). The resulting Bookworm is available as an open source project (code available.) The HTRC’s implementation of Bookworm is more graphically oriented allowing from a fairly complex set of constraints to be set on the data through an intuitive and clean visual interface. Currently, it is limited to unigrams. There is a lot of interest in the conference to allow the HTRC Bookworm to be constrained to a collections of one’s own making, so it is likely that development of this feature will begin in the year to come.
The Data Capsule (located in the “portal” alongside the basic algorithms) also made its debut, along with a supporting tutorial. The data capsule is a virtual machine environment that a scholar can customize with various tools, lock, and then use to address the texts of the HTRC computationally. The scholar can then unlock and retrieve the results of the computation. Overall, this presents an environment that supports scholars using computational methods and also prevents reconstruction of the corpus or any given work in it. Currently, analysis is restricted to books in full-view, but the Data Capsule is engineered specifically for security that will allow scholar’s access to computationally address works in limited view as well. Legal machinery is already underway to pave the way for this important evolution.
Sustaining HTRC services
The HTRC has been spending considerably effort to grow from an experimental pilot to a reliable service. Although there is an open acknowledgement that this road will be a long one, there are some welcome developments that are good steps in this direction. HTRC has hired Dirk Herr-Hoyman (Indiana University – Bloomington) as Operations Manager, who is charge with bringing greater rigor to the HTRC service offerings (clearer versioning, increased security, responsive user help, clear development roadmaps, regular and documented/announced refreshes of primary data, etc.) Administrative development of an MOU between the HTRC and HathiTrust is another important step, leading to greater clarity of their separateness and relationship, and the roles and responsibilities of each in terms of data management and security.
Adoption of the HTRC services by scholars using computational methods was a central concern. Instance of use is a measure of relevance, and programs of greater relevance make a better business case to be worthy of our efforts. Throughout the conference, discussions of what might be needed to facilitate adoption of the HTRC service offerings by scholars elicited many concrete steps that could help:
- User testing/UX of portal. Scholars and especially librarians feel that the user experience could benefit from some redesign. Currently there are at least four base URLs that lead to various tools and user experiences and documentation, and none of these have the same branding or interface design, leading to confusion on the part of scholars as to what HTRC is and what it is offering. It is equally unclear where to go for help on each tool, where to post questions, how to ask for features or development when one’s project is out of the current scope of a tool and requires some extension. There was some talk about rebranding the HTRC services as Secure HathiTrust Analytical Research Commons (SHARC) and placing all service offerings under that umbrella, as well as some immediate and longer term steps that might be taken to move things in a positive direction.
- Discussions also revealed the need to strengthen the link between developments at Google and how they affect HathiTrust and HTRC. Google improves their books periodically, developing image corrections and improved OCR at scale. (In fact, right now, they have recently released a new OCR engine, and are re-processing many books that have the long s, including fraktur. Early results look very promising.) These improved volumes are reingested into HathiTrust. These improvements can be better leveraged by:
- Improved communication between HathiTrust and HTRC on improvements underway. The Google Quality Working Group might be a good place to coordinate some of this information. If these updates are systematically conveyed to HTRC, they will direct precious resources into other efforts.
- Updates of the data HTRC receives from HathiTrust that are coordinated with major releases of material re-processed by Google.
- Effort should be directed to relationships. The suggestions in this conversation were that in the short-term, HTRC might supply more advising on grants and the grant process that would leverage HTRC services. In the mid- to long-term, HTRC might seek international partnerships and relationships. Also in the mid- to long-term, HTRC might leverage librarians and scholars as ambassadors to professional societies to raise awareness.
There were two keynotes that described scholarly projects and nine projects touched upon in lightning rounds.
Michelle Alexopoulos, “Off the Books: What library collections can tell us about technological innovation…” Michelle shared her perspective as an economist working with HTRC data to discover patterns of the time between the invention of a technology and its adoption, and describe the economic impacts, as well as the ways in which a specific technology might impact other technological developments. Her project employed algorithmic selection of a large corpus based on MARC attributes, and Bookworm/nGram data.
Erez Lieberman Aiden, “The once and future past” Erez was on the original team of people who created the nGram viewer code and coined the term Culturomics to describe the intersection of patterns they were seeing between culture and trends revealed through algorithmic analysis of texts. He recapped the scholarly impact of the nGram viewer, its open source successor (Bookworm), and the provocative notion that we can use this data predictively as well as retrospectively.
Lightning rounds included Natalie Houston and Neal Audenaert’s “VisualPage: A Prototype Framework for Understanding Visually Constructed Meaning in Books”. Natalie also visited Cornell to present and discuss about her work on 4/16 as a part of the Conversations in the Digital Humanities series.
Michele Hamill of CUL’s fantastic Conservation Department asked me to guest blog about audiovisual preservation as part of Preservation Week 2015. Pardon the cross-posting, but I thought I’d share it here on the DSPS Press blog as well. Wishing you a happy birthday and a wonderful Charter Day, Cornell University, as well as a wonderful weekend to you all.
First of all, I’m honored to be a guest on our Library’s Conservation Department blog, as they are a great team doing magical things. When discussing audiovisual preservation and the big issues facing possible catastrophic loss of materials on magnetic media, proper conservation becomes even more important as we chart out solutions that may emerge from our campus-wide AV Preservation Initiative.
Both UNESCO’s Blue Ribbon Task Force publication (Sustainable Economics for a Digital Planet, 2010) and the Library Of Congress are estimating that the vast majority of materials housed on magnetic tapes (cassettes, open-reel audiotape, VHS, etc.) will be lost in the next 10 years due to degradation and playback obsolescence. This includes materials ranging from field recordings of cultural events in dying languages to your own home movies of grandparents or children.
Cornell University Library’s Collection Development Executive Committee has set up a preservation fund (allocated through a grant-based system) awarded to save fragile, unique, and heavily used collections and, due to issues with legacy AV content, a lot of that fund has gone to digitization of AV collections. As an example, I’m currently working on digitizing a large collection of VHS tapes for the Africana Library of unique lectures given at Cornell in the past. Last year, this collection was moved to the annex, as they are the only copies in existence and are no longer in circulation.
While preservation and digitization is key to older formats, it’s also incredibly challenging for digital formats as well. Digital content, while often easier to use and access in a lot of cases, is incredibly fragile and subject to many problems such as bit rot and errors, proprietary and complex formats and file types, and costly storage. In reality the world is creating digital content at a staggering pace, resulting in petabytes of possibly important or disposable content. How do we deal with this in our work or even in our personal collections of video or photos?
The Library of Congress has provided a thorough resource for individuals to get a handle on the digital content they are creating, as well as digitizing to share with family and friends across the globe. This is a rapidly increasing need of people everywhere, but how do we decide what do we keep and how much? Witness.org stands out as a good example of an organization that is also promoting a more curatorial culture for our content at large, and for a purpose. They provide a guide to archiving content from a journalism/activist perspective, from creation to preservation and access.
Working in a memory institution, I often feel like I’m helping usher content from the past into the future and that is a tremendously gratifying feeling. ‘This work will outlive us,’ is something I often hear said in libraries and archives and while that is true, there is a huge amount of effort and a lot of tough decisions that go into conservation, preservation, and access. Whether it’s a beautiful tome from the 17th century or video of one of the last known public appearances of Jimmy Hoffa, it takes detailed work, resources, and careful planning to keep these things alive. In reality, history is written by every one of us. What’s your story?
Digital Scholarship and Preservation Services had the good fortune to host Noah Hamm, Public Services Assistant at Mann Library, on a brief DSPS fellowship from from 9/2014-2/2015. One current need that arises as we seek to support of scholars is their need for tools that visualize data as maps. Given Noah’s interest in the visualization of location data, it seemed to be a good fit to leverage his explorations as a way to surface and generally categorize appropriate tools. Below is a general overview of his findings.
DSPS fellowships offer a great opportunity to reach out and develop new skills, explore different aspects of the library system, meet colleagues and get involved in the inner workings of various ongoing library projects. My fellowship involved audio-visual preservation, digital collections usage statistics, and digital humanities topics like concept mapping and text mining. I also had some time to devote to an area of personal interest: online mapping and data visualization platforms. Below is a report on my basic impressions of some of the mapping tools that I explored.
The first impression of a newcomer to the landscape of geospatial mapping and data visualization software, is that it really is an uncharted jungle. There are many different applications for geospatial visualization software – from complex spatial analyses to mapping the family vacation. And, because there seems to be no obvious authoritative or comprehensive accounting of all the different available platforms and their features, it can seem like a real wilderness.
There are some big name companies that have developed geospatial mapping platforms such as Esri’s ArcGIS.com, Microsoft’s Power Map for Excel, and Google’s Fusion Tables. If you want to access existing maps and survey data and customize the appearance of your results, ArcGIS.com is a great choice. If you need to geo-reference your own set of addresses or location data, Google fusion tables is a great choice. And if you want to build a slideshow of your map visualizations and you’re familiar with excel, then Power Map is for you. A savvy user can combine the advantages of these different programs and get great results.
There are powerful open-source programs like Quantum-GIS and GRASS for those who want maximal control over the layers of data and imagery going into their projects, greater sets of options for customizing the final appearance and are willing to engage the fairly complicated learning curve associated with the programs interface. These are not recommended for a single-use need but definitely worth the time and effort required to master for a long term project or multiple uses.
For those who have simpler data sets and don’t need or want to download GIS software- Online platforms like CartoDB, and Viewshare offer mapping and data visualization. Mapping US census data is one of the nicest features of SimplyMap and nearly all of the programs I’ve mentioned offer embed code to place the visualization within a specific web environment.
The best fit for a new user will depend largely on the content and complexity of their task and their level of interest in and facility manipulating data sets. My best advice is that they take advantage of the wonderful GIS support and resources in the Library System. There is nothing like a face to face conversation with an expert guide get you pointed in the right direction.
Thanks for the opportunity to see another side of this great Library system!
And thank you, Noah, for sharing your skills and interests with us!
This is a follow up to Oya Rieger’s announcement to CU-LIB on 2/18/2015.
In the early weeks of January of 2015, the last book from Cornell’s last shipment to Google for digitization was reshelved, bringing to a conclusion a long and successful collaborative effort by Cornell and Google to digitize well over 500,000 books. Digitization activity spanned about seven years, from October 2008 through December 2014. Although our digitization with Google has ceased for the foreseeable future, our partnership continues with a lower level of effort: Michelle Paolillo will continue her participation in the Google Quality Working Group (more about this further on), and her coordination of improvements to Google created images. Michelle will also continue her work with HathiTrust, the repository where we ultimately store our Google-digitized images, assuring that as many books as feasible can be ingested into that repository.
Google Digitization Overview
Google Digitization was organized into four phases. Each phase included material from both library units and those libraries’ holdings in the Annex. (Numbers for each phase are approximate):
- Phase 1: 244,000 items – Albert Mann Library, Entomology Library, Lee Library at the Geneva Experiment Station, Bailey Hortorium
- Phase 2: 99,000 items – Engineering Library, Mathematics Library, Edna McConnell Clark Library (Physical Sciences), Flower-Sprecher Veterinary Library
- Phase 3: 82,000 items – Martin P. Catherwood Library of Industrial and Labor Relations, Nestlé Hotel Library, Johnson Graduate School of Management Library
- Phase 4: 133,000 – John M. Olin Library and Uris Library (humanities and social sciences), and Carl A. Kroch Library (Division of Asia Collections)
Total items sent for digitization is about 558,000 items.
Books that are digitized are added to Google Books. Be aware that not every book that is sent can be digitized due to various factors (size, condition, publishers’ stipulations, etc.). Cornell’s overall yield has been high (93%), adding about 519,000 books to Google Books over the course of the project. The level of access there is based in part on the publication date of the title. If a book is under copyright protection, Google provides limited or snippet views of the material. If a book is in public domain, Google allows viewing in full.
Cornell has been depositing the digital books created through our collaboration with Google into HathiTrust. Even though digitization has concluded, the number of items Cornell has deposited into HathiTrust through the Google partnership is somewhat fluid. This is because ingest into HathiTrust is gated based on various quality metrics related to individual books. Thresholds that drive this gating can and do change over time, and digital books can also be reanalyzed and improved in quality. Often these changes in quality and gating allow ingest of books that were previously ineligible. The Google Quality Working Group, a group of Google partners that are focused on quality improvements of the books created in the Google Library Partnerships, has produced success in working through ingest related issues in the past year, and Michelle will continue to participate with both this group and in her efforts directly with the HathiTrust to maximize CUL’s deposits. The HathiTrust repository grows daily, but at the present, HathiTrust contains over 13 million items, 37% of which are in the public domain. Cornell has currently deposited almost 516,000 items into HathiTrust.
Initially, the viewability of any item in HathiTrust mirrors that of Google’s, but through systematically addressing issues in rights management, HathiTrust has begun to open up viewability of many items. Logging into HathiTrust will allow the members of the Cornell community to take advantage of benefits for member institutions, such as the ability to download PDFs of full-view items, organize personal collections, and any new services offered in the future.
The Google Digitization Project has been conducted under the sponsorship of Oya Rieger. Of course, the project would not be possible without the work and skills of many people; rather than to repeat individual names, I will draw attention to the last two paragraphs in Oya’s announcement that are dedicated to this purpose. Truthfully, everywhere we prepared shipments, CUL staff have shown exemplary hospitality as our preparation teams and equipment occupied your library spaces. I am deeply indebted to the open hearts, able assistance, and helpful advice I have experienced as this project moved about campus. Many thanks for a fruitful collaboration!
Nostalgia for vintage technology seems to be all around us — the Internet Archive just released over 600 DOS games and the Museum of Modern Art has started collecting video games, to name just a few high-profile examples of attempts to restore our technological heritage. I have this memory of being a college student in the early 2000s, getting trained to work at my university Helpdesk, learning OS 9 on a blue Apple G3 desktop. The technology I’ve been in contact with since those days has changed dramatically. In the intervening decade, I’ve had several laptops and a few smart phones; I honestly never thought I’d see OS 9 ever again until just two years ago when I had the great fortune to join DSPS as the technical lead on a grant restoring complex, interactive, new media artworks in the Goldsen Archive of New Media Art.
In 2012, the library received a grant from the National Endowment from the Humanities to develop a scalable preservation and access framework for a test bed of artworks in the Goldsen collection — approximately 300 works on CD-ROM and DVD-ROM. Looking at the system requirements for these artworks can remind you of your own personal computing history. Do you remember Windows 3.1, System 7, Windows 95, or OS 8? Or when 128 MB of RAM seemed like enormous computing power? Or when you might not actually have 40MB of free hard drive space left on your system?
The big question facing me is, how do you bring decades-old technology to life again? You may be surprised to find out that many of the same tools that I use in my day-to-day work analyzing, describing, and restoring this artwork are the same as those used by computer forensics experts. In fact, I often feel like a detective, trying to understand how these artists conceived of their artworks and figuring out the best way to allow others to interact with them again. I use tools that help me uncover what filesystems are present on a disk, and, in doing so, determine which hardware it once ran on. Sometimes, file names and extensions can be misleading, so I use tools to verify filetypes of the files included in an artwork. I run virtual machines and emulators, meaning that, for example, I can run an older Macintosh operating system in a window on my Ubuntu Linux machine, in order to see how this artwork would have appeared to a user years ago. Sometimes, artworks required older versions of QuickTime or a Shockwave plugin, and we have to track down older versions in order to make it run again.
It’s not surprising that I draw from many of the tools available to computer forensics professionals. Archives are interested in reliable and trustworthy information and need the tools to analyze digital data at a deep level. Digital artworks, like the ones in the Goldsen collection, are highly complex, and we want to ensure that we’ve fully captured this complexity in our metadata so that future scholars can access and use this material.
For more information about the NEH grant, see Interactive Digital Media Art Survey: Key Findings and Observations.
Cornell has engaged a workflow that allows for the improvement of digital books made through our partnership with Google and deposited in HathiTrust. The workflow is responsive to alerts from HathiTrust regarding the need for improvement of specific pages, and also engages the Single Page Insertion and Replacement workflow that Google has set up for library partners. The results so far are very positive: complete and correct digital volumes, satisfied staff from multiple institutions, and some very happy HathiTrust patrons.
The process begins most often with a patron from a member institution in HathiTrust. While using a digital book, they may notice problems with the copy similar to those in list below:
- foldouts were not unfolded during the scan process, resulting in important diagrams, charts, maps or pictures being lost
- a particular page might be skipped
- the page was moved during image capture, yielding a blurry image
- an operator’s hand or book clamps might have been caught in the frame
HathiTrust has made a feedback link available on every page, located near the middle of the lower navigation bar. The link yields a pop-up form that captures a few quick details: the only entry required of the patron is an email address (highly recommended to enter this, since the disposition of the issue will be reported back to this address) a radio button, a few check boxes and an optional note. (The page URL is captured automatically from the browser, and not required from the user.) Thus with minimal text entry and a few clicks, the patron makes an informative report that opens a ticket with HathiTrust. Staff at HathiTrust respond to the ticket, and facilitate corrective measures. In the case of books created through the Google Library partnership (comprising most of Cornell’s deposits) staff at HathiTrust first contact Google directly to see if the problematic pages can be rectified. If not, they will contact the HathiTrust member institution and let them know that the digital book needs improvement.