This is a follow up to Oya Rieger’s announcement to CU-LIB on 2/18/2015.
In the early weeks of January of 2015, the last book from Cornell’s last shipment to Google for digitization was reshelved, bringing to a conclusion a long and successful collaborative effort by Cornell and Google to digitize well over 500,000 books. Digitization activity spanned about seven years, from October 2008 through December 2014. Although our digitization with Google has ceased for the foreseeable future, our partnership continues with a lower level of effort: Michelle Paolillo will continue her participation in the Google Quality Working Group (more about this further on), and her coordination of improvements to Google created images. Michelle will also continue her work with HathiTrust, the repository where we ultimately store our Google-digitized images, assuring that as many books as feasible can be ingested into that repository.
Google Digitization Overview
Google Digitization was organized into four phases. Each phase included material from both library units and those libraries’ holdings in the Annex. (Numbers for each phase are approximate):
- Phase 1: 244,000 items – Albert Mann Library, Entomology Library, Lee Library at the Geneva Experiment Station, Bailey Hortorium
- Phase 2: 99,000 items – Engineering Library, Mathematics Library, Edna McConnell Clark Library (Physical Sciences), Flower-Sprecher Veterinary Library
- Phase 3: 82,000 items – Martin P. Catherwood Library of Industrial and Labor Relations, Nestlé Hotel Library, Johnson Graduate School of Management Library
- Phase 4: 133,000 – John M. Olin Library and Uris Library (humanities and social sciences), and Carl A. Kroch Library (Division of Asia Collections)
Total items sent for digitization is about 558,000 items.
Books that are digitized are added to Google Books. Be aware that not every book that is sent can be digitized due to various factors (size, condition, publishers’ stipulations, etc.). Cornell’s overall yield has been high (93%), adding about 519,000 books to Google Books over the course of the project. The level of access there is based in part on the publication date of the title. If a book is under copyright protection, Google provides limited or snippet views of the material. If a book is in public domain, Google allows viewing in full.
Cornell has been depositing the digital books created through our collaboration with Google into HathiTrust. Even though digitization has concluded, the number of items Cornell has deposited into HathiTrust through the Google partnership is somewhat fluid. This is because ingest into HathiTrust is gated based on various quality metrics related to individual books. Thresholds that drive this gating can and do change over time, and digital books can also be reanalyzed and improved in quality. Often these changes in quality and gating allow ingest of books that were previously ineligible. The Google Quality Working Group, a group of Google partners that are focused on quality improvements of the books created in the Google Library Partnerships, has produced success in working through ingest related issues in the past year, and Michelle will continue to participate with both this group and in her efforts directly with the HathiTrust to maximize CUL’s deposits. The HathiTrust repository grows daily, but at the present, HathiTrust contains over 13 million items, 37% of which are in the public domain. Cornell has currently deposited almost 516,000 items into HathiTrust.
Initially, the viewability of any item in HathiTrust mirrors that of Google’s, but through systematically addressing issues in rights management, HathiTrust has begun to open up viewability of many items. Logging into HathiTrust will allow the members of the Cornell community to take advantage of benefits for member institutions, such as the ability to download PDFs of full-view items, organize personal collections, and any new services offered in the future.
The Google Digitization Project has been conducted under the sponsorship of Oya Rieger. Of course, the project would not be possible without the work and skills of many people; rather than to repeat individual names, I will draw attention to the last two paragraphs in Oya’s announcement that are dedicated to this purpose. Truthfully, everywhere we prepared shipments, CUL staff have shown exemplary hospitality as our preparation teams and equipment occupied your library spaces. I am deeply indebted to the open hearts, able assistance, and helpful advice I have experienced as this project moved about campus. Many thanks for a fruitful collaboration!
Nostalgia for vintage technology seems to be all around us — the Internet Archive just released over 600 DOS games and the Museum of Modern Art has started collecting video games, to name just a few high-profile examples of attempts to restore our technological heritage. I have this memory of being a college student in the early 2000s, getting trained to work at my university Helpdesk, learning OS 9 on a blue Apple G3 desktop. The technology I’ve been in contact with since those days has changed dramatically. In the intervening decade, I’ve had several laptops and a few smart phones; I honestly never thought I’d see OS 9 ever again until just two years ago when I had the great fortune to join DSPS as the technical lead on a grant restoring complex, interactive, new media artworks in the Goldsen Archive of New Media Art.
In 2012, the library received a grant from the National Endowment from the Humanities to develop a scalable preservation and access framework for a test bed of artworks in the Goldsen collection — approximately 300 works on CD-ROM and DVD-ROM. Looking at the system requirements for these artworks can remind you of your own personal computing history. Do you remember Windows 3.1, System 7, Windows 95, or OS 8? Or when 128 MB of RAM seemed like enormous computing power? Or when you might not actually have 40MB of free hard drive space left on your system?
The big question facing me is, how do you bring decades-old technology to life again? You may be surprised to find out that many of the same tools that I use in my day-to-day work analyzing, describing, and restoring this artwork are the same as those used by computer forensics experts. In fact, I often feel like a detective, trying to understand how these artists conceived of their artworks and figuring out the best way to allow others to interact with them again. I use tools that help me uncover what filesystems are present on a disk, and, in doing so, determine which hardware it once ran on. Sometimes, file names and extensions can be misleading, so I use tools to verify filetypes of the files included in an artwork. I run virtual machines and emulators, meaning that, for example, I can run an older Macintosh operating system in a window on my Ubuntu Linux machine, in order to see how this artwork would have appeared to a user years ago. Sometimes, artworks required older versions of QuickTime or a Shockwave plugin, and we have to track down older versions in order to make it run again.
It’s not surprising that I draw from many of the tools available to computer forensics professionals. Archives are interested in reliable and trustworthy information and need the tools to analyze digital data at a deep level. Digital artworks, like the ones in the Goldsen collection, are highly complex, and we want to ensure that we’ve fully captured this complexity in our metadata so that future scholars can access and use this material.
For more information about the NEH grant, see Interactive Digital Media Art Survey: Key Findings and Observations.
Cornell has engaged a workflow that allows for the improvement of digital books made through our partnership with Google and deposited in HathiTrust. The workflow is responsive to alerts from HathiTrust regarding the need for improvement of specific pages, and also engages the Single Page Insertion and Replacement workflow that Google has set up for library partners. The results so far are very positive: complete and correct digital volumes, satisfied staff from multiple institutions, and some very happy HathiTrust patrons.
The process begins most often with a patron from a member institution in HathiTrust. While using a digital book, they may notice problems with the copy similar to those in list below:
- foldouts were not unfolded during the scan process, resulting in important diagrams, charts, maps or pictures being lost
- a particular page might be skipped
- the page was moved during image capture, yielding a blurry image
- an operator’s hand or book clamps might have been caught in the frame
HathiTrust has made a feedback link available on every page, located near the middle of the lower navigation bar. The link yields a pop-up form that captures a few quick details: the only entry required of the patron is an email address (highly recommended to enter this, since the disposition of the issue will be reported back to this address) a radio button, a few check boxes and an optional note. (The page URL is captured automatically from the browser, and not required from the user.) Thus with minimal text entry and a few clicks, the patron makes an informative report that opens a ticket with HathiTrust. Staff at HathiTrust respond to the ticket, and facilitate corrective measures. In the case of books created through the Google Library partnership (comprising most of Cornell’s deposits) staff at HathiTrust first contact Google directly to see if the problematic pages can be rectified. If not, they will contact the HathiTrust member institution and let them know that the digital book needs improvement.