Cornell has recently deposited its Making of America (MOA) collection into HathiTrust. Through this process, we have transitioned one of our oldest digital collections to contemporary architecture, improved the quality of both page images and optically recognized character (OCR) textflow, and made the content available to scholars using both traditional and computational methods. Our experience illustrates how advances in technology over two decades can improve access to online content, and how migration of legacy content provides us natural points of opportunity to do just that.
About the collection: The Making of America Project was an early collaboration between Cornell and the University of Michigan that pioneered methods of digital preservation. In 1995, with funding from the Andrew W. Mellon Foundation, the two institutions developed a digital collection documenting American social history from the antebellum period through reconstruction. At Cornell, well over one thousand volumes were selected, scanned, and made available online as a collection in the repository named “Windows on the Past”, enabled by an architecture called DLXS.
In the intervening two decades, there have been a multitude of changes in the capabilities of repository architectures and the maturity of inter-institutional efforts. As with all technology, DLXS entered sunset. Its last release was in 2010, the same year that Cornell joined HathiTrust. When considering how MOA would best persist, deposit to HathiTrust was especially attractive. HathiTrust is a TRAC certified digital preservation repository, aligning well with Cornell’s commitment to this content. Preservation costs for open material like MOA is underwritten by all members (see the HathiTrust Digital Library’s (HTDL) cost model). Items can be searched by bibliographic details as well as in full-text. Further, the full-text indices are shared with the HathiTrust Research Center, for access by scholars using computational methods. Deposit to HathiTrust enriches a publicly available common good, leverages new methods for scholarship, and gives the MOA “Windows into the Past” content a clear path into the future. But although the gains are clear, the path was not. Retrofit of legacy materials for contemporary repositories is a territory with some unique challenges that took creativity, diverse skills, and a fair amount of persistence to meet.
What is a volume? The problem of records. The first challenge for us was the lack of any association or arrangement reflecting physical volumes in the DLXS architecture. HTDL requires deposits to be accompanied by item level bibliographic metadata. When we deposit volumes, we create this metadata from our catalog, typically using a combination of title level and item level identifiers. However, DLXS metadata lacked any such identifiers. Additionally, the original books were dis-bounded before scanning, and discarded afterwards. New pages were printed from the scans, but they weren’t always bound in the same enumeration as the old volumes. This led to a fairly fluid relationship of the pages as seen in DLXS, and the actual bound volumes on our shelves. Our mapping of metadata from our catalog, then, was a combination of automatic and manual processes, with careful resolution of some lingering troubleshooting where volume enumeration didn’t match up. Along the way, we learned a tremendous amount about our catalog records including how unevenly we recorded our digital project information, and where all of the documentation is that enables a true harvest!
Improving Page Images. As in any digitization project, the original process for page-by-page capture had a some hiccups. Sometimes a page could be missed, or the image capture itself had quality issues. Pages that could be improved upon were captured a second time. We are certain that intentions were always to have these integrated into the MOA pages, but during migration, we discovered that this process was not completed. The point of migration allowed us to insert these improvements where they belonged while we were restructuring the volumes for packaging.
OCR improvement. One of the frustrations some had of the original project was that the OCR did not faithfully reflect the textflow of the original pages. Although every attempt was made in the original project to use the best OCR engines available, the simple truth is OCR of the mid 1990’s was not even close to the state of the art of OCR today. OCR quality impacts search inside the book, and for the print disabled, the accessibility of the content itself. Once again, leveraging the migration process as a point of opportunity, we were able to improve the OCR to modern standards.
Structural metadata. Users of online material don’t directly see structural metadata, but they use it all the time. Structural metadata allows navigation through titles, volumes issues, and chapters, allowing the reader to easily find the page they are seeking. DLXS already held good structural metadata from the original project, and we didn’t want to lose it. At the same time, this information had to be translated into markup that the HTDL could use, and divided into the packages that reflected the separate physical volumes. In the end, scripting came to our rescue, allowing us to reliably repurpose the old structural metadata at scale.
Package and upload. The narrative above makes our experience sound like a smooth progression where all volumes moved through an assembly line from start to finish. In actuality, we managed the deposit in three iterative phases, each of which took a progressively larger set of volumes through the whole process from start to finish. This allowed us to “stop the line” between sets, learn from any failures and make necessary improvements in the assembly line itself. The first set was quite small, about 20 volumes. We made many mistakes, and as a result experienced a variety of failures during upload. Working through corrections allowed us opportunity to improve our processes and also to anticipate scaling up things that were already working well. The second set was about 100 volumes. We experienced fewer failures, and made further adjustments. The last set was over 1,000 volumes, deposited without incident.
Gratitude. Work like this takes many hands and diverse skills. Many thanks to George Kozak whose deep experience with DLXS and able scripting supported the work of page image improvements, structural metadata translation, and initial packaging. Mira Basara’s skills in OCR allowed us to leap decades ahead in the quality of textflow. Mira also managed much of the final packaging, and managed the majority of communication with HathiTrust. Thanks are owed to Gary Branch, for his help in harvesting records. A special thank you goes to Aaron Elkiss at HathiTrust for his patience with our first efforts to package at scale, and his advice on how to transform these into more successful efforts. Finally, my thanks to practitioners of digital preservation, past, present and future; techniques and capabilities may change over the years, but our commitment does not!