UC is the second largest contributor to HathiTrust after the University of Michigan. Roughly a quarter of the 17.6 million volumes in HathiTrust (or 4.5 million books) were digitized from University of California Libraries. How did all of these books get from the shelves of UC libraries and into HathiTrust? What type of collection decisions were made and by whom? What is included in the collection, or perhaps more importantly, what has been excluded? How can we better understand UC’s HathiTrust collection so as to make decisions about how to improve it, complete it, or expand it going forward?
The answers to these questions are closely tied to the history of the Google Books Library Project, which UC joined as an early (6th) partner in 2006. The first 2 years of UC's participation saw a tremendous number of volumes digitized, largely from NRLF. By the time HathiTrust was launched in late 2008, Google had already digitized nearly 2 million UC volumes.
2006 - 2011: Google Library Project - Shelf Clearing & Massive Scanning
The majority of the digitization of UC volumes was done by Google between 2007 and 2011. Google’s scanning of UC collections started at NRLF in 2006, where it continued without pause until the COVID pandemic caused shutdowns in 2020. Google’s digitization strategy in these early years was shelf clearing: every volume on a given shelf was removed in order, checked out, and sent to the Google scan center where it was digitized. That said, not all books were suitable for digitization because of their physical condition or other factors, such as size. NRLF shelves books by size, so shelves that contained volumes that exceeded Google's size limitations were skipped.
At the height of scanning (from 2007-2009), NRLF was sending an average of 3,000 volumes per day to be scanned. The largest shipment occurred in June of 2009, a month in which NRLF averaged 5,294 items sent per day: on June 25, 2009 NRLF sent a total of 9,701 items! During these first few years of the Google Library Project, volumes were sent regardless of copyright status.
Google scanned in-copyright volumes to become part of its massive search index where they could be made easily searchable and discoverable. Public domain books were scanned to support both search/discovery and full online access, including download. While these were Google’s desired outcomes from the scanning effort, one of UC’s prime motivators for joining the Library Project was to have its physical library collections scanned for preservation purposes.
In 2008, Google scanning at UC expanded to include UC Santa Cruz and UC San Diego libraries. UCLA joined the Library Project in 2010, UCSF in 2012, UC Davis and SRLF in 2013, UC Berkeley in 2014, UC Riverside and UC Irvine in 2016, and UC Santa Barbara was the most recent to join in 2019. UC San Diego ceased operations with Google at the end of 2011, but rejoined the project in 2017. Likewise, UC Davis ended scanning with Google in 2014 but has rejoined the project in 2022.
While there has been no overall UC collection “strategy” for deciding which books should be digitized by Google, there has been some intentional, local collection building on the part of the participating UC libraries. While all subjects were sent from NRLF, UC Santa Cruz focused on sending humanities and social science collections from the McHenry Library, and UC San Diego initially focused on sending their East Asian, International Relations & Pacific Studies (IRPS), and Scripps Institution of Oceanography (SIO) collections.
The Open Content Alliance and Digitization with the Internet Archive
Simultaneous with the Google Books digitization efforts - actually preceding it by a month - was UC's co-founding of and participation in the Open Content Alliance. While the output of this project was much smaller in scope (roughly 200K UC volumes total), the Internet Archive, as part of the Open Content Alliance, also digitizing UC library collections from 2006 to 2009. During these years, Internet Archive had scanning centers set up at both NRLF and SRLF (in fact, NRLF made history as the first Internet Archive/Open Content Alliance scanning location in the United States). The Open Content Alliance focused solely on digitizing public domain (pre-1923) volumes so that the content could be made openly available online. Pre-1923 English language books were a focus for scanning at both RLFs, and pre-1923 foreign language books were a focus at SRLF. UC Davis focused on selected California documents (including California Department of Water Resources Bulletins and Bureau of Mines Bulletins); UC Berkeley focused on cookbooks and mathematics; and UCLA focused on children’s books, Italian comedies, and rare business and economic texts. Selected Bancroft collections were also scanned. The UC volumes scanned by Internet Archive as part of the Open Content Alliance were added to HathiTrust in 2010, where they are available for full view reading and downloading access.
Google Publisher Opt-Outs
When UC Santa Cruz joined the Google Library Project in 2008, they quickly noticed that Google returned many of their volumes without scanning them. Google had provided an opt-out for publishers who wished not to have their publications scanned by the Library Project; these volumes were published by those who opted out, perhaps motivated by lawsuits filed against Google by the Author’s Guild and the Association of American Publishers (and the surrounding publicity). At the time, UC Santa Cruz was the newest Google partner, and it was the first to run up against the opt out issue. To avoid wasted time and effort, Google began analyzing UCSC’s catalog records to create selective candidate lists of UCSC volumes for which there were no publisher opt outs. UCSC staff used the lists to pull books from their shelves to be scanned by Google.
Many of the publishers who chose to opt out of the Google Library Project chose to partner with Google instead to have their publications scanned via Google’s separate Partner Program and thus included in the Google Books Digital Library. Google’s Partner Program allows publishers more control over if and how their books are displayed in Google Books. When thinking about UC’s collection in HathiTrust (and the HathiTrust collection in general), it’s important to understand that books scanned as part of Google’s Partner Program are not included in HathiTrust.
2012 - Present: Candidate Lists and Selective Scanning
At the end of 2011, Google greatly reduced the volume and speed of its scanning project with UC, and changed its focus to public domain volumes (those published prior to 1923 as well as US federal government documents). By this time, Google had added many additional libraries around the world to the project. Shelf clearing ended, and Google began analyzing the catalog records of each partner library to create selective candidate lists. The selective lists enabled Google to exclude volumes they had previously digitized to avoid duplication. They also allowed Google to exclude volumes that had been opted out by publishers, as well as volumes likely in copyright. In addition, the lists allowed Google to target public domain volumes that were unique to a library’s collection.
UC’s strategy during the candidate list era has been to digitize as much of its public domain corpus via the Google Library Project as possible.
In 2015, Google introduced new candidate lists to include volumes beyond those determined to be public domain based on their publication date. These included selected volumes from the US Renewal Era (1923-1963), which may not have had a copyright registered or renewed, US Federal government publications, US state government documents (from states where such materials are public domain), and various international government documents. Google’s decision to include targeted volumes from the Renewal Era was an astute one: a 2019 study from the New York Public Library estimates that 75% of volumes published in the US Renewal Era may be public domain due to lack of registration and/or required renewal.
Also in 2015, Google began to scan foldouts contained within the volumes they digitized. Foldout scanning requires a separate scanning process and the resulting images must be inserted into the digital file in the correct location; they had not scanned any foldouts up to that point. Once Google started to digitize foldouts, there were (and continue to be) size limits. And because of the additional time required to scan foldouts separately, they can only scan a certain number of foldouts for each shipment UC sends for scanning. This means there are still many volumes in UC’s HathiTrust collection that are missing foldouts. There is a process by which libraries may scan the foldouts themselves and send them to Google to be included in the volume.
As an early Google partner, UC’s digitized corpus contains a large number of in-copyright titles compared to Google Library Project partners who joined later. The statistics from HathiTrust - where almost all Google Library Project partners deposit their digitized scans - bear this out. The percentage of full view volumes in the entire HathiTrust corpus is approximately 40%, meaning 60% of the overall collection is restricted access. Yet only 28% of the volumes UC has deposited in HathiTrust are available as full view, meaning 72% of UC’s collection is restricted access.
Yet, when the COVID pandemic shut down physical access to UC’s libraries in early 2020, the benefit of having a large number of in-copyright volumes in HathiTrust became clear. HathiTrust responded to physical library shutdowns by creating the Emergency Temporary Access Service (ETAS), which allowed their member affiliates carefully controlled access to in-copyright volumes. UC’s large number of more recent in-copyright publications available in HathiTrust were enormously valuable. User data shows that the more recent the publication year, the greater was the use relative to the amount of ETAS content available. While HathiTrust consistently provides preservation and discovery for UC’s restricted in-copyright volumes, access to these volumes during physical library shutdowns proved to be invaluable.
Local Campus Digitization
In addition to the mass scale digitization of UC library collections by Google, certain UC campuses have developed local digitization pipelines by which they contribute books to HathiTrust. UC Berkeley, UCLA, and UC San Diego have been contributing locally digitized volumes to HathiTrust for a number of years. This allows local priorities and decision making to inform digitization selection of materials that may not appear on Google's candidate lists.
Conclusions and Looking Forward
Understanding the origin of UC’s collections in HathiTrust can help shape decision making about its future. The brief history above points to potential gaps in UC's HathiTrust collections. These gaps include volumes that exceeded Google's size limitations; foldouts for volumes digitized prior to 2015 (and some digitized later); volumes for which the publishers opted; and volumes published in 2010 and later.
You can actually see the effect that moving from shelf clearing to candidate lists, and focusing on public domain materials had on the HathiTrust collection: Only .5% (41K volumes) of the overall collection was published between 2010-2020, the lowest percentage for any decade since 1810-1820. And so far, 0% of the collection (less than 1,000 volumes!) was published in 2020 or later.
UC campuses can use local digitization and deposit to HathiTrust as one path to ensure that volumes they value will be preserved, discoverable, and available far into the future (and for more immediate emergency use should the need arise).
Recently, HathiTrust has started developing strategies to help support more intentional collection building on the part of its member libraries. Once developed and implemented, these strategies might help UC target date ranges, subjects, and titles of volumes to ameliorate any significant gaps in UC’s HathiTrust collection.