HathiTrust’s metadata management system Zephir currently holds close to 25 million MARC records, representing the scanned volumes contributed by 48 Hathitrust member institutions. Each of these institutions has its own systems for cataloging materials and exporting MARC records – and such practices often vary widely even within a single contributing institution. Compounding this complexity, HathiTrust has a policy of respecting the metadata decisions of its contributing members, neither correcting nor amending records as they come in. Wrangling these 25 million user contributed MARC records into consistent metadata to organize and support user discovery in HathiTrust is, consequently, a massive challenge. Members of CDL’s Discovery and Delivery (D2D) team work closely with HathiTrust staff to manage and maintain Zephir, while always seeking ways to improve its available metadata. 


For the past few years, software engineers on the D2D team have been exploring whether AI can be used to improve or enhance Zephir metadata. Charlie Collett and Raiden van Bronkhorst have been working on an initiative to train an AI model to successfully match duplicate Zephir records without using OCLC numbers or other numerical identifiers that are currently used to cluster records. (For those interested in how Zephir currently clusters records, Barbara Cormack, also a member of the D2D team, recently published a helpful Dear Zephir post on the topic.) Charlie and Raiden have explored whether they can train an AI model to recognize the variations common to MARC fields – including title, author, publisher, publication date, publication place, and pagination – and use this information to reliably match duplicate records.


Clustering records using the OCLC number is successful most of the time, but many Zephir records don’t include an OCLC number, or may include multiple correct OCLC numbers, or the OCLC number may be wrong or outdated. Again, contributing institutions each have their own unique approach to cataloging, so matching duplicate records on a common identifier, like an OCLC number, can be a tricky business. It can especially be a problem for older records and those records contributed by international contributors who may not use OCLC numbers. Clustering records on common MARC fields (like title, author, and publisher) could potentially be used to identify when two or more clusters with different OCLC numbers are actually the same item and should be grouped together. This practice could also allow the Zephir team to assess the efficacy of using the OCLC number for clustering. 


Accurate record matching in Zephir greatly improves the user experience of HathiTrust because the records for duplicate items are clustered onto a single HathiTrust catalog page (click here for an example). This reduces the number of search results users have to sort through and it makes it easier to see and access duplicate copies of the same title. Accurate record matching can also potentially lower costs for HathiTrust contributing members, because fees for duplicate holdings are shared across contributing institutions, while fees for unique holdings are paid by the single contributor. 


Charlie and Raiden explain their process and findings in this 20-minute presentation on “Using AI for Matching MARC records” from Code4Lib 2023. Their presentation is an excellent, user-friendly introduction to their research and describes their work to train an AI neural network to predict the probability of a match between existing records based on 6 bibliographic elements: title, author, publisher, publication date, publication place, and pagination. The model was trained using 50K pairs of HathiTrust records divided into 25K matching pairs and 25K mismatching pairs, and was able to predict matches and mismatches in the training data with an accuracy rate of 98.46%. 


Raiden and Barbara have been participating in the Matching Algorithms Task Force of the Infrastructure Standing Committee of the Shared Print Partnership. The task force has compared the matching results from Charlie and Raiden’s AI model to results from other projects, like the Colorado Alliance of Research Libraries' Gold Rush and ReCAP's Shared Collection Service Bus (SCSB), and the results are encouraging. The task force hopes to release a report with their methods, findings, and recommendations soon.


While these findings are promising, they are also preliminary. The AI model works well for the English language monographs it was trained on, but may not be equipped to handle the entirety of the much more varied HathiTrust corpus. Charlie and Raiden are exploring how to extend this duplicate record discovery by training the model to use text features or title pages from the actual book scans to double check matches for accuracy. Checking the book scans for corroborating data could be particularly useful for government documents that sometimes, a) have long identical titles except for the final words which may include the name of a state, or b) are cataloged with a brief generic title. And they want to try blocking the data (grouping it into similar content blocks) to test so they don’t waste computing time on matching obviously different content types (for example, a Dr. Seuss story and a US government document). 


This work is also preliminary because, so far, this has been a D2D led exploration, rather than an active initiative undertaken with HathiTrust colleagues. Incorporating AI matching is also complicated by HathiTrust’s metadata sharing and use policy, which assigns to contributors the ultimate responsibility for the metadata accuracy. Therefore any AI suggested improvements to metadata would need to be communicated to and implemented by the contributors. Additionally, there is no simple way to incorporate an AI model into HathiTrust systems to ensure that AI matching would only contribute positively, rather than introducing inaccuracies. 


One possible way forward could be to use AI models to identify problematic records and notify contributing institutions to suggest they remediate and resubmit their metadata. But this approach would require a lot of contributing institutions, who may not have the resources to do this work. And it would not provide a solution for international members who contribute records for which OCLC numbers may not exist or even be desirable. Improving Zephir clustering in this way would add to, rather than decrease, the resource requirements for HathiTrust members. To quote Charlie, “This is not the kind of AI future people are imagining”! 


For now Charlie, Raiden, and the D2D team are continuing to explore how to improve record matching and have plans for additional explorations into the use of AI to help with challenges posed by Zephir and other CDL managed services. For example, they want to figure out how to normalize serial holdings statements and create order in the chaos currently posed by enumeration and chronology metadata. 


The AI future is promising for HathiTrust and other digital libraries, but at this stage it requires dedicated innovation, diligence, and collaboration. 


For More Information

  • For those interested in possibly using AI themselves, Charlie and Raiden gave a 15-minute presentation at Code4Lib this spring on “Lessons learned: How to get traction with AI and start building”.
  • ai4Libraries provides a platform to explore the potential benefits of AI in libraries. The free annual virtual conference aims “to bring together librarians, subject matter experts, practitioners, enthusiasts, and skeptics to exchange ideas, share experiences, and chart a path toward a future where AI plays a significant role in driving library services”. The 2024 conference is happening October 22 and 23, but recordings will be available after November 17. Recordings from the 2023 conference are available on the website.