Zephir, the HathiTrust bibliographic metadata management system, is managed by CDL’s Discovery & Delivery team. In this advice column, Barbara Cormack, the metadata analyst for Zephir, answers common questions for contributing records to Zephir. While these questions were written by fictitious authors, you are welcome to submit your questions to Zephir (email: [email protected]).
Dear Zephir,
I keep hearing that records in HathiTrust are grouped together, or clustered, but I also keep seeing what look like multiple records for the same work or title in the catalog. Can you explain exactly how records get grouped together in the HathiTrust catalog?
– Flustered About Clusters
Dear Flustered,
Thanks for your inquiry, it’s a great question and something that comes up fairly regularly. I hope you’re prepared to get “down in the weeds” a little in order for me to explain how this works in Zephir.
Records submitted to Zephir go through two phases of processing. In phase one, the records are “prepared” for ingest. During this phase, some of the metadata in the record is validated and manipulated, the record is assigned to a cluster, and it receives a cluster identifier (or “CID”). Zephir does the cluster assignment using one of three possible data elements:
First, if the incoming record has an OCLC number, Zephir will determine if that OCLC number is in the database already. If the OCLC number is present in the database, Zephir will assign the incoming record to the existing cluster using it. No further matches are attempted.
If there is no OCLC number, Zephir will next try to use the incoming record’s bibliographic ID number - for example, an Alma MMS ID in the 001 field - in a similar manner. If that MMS ID is found in a cluster, Zephir will assign the record to that cluster. No further matches are attempted.
If no cluster match is found using the bib ID, Zephir checks the record and the configuration file, which controls certain aspects of ingest, to see if the record has a previous system bib ID number and the settings to use that for matching. If those conditions are met, Zephir will search the database for matches on the previous system ID and try to assign the record to a cluster in that way.
If none of these conditions are met, the system will assign the incoming record to a new cluster.
Now, even when the sequence of cluster assignment steps outlined above is followed, there can be complications and variations. As an example, for historical reasons there may be multiple clusters containing the same OCLC number. You might ask, how could this come about? There are some different explanations for it, which can be difficult to untangle. In the early days of the system many records were submitted without OCLC numbers, sometimes because they were brief “shelf records,” such as those for materials in the NRLF (UC's Northern Regional Library Facility). At the time, quantity of records was prioritized over quality of the metadata. Sometimes records were contributed from libraries that did not catalog with OCLC. Later, if updates to these records were submitted, now containing OCLC numbers, they might not cluster with their earlier editions, depending on a variety of database conditions, thus resulting in disparate clusters with the same OCLC number. Given a condition like this, if a record is submitted with that OCLC number, Zephir will assign the record to the cluster with the lower CID number.
This is just one example of the complexities involved with clustering records. In a previous “Dear Zephir” column, we discussed how different editions of the same work are considered to be separate entities and, so, are not clustered together, and that is another explanation for why you may see multiple “hits” for what appears to be the same record. Zephir has to be able to address many different conditions in the metadata it processes in order to cluster records. Here, we’ve touched on just a few of them!