SNAC
Prototype

Introduction

This page contains information on the background and current status of the SNAC prototype historical resource and access system. First time visitors are encouraged to read this information before visiting the prototype: it is a work in progress and some features are still in development or not yet fully functional.

SNAC's overall objective is to derive EAC-CPF descriptions from EAD-encoded finding aids using the names of creators and people referenced in archival descriptions, as well as biographical or historical information. Initially the focus is on names specifically tagged as such (<persname>, <corpname>, and <famname>), though later in the project we will also attempt to identify names found in <unittitle> and prose <bioghist>. The EAC-CPF records derived from the finding aids are being matched against one another to combine records for the same entity and then these records are matched and combined with authority records supplied by the Library of Congress (Library of Congress/NACO Name Authority File), Getty Vocabulary Program (Union List of Artist Names), and OCLC Research (providing data from the Virtual International Authority File (VIAF). Finally, SNAC is developing a prototype historical resources and access system based on the unique records resulting from the first two steps of processing.

EAC-CPF record extraction

IATH is deriving EAC-CPF records from EAD-encoded archival finding aids from the Library of Congress (LoC), the Online Archive of California (OAC), the Northwest Digital Archives (NWDA), and Virginia Heritage (VH). Thus far, EAC-CPF records have been derived from the LoC, OAC, and NWDA finding aids. VH has not yet been processed. The derivation process has been iterative, that is, the finding aids are processed repeatedly using XSLT programs that are continually being adjusted and refined to increase the number of EAC-CPF records derived, and to improve the quality of the derived entries and descriptions. As of this date, we can report the following results:

LoC: 43,702 EAC-CPF records derived from 1159 finding aids

OAC: 91,811 EAC-CPF records derived from ~15,400 finding aids

NWDA: 22,609 EAC-CPF records derived from 5,568 finding aids

While many of the names found in finding aids have been carefully constructed, frequently in consultation with LCNAF, many other names present extraction and matching challenges. For example, many personal names are in direct rather than indirect (or catalog entry) order. Life dates, if present, some times appear in parentheses or brackets. Numerous names some times appear in the same <persname>, <corpname>, or <famname>. Many names are incorrectly tagged, for example, a personal name tagged as a <corpname>.

We will continue to refine the extraction and matching algorithms over the course of the project, but it is anticipated that it will only be possible to address some problems through manual editing, perhaps using "professional crowd sourcing."

Despite the challenges, both the quantity and the quality of the extracted records strongly suggest that the techniques are effective. An important byproduct of SNAC will be recommendations concerning content standards, descriptive practice and the application of EAD, as well as recommendations for revision of EAD.

EAC-CPF matching and combining with authority records

During the first four months of the project, SI/UCB indexed several million LCNAF, VIAF, ULAN authority records in preparation for the match-merge processing. The match-merge processing began in September. While matching has been tested successfully against each authority record source type, the initial focus has been on matching EAC-CPF records against one another and against VIAF records. The VIAF records are all personal name records.

Matching presents two primary challenges, and these challenges are in tension with one another. The first challenge is matching strings against one another. This is straightforward for strings that match character for character. Other strings, though, are close but not exact matches. In general, we want to match as many strings as possible. The second challenge is the reliability of the match, that is, that two name strings that match do in fact designate the same entity. For example, is this "Brown, John", the same person as that "Brown, John"? For personal names, when one or more life dates is present in both strings, there is a high likelihood that the names are for the same person. When a name string consists of only one or two name components, the reliability decreases. SNAC is attempting to address both of these challenges and in a manner that maximizes matching with acceptable reliability. In order to maximize matching, we are experimenting with algorithms for close but not exact matches. To increase reliability for "weak strings," that is, strings with one or two name components and no life dates, we are experimenting with considering contextual data derived from the source finding aid and contextual data in the enhanced VIAF authority records.

The initial matching has focused on exact matches (character for character). EAC-CPF names are being matched against both authoritative and alternative names in the VIAF records. Algorithms for near matches are not yet deployed. Thus the early results produce very accurate matches, but reliability of the match is mixed. There are many entries that should ultimately match. There are also many matches where records for two people are combined in one record, that is, false matches.

When combining records at this stage of the project, all name entries derived from both the EAD-encoded finding aids and the authority records are retained. Thus one will currently find duplicate name entries in the merged EAC-CPF records, and competing authoritative forms. At a latter stage in the project, we will devise algorithms for eliminating duplicate name entries in the same record and for reducing the competing authoritative forms to one, and "demote" the others. The obvious choice will be to prefer the LCNAF authoritative name to all others, but it becomes more challenging to choose among competing authoritative names when there is no matching LCNAF record.

Prototype development

CDL has focused on the development of the prototype historical resource and access system. SNAC will be making iterative updates and enhancements to the site over the course of the project. There are some known issues that we are working on with matching, merging, and normalizing some of the records. The EAC-CPF records that we are assembling represent a novel challenge in display, with few comparable examples from which we might draw inspiration.

The user interface is being developed using an iterative approach. This first iteration of the prototype has only the basic functionality which we believe is needed to allow researchers and archivists to imagine how they might want to interact with the merged records. Development of further iterations of the prototype, including usability assessment, will continue through Spring 2012.

Your help and feedback requested

We welcome your reviews about the prototype historical resource and access system. Please submit your impressions, suggestions, or comments to our Feedback Forum.

If you would like to help us improve our data matching, merging, and normalization processes, please report any errors that you notice with any particular EAC-CPF record using the "Note Data Issue" link in the top-right corner of the page for any given record.

Comments and questions about the SNAC project can be submitted via the Contact page.