The Prototype History Research Tool is being built using existing description of archival resources. While much of the processing is successful, a significant portion of the data is inaccurate. The resource description data was not created with the current processing objectives in mind. The quality of the resource descriptions is uneven, most often because of simple human error and oversight, either in the data itself or in the encoding of the data. There are typographical errors in names and dates, and errors in data encoding (for example, labeling an organization a person, or vice-versa). Some times the differences are based on conflicting evidence, such as differences in spelling or life dates.
While refining the computational techniques used continues, such techniques alone will always fall short. The most fundamental problem is identifying when similar names are for the same person or different persons. Even for human editors, identity resolution can be an exceptional challenge and sometimes cannot be reliably achieved due to insufficient or ambiguous evidence.
Users will frequently find two or more SNAC identity descriptions for the same person or organization. In attempting to resolve identities, SNAC has intentionally erred on the side of not combining similar identities when there is uncertainty. The rationale for this bias is that it can be exceptionally difficult for users to identify when distinct identities have been incorrectly combined. SNAC thus attempts to avoid such combinations, even at the expense of not combining what to human users are seemingly obvious descriptions for the same person or organization. When two identities are similar but do not cross a certainty threshold, SNAC relates them to one another and labels them “maybeSameAs.” Nevertheless, users will discover identity descriptions that combine two or more identities.
The associated resource descriptions typically make a distinction between when an identity created the resources, and when the identity is referenced in the resources. In some cases, the entity may be represented as the creator of linked archival resources when in fact, the person, family, or organization is not the primary creator of the collection, but merely the creator of some portion of it, perhaps even just one or two items out of hundred or thousands of items.
For many of the archival resources described in WorldCat, we were unable to determine the name of the holding institutions based on identifying codes. Thus the descriptions of the resources list the holding institution as “unknown” followed by the code. In most cases, the linked WorldCat record will identify the holding institution.
We plan to address this issue in the near future, with the objective of being able to display on a map the geographic location of holdings to assist researchers in planning travel.
While most of the persons appearing as related to others persons are socially (including family relations) or professionally related, some are intellectually related, which is to say, were not known to one another or perhaps not even contemporaries, but are related because of influence or, quite frequently, because a person collected items by and about a person who lived before him or her.
Because archivists and librarians described the creators of archival resources in the description of the resources, the identity descriptions frequently contained multiple biographical notes for each entity, derived from the multiple source resource descriptions. Again because of encoding errors, users will find with some frequency a biography for one entity associated with a different entity. Upon closer examination, there will almost always be some relation between the two entities. Another common error is due to the archivists or librarian incorrectly labeling abstracts of resources as biographical data. Thus some of the "biographies" are simply not biographies.