One of the things that one suddenly has time for after a hectic conference is to sit down and read some of the papers one didn’t have a chance to see being presented. One such paper (brought to my attention by Eric Childress at OCLC—thanks Eric!) was Joe Tennis’ paper entitled “SKOS and the Ontogenesis of Vocabularies” [http://purl.org/dcpapers/2005/Paper33].
Now, Joe is a friend of mine and I would normally have gotten around to reading his paper in the fullness of time (next plane trip, doctor’s office visit, etc.), but as it happens the paper addressed an aspect of the issue of change and versioning that we’d been thinking about on the NSDL Registry Project. As some of you know, we’ve decided to approach our Registry activities from the controlled vocabulary end rather than the schema end. This might seem backwards at first glance (particularly since most of the other registry projects are starting with schemas), but as it happens, the NSDL community is at present far more in need of a registry of controlled vocabularies than one to register schemas. The emergence of the SKOS work also made that decision seem ‘right’ so we’re committed to taking that approach.
Clearly, when one starts thinking about the registration of controlled vocabularies the whole question of versioning looms up like a toothy Godzilla on the horizon. We think we’ll need to explore the various aspects of this in some sort of policy statement, very soon, but I’m going to walk carefully around Godzilla at the moment to talk about Joe’s paper, and see if the conversation he expressed a desire to have on change in vocabularies can be engaged.
As many of you know, Joe is the keeper of the DC conference papers, and manages the development of the thesaurus used to expose those papers to users. Over the three years that papers have been in the system, the vocabulary used to describe them topically has changed. In Joe’s words, the issue is thus:
“Because the metadata thesaurus undergoes constant revision, it is unstable and cannot provide fixed relationships between indexing terms (concepts) and the entire collection of Dublin Core Online Conference Proceedings. For example, a paper indexed in 2002 will not be re-indexed with the revised index terms with the papers for 2004. However, the purpose of a controlled vocabulary is to collocate documents on the same subject. In order to accomplish this task, a secondary mechanism is required. We need a mechanism to express relationships of similarities and dissimilarities across the different versions. This mechanism would chart the development (ontogeny) of the metadata thesaurus and in doing so provide a structure for identifying similar and dissimilar terms across all versions of the thesaurus.”
As part of his analysis, Joe points out that SKOS suggests using OWL and DC to accomplish two important functions: identifying versions of concept schemes, and identifying one-to-one changes of concepts between schemes. Missing is a method to address more complex changes, for instance when a concept is ‘refined’ or ‘lumped.’
As a former Authorities Librarian this caught my eye immediately. In that former life I attempted to maintain congruence between the changes happening in the Library of Congress Subject Headings (LCSH) and Cornell’s database of MARC records, containing millions of LCSH headings. The process we used at that point was only partially machine assisted: we received files of changed LCSH records from a vendor and ran paper reports of conflicts created in our files by those LCSH changes, which were then triaged and fixed by human hands (mostly work study student hands, about the cheapest labor we had available). We had some hopes of eventually being able to manage one-to-one changes by machine, but the changes that gave us the most headaches were the ‘splits,’ things like the change from “Nurses and nursing” to “Nurses” and “Nursing,” where each record needed to be evaluated to determine which heading applied. [Lest any of you catch yourselves rolling eyeballs at the muddy concepts represented by “Nurses and nursing,” recall that LCSH is over a century old.]
The machine-readable records for LCSH were encoded in the MARC Authority Format (http://loc.gov/marc/authority/ecadhome.html):
“The MARC 21 Format for Authority Data is designed to be a carrier for information concerning the authorized forms of names and subjects to be used as access points in MARC records, the forms of these names, subjects and subdivisions to be used as references to the authorized forms, and the interrelationships among these forms.”
… [and further on]
“The MARC 21 Format for Authority Data also provides authoritative information concerning the standard terms used as node labels in the systematic section of a thesaurus to indicate the logical basis on which a category has been divided. A node label is not assigned to documents as an indexing term.”
Given that MARC has been around for about four decades (though Authorities is one of the ‘younger’ formats), it might be instructive to look at how changes are managed in MARC and how that information is used by managers of the thesauri, as well as downstream users (such as librarians managing databases) and appliers of the terms (generally catalogers). Some of the change information is also exposed to users, though this varies based on the system a library uses. MARC uses a system of public and private notes, status and dates to make changes explicit to most categories of users, a simple solution not particularly well suited for changes manipulated primarily via machine.
Some of this notion of notes has been incorporated into SKOS, but the notes examples in the SKOS Guide seem woefully simplistic and underdeveloped, so it’s difficult to envision how they might be incorporated with a more sophisticated encoding of changes, like the one Joe suggests.
I’m very excited by the prospect of finding ways to limit the need for human grunt labor in managing thesaurus change. Joe’s suggestions sound sensible and appropriately grounded in experience to appeal to me, but it seems clear that we’re just scratching the surface on what might be needed to manage these beasts over time. It would be useful to consider how this structure might be integrated into the concepts of notes already in SKOS to ease the task of the maintainers of vocabularies (as well as those beleaguered souls at some remove, managing the data that includes vocabularies), and make both machines and humans happy with the result.
No comment yet