A Metadata Registry from Vocabularies Up: The NSDL Registry Project
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Metadata Registry from Vocabularies Up: The NSDL Registry Project Diane I. Hillmann Stuart A. Sutton Cornell University University of Washington Tel: +1 607 387-9207 Tel: +1 206 228-6709 dih1@cornell.edu sasutton@u.washington.edu Jon Phipps Ryan Laundry Cornell University University of Washington Tel: +1 607 785-3224 Tel: +52 312 316 1000 jp298@cornell.edu rjlaundr@u.washington.edu Abstract: vocabularies); (2) the machine declaration The NSDL Metadata Registry is designed for encoding and network transmission of to provide hu mans and machines with the those schemes and schemas; and (3) the pub- means to discover, create, access and lication of those schemes and schemas to manage metadata schemes, schemas, ap- communities and applications. As part of its plication profiles, crosswalks and concept core services, the Registry will provide ma- mappings. This paper describes the gen- chine-addressable crosswalks and other map- eral goals and architecture of the NSDL pings that relate member terms in the Metadata Registry as well as issues en- schemes and schemas it contains one to an- countered during the first year of the pro- other. In addition, the project will provide ject's implementation. well-documented means for individual NSDL projects and others to identify, declare and Keywords: Metadata registries, schemas, schemes, publish their local schemes and schemas Semantic Web, National Science Digital through the Registry. Thus, the Registry will Library (NSDL). support the key goals of metadata discovery, reuse, standardization and interoperability. 1. Introduction The NSDL Registry work is grounded sol- In this paper, we describe progress on the idly in the NSDL projects facing challenges development of the National Science Digital in the effective deployment of their metadata Library (NSDL) Metadata Registry (hereaf- schemes and schemas. In the past few years, ter Registry) as a fundamental piece of core a community of interest within NSDL has technical architecture. It is not the purpose emerged. Communication and work among of this paper to chronicle the short history of this community has been supported by the research in the area of Web-based metadata proposers through NSDL Communication registries. For current explications of an ar- Portal discussion lists and an NSF/NSDL- ray of registry initiatives, see Wagner and sponsored Vocabulary Workshop. Use cases Weibel (2005) and Kotok (2003). Needless to guide Registry development have been to say, registries have been a part of the vetted through this community of interest. metadata discussions for a number of years, The community will also assist the project as the need for enabling infrastructure for through iterative evaluation during the pro- the Semantic Web has become more critical. ject’s second year. The NSDL Registry will make possible: (1) One of the goals of the NSDL Registry is to the unambiguous identification of metadata provide a stable home for schemes, schemas schemas (attribute spaces or element/property and application profiles used in the NSDL sets) and schemes (value spaces or controlled that lack a maintenance organization with the interest and resources for their long-term
maintenance. Another goal is to interact their content and will limit the functionality with registries external to NSDL that man- offered to discovery and exposure. age schemes and schemas of interest to the community. It is fundamental to the stability 2. NSDL Registry Services of knowledge organization systems and schemas that their maintenance and evolu- In essence, the Registry will manage the tion be managed as near their source—their following hosted top-level entities and their promulgating agency—as possible. Thus, content: while it is meaningful to develop a central- • Schemas. Entities that define elements ized NSDL Registry, it can only function or properties in attribute space name- effectively if it can interact with registries spaces; operated by the promulgating agencies just • Schemes. Entities that define concepts in noted. Therefore, we will build on the Web value space namespaces; Services currently deployed to address this • Application Profiles. Entities that pro- critical need to provide inter-registry inter- vide the means for selecting terms from actions for both humans and machines. disparate attribute and value spaces and defining their usage for a specific dis- As a result of this need for the NSDL Regis- course or practice community (see, try to interoperate with other metadata regis- Heery & Patel, 2000); tries, we define two classes of entities re- • Crosswalks. Entities that define relation- quiring different levels of “management.” ships among elements or properties in The first class is made up of those entities disparate attribute spaces; and hosted by the NSDL Registry. These are • Mappings. Entities that define relation- entities for which the canonical versions ships among concepts in disparate at- reside within the Registry. The Registry tribute spaces. provides the promulgators of this class of entity with capabilities to upload and import The relationships among these top-level en- into the Registry fully-formed entities or to tities are illustrated in Figure 1. create, edit, and version entities and their content using Registry tools. To date, most research implementations for the Web have approached registry research The second class of concern to the Registry is and implementation as a means for managing non-hosted entities. The goal with non-hosted and promoting reuse of attribute spaces — entities is to interact with the registries in i.e., the left-hand side of Figure 1. While the which they reside and to expose those entities NSDL Registry will also be handling attrib- through the Registry interface. The Registry ute spaces, the initial work has focused in- has no means to “manage” such entities or stead on value space issues—taking on some Figure 1: Top-Level Entities
of these issues at the most granular level and typical use case for human users begins with attempting to address the big question: “What the need to search or browse the Registry for should these registries do with knowledge vocabularies that might suit the needs of a organization systems (KOS) such as thesauri, project or community, most often during taxonomies, simple term lists and ontologies planning phases for projects or application and how should such registries operate in an profiles. Users at this stage are presumed to open services environment?” Because con- be looking for a rich and comprehensive re- trolled vocabularies tend to be more volatile sult set, which can allow them to explore the and change is a necessary part of the man- range and depth of vocabularies available agement challenge, we believe that starting through the Registry. with value spaces will ensure that the deci- sions we make and the processes we design For users who have already made a choice, or will work well for less volatile resources. for whom a choice is determined by commu- nity requirements, the Registry will provide It is clear that one measure of the long-term services that will allow for the optimal main- success of the NSDL Registry will be the tenance of chosen vocabularies within instance level of technical transparency of its underly- data. Because one criteria Bruce and Hillmann ing metadata abstract models and their asso- (2004) assert as a measure of quality of meta- ciated encodings in schema languages. It asks data is the currency of the controlled vocabu- too much of a collection holder or commu- lary terms, a range of services will be offered nity wishing to develop an application profile to assist in keeping vocabularies in applica- to master a schema language in order to gen- tions and instance data current. erate an appropriate schema. Placing tools in the hands of users that provide the means to Users of particular vocabularies will be able generate schemas for submission through to register their usage and sign up for regular, simple interface mechanisms, drawing on configurable notification of changes in the elements already in existence in the Registry, vocabularies they use. Notifications can in- encourages the use of application profiles and clude a variety of options ranging from files makes them easier for others to discover. In that can be used directly in update routines, to addition, providing a simple means for ex- human readable change listings that staff can tending existing schemas to include local use to update data using established manual elements is also required and will be possible processes. Because the goal is to support the through the schema generation tool. maintenance of metadata, Registry develop- ers will work closely with early users to en- 2.1. Registry Services for Vocabulary Users sure that the array of services offered meet the needs of projects and data providers. Although registries have long been regarded as one of the missing parts of web infrastruc- We recognize that the initial categorization ture, it does not follow that “build it and they of human and machine users breaks down will come” is sufficient to persuade either rather quickly, as some of the service com- vocabulary owners or users to interact with a ponents selected by humans are intended for registry. Incentives in the form of easily un- automated provision, but we need to be derstood value-added services are the key to flexible about how the services are deliv- bringing both owners and users into the Reg- ered, given the necessity to meet the needs istry—and keeping them coming back. of users at all stages of automated capability. Registry services can be categorized at the 2.2. Registry Services for Vocabulary most basic level by whether the initial user of Owners the service is a human or a machine. For hu- man users, initial services in the Registry will Ultimately, registry success relies much more be resource discovery and maintenance. A on services to vocabulary owners than it does
to other users. If vocabulary owners can’t been imported, vocabulary owners and find a reason to continue to update their vo- maintainers may request export of the vo- cabularies in the Registry, users will need to cabulary in any of the input or output for- find other ways outside the Registry to main- mats that the Registry supports, bearing in tain their data or not maintain it at all. Given mind the potential for data loss with non- that reality, it is obviously critical to this XML/RDF formats. Web services will also category of services to make the Registry an be provided that will support remote vo- integral part of the document/publish strategy cabulary maintenance and interaction. for vocabulary owners and managers, and not just another task with little or no immediate 2.4. Generating KOS within the Registry payback. As we noted earlier, one of the goals of the The first interaction vocabulary owners will project is to provide developers and main- have with the Registry is as a user, registering tainers of KOS with the means to author and an organization or individual as an agent and update those KOS within the Registry envi- registering additional contacts for the agent. ronment. While we are committed to being From there they provide basic information as open as possible in terms of encodings for about the vocabularies they own and/or man- existing KOS imported into the Registry, by age, either as an individual or on behalf of an necessity we must be more selective in the organization, and designate contacts as main- scheme authoring environment we imple- tainers of each vocabulary. This process pro- ment. Initially we will be developing an edi- vides the basis for a continuing relationship tor and validator conforming to the Simple between the Registry and the vocabulary, and Knowledge Organization System (SKOS) focuses on setting up properly scoped contact (http://www.w3.org/TR/2005/WD-swbp- information that can be used for ongoing no- skos-core-spec-20051102/). Where possible, tification and interaction. we will build on existing work in this area— see, for example the W3C work on SKOS 2.3. Uploading Existing KOS to the Registry validation (http://www.w3.org/2004/02/ skos/core/validation). We consider it likely that in many instances, vocabulary owners will initially continue to Framing the Registry’s built-in authoring manage and update their vocabularies using environment on the evolving SKOS is not whatever processes and applications that without its problems. Currently, there is no have served them in the past. Eventually, direct support in SKOS for handling our goal is to be able to supply services versioning of KOS concepts. From the be- within the Registry that will allow vocabu- ginning of the project, we recognized the ab- lary owners to shift their maintenance activi- solute need to manage versioning of schemes ties to within the Registry, relying on easy, and schemas as well as their member con- configurable output mechanisms to update cepts and terms. It is to these issues that we vocabulary usage within their own applica- now turn. We will return to the current limita- tions and data processes. tions of SKOS near the end of the paper. In order to support migration of existing vo- 3. Versioning Challenges cabularies to the hosted registry manage- ment infrastructure, the Registry will pro- Tracking changes in resources is an essential vide a flexible KOS upload and import task of a registry. Users need to be able to process. This process will support the import manage change either by relying on a par- of existing KOS from a number of different ticular version of a schema or scheme until a file formats, including non-XML/RDF for- particular change makes reconciliation a mats where the requirements of the vocabu- necessity, or alternatively, by automatically lary allow for it. Once the vocabulary has updating to match each new change. The
Registry must support them in carrying out representation of a vocabulary and it's either strategy. relationships; 4. An identifiable snapshot must include Controlled vocabulary versioning issues oc- the version designation (either “number” cur with both URIs and descriptions. Each or “date”); can change at two levels: at the term level, 5. Once published, individual concepts in a where each term change may invoke a vocabulary may be created, updated, or change management policy, and at the overall deprecated, but not deleted; vocabulary level, which is intrinsically dif- 6. Namespaces of vocabulary schemas ferent each time a term changes. Because it's won't be versioned; and not entirely clear what end users of vocabu- 7. Schema name versioning will only laries will require from registered vocabular- change if the version change would ies, the Registry will make available histori- harm backward compatibility cal changes and versions of the vocabularies and individual terms to the extent possible. 3.2. URI Changes The Registry strategy for tracking change Stability and reliability of concept URIs is relies partially on the software model, where critical to the Registry. Determining unam- recognition of “diffs” or differences between biguously when a maintainer of a hosted one version and the next (including who term intends to change its semantics will be made the changes) are the norm. Use of this a challenge with some forms of controlled model allows a complete history of all vocabularies. If the Registry allows registra- changes (and who made the change) to be tion of simple term lists, without hierarchies maintained and accessed by administrators, or definitions to determine term boundaries, maintainers and users. there is no ability to automatically signal any semantic change beyond the addition But not all change is important in the evolu- and deprecation of terms. Mappings between tion and proper usage of vocabularies and simple term lists and other schemes, or as- terms, and flooding users with undigested sertions of relationships between undefined information is clearly not an acceptable solu- terms are also problematic in this context. tion. Based on an in-depth analysis of possi- ble semantic changes and their implications, Most changes in description of the term, the Registry will track semantically signifi- including most changes of definitions and cant changes to individual terms in ways that simple additions or changes in term relation- will assist users in maintaining their vocabu- ships, should not qualify as semantic laries and their metadata appropriately. changes requiring a change in a term URI. In general, non-semantically significant Because there are distinct differences in the changes might include: control the Registry has over hosted and non-hosted vocabularies, the Registry poli- 1. Additions of broader, narrower or re- cies for each will be separately addressed. lated terms, when no change in hierar- chical placement is made; 3.1. General Assumptions: 2. Changes in definition for clarification, correction of typos or grammar, etc.; 1. URIs will remain stable as long as the 3. Addition of definition or scope note semantics of the concept do not change; when none is present; 2. URIs of individual concepts won't 4. Change in term status; and contain version information; 5. Addition of other information 3. The Registry must be allow people/ (references, etc.). services to create dependencies on an identifiable snapshot of a particular
Semantic changes, requiring a change in to provide change notification services simi- URI, might include: lar to those provided for hosted vocabularies. 1. Some instances of term splitting or 4. The Challenge of URIs consolidation; 2. Changes in definition that change the There are at least three possible scenarios semantics of the term; and envisioned for the assignment of term URIs 3. Changes in hierarchical relationships, within the Registry: when there is no definition and the hierarchy placement is the only semantic 1. A vocabulary maintainer submits clue. already assigned URIs with the terms; 2. A vocabulary maintainer submits a Enforcement of this policy is challenging, domain and URI ‘template’ with the since the initial decision about whether a top-level vocabulary description, so that change requires a new URI is made by the the Registry can use that information to maintainer (the exception is splits or consoli- assign URIs; and dation, where machine validation is possible). 3. A vocabulary maintainer asks the It is possible that a combination of explicit Registry to assign URIs. questions to the maintainer before a submis- sion and some monitoring by a Registry ad- In the first case, the owner-submitted URIs ministrator (particularly focusing on new can be validated to ensure uniqueness, and to maintainers) might decrease chances of se- some extent the Registry can monitor for in- mantically significant changes being made stances where semantic changes might re- without triggering a new URI. This is certainly quire a new URI, but should be able to as- an area where experience will be instructive. sume that the vocabulary maintainer is taking responsibility for URI assignments for new 3.3. Non-Hosted Vocabularies terms. In the second instance, the maintainer may not already have assigned URIs, but Most of the “control” over externally man- since they are required in the Registry, a do- aged vocabularies, particularly in terms of main can be submitted, along with a decision versioning, will be at a policy level, since the on whether the term name or a numeric value maintenance agency processes will be inde- will be used to create a unique URI, and the pendent of the Registry. If the Registry is to Registry can complete the process of assign- make available any notion of “versioned cop- ment when the terms are added. In the last ies” for these vocabularies, the versioning instance, the vocabulary maintainer asks the information at both the vocabulary and term Registry to assign a URI and the Registry levels must be exposed to the Registry. Ide- assigns a permanent URI constructed from a ally, the Registry will at some point be able base domain (either a domain supplied by the to ingest vocabulary “snapshots” (if the main- vocabulary owner, or the Registry’s native taining agency makes them available) or cre- domain), a unique token assigned by the vo- ate from ingestion of term changes viable cabulary owner to the vocabulary itself, and a “versioned snapshots” for use by other serv- numeric value assigned to each vocabulary ices or organizations. concept. This construct will ensure the uniqueness of each URI and provide support Registry services may be developed to man- for the W3C Semantic Working Group’s age agreements with agencies and ingest “Best Practices Recipes for Publishing RDF processes when terms change externally. The Vocabularies” http://www.w3.org/2001/sw/ Registry should maintain sequenced copies of BestPractices/VM/http-examples/). the concept schemes to be able to track changes over time and to show these copies As part of the effort to analyze the implica- to vocabulary users, and potentially use them tions of vocabulary changes on the Registry,
it became clear that using term names or (for instance, to confirm whether a term labels as part of a URI (a practice common change might qualify as a semantic in schema registries, including the DCMI change requiring a new term); and registry) in an effort to improve the “human • new terms have been added and a new readability” of URIs, could eventually de- term URI has been created. grade, particularly given the greater volatil- ity of controlled vocabularies over attribute Because most Registry interactions with vo- sets. This would tend to happen particularly cabulary owners and maintainers will be in in cases where a prefLabel and an altLabel the form of automated notifications, we rec- for a concept might be interchanged, for in- ognize that creating notifications that are stance when term usage changed over time. understandable and easily actionable by a For this reason, the Registry will use nu- broad range of agents will be an enormous meric concept identifiers, as noted above, as challenge. A helpdesk system to track and a default, and encourage vocabularies that manage interactions arising from notifica- have not already committed to using term tions will be essential to the project, as will a names as identifiers to follow suit. full range of supporting documentation. 5. Notifications, Outputs, and Other As part of the enticement for vocabulary Interactions owner participation, we anticipate notifying owners when users register their intention to Like most digital library services, the Regis- use their vocabularies, providing an incen- try is designed to operate with the least pos- tive to continue maintaining via the Registry sible human intervention. For that reason, system and perhaps also encouragement to considerable effort will be devoted to design- continue investing in vocabulary develop- ing and implementing automated notifica- ment. This registration of usage is integral to tions that can be easily understood by users, both vocabulary owners and users—each and to which there is adequate support for an has a strong interest in the participation and appropriate response. Where possible, re- activities of the other, and building on that quests that require simple “yes/no” responses interest will be more likely to contribute to will include clickable links, similar to those the growth of the Registry than broad ap- now common for email confirmations when peals to the “common good.” Detailed speci- registering for discussion lists and other serv- fications for output formats and mechanisms ices. In other cases, links to logs, documenta- are still incomplete, but will be an important tion, or specific terms or interactions will be priority as implementation progresses. included to assist the users in solving prob- lems that have been the cause of the notifica- Another reason for broad notification is to tion. Vocabulary maintainers will also be prevent nefarious activity within the Regis- prompted to review and resolve identified try, without the introduction of extensive problems when they log in to the Registry. security measures that complicate interac- tion. In instances where a person is main- Registered users will be able to subscribe to taining a vocabulary on behalf of another a notification service that will let them person or organization, notifications to other know, via Atom/RSS/RDF feed or email, of contacts with interests in the vocabulary changes to all or selected vocabularies. Ad- provides extra security for the Registry. ditionally, vocabulary owners may request that routine notifications be sent when: 5.1. Inter-Registry Services • registered maintainers have modified If the vision of distributed registries is to terms or term relationships; become reality, services between registries • file uploads or service interactions have must be part of the planning package. Given validation errors or require confirmation the expected volatility of some vocabularies,
these services must be based on standardized has simply been proposed, is approved (or service models and require as little human not), or has been depredated. intervention as possible. While additional support for revision and A distributed registry system should allow change management is welcome, extensions users to discover schemas, vocabularies and that address only the “human-friendly” as- application profiles across the system, with- pects of concept management provide only a out having to “shop” individual registries for partial solution. The Registry software will, an appropriate result. Given the problems of as a default, track every change made to con- federation-based “metasearch” solutions in cepts, and presenting this history of change to the library world, it is unlikely that discov- users without extensive editing by humans ery services in the Registry world could ac- will be necessary, if not necessarily simple. ceptably operate with discovery required to Reliance on human-created and maintained navigate federated “silos.” Thus, the Regis- notes to present change history to users is not try will provide APIs that support the inter- a scalable solution for a registry that must change of data between metadata registries. rely as much as possible on automated proc- Any metadata registry or other service that esses. Many of the maintainers of vocabular- supports the same APIs will be able to ex- ies interacting with the Registry will not be change data with the Registry. trained in vocabulary management, so expec- tations that they will understand SKOS or 6. SKOS Sufficiency—“Mind the Gap” thesaurus concepts sufficiently to construct standard notes are probably misplaced. Like Dublin Core, SKOS contains little in the way of guidance or support for meta- It is also possible that some flavors of output metadata, leaving most decisions to the im- desired by users will require distribution of plementer. This is particularly an issue when the full change histories maintained by the management of change and versioning is Registry, which suggests a need for standard- considered. As Tennis (2005) points out in a ized methods for capture, characterization recent paper, there are basically two meth- and exposure of machine-created and read- ods for concept scheme revision in SKOS: able concept changes. Other management notes and OWL versioning. He suggests information, like “status” might also be in- some additional extensions to address con- cluded in some desired output. cept “lumping” or combination of terms as well as concept refinement. 7. Conclusion Another issue that SKOS addresses only in Building a registry from the most granular its internal documentation is “status.” SKOS pieces “up” to more general, aggregated ex- terms themselves each have a “status”— pressions provides both important opportu- defined by a small vocabulary of status nities and significant potential for stumbles. terms—but the status of terms within a Without the development of SKOS, it would vocabulary cannot be described using clearly not be feasible, and given that there SKOS. To some extent this gap in attention have not, at this writing, been significant to administrative metadata mirrors Dublin SKOS implementations, there are still a few Core, which relies exclusively on external leaps of faith required. One interesting ques- standards (like OAI-PMH) to supply the tion it’s still too early to answer is: how will administrative “wrapper” around resource experience building this end of the Registry metadata. The Registry will define and sup- inform the other parts? Each phase implies a port a vocabulary of status terms (registered shift in focus, and a consolidation of lessons, of course) intended to provide vocabulary but each builds significantly on the next. users with an indication of whether a term
This material is based upon work supported by the National Science Foundation under Grant No. DUE-0532828. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. References: Bruce, Thomas R., and Hillmann, Diane I. Heery, Rachel & Patel, Manjula (2000). 2004. The Continuum of Metadata “Application profiles: mixing and Quality: Defining, Expressing, Exploit- matching metadata schemas.” Ariadne, ing. In Metadata in Practice, edited by Issue 25. Available at: Diane I. Hillmann and Elaine L. West- http://www.ariadne.ac.uk/issue25/ brooks. Chicago: ALA Editions. app-profiles/intro.html. Heery, Rachel, and Wagner, Harry. 2002. Heery, Rachel & Wagner, Harry. 2002. “A “A Metadata Registry for the Semantic Metadata Registry for the Semantic Web.” D-Lib Magazine, Vol. 8, No. 5. Web" D-Lib Magazine. Volume 8, Issue Available at: 5. Available at: http://www.dlib.org/dlib/may02/wagner/ http://www.dlib.org/dlib/may02/wagner/ 05wagner.html 05wagner.html. Heery, Rachel, Johnston, Pete, Beckett, Kotok, Alan. 2003. 'Metadata Rules' - A Dave, and Steer, Damien (2002). “The Report from the Open Forum on Meta- MEG Registry and SCART: Comple- data Registries. WebServices.org 2003- mentary Tools for Creation, Discovery 02-24. Available at: and Re-use of Metadata Schemas.” DC- http://www.webservices.org/categories/ 2002 International Conference. Flor- technology/standards/metadata_rules_ ence, Italy. Available at: a_report_from_the_open_forum_on_ http://www.bncf.net/dc2002/program/ft/ metadata_registries/(go)/Articles paper14.pdf Tennis, Joseph T. 2005. “SKOS and the On- Heery, Rachel; Johnston, Pete; Fulop, Csaba togensis of Vocabularies.” DC- 2005 In- & Micsik, Andras. 2003. “Metadata ternational Conference, Madrid, Spain. Schema Registries in the Partially Available at: Semantic Web : The CORES http://purl.org/dcpapers/2005/Paper33.pdf Experience.” DC-2003 International Conference. Seattle, WA. Available at: Wagner, Harry & Stuart Weibel. 2005. “The http://www.siderean.com/dc2003/ Dublin Core Metadata Registry: Re- 102_Paper29.pdf quirements, Implementation, and Expe- rience.” Journal of Digital Information, Volume 6, Issue 2, Article No. 330. Available at: http://jodi.tamu.edu/Articles/v06/i02/ Wagner/DCMI-Registry-final.pdf
You can also read