Lorcan Dempsey
UKOLN, University of Bath, UK
http://www.ukoln.ac.uk/~lisld/
Stuart L. Weibel
OCLC Office of Research, Dublin, Ohio, USA
http://purl.oclc.org/net/weibel
The first week of April 1996 found fifty representatives of libraries, Internet standards, text markup, and digital library projects converging at Warwick University for the OCLC/UKOLN Warwick Metadata Workshop. The conferees came from three continents, eleven countries, and numerous perspectives in an effort to apply their collective experience to the clarification of issues surrounding the effective deployment of metadata for networked information resources.
The meeting followed last year's OCLC/NCSA Metadata Workshop, which convened a similarly diverse collection of stakeholders, and resulted in consensus on a simple resource description record that has come to be known as the Dublin Core. Indeed, the consensus itself may may well have been the first workshop's most important deliverable. The thirteen elements of a Dublin Core record contain few surprises, focusing largely on what might be thought of as network resource bibliography and a little bit more. [WEIB95a].
The Dublin Core received considerable attention as a simple resource description record in the year since the first meeting. While the first workshop helped to focus discussion of the topic in many communities, the implementation of such a description record requires a formal syntax and deployment strategy that were beyond the scope of that first meeting.
Planning for the second workshop began with informal discussions between the UK Office for Library and Information Networking (UKOLN) and OCLC's Office of Research in the summer of 1995. The agenda for the meeting gradually crystallized around the theme of identifying and resolving impediments to deployment of a Dublin Core style record for resource description. The expectations of the organisers and participants were exceeded over the course of the meeting as conferees worked towards a number of related conclusions about the Dublin Core Metadata element set, about the need for a wider set of metadata types, and about an extensible framework for interchange of metadata of different types. A consensus about these issues emerged from the workshop and a set of concrete proposals for moving forward has been produced. The areas of consensus include:
Dublin Core
Warwick Framework
Guide to Creation and Maintenance of Metadata
This paper provides a high-level overview of the issues discussed at the workshop. It brings together descriptions of the above outcomes and places them in context. Section 2 discusses the Dublin Core and the proposals for taking it forward. Section 3 discusses the rationale for the Warwick Framework.
The Dublin Core Metadata element set is a set of thirteen metadata elements proposed by the first workshop as a core description record to facilitate discovery of document-like objects in a networked environment. To facilitate progress, a number of constraints were imposed on the discussions of the two and one half day workshop of April, 1995:
The 1995 Dublin Metadata Workshop is described in greater detail in:
[WEIB 95a] and
[WEIB95b] .
The reference description of the element set can be found at:
http://purl.org/metadata/dublin_core_elements
The development of the Dublin Core is motivated by several intended uses:
It is clear from early implementation experience (see 2.3 below) that projects have adopted the semantic flavor of the Dublin Core to develop simple resource description formats. The Dublin Core is intended to fill the niche between the terseness of the unstructured full-text web indexes and the structured description of more complex models such as MARC. It is intended to be sufficiently rich to support useful fielded retrieval but simple enough not to require specialist expertise or extensive manual effort to create.
Simplicity is especially important in the context of author-generated metadata. Conferees at both the 1995 and the 1996 workshops recognized the importance of embedded metadata in Web documents to be harvested by software robots. The key to success is balancing the need for well-structured metadata with the requirement that the creation of the description is manageable by authors.
Future applications will have to work with different types of metadata from different sources. The Dublin Core was positioned to provide a common set of tags that would have recognizable meaning across description models, and in that way provide a unifying semantics among many disciplines. The National Document and Information Service, a joint project of the National Library of Australia and the National Library of New Zealand, (described below) is one example of such a use.
Even absent a clearly defined syntax, the Dublin Core element set attracted the interest of a number of early adopters who developed projects that built on the consensus that emerged from the Dublin Metadata Workshop. Some of these include:
Among Nordic countries, there is a special need for a shared metadata system, as it would facilitate further the already active use of Inter-Library Loan (ILL) and document delivery services within Scandinavia. The Dublin Core is one of several resource description models that are under consideration for adoption for this purpose.
A preliminary project plan for the Nordic Metadata Project has been written by Juha Hakala from the Helsinki University Library. The NORDINFO management group accepted the plan in its meeting in Spring 1996, and a full project plan will be written to the management group's September 1996 meeting [HAKA96].
The Uniform Resource Name Interoperability Project (TURNIP), initiated by the Distributed Systems Technology Centre (DSTC) in Australia, has produced a URN Resolution service that utilises the Dublin Core element set for URC metadata. The Dublin Core elements are used to describe DSTC Technical reports and are supplemented with Administrative metadata elements (eg URC-Type, Date-Creation, Owner). Three main issues arising from this deployment of Dublin Core included the need to group elements together, a common syntax for the exchange of URCs, and standards for element qualifiers.
More information on the TURNIP project can be found at:
OCLC
The OCLC Office of Research is investigating several potential applications of the Dublin Core element set, including:
The NDIS project (National Document and Information Service) is a joint development of the National Library of Australia and National Library of New Zealand aimed at providing a sophisticated search service to Australian and overseas databases, collection management services and state of the art document delivery services. The first phase of the project will implement a search and document request service across an integrated information resource of MARC-based bibliographic data and a suite of indexing, directory, and thesauri databases in a variety of encoding formats. Further information about the project can be located at: http://www.nla.gov.au/2/NDIS.
The NDIS project used the Dublin Core as a tool in determining generic metadata for bibliographic data, with extensions of the core element set or adoption of other metadata standards for non-bibliographic data. The creation of additional metadata can be viewed as extensions or separate core elements sets.
The Dublin Core serves as a useful model for the generic storage and access requirements in cross-database searching. Its concept of qualification offers a model for normalizing disparate data types and search precision at the individual database level via specific schema or types. The NDIS implementation utilizes many principles of the Dublin Core, such as extensibility and modifiability, but differs on optionality, as only those metadata elements that intersect across data types are core information resource elements. Metadata elements intersecting a grouping of data or item types are considered "common" metadata elements.
After the first OCLC/NCSA Metatdata Workshop in March 1995, the Library of Congress drafted two discussion papers for review by the USMARC Advisory Group at its June 1995 meeting. DP86 was called "Mapping the Dublin Core Metadata Elements to USMARC" [GUEN95a]. The purpose was to publicize the Dublin Core, to encourage a standard mapping to USMARC, and to point out problems in mapping to the current format.
One of the biggest problems was that there was no valid place in MARC to put names from the "Author" or "OtherAgent" elements when you might not know the type of name or its relation to the object. Discussion Paper 88 [GUEN95b] and subsequently, Proposal 96-2 [LCON95], proposed a field for "generic author", which was added to USMARC in January 1996 as the Uncontrolled Name field, tag 720.
A full discussion of data interchange between MARC and Dublin Core will appear in Cataloging and Classification Quarterly, "Metadata for Internet Resources: The Dublin Core Metadata Elements Set and Its Mapping to USMARC" by Rebecca Guenther and Priscilla Caplan [CAPL96].
See Mapping the Dublin Cor Metadata Elements to USMARC for further information.
The Alexandria Digital Library (ADL) project is one of six NSF/NASA/ARPA-funded Digital Library Initiatives. ADL focuses on on-line access to spatial data. Given that an estimated 90% of all spatial data is available only in hard-copy form, metadata is of prime importance. At the same time, ADL recognizes that a full cataloging record is not needed by the vast majority of general users. ADL translated Dublin Core fields into ADL fields, and added fields required specifically for spatial data and for hard-copy items. This combined set of fields is the default display set that general users see when they perform a search and display resulting metadata.
Among the factors that motivated the Warwick Framework, described later in this paper, is the certainty that a variety of resource description models will emerge from different communities. A successful architecture of network resource description must accommodate such diversity.
Examples of other simple resource description models discussed at the workshop include RFC 1807 and IAFA templates:
This RFC [http://ds.internic.net/rfc/rfc1807.txt] defines a format for bibliographic records describing technical reports. This format is used by the Cornell University Dienst protocol and the Stanford University SIFT system.
RFC 1807 [LASH95] is a bibliographic record tailored to the needs of the Networked Computer Science Technical Report Library (NCSTRL) project [http://www.ncstrl.org.] and is targetted specifically to the description of computer science technical reports. As such it has many characteristics appropriate to a resource description record for document-like objects.
ROADS [http://ukoln.bath.ac.uk/roads/] ROADS is an eLib funded project to implement software for resource organisation and discovery in subject-based services. The aims are to develop a sharable resource discovery system and to fulfill the requirements of the eLib subject-based services. The intention is to involve information providers in resource description as this is viewed as essential to a sustainable service.
There are two subject services currently in production (OMNI and SOSIG) using a prototype version of ROADS. The choice of standards for ROADS was based on the criteria of simplicity and availability, to allow for speedy start-up of the subject services. To this end, a simple attribute-value record structure based on the IAFA/whois++ template definition was chosen. A later version of ROADS will be based upon implementation of the Common Indexing Protocol (CIP) to allow for a distributed system of shared indexing. Initial experience of deployment of the IAFA/whois++ template has generated statistical information on the frequency of use of both bibliographic and administrative attributes. It is expected that this will provide useful feedback for further development of the whois++ template structure.
Specification of a Transfer Syntax
Discussions of syntax are often difficult, burdened as they are with the biases of familiarity and competing methodologies. The earlier Dublin Workshop made progress partly because such discussions were ruled out of scope. However, consensus concerning semantics cannot be deployed without a concrete syntax (or syntaxes). In pilot implementations, the absence of a common model led to different syntax and structuring choices. Clearly, any widespread deployment of Dublin Core (or any similar description scheme) hinges on reaching consensus about a transfer syntax.
Since the Web is currently the primary medium of the Internet, it was further recognized that deployment of metadata in the Web is the primary strategic application; successful deployment of metadata in HTML is necessary, though almost certainly not sufficient.
A working group on syntax formed around this issue, and this group has elaborated a position paper describing a formal syntax for Dublin Core Metadata. A Syntax for Dublin Core Metadata (Burnard, Miller, Quin, and Sperberg-McQueen) includes:
In related developments, a convention for embedding metadata in HTML was proposed in a break-out group at the subsequent W3C Distributed Indexing and Searching Workshop, May 28-29, 1996. This break out group included representatives of the Dublin Core/Warwick Framework Metadata meetings, representatives of several major Web search vendors (Lycos, Microsoft, WebCrawler), various other software vendors, and the W3 Consortium.
The problem is to identify a simple means of embedding metadata within HTML documents without requiring additional tags or changes to browser software, and without unnecessarily compromising current practices for robot collection of data.
While metadata is intended for display in some situations, it is judged undesireable for such embedded metadata to display on browser screens as a side effect of displaying a document. Therefore, any solution requires encoding information in attribute tags rather than as container element content.
The goal was to agree on a simple convention for encoding structured metadata information of a variety of types (which may or may not be registered with a central registry analogous to the MIME Type registry). It was judged that a registry may be a necessary feature of the metadata infrastructure as alternative schema are elaborated, but that deployment in the short-term could go forward without such a registry, especially in light of the proposed use of the LINK tag to link descriptions to a standard schema description as described below.
The solution agreed upon is to encode schema elements in META tags, one element per META tag, and as many META tags as are necessary. Grouping of schema elements is achieved by a prefix schema identifier associated with each schema element.
A convention for linking resource description tags to the reference definition of the metadata schema (or schemata) used in a document was also proposed. Doing so serves as a primitive registration mechanism for metadata schemata, and lays the foundation for a more formal, machine-readable linkage mechanism in the future [WEIB96].
The proposed conventions are described more fully in http://www.oclc.org:5046/~weibel/html-meta.htmlDevelopment of User Guides
Resource descriptions might be created by a number of different agents in the metadata chain: authors, collection administrators, and third-party catalogers. Guidelines for the creation of metadata are needed. A guide for authors themselves would be especially useful in supporting a move to document-embedded descriptions, and at least one producer of HTML authoring tools (SoftQuad, Ltd.) has committed to embedding Dublin Core resource description templates in their products when the syntax and guidelines are sufficiently stable.
A working group on user guides formed at Warwick around the task of providing such guidelines [KUNZ96]. Their efforts are evolving and are linked to the Dublin Core home page.
Extensibility -- Mixing and Matching Metadata
The Dublin Core addresses one particular niche of the metadata ecology. It is a simple resource description format that is intended to be extensible in at least two ways. As its name implies, it is intended to provide a commonly understandable core of elements that will help unify different models of resource description. Its simplicity is among its major virtues, but users may well wish to augment description of their resources with additional data.
Original concepts of extensibility for the Dublin Core assumed a mechanism for local extensions -- additional elements added at the discretion of authors or collection maintainers. Such local information may be critical to the effective use of a particular collection, though the local character of such elements may not be of general interest or usefulness.
Of perhaps greater importance is the need to link Dublin Core records to other, richer description schemes (for example, MARC). The ability to link a simple description record to a richer description model provides a means to promote one record type to a more complete description as warranted, and also affords a more continuous axis of resource description (from simple to complex) to suit a variety of user or system needs.
Additionally, Dublin Core data address only one niche of the metadata ecology (resource description for search and retrieval). Other types of description are necessary, as well: terms and conditions (who must pay what to whom, for example), archival status, administrative metadata, and others.
Finally, there are competing models of resource description that overlap the Dublin Core to one degree or another. RFC 1807 and IAFA templates discussed above are examples of such formats. Workshop discussions on extensibility merged with this recognition of the need to accommodate different description models. No single format for resource description will fill all the needs, nor could such a monolithic model be easily maintained. The consensus of the workshop converged on a need for an architecture that would accommodate the diversity of models and levels of description that characterize the heterogeneous world of electronic resources.
The proposal that emerged from these discussions is known as the Warwick Framework, discussed in detail in a companion article by Carl Lagoze in this issue of D-Lib Magazine. It is an architecture for the aggregation and interchange of discrete metadata packages. Such an architecture will afford the opportunity to mix and match metadata sets, allowing rational deployment of many existing and emergent description models. The following section summarizes the essential features of the Warwick Framework.
No single element set will satisfy all metadata requirements. Different communities of users or different application areas will require data of different elements and levels of complexity. The Workshop took as its starting point the Dublin Core, a simple scheme for what might be thought of as electronic bibliography. However, other application areas might require the fullness and structure provided by a MARC-type record, for example, or might have domain specific descriptive requirements not addressed in the Dublin Core. At the same time other types of data exist which were outside the scope of the Dublin Core: terms and conditions, evaluative data, for example.
Satisfying the need for competing, overlapping, and complementary metadata models requires an architecture that will accommodate a wide variety of seperately maintained metadata models. It was concluded that an architecture for the interchange of metadata packages was required. A package is conceived as a metadata object specialized for a particular purpose. A Dublin Core-based record might be one package, a MARC record another, terms and conditions another, and so on. Such discrete packages might be numerous and varied in content and even source. Users or software agents would need the ability to aggregate these discreet metadata packages in a conceptual container (a metadata basket of sorts), hence the notion of a container-package architecture.
This architecture should be modular, to allow for differently typed metadata objects; extensible, to allow for new metadata types; distributed, to allow external metadata objects to be referenced; recursive, to allow metadata objects to be treated as 'information content' and have metadata objects associated with them.
Packages are typed objects. They may be primitive (a package is one of a number of separately defined, primitive metadata formats); indirect (a package may be a reference to an external object); or a container (a container is a collection of metadata objects, which may in turn be packages or other containers).
Several benefits flow from a container-package approach:
The Warwick Framework is a high-level container architecture: it makes no assumptions about the contents of the packages. Nor can it be assumed that clients (or agents) will be able to interpret all packages. Conferees agreed that packages should be strongly typed and that a registry for metadata types will probably be required, perhaps along the same lines as the IANA registry for Internet Media Types (also known as MIME types).
Concrete Implementations
The architecture needs to be realized in one or more concrete implementations. Proposals for MIME- and SGML- based implementations have been prepared as well as a discussion of the architecture in a distributed object environment.
Registration
A registry agency for metadata object types needs to be established. Early implementation pilot projects should not be hampered by the lack of such an agency, but as more metadata sets are elaborated by various stakeholders, a formal means for managing changes will be important.
The Warwick Framework was enthusiastically welcomed at the workshop as a practical approach to the effective integration of metadata into a global information infrastructure. The realization of such an architecture will require great effort on many fronts, in many communities. The great hope is that the consensus achieved at this meeting will have provided the foundation for coordination, and sufficient freedom in the proposed architecture to allow progress without an undue burden of close coordination.
The following working papers address aspects of the Warwick Framework more fully:
Conferees left Warwick convinced that significant progress had been made in important areas. This conviction is corroborated by the rapid appearance of a number of documents supporting key decisions and recommendations.
The consensus concerning embedding metadata in HTML reached at the W3C workshop on Distributed Indexing and Searching provides an encouraging impetus to rapid deployment of richer resource description techniques on the Web along the lines developed in the Warwick Workshop.
The recent appearance of a Dublin Core implementation based on these developments http://archaeology.ahds.ac.uk/project/metadata/dublin.html is a promising indicator of the need and demand for better resource description on the Internet, and the speed with which such ideas can be promulgated when community concensus emerges.
It is hoped that the Warwick Workshop will prove to have galvanized such a consensus and provided an important signpost for the development of more effective networked resource description.
July 15, 1996
D-Lib Magazine |
Current Issue |
Next Story
hdl://cnri.dlib/january96-weibel