XML Schema WG comments on RDF documents from C. M. Sperberg-McQueen on 2003年03月10日 (www-rdf-comments@w3.org from January to March 2003)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: 2003年3月10日 14:33:35 -0700
To: www-rdf-comments@w3.org
Cc: W3C XML Schema IG <w3c-xml-schema-ig@w3.org>
Message-Id: <5.1.0.14.1.20030310141948.025a6338@localhost>
Colleagues:
With apologies for the delay, I transmit to you herewith the comments
of the XML Schema Working Group on the various RDF documents published
in Last Call recently. We congratulate you on the progress of your
work and hope our comments are useful to you. An HTML version of our
comments may be found at
http://www.w3.org/XML/Group/2003/03/xml-schema-rdf-notes.html
and I append an ASCII-only version for the convenience of those who
find it more convenient.
-C. M. Sperberg-McQueen
 Co-chair, W3C XML Schema Working Group
...........................................................
W3C XML Schema Working Group
Comments on RDF documents
ed. Charles Campbell, C. M. Sperberg-McQueen, Henry S. Thompson
10 March 2003
 _________________________________________________________________
 * 1. [1]Notes on RDF Primer
 + 1.1. [2]Design question, complexity (substantive)
 + 1.2. [3]Whitespace handling (schema-related)
 * 2. [4]Notes on RDF Concepts and Abstract Syntax
 + 2.1. [5]Mapping from lexical forms to values (schema-related,
 terminological)
 + 2.2. [6]Values without lexical forms (schema-related,
 important)
 + 2.3. [7]Lexical forms, strings, and character sequences
 (schema-related, editorial)
 + 2.4. [8]Strings for natural-language data (substantive)
 + 2.5. [9]Typos and minor editorial notes
 * 3. [10]Notes on RDF Semantics
 + 3.1. [11]The "meaning" of literals (editorial)
 + 3.2. [12]Types as lexical mappings (schema-related)
 + 3.3. [13]Miscellaneous editorial notes
 * 4. [14]Notes on RDF/XML Syntax Specification (Revised)
 + 4.1. [15]Manifest typing in the instance (policy)
 + 4.2. [16]QNames (Editorial, but important)
 + 4.3. [17]Miscellaneous editorial notes
 + 4.4. [18]Normative specification of XML grammar (policy,
 substantive)
 + 4.5. [19]On the relation between RDF and off-the-shelf XML
 tools (policy, substantive)
 _________________________________________________________________
 NOTE:
 [These notes have been considered and approved by the W3C XML Schema
 Working Group, and are transmitted to the RDF Core Working Group as
 comments on the last-call drafts of various RDF-related documents.]
 $Id: xml-schema-rdf-notes.html,v 1.11 2003年03月10日 21:31:34 cmsmcq Exp $
 The XML Schema Working Group congratulates the RDF Core Working Group
 on progressing its several documents to Last Call; we apologize for
 the late submission of these comments, and hope that they prove
 helpful.
 Our comments include some which bear directly on the use of XML
 Schema's simple types by RDF, to which we believe you wished us to
 give particular attention. In the text which follows, these are
 labeled "schema-related". Some other comments, in contrast, relate to
 important and difficult technical and policy questions relating to
 language design and tool usage; these are labeled "policy". We hope
 that you will give these comments very serious consideration, but we
 do not pretend to any special standing in raising them, other than as
 representative members of the XML community at large. Finally, there
 are some other questions which are not directly related to XML Schema
 or to XML in general, and for which we therefore pretend to no
 particular expertise or standing, but which we happened to notice and
 which we call to your attention, as any technically minded reader
 might do, in the hopes that doing so may be useful to you; these are
 labeled "substantive" or "editorial" as the case might be.
1. Notes on RDF Primer
 RDF Primer, section 2.4 Typed literals
 [20]http://www.w3.org/TR/rdf-primer/#typedliterals
 [20] http://www.w3.org/TR/rdf-primer/#typedliterals
1.1. Design question, complexity (substantive)
 The introduction of pairs consisting of a lexical form and a type (or,
 strictly speaking, a lexical form and a type label) seems at first
 glance to complicate the RDF model somewhat. We have had the
 impression that in other parts of RDF, typing is handled by adding
 further arcs and nodes. If the type of a resource is identified by
 having an arc labeled rdf:type from it to (the URI of) its (RDF) type,
 and if the type of an arc is similarly identified by an arc, then
 surely a reason ought to be given for shifting to a different method
 for typing literal strings. It seems like a dramatic shift in the
 infrastructure of RDF, from "everything is a node, an arc, or a
 literal value" to "everything is a node, an arc, or a typed literal
 value". Perhaps not quite so dramatic, after all. But the question of
 design consistency remains: why not "everything is a typed node, a
 typed arc, or a typed literal"?
1.2. Whitespace handling (schema-related)
 Some members of the XML Schema WG have expressed concern that XML
 Schema's rules for whitespace handling may interfere with expected
 behavior in other contexts. This may be the appropriate place to bring
 this question up.
 In brief, XML Schema's simple types each define a whitespace facet,
 which governs the kind of whitespace pre-processing done by an XML
 Schema processor before the lexical form is checked for type validity.
 Since the point of whitespace normalization is to simplify subsequent
 processing, the lexical spaces of XML Schema's simple types are (like
 those in many programming languages) defined without reference to the
 preceding whitespace normalization. Integers, for example, are
 represented by sequences of decimal digits; sequences containing
 blanks are not legal lexical forms for integers. Indeed, strictly
 speaking it is only after the whitespace pre-processing is done that
 the XML Schema processor can be said to be working with a lexical form
 at all.
 For example, the integer type has a value of collapse for the
 whitespace facet, which means leading and trailing whitespace is
 stripped, and internal whitespace sequences are reduced to a single
 blank (x20) character. In an XML document in which the element
 exterms:age is defined as having type xs:integer, the following
 instances of exterms:age will all be type-valid:
 <exterms:age>27</exterms:age>
 <exterms:age>
 27
 </exterms:age>
 <exterms:age> 27 </exterms:age>
 <exterms:age> 2<!--* ha, ha, fooled your full-text indexer!
 *-->7 </exterms:age>
 The input information set, in each case, contains a character
 information item for "2" followed by a character information item for
 "7", with character information items for whitespace characters, and a
 comment information item, present in some of the examples. In all
 cases, the lexical form proper is the character sequence "27" (i.e.
 the sequence of characters after white space handling, and ignoring
 comments, processing instructions, entity boundaries, and other
 distractions). This is a legal lexical form for an integer, so all the
 examples are type valid.
 Some members of the XML Schema WG have worried that it may not be
 obvious that the whitespace processing is not part of the process of
 checking lexical forms for type validity, but part of the process of
 extracting the lexical forms from the XML information set presented to
 the processor. If an RDF document contains
 <exterms:age> 27 </exterms:age>
 and a processor hands the contents of the element to a generic
 type-checker for XML Schema's simple types, saying in effect "this
 purports to be the lexical form of an integer; is that OK?", that type
 checker will be required (if it conforms to the XML Schema spec's
 definition of the simple types) to say "no, the character sequence
 ` 27 ' is not a legal lexical form for an integer."
 It's not clear whether RDF, being type-system neutral, can directly
 address this concern (e.g. by specifying that an RDF processor should
 do the appropriate whitespace pre-processing, or by warning users that
 they should not include vagrant whitespace in typed literals), or
 whether it suffices for developers of RDF software with built-in
 support for XML Schema's simple types to deal with it, e.g. by
 performing it themselves before handing the resulting lexical form to
 a type checker.
 As noted, some members of our WG feel that you need to be alerted to
 this as a possible source of confusion and unexpected results. Other
 members of the WG feel that it verges on disrespect to assume that you
 need instruction on this point. We compromised by agreeing to point
 out the issue to you, and to leave you to draw your own conclusions.
2. Notes on RDF Concepts and Abstract Syntax
2.1. Mapping from lexical forms to values (schema-related, terminological)
 In [21]http://www.w3.org/TR/rdf-concepts/#section-Datatypes:
 [21] http://www.w3.org/TR/rdf-concepts/#section-Datatypes
 A datatype mapping is a set of pairs whose first element belongs to
 the lexical space of the datatype, and the second element belongs
 to the value space of the datatype:
 We agree that it is useful to define a term to denote such mappings;
 in the interests of inter-specification consistency, we wonder whether
 you would be willing to consider using the term lexical mapping, which
 we are introducing in our forthcoming draft of XML Schema 1.1. The
 term datatype mapping seems unlikely to be usable in the XML Schema
 specification, where it would suggest to some readers a mapping from
 one datatype to another, rather than as here a mapping from lexical
 space to value space. (XML Schema 1.0 got by without a term for this
 concept.)
2.2. Values without lexical forms (schema-related, important)
 In [22]http://www.w3.org/TR/rdf-concepts/#section-Datatypes:
 [22] http://www.w3.org/TR/rdf-concepts/#section-Datatypes
 * Each member of the value space may be paired with any number
 (including zero) of members of the lexical space (lexical
 representations for that value).
 The provision for values without corresponding lexical forms
 contradicts an assumption to which the XML Schema spec appeals from
 time to time. The lexical space of any simple datatype in XML Schema
 is the domain of the type's lexical mapping; the value space is its
 domain. There are no meaningless lexical forms in the lexical space of
 the type, nor are there ineffable values in the value space. By
 eliminating values from the value space (e.g. by setting minimal and
 maximal values), the type definer may indirectly also eliminate
 lexical forms from the lexical space; conversely, by eliminating some
 items from the lexical space (e.g. by setting a pattern), the type
 definer may eliminate items from the value space.
 Are there crucial aspects of RDF which will break if the list item
 quoted above is changed to read "paired with one or more members of
 the lexical space"?
2.3. Lexical forms, strings, and character sequences (schema-related,
editorial)
 In [23]http://www.w3.org/TR/rdf-concepts/#section-Datatypes:
 [23] http://www.w3.org/TR/rdf-concepts/#section-Datatypes
 With one exception, the datatypes used in RDF have a lexical space
 consisting of a set of strings.
 Since "string" is used as the local name for a particular simple type
 in the XML Schema namespace, we believe it will be less confusing for
 users, in the long run, if the lexical representations of
 simple-datatype values are described not as "strings" but as
 "character sequences".
 This comment also applies to other uses of the term string to denote
 the members of a lexical space.
2.4. Strings for natural-language data (substantive)
 In [24]http://www.w3.org/TR/rdf-concepts/#section-Datatypes:
 [24] http://www.w3.org/TR/rdf-concepts/#section-Datatypes
 * A plain literal is a string combined with an optional language
 identifier. This should be used for plain text in a natural
 language. As recommended in the RDF formal semantics
 [RDF-SEMANTICS], these plain literals are self-denoting.
 We do not believe that simple strings are likely to be adequate for
 the representation of arbitrary natural-language text. Even in
 English, natural-language utterances (such as this document) may need
 some degree of inline markup for clarity and adequate presentation; in
 natural-language utterances requiring bidirectional display or ruby,
 the best authorities (including the W3C I18n Working Group) recommend
 the use of markup within the natural-language utterance. We thus
 suggest that you may wish to moderate this recommendation that
 natural-language material be represented by literals.
 This is not an area in which we claim particular technical expertise;
 we merely call it to your attention in the hopes that doing so may be
 useful to you.
2.5. Typos and minor editorial notes
 In [25]http://www.w3.org/TR/rdf-concepts/#section-Literal-Value, for
 "the datatype mapping is applied to the pair form by the lexical form
 and the language identifier" read "the datatype mapping is applied to
 the pair formed by the lexical form and the language identifier".
 In the same section, for "Such a case, while in error, is not
 syntacticly ill-formed " read "Such a case, while in error, is not
 syntactically ill-formed" (et passim).
 In section [26]http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral,
 for "root element tag" read "root element".
 In the same section, for "XML element content" read "XML data" (the
 term element content is used in some markup-related specs as a
 complement of mixed content to denote the content of elements which
 can contain other elements but cannot contain parsed character data).
 [25] http://www.w3.org/TR/rdf-concepts/#section-Literal-Value
 [26] http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
3. Notes on RDF Semantics
3.1. The "meaning" of literals (editorial)
 The meaning of a literal is principally determined by its character
 string: it either refers to the value mapped from the string by the
 associated datatype, or if no datatype is provided then it refers
 to the literal itself, which is either a unicode character string
 or a pair of a string with a language tag.
 Some members of the XML Schema WG are made nervous by the appeal to
 the notion of "meaning" here. [N.B. our task force read this section
 out of context, and were not aware of any foregoing elucidation. So
 this comment may be out of place.] There is also some concern about
 the apparent conflation here of the notions of meaning and reference.
 We wonder whether this discussion would be weakened by replacing
 references to meaning and reference by references to denotation; we
 are inclined to think it would be an improvement, but recognize that
 the RDF Core WG's views may differ.
3.2. Types as lexical mappings (schema-related)
 A datatype is an entity characterized by a set of character strings
 called lexical forms and a mapping from that set to a set of
 values.
 We have a couple of reservations concerning this characterization.
 * Elsewhere (e.g. in Concepts and Abstract Syntax, section 3.3,
 [27]http://www.w3.org/TR/rdf-concepts/#section-Datatypes), the RDF
 specs say that there may be values in a value space which are not
 in the range of the lexical mapping; we have suggested that if
 possible those statements should be changed, but if they are
 retained, then a datatype cannot be characterized solely by the
 lexical space and the lexical mapping, because such ineffable
 values appear in neither of these.
 * The statement describes (with the exception of the problem just
 noted) simple datatypes, but not the class of complex datatypes
 which can be defined by XML Schema, nor all the types (or
 type-like constructs) definable in various other schema languages
 for XML.
 [27] http://www.w3.org/TR/rdf-concepts/#section-Datatypes
3.3. Miscellaneous editorial notes
 In [28]http://www.w3.org/TR/rdf-mt/#dtype_interp, for "which we will
 refer to as XSD and use the Qname prefix xsd:" read "which we will
 refer to as XSD and denote using the Qname prefix xsd" (or something
 similar).
 In [29]http://www.w3.org/TR/rdf-mt/#dtype_interp:
 [28] http://www.w3.org/TR/rdf-mt/#dtype_interp
 [29] http://www.w3.org/TR/rdf-mt/#dtype_interp
 For example, XML Schema requires that the value spaces of
 xsd:string and xsd:decimal to be disjoint ...
 This sentence is not exactly wrong, but it seems slightly unusual to
 use the verb require here, instead of define or something similar. We
 suggest recasting this as "For example, XML Schema defines the value
 spaces of xsd:string and xsd:decimal as disjoint ..." (Note, for the
 record, that the value spaces of all the primitive simple datatypes of
 XML Schema 1.0 are pairwise disjoint.)
 In ,
 any literal of the form "sss"@ttt^^ddd, where ddd is not
 rdf:XMLLiteral, treated as identical to the same literal without
 the language tag, "sss"@ddd
 is "sss"@ddd a typo for "sss"^^ddd?
 In [30]http://www.w3.org/TR/rdf-mt/#dtype_entail, for "it is valid to
 add any number of leading zeros to any numeral and still be a correct
 lexical form for xsd:integer", perhaps read "it is possible to add any
 number of leading zeros to any lexical form for xs:integer without it
 ceasing to be a correct lexical form for xsd:integer"
 [30] http://www.w3.org/TR/rdf-mt/#dtype_entail
4. Notes on RDF/XML Syntax Specification (Revised)
 RDF/XML Syntax, [31]http://www.w3.org/TR/rdf-syntax-grammar/
 [31] http://www.w3.org/TR/rdf-syntax-grammar/
4.1. Manifest typing in the instance (policy)
 RDF allows Typed Literals to be given as the object node of arcs.
 These consist of a literal string (with optional language) and a
 datatype RDF URI Reference. This is handled ... with an additional
 rdf:datatype="datatypeURI" attribute on the property element.
 We believe there are probably good reasons for using an rdf:datatype
 attribute, instead of re-using the existing xsi:type attribute which
 has (when the type is defined in a schema defined by XML Schema 1.0)
 the same semantics. In particular, rdf:datatype does not assume or
 assert the existence of the type named as a type in a schema defined
 by XML Schema, so it would be problematic to use xsi:type.
 We do fear, however, that users are likely to find this
 near-duplication of the meaning and function of xsi:type confusing. It
 is not clear to us what, if anything, can or should be done to
 minimize this danger.
4.2. QNames (Editorial, but important)
 We were unable, on a first reading, to determine whether the default
 namespace declaration, and thus unprefixed names, were or were not
 allowed in order to encode 'RDF URI References'. Indeed the
 introductory prose about QNames (2nd para of
 [32]http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-intro])
 does not seem to connect up with the relevant (?) production in
 [33]http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar]
 , which we take to be
 [34]http://www.w3.org/TR/rdf-syntax-grammar/#URI-reference].
 This can and should be cleared up.
 [32] http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-intro
 [33] http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar
 [34] http://www.w3.org/TR/rdf-syntax-grammar/#URI-reference
4.3. Miscellaneous editorial notes
 In
 [35]http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-empty-prop
 erty-elements, the sentence
 [35] 
http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-empty-property-elements
 When an arc in an RDF Graph points to an object node which has no
 further arcs, which appears in RDF/XML as an empty node element
 sequence such as the pair <rdf:Description rdf:about="...">
 </rdf:Description>, this form can be shortened.
 seems less clear than it might be. Different readers prove to have
 different views on what is meant by "the pair <rdf:Description
 rdf:about="..."> </rdf:Description>"; perhaps it can be replaced by
 something like "the empty element <rdf:Description rdf:about="..."/>"
 without loss of precision? Perhaps the sentence could read
 When an arc in an RDF Graph points to an object node which has no
 further arcs, which appears in RDF/XML as an empty node element
 such as <rdf:Description rdf:about="..."/>, this form can be
 shortened.
4.4. Normative specification of XML grammar (policy, substantive)
 We note with admiration the excellent tutorial introduction to the
 striped syntax in Section 2
 [36]http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax]. We are
 less happy with the nature of the syntax, and with the approach taken
 to its normative statement
 [37]http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar]
 .
 As regards the syntax itself, we would much prefer to have seen a move
 to a single canonical syntax with much less variablity. With respect,
 the current design suggests that the value of XML has been
 misunderstood. The range of alternative forms of expression provided
 for in the current design make it very difficult to use the broad
 range of generic XML tools (e.g. syntax-directed editors, XSLT) which
 could give so much benefit to RDF users. (More on this below.) At the
 very least we would encourage you to specify a single canonical form,
 probably strictly striped, which could be defined by an XML Schema or
 DTD. We would be happy to work with you to develop a schema for such a
 subset.
 As regards the approach taken to defining the syntax, in our view,
 layering of specs has very high value, and so defining an XML document
 type by way of what is very nearly a character-level BNF is at best a
 missed opportunity and at worst a serious mistake. It obscures the
 important aspects of the document type behind a welter of irrelevant
 detail about e.g. whitespace and start-tag/end-tag matching. It makes
 it very difficult for the reader to actually understand what is and
 isn't actually allowed -- what an RDF/XML document actually looks
 like.
 Not only does this confuse levels and thus readers, it also runs the
 risk of inadvertently defining an XML subset. It also appears, on a
 strict reading, to rule out XML documents not derived from the parsing
 of character streams as possible RDF/XML (so that it would be
 illegitimate to regard a data structure created using a DOM interface,
 for example, as RDF/XML).
 The use of event-triggered data-model construction actions to specify
 the relationship between XML representation and corresponding data
 objects is innovative and compelling, but surely it would be
 straight-forward to associate these events with a pre-order traversal
 of an infoset independently constrained by a DTD, XML Schema schema or
 other appropriate definition of the canonical document type. If
 continued support for alternative forms is considered essential, then
 a two-step approach where the semantics of any non-canonical form is
 defined in terms of a canonical form to which it corresponds would
 still be far simpler than the current approach.
 [36] http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax
 [37] http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar
4.5. On the relation between RDF and off-the-shelf XML tools (policy,
substantive)
 With some diffidence, we conclude by raising what may be a sensitive
 issue.
 It does not seem to us that the XML serialization of RDF shows RDF to
 advantage. At the level of the underlying graph model, RDF information
 has a simple and regular structure, which appears in the XML
 serialization to be anything but simple and so irregular as to bring
 the words "capricious" and "arbitrary" to the lips of unprejudiced
 observers. Tastes in markup style differ, but we believe that the root
 of the problem is the high degree of variability with which the same
 underlying graph structures may be serialized, according to the rules
 given in this document.
 Owing in part to the variability itself, and in part to the specific
 forms taken by that variability, it is not feasible to write an XML
 Schema schema, or (if the comments in Appendix A.1 are accurate) a
 Relax NG schema, or an XML 1.0 DTD, which defines the set of correct
 serializations of correct RDF graphs. It is not convenient to run XSLT
 processes over arbitrary RDF serializations, nor to query or process
 arbitrary RD data using XQuery. Arbitrary RDF data is similarly
 inconvenient for other standard XML tools to process.
 There is, as a result, something of a cleft between the RDF community
 and the set of RDF tools on the one hand, and the community of users
 and tools employing what some have called colloquial XML. The parallel
 development of query languages, schema languages, object models, APIs,
 editors, display tools, and so on does offer relatively harmless ways
 for a large number of people to employ their time, but it does not
 seem to us to serve the larger Web community well.
 The cleft between RDF and colloquial XML does not seem to us to be
 required by the RDF data model. A graph in which nodes have certain
 properties and arcs have certain properties is not, in itself, a
 peculiarly difficult structure to render in XML or to process with
 off-the-shelf XML tools. An XML vocabulary in which nodes may appear
 as elements, or as attributes, or as attribute values, or as the
 PCDATA content of elements, and in which property names may appear as
 three of the same four constructs, on the other hand, seems a rather
 less straightforward XML representation of the underlying graph
 structure than most XML vocabularies for graphs have chosen.
 The result is that not just arbitrary RDF data, but data encoded using
 vocabularies defined in RDF terms (for which current W3C work provides
 a number of examples), will be hard to process using off-the-shelf
 tools. We believe this difficulty represents a lost opportunity, and
 we believe the opportunity could readily be seized if the XML
 serialization were modified to capture more of the regularity of the
 RDF data model.
 We are ready to work together with the Working Groups in the Semantic
 Web Activity and with other interested parties to formulate an XML
 serialization which captures the information in the RDF model and
 which is more readily amenable to processing with off-the-shelf XML
 tools.
Received on Monday, 10 March 2003 16:35:17 UTC