Caching and Preparsing Grammars

http://xml.apache.org/ http://www.apache.org/ http://www.w3.org/

Home

Readme
Charter
Release Info

Installation
Download
Bug-Reporting

FAQs
Samples
API JavaDoc

Features
Properties

XNI Manual
XML Schema
SAX
DOM
Limitations

Source Repository
User Mail Archive
Dev Mail Archive

Questions

Caching Grammars
Xerces Default Grammar Caching Implementation
Preparsing Grammars
Grammar caching with Standard APIs
Examining Grammars
Alternative method for getting an XSModel

Answers

I have a set of (DTD or XML Schema) grammars that I use a lot. How can I make Xerces reuse the representations it builds for these grammars, instead of parsing them anew with every new document?

Before answering this question, it will greatly help to understand how Xerces handles grammars internally. To do this, here are some terms:

Grammar: defined in the org.apache.xerces.xni.grammars.Grammar interface; simply differentiates objects that are Xerces grammars from other objects, as well as providing a means to get at the location information (XMLGrammarDescription) for the grammar represented.
XMLGrammarDescription: defined by the org.apache.xerces.xni.grammars.XMLGrammarDescription interface, holds some basic location information common to all grammars. This can be used to distinguish one Grammar object from another, and also contains information about the type of the grammar.
Validator: A generic term used in Xerces to denote an object which compares the structure of an XML document with the expectations of a certain type of grammar. Currently, we have DTD and XML Schema validators.
XMLGrammarPool: Defined by the org.apache.xerces.xni.grammars.XMLGrammarPool interface, this object is owned by the application and it is the means by which the application and Xerces pass complex grammars to one another.
Grammar bucket: An internal data structure owned by a Xerces validator in which grammars--and information related to grammars--to be used in a given validation episode is stored.
XMLGrammarLoader: defined in the org.apache.xerces.xni.grammars.XMLGrammarLoader interface, this defines an object that "knows how" to read the XML representation of a particular kind of grammar and construct a Xerces-internal representation (a Grammar object) out of it. These objects may interact with validators during parsing of instance documents, or with external code during grammar preparsing.

Now that the terminology is out of the way, it's possible to relate all these objects together. At the commencement of a validation episode, a validator will call the retrieveInitialGrammarSet(String grammarType) method of the XMLGrammarPool instance to which it has access. It will use the Grammar objects it procures in this way to seed its grammar bucket.

When the validator determines that it needs a grammar, it will consult its grammar bucket. If it finds a matching grammar, it will attempt to use it. Otherwise, if it has access to an XMLGrammarPool instance, it will request a grammar from that object with the retrieveGrammar(XMLGrammarDescription desc) method. Only if both of these steps fail will it fall back to attempting to resolve the grammar entity and calling the appropriate XMLGrammarLoader to actually create a new Grammar object.

At the end of the validation episode, the validator will call the cacheGrammars(String grammarType, Grammar[] grammars) method of the XMLGrammarPool (if any) to which it has access. There is no guarantee grammars that the grammar pool itself supplied to the validator will not be included in this set, so a grammar pool implementation cannot rely only on new grammars to be passed back in this situation.

At long last, it's now possible to answer the original question--how can one cache grammars? Assuming one has a reasonable XMLGrammarPool implementation--such as that provided with Xerces--there are two answers:

The "passive" approach: Don't do any preparsing, just register the grammar pool implementation with the parser, and as new grammars are requested by instance documents, simply let the validators add them to the pool. This is very unobtrusive to the application, but doesn't provide that much control over what grammars are added; even if a custom EntityResolver is registered, it's still possible that unwanted grammars will make it into the pool.
The "active" approach: Preload a grammar pool implementation with all the grammars you'll need, then lock it so that no new grammars will be added. Then registering this on the configuration will allow validators to make use of this set; registering a do-nothing EntityResolver will allow the application to deny validators from using any but the "approved" grammar set. This will oblige the application to use more Xerces code, but provides a far more fine-grained approach to controlling what grammars may be used.

We discuss both these approaches in a bit more detail below, complete with some (broad) examples. As a starting point, though, the XMLGrammarBuilder sample, from the xni package, should provide a starting-point for implementing either the active or passive approach.

Exactly how does Xerces default implementation of things like the grammar pool work?

Before proceeding further, let there be no doubt that, by default, Xerces does not cache grammars at all. In order to trigger Xerces grammar caching, an XMLGrammarPool must be set, using the setProperty method, on a Xerces configuration that supports grammar pools. On the other hand, you could simply use the XMLGrammarCachingConfiguration as discussed briefly below.

When enabled, by default, Xerces's grammar pool implementation stores any grammar offered to it (provided it does not already have a reference matching that grammar). It also makes available all grammars it has, of a particular type, on calls to retrieveInitialGrammarSet. It will also try and retrieve a matching grammar on calls to retrieveGrammar.

Xerces uses hashing to distinguish different grammar objects, by hashing on the XMLGrammarDescription objects that those grammars contain. Thus, both of Xerces implementations of XMLGrammarDescription--for DTD's and XML Schemas--provide implementations of hashCode(): int and equals(Object):boolean that are used by the hashing algorithm.

In XML Schemas, hashing is simply carried out on the target namespace of the schema. Thus, two grammars are considered equal (by our default implementation) if and only if their XMLGrammarDescriptions are instances of org.apache.xerces.impl.xs.XSDDescription (our schema implementation of XMLGrammarDescription) and the targetNamespace fields of those objects are identical.

The case in DTD's is much more difficult. Here is the algorithm, which describes the conditions under which two DTD grammars will be considered equal:

Both grammars must have XMLGrammarDescriptions that are instances of org.apache.xerces.impl.dtd.XMLDTDDescription.
If their publicId or expandedSystemId fields are non-null they must be identical.
If one of the descriptions has a root element defined, it must be the same as the root element defined in the other description, or be in the list of global elements stored in that description.
If neither has a root element defined, then they must share at least one global element declaration in common.

The DTD grammar caching also assumes that the entirety of the cached grammar will lie in an external subset. i.e., in the example below, Xerces will happily cache--or use a cached version of--the DTD in "my.dtd". If the document contained an internal subset, the declarations would be ignored.

<!DOCTYPE myDoc SYSTEM "my.dtd">
<myDoc ...>...</myDoc>

Using these heuristics, Xerces's default grammar caching implementation appears to do a reasonable job at matching grammars up with appropriate instance documents. This functionality is very new, so in addition to bug reports we'd very much appreciate, especially on the DTD front, feedback on whether this form of caching is indeed useful or whether--for instance--it would be better if internal declarations were somehow incorporated into the grammar that's been cached.

I like the idea of "active" caching (or I want the grammar object for some purpose); how do I go about parsing a grammar independent of an instance document?

First, if you haven't read the first FAQ on this page and have trouble with terminology, hopefully answers lie there.

Preparsing of grammars in Xerces is accomplished with implementations of the XMLGrammarLoader interface. Each implementation needs to know how to parse a particular type of grammar and how to build a data structure representing that grammar that Xerces can efficiently make use of in validation. Since most application programs won't want to deal with Xerces implementations per se, we have provided a handy utility class to handle grammar preparsing generally: org.apache.xerces.parsers.XMLGrammarPreparser. This FAQ describes the use of this class. For a live example, check out the XMLGrammarBuilder sample in the samples/xni directory of the binary distribution.

XMLGrammarPreparser has methods for installing XNI error handlers, entity resolvers, setting the Locale, and generally doing similar things as an XNI configuration. Any object passed to XMLGrammarPreparser by any of these methods will be passed on to all XMLGrammarLoaders registered with XMLGrammarPreparser.

Before XMLGrammarPreparser can be used, its registerPreparser(String, XMLGrammarLoader): boolean method must be called. This allows a String identifying an arbitrary grammar type to be associated with a loader for that type. To make peoples' lives easier, if you want DTD grammars or XML Schema grammar support, you can pass null for the second parameter and XMLGrammarPreparser will try and instantiate the appropriate default grammar loader. For DTD's, for instance, just call registerPreparser like:

grammarPreparser("http://www.w3.org/TR/REC-xml", null)

Schema grammars correspond to the URI "http://www.w3.org/2001/XMLSchema"; both these constants can be found in the org.apache.xerces.xni.grammars.XMLGrammarDescription interface. The method returns true if an XMLGrammarLoader was successfully associated with the given grammar String, false otherwise.

XMLGrammarPreparser also contains methods for setting features and properties on particular loaders--keyed on with the same string that was used to register the loader. It also allows features and properties the application believes to be general to all loaders to be set; it transmits such features and properties to each loader that is registered. These methods also silently consume any notRecognized/notSupported exceptions that the loaders throw. Particularly useful here is registering an XMLGrammarPool implementation, such as that found in org.apache.xerces.util.XMLGrammarPoolImpl.

To actually parse a grammar, one simply calls the preparseGrammar(String grammarType, XMLInputSource source): Grammar method. As above, the String represents the type of the grammar to be parsed, and the XMLInputSource is the location of the grammar to be parsed; this will not be subjected to entity expansion.

It's worth noting that Xerces default grammar loaders will attempt to cache the resulting grammar(s) if a grammar pool implementation is registered with them. This is particularly useful in the case of schema grammars: If a schema grammar imports another grammar, the Grammar object returned will be the schema doing the importing, not the one being imported. For caching, this means that if this grammar is cached by itself, the grammars that it imports won't be available to the grammar pool implementation. Since our Schema Loader knows about this idiosyncrasy, if a grammar pool is registered with it, it will cache all schema grammars it encounters, including the one which it was specifically called to parse. In general, it is probably advisable to register grammar pool implementations with grammar loaders for this reason; generally, one would want to cache--and make available to the grammar pool implementation--imported grammars as well as specific schema grammars, since the specific schemas cannot be used without those that they import.

All right, I've (somehow) got a grammar pool full of grammars. How do I use this with my application that uses standard (SAX|DOM|JAXP) parsers?

For SAX and DOM the case is simple. Just do:

XMLParserConfiguration config = new XIncludeAwareParserConfiguration();
config.setProperty("http://apache.org/xml/properties/internal/grammar-pool",
 myFullGrammarPool);
(SAX|DOM)Parser parser = new (SAX|DOM)Parser(config);

Now your grammar pool instance will be used by all validators created by this parser to validate your instance documents.

If you have an application that uses pure JAXP, your task is a bit trickier. You'll need to do something like this:

System.setProperty("org.apache.xerces.xni.parser.XMLParserConfiguration",
 "org.apache.xerces.parsers.XMLGrammarCachingConfiguration");
DocumentBuilder builder = // JAXP factory invocation
// parse documents and store grammars

Note that this only supports the "passive" caching approach discussed in above. The org.apache.xerces.parsers.XMLGrammarCachingConfiguration represents experimental code; feedback on whether it is useful would be greatly appreciated.

But I don't want to "preparse" grammars for efficiency; I want to parse them in order to look at their contents using some API! Can I do this?

Yes, for grammar types for which such an API is defined. No such API exists at the current moment for DTD's. For XML Schemas, Xerces implements the XML Schema API. For details, it's best to look at the API docs for the org.apache.xerces.xs package. Assuming you have produced a Grammar object from an XML Schema document by some means. To turn that object into an object usable in this API, do the following:

Cast the Grammar object to org.apache.xerces.xni.grammars.XSGrammar.
Call the toXSModel() method on the casted object.
Use the methods in the org.apache.xerces.xs.XSModel interface to examine the new object; methods on this interface and others in the same package should allow you to access all aspects of the schema.

Is there an alternative method for getting an XSModel?

Yes, for more information see the XML Schema FAQ.