Using Stax2 (Woodstox 3.0) Validation API, part 3
Continuing on the theme of validating XML content processed with Woodstox, using Stax2 extension of Stax API, let's do something more interesting: validate content as it is getting written (note: the full source code for the example shown below can be found from http://woodstox.codehaus.org/DocStax2Validation).
So, here is piece of code that will demonstrate how to validate XML output being written (using XMLStreamWriter), using Stax2 API extension.
final String DTD_STR = "<!ELEMENT root (branch | leaf)*>\n" +"<!ELEMENT branch (leaf)+>" +"<!ELEMENT leaf (#PCDATA)>" +"<!ATTLIST leaf desc CDATA #IMPLIED>\n"; StringWriter strw = new StringWriter(); // First, let's parse DTD schema object XMLValidationSchemaFactory sf = XMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_DTD); XMLValidationSchema dtd = sf.createSchema(new StringReader(DTD_STR)); XMLOutputFactory ofact = XMLOutputFactory.newInstance(); XMLStreamWriter2 sw = (XMLStreamWriter2) ofact.createXMLStreamWriter(strw); sw.validateAgainst(dtd); // this starts validation // Document validation is done as output is written try { sw.writeStartDocument(); sw.writeStartElement("root"); sw.writeStartElement("branch"); sw.writeStartElement("leaf"); sw.writeEndElement(); // We'll get validation exception here -- branch not allowed within branch sw.writeStartElement("branch"); sw.writeEndElement(); sw.writeEndElement(); sw.writeEndElement(); sw.writeEndDocument(); sw.close(); } catch (XMLStreamException xse) { System.err.println("Failed output the document: "+xse); }
You may notice some similarity with the earlier reader side example (and if not, you may want to have another look!). The pattern is quite simple: obtain a schema object from schema factory, passing in schema content from any of typical content sources (InputStream, Reader, javax.xml.transform.Source), and start validating content being read (using XMLStreamReader) or written (using XMLStreamWriter). How is that for simplicity? Even more advanced things like chaining multiple instances of validators, or doing partial validation, just use these basic mechanisms (ok, except for partial validation also needing to use method stopValidatingAgainst()...)
Now, what is the point of validating output? Since you write output code, shouldn't you be able to do it just fine with normal testing? In above example there isn't much need for validation, obviously, but there are other cases where output validation makes sense. For example:
- During testing, you may want to enable strict input and output side validation, as assertions verifying correctness of code, even if you disable validation in production. And even in production, you may be able to easily re-enable validation as needed.
- When doing transformations, it is hard to cover all the possible outputs that might result: even worse, when using technologies like XSLT, there is no formal way of (statically) ensuring that the output will conform to a given schema. But you can assert validity on output side quite simply by validating against specific schema.
- When pipelining XML content, it may be easier (and more efficient) to plug in processing component between output stream writer, and actual physical output, than having to write output to a temporary location, and then parsing for validation.
Another question is what is the specific point of using Stax2 validation, over, say, using stand-alone validators or plugging in SAX-based validators. One benefit is that validation done as part of reading/writing XML is likely to be more efficient, as input/output is only parsed/generated once. Also, diagnostics regarding the problem are likely to be more accurate when validation is synchronized with actual processing.
As to validation schema objects, it is worth noting that these schema objects are fully reusable (actual validators that are created from schemas are not; calls to startValidatingAgainst() create validator objects behind the scenes), as well as thread-safe. This means that in general you can just create validation schema objects once when the system starts up (for static set of schemas at least), and fully reuse afterwards.
Given that it is easy to validate XML output this way, I hope that more developers will make use of this feature. I am also interested in hearing about experiences from doing this (feedback can be sent to stax_builders mailing list, for example).