This document describes the concepts behind preserving doctype information. For the resources associated with this sample, see here.
While the PipelinedComparator and DXP pipelines have facilities for specifying a doctype using output properties, they used fixed values, or at best, values that must be specified as parameters. If the input files themselves contain a doctype, it is often preferable to use this as the doctype for the output document without having to specify it. This sample pipeline demonstrates how to output a dynamic doctype based on the doctypes in the input documents.
Note that it may be easier to select one of the pre-configured lexical preservation modes as discussed in the Guide to Lexical Preservation, as many of them include the dynamic preservation of doctypes.
Simple API approach
A simple approach for retaining doctypes is to enable the built-in lexical preservation on our 'Core S9 API' comparators, such as the PipelinedComparatorS9, which are configured by passing them a LexicalPreservationConfig object. The following code extract illustrates how to enable just doctype preservation on a LexicalPreservationConfig object.
Having enabled the preservation the next step is to specify how changes in doctype and its optional internal subset should be handled. It is straightforward to handle unchanged doctypes, their input value is passed through to the output, possibly with different whitespace layout as this is not reported by the parser. The difficulty comes in working out how to handle inconsistent doctypes, as it is not feasible for an output XML document to have more than one doctype. Here, an answer could be to choose one of the input doctypes, and hope that they are compatible. The 'B' input can be chosen as follows:
One problem with the above 'input selection' approach is that a doctype's internal subset can declare elements, attributes and entity references, which are used in the document. Therefore, removing these declarations could cause the output document to become invalid. Hence, the lexical preservation scheme provides a special output mode that enables the internal subset declarations from both inputs to be kept, except where they conflict, in which case one is chosen. Setting the doctype output mode to 'BdA' has the affect of choosing the 'B' version of all doctype information, and the 'A' version of any declarations when there is no 'B' version of it.
For further information on the representation of the doctype information please refer to the Explanation section of this sample.
DCP or DXP Approach
It is possible to specify the lexical preservation options in a DXP (for Pipelined Comparator) or DCP (for Document Comparator) configuration file using the lexicalPreservation element. For example, the following DCP file can be used to setup the lexical preservation configuration in the same manner as that discussed in the Simple API approach above.
Running the sample
The sample resources and a description on how to run it can be found at: https://bitbucket.org/deltaxml/preserving-doctype-information.
The resources should be checked-out, cloned or downloaded and unzipped into the samples directory of the XML Compare release. They should be located such that they are two levels below the top level release directory, for example
Converting a Doctype into XML
The first step that must be performed is to convert the Doctype into XML inside the document. As it stands, the Doctype is not technically part of the XML document itself and will not pass through from the parser to the Comparator unless we intervene. In order to preserve this information, the Pipelined Comparator and Document Comparator use a built-in filter, this is the first filter in the input pipeline. The purpose of this filter, amongst other things, is to convert the reported doctype information into an element that is added as the first child of the root element. This will then be passed through to the Comparator and make its way through to the output filters.
Example 1 show an input document with a doctype declaration and example 2 shows the same document after it has passed through the filter used for lexical preservation.
Example 1: input document with doctype (input1.xhtml in Bitbucket, https://bitbucket.org/deltaxml/preserving-doctype-information)
Example 2: the input document after passing through the filter used for lexical preservation
Handling Changes to the Doctype
Now that the doctype is held as XML within the document, it will be compared as part of the comparison process. This means that if the input documents have different doctypes, that change will be reflected in the result document. In order to output the doctype correctly, we must decide which version is going to be used. We can make use of the generic ignore-changes filters to process this. While this may be a little more complicated than the sample pipeline warrants, it is included as a filter that could be adapted to ignore other changes types at the same time. The most important point is that the
<preserve:doctype> element should be marked as unchanged by the time we reach the final filter. For more information on the ignoring changes, see the ignore-changes sample.
Example 3 shows how a doctype may change. The XML shown is a snippet from the immediate output of the comparator when comparing input1.xhtml and input2.xhtml (included in Bitbucket, https://bitbucket.org/deltaxml/preserving-doctype-information)
Example 3: a modified
As can be seen above, the publicId and systemId for the doctype have changed between the inputs. In order to output a doctype in the result we need to decide which version we are going to use. An instance of the 'Core S9API' LexicalPreservationConfig class can be used to choose which version of the doctype to output in the event of a change (as discussed previously). Example 4 shows the output when the DOCTYPE output mode is set to 'BdA'; i.e. to output the B doctype where present, or the A doctype if there wasn't one in input B.
Example 4: the resulting file
Note the old
testDelete element declaration was only in 'A' input and modified
testConflict element declaration was in both the 'A' and 'B' inputs. Therefore, according to the 'BdA' behaviour the deleted declaration (i.e.
testDelete) and the 'B' version of the modified declaration are in the output.