The Document Comparator is a top-level component in XML Compare, first introduced in version 7.0. It provides a set of document comparison features that are customisable to meet your specific needs. Designed for integration with other systems, the Document Comparator is accessed through the
DocumentComparator class in the Java and .NET APIs for XML Compare. A
DocumentComparator class can also be created using DCP XML via the
This guide serves as an overview of the Document Comparator. It explains when you should use it, its key capabilities, how it can be configured and extended, how comparisons are run, and how the comparison result can be used.
The XML Compare User Guide provides more general information on XML Compare as a whole. It covers subjects such as comparison concepts, system configuration, and licensing. More detailed information on specific Document Comparator features and how they can be customized can be found in the respective Java API documentation and .NET API documentation. The DCP Schema Guide and DCP User Guide together describe how the Document Comparator Pipeline (DCP) XML format is used to customize the Document Comparator.
Choosing a Comparator
Besides the Document Comparator, the other main component in XML Compare API is the Pipelined Comparator, implemented through the
The Pipelined Comparator is a flexible XML comparison toolkit. It provides an XML comparator placed within a pipeline to which XSLT or Java filters can be added at any point within the input or output streams. A set of documented filters is supplied for use in the pipeline.
The Document Comparator is an extension of the Pipelined Comparator designed specifically for comparing documents. As a more specialist solution, it therefore incorporates many filters directly into the pipeline that would otherwise have to be added manually. It also allows for more sophisticated capabilities to be exposed through a simple API. Considerable extensibility is still available should it be required, using DCP or through the API, but the default settings are tailored for document use so that the user is freed from having to understand all the settings.
If the XML you're comparing is predominantly document-based then the Document Comparator would in most cases be preferred to the Pipelined Comparator. That said, the flexibility of the Pipelined Comparator may still offer benefits if you have experience and resources available from previous projects based on this component.
While XML is used for conveying information for a range of diverse purposes, the Document Comparator is optimized specifically for XML content that is predominantly narrative in nature, such as an article or book. To understand why a specialist comparison is needed for this type of XML, it's helpful to first look at the different characteristics of XML elements typically encountered in a document:
- Structure elements divide documents into a hierarchy, often with titled sections.
- Highly structured elements, such as tables, need to conform to a set of rules to be valid.
- Some elements have child elements whose order is not significant (it normally is).
- Structural elements may be uniquely identifiable, often with a dedicated id attribute.
- Elements may link to other documents or resources such as images.
- Some elements are predominantly for styling the document, either inline or at the block level.
- Whitespace differences are significant in some elements, but not all.
- Certain elements may have content with automated changes that are ignorable.
- Text within many elements is of a narrative form, with words and punctuation.
The list of document element characteristics above is grouped according to whether they're related to the structure or content of a document. A document comparison can be customized so that these characteristics are identified for the elements of a specific document type, and therefore improve the effectiveness of the result.
Performing a comparison
Invoking the comparison
Comparisons using the Document Comparator can be invoked either from the command-line, GUI, and REST API using DCP or by using the .NET or Java API via the
compare method of a
DocumentComparator object instance. This method is overloaded to cater for a wide range of use cases.
Before invoking the compare method, the
DocumentComparator instance is configured for the specific comparison required. Options can either be set on the DocumentComparator object directly using Java, or through a DCP XML file using a DCPConfiguration object.
Using Java Only
Using Java with DCP XML for configuration
Under certain conditions document comparisons can take a while to complete. In such cases, the end-user can be informed of progress using a 'listener'. A listener is an instantiated object that implements the
DocumentProgressListener interface; this can exploit data passed in call-back methods from the Document Comparator to relay progress back to the user. The listener is associated with the
DocumentComparator using its
When developing new filters to add to a pipeline, it is often useful to isolate the behaviour of each filter. To achieve this, the Document Comparator can serialize the output from every filter in the pipeline to a separate file. These files are named to match filter steps and arranged in a directory structure to reflect that of the filter chains in the pipeline.
By default, pipeline diagnosis is disabled. To enable it, the
DebugFiles property of the
DocumentComparator is set to
true. This property can also be controlled as an XML Compare configuration property using its fully qualified name
com.deltaxml.cores9api.DocumentComparator.debugFiles. See the XML Compare user guide for more information on this property.
Customizing a comparison
The Document Comparator comprises a standard comparator engine that lies at the centre of a processing pipeline that links together a series of input and output filters. The filters are simple transforms that each modify the input for a specific purpose to produce the required output. Filters can be implemented in XSLT or Java.
The document comparison pipeline
For customization, filters can either be internal and controlled via
DocumentComparator properties, or they can be external XSLT or Java resources. The external filters are managed via the Document Comparator API as
FilterChain objects that are added at named extension points in the comparison pipeline. The extension points allow external filters to fit in with the behaviour of internal filters and are illustrated in the diagram below:
Extension points allow the insertion of custom filters into the pipeline
FilterChain consists of a sequence of 1 or more
FilterStep objects. Since a
FilterChain is a type of
FilterStep, it can in turn be added to another
A filter, normally coded in XSLT, is typically dedicated to a single task such as adding an attribute for elements that match a certain pattern. More detail on writing and adding a filter to the comparison pipeline can be found at Appendix I - Custom Filters.
Types of customization
Customization improves comparisons for a range of purposes that can be broadly categorized as: Input, Readability and Output. The sections below describe each of these categories in turn.
The Document Comparator, provides 'out of the box' comparison behaviour designed for general input document characteristics; but, as outlined in the Document Comparison section, a comparison tailored to the element characteristics of a specific document type will yield significantly improved results.
Word by Word
deltaxml:word-by-word="false"attribute is found on the element. The Word By Word Text Comparison sample shows the result of an XSLT filter being used to disable the word-by-word feature. Note that word-by-word exploits word-splitting based on the
icu4jJava library, from the International Components for Unicode (ICU).
xml:space="preserve"attribute is in scope on a containing element. Text nodes that contain only whitespace characters are treated differently when they are found to be children of mixed-content elements. The Managing White Space guide covers this subject in more detail.
deltaxml:formatting="true"attribute are 'flattened' for comparison purposes and then reconstructed post-comparison. This allows changes to formatting elements to be marked differently and therefore treated specially in an output filter. The default behaviour is to only show the formatting elements from the 'B' document; this is demonstrated in the Formatting Element Changes sample. The
OutputFormatConfiguration.ModifiedFormatOutputproperty defaults to 'B', but has alternative values of 'A' or 'CHANGE'. When
ModifiedFormatOutputis set to'CHANGE', the formatting elements for both documents are represented using special DeltaV2.1 format elements.
DocumentComparatoris used to customize or disable this functionality. For HTML table elements to be processed specially, they must be in the XHTML namespace.
For CALS table processing, the
TableConfiguration object has a set of additional properties:
The standard behaviour is for the comparator to match elements based on their document order. This behaviour can be tailored to match different elements via any of the following methods.
- Use the
deltaxml:ordered="false"attribute on an element whose child element can appear in any order without affecting their meaning.
- Add an
deltaxml:key="value"attribute, where the value is a unique identifier for aligning a specific element occurring in both input documents.
- Sort the child elements of a specific element type using a custom sort key prior to comparison.
The result of orderless comparisons is controlled by the 'OrderlessPresentationMode' property of the OutputFormatConfiguration class. See the API documentation for full details on this.
The principle of using
deltaxml:key="value" attributes with ordered data and mixed ordered/orderless data is shown in the Using Keys with Ordered Data sample and Mixed Ordered and Orderless Data guide respectively.
deltaxml:ignore-changesattribute, where the attribute value determines the keep/override behaviour. A practical example of how this feature is used can be found in the Ignoring Changes and Creating a Merged Document sample.
Note 1: The sample referenced above uses the Pipelined Comparator, so although the filters are the same, they are added to the pipeline and run differently when using the Document Comparator.
Note 2: This feature is not intended for elements marked as 'formatting elements' - use the 'Ignoring Formatting Changes' feature instead.
Altering standard behaviour
The above description of standard input behaviours described the attributes or 'markers' that can be added to customize behaviour. With the exception of Ignored/Merged Changes, these are all added prior to comparison, i.e. in 'marker' input filters.
In some contexts, a non-optimized comparison result may be cluttered or hard to interpret. The Document Comparator exploits a number of features to enhance the readability of results - without affecting their correctness. These features can be fine-tuned to specific content via the
ResultReadabilityOptions class, the default property values for this are given in Appendix II. An overview of the readability options is given below:
Elements from two input documents may align mainly due to structure, with minimal shared text content. Moreover, text matches that do occur may be coincidental.
In such cases, it can cause confusion if the result is represented as a single element with many text changes, interspersed with matching text. Here the strategy is to split the modified element into two when the amount of shared text falls below a given percentage.
To preserve change information on keyed elements, element splitting is never applied to elements with a
Orphaned Word Detection
ChangeGatheringEnabledproperty can be set to false to disable this behaviour at the element level, however, word level change gathering is always enabled.
A further reason for customization is to produce an output format designed for a specific purpose or tool.
LexicalPreservationConfigproperty that is initially set to
This standard lexical preservation should meet the requirements for many cases, but customization is available for specific output formats. This is described in the Lexical Preservation guide.
Ignoring Formatting Changes
paragraphis present in the 'A' document but not in the 'B' document (i.e. the paragraph was deleted as a whole). This is initialized to
ModifiedFormatOutputproperty of the
The 'raw' result of the Document Comparator is the 'Delta'. It has the look and feel of the original input documents, but with annotations added to describe the differences. This is the standard output, but other output formats can be produced:
Tracked Change Formats
OutputFormatConfigurationobject has a
ResultFormatproperty that is used to set the tracked changes format. This object has other properties for setting author and date-time data in the tracked-changes format.
Custom Output Formats
Viewing a comparison result
The raw Delta output from the Document Comparator needs to be transformed to a suitable output format to allow the end-user to interpret the result most effectively. This format can be for viewing in a web browser, opening in an XML authoring tool, opening in a specialist reviewing tool, or perhaps ultimately printing to paper.
The Output Formats section above provides more detail on the pre-defined output formats available from the Document Comparator.
Changes can be viewed in standard XML Authoring tools using the 'Tracked Changes' format options
Appendix I - Custom Filters
Writing an XSLT filter
Most filters are coded in XSLT and will normally exploit DeltaXML's internal Saxon processor which supports XSLT 2.0 and 3.0. Here is a sample input filter (XSLT 2.0) that performs an 'Identity Transform' matching
li elements with
xml:id attributes and adding an
xsl:key attribute to hold the
Adding a filter to the pipeline
Once written, the XSLT filter needs to be added to the Document Comparator pipeline using the
DocumentComparator API, here's some sample Java code showing this:
In the above example the FilterStep is created using the
newFilterStepFromResource method to access one of the XSLT filter files included within XML Compare; the
newFilterStep method is used to access external files.
Screenshot of a side-by-side DiffReport created with the dx2-deltaxml-sbs-folding-html.xsl output filter:
Filters can also be added using a DCP pipeline definition, the following DCP uses the 'folding' DiffReport stylesheet to format the output:
Screenshot of a 'folding' DiffReport created with the dx2-deltaxml-folding-html.xsl output filter:
Appendix II - Default Configuration
The Document Comparator is pre-configured using default settings. The default settings are summarized in the table below, more details can be found in the API documentation, links to the corresponding classes can be found in the 'Class' column.
|Lexical Preservation Config||PresetPreservationMode||ROUND_TRIP||(see API docs)|
|Result Readability Options||ElementSplittingEnabled||true||boolean|
|ElementSplittingThreshold||10||1 to 100|
|OrphanedWordLengthLimit||2||1 to unlimited|
|OrphanedWordMaxPercentage||20||1 to 100|
|ModifiedWhitespaceBehaviour||AUTOMATIC||(See API docs)|
|Output Format Configuration||ResultFormat||DELTA|
|ModifiedAttributeMode||AUTOMATIC||(See API docs)|
|OrderlessPresentationMode||B_DELETES||(See API docs)|
Appendix III - Document Comparator Samples
The Samples section of XML Compare's documentation provides links to all samples and guides, these are designed for use with one or more of XML Compare's comparators. The following list highlights samples of particular relevance to the Document Comparator.