Document Comparison

Introduction

XML document types such as DITA, DocBook and XHTML share a common set of features such as: inline formatting, tables, ordered/orderless lists and linked resources. To help achieve simple and accurate difference reports when comparing documents, each element supporting these features can be processed in a special way, both at comparison time and when the result is output.

Many document features can benefit from special processing.

For optimized processing of document-centric features, two approaches are recommended. The first approach is to exploit built-in features in XML Compare's document comparator augmented with custom XSLT filters where required. The second, more complex approach, is to use the pipelined comparator with a specially configured pipeline exploiting a set of custom XSLT filters.

Most of the features outlined in this section are incorporated into the document comparator, however links are included to samples for cases where you wish customize your own pipelined comparator, these samples also provide some useful insight into how the capabilities that are built into the document comparator actually work. Not all features are enabled by default in the document comparator.

Text Comparison

Normally text comparisons are case-sensitive, but there are certain contexts where case should be ignored, the Case Insensitive Comparison sample shows how this can be done. Also, comparison of text within each element of a document can be performed at different levels, three levels are considered for XML Compare, as outlined below:

Text-node Level- if the contents of a text-node changes the whole node is marked as a change (noting that a mixed-content element may contain more than one text node).

Word by Word- allows differences in content to be resolved down to specific words - normally differences are shown at the element level. The Word by Word Text Comparison tutorial introduces you to this concept.

Character by Character- a further refinement to Word by Word comparison, where differences within words are marked, this is described further in the Character by Character Comparison tutorial.

Lexical Preservation

For preservation of content that is often lost when processing XML, this covers XML comments, XML processing-instructions, CDATA tags, DOCTYPE declarations and entity references. The features supporting this in XML Compare are described in the Lexical Preservation reference. For further help on the use of custom lexical preservation filters, there are also the tutorials: How to Preserve Processing Instructions and Comments and How to Preserve Doctype Information.

Whitespace Management

Whitespace-only nodes found in an XML document should be treated differently depending on whether they are a significant part of content (as in mixed content) or simply used for formatting the XML source. The technique for this is described in the Managing White Space tutorial.

Table Comparison

Complications arise when comparing tables where the structure has changed, for example, when a column has been inserted or removed, the DocumentComparator class of the XML Compare API has ProcessCalsTables and ProcessHTMLTables boolean properties (with get/set methods for Java) that, when set, will manage table comparison so that the result remains valid. For a detailed discussion of table processing with examples see Comparing Document Tables.

Key-assisted Matching

Some document elements have unique content, such as id attributes, that can be highlighted for the comparator by adding a special key attribute. Keys are particularly useful for matching 'orderless' elements, but can also be of value for ordered elements, with some additional processing to handle moves (see Detecting and Handling Moves. More information can be found in the following samples and guides: Ordered Comparison, Mixed Ordered and Orderless Data and Comparing Orderless Elements.

Linked Resources

For elements whose main purpose is to link to other resources such as images or other documents, results can be improved if special processing is applied. Filters can be included in the processing pipeline to handle such cases. The Image and Binary Comparison sample shows how such link elements can be processed using an XSLT filter that exploits a Java extension function for binary file comparison. This sample could be adapted to suit cases where the link target is a text or XML resource.

Formatting Elements

The document comparator can be configured (by modifying a simple XSLT identity transform) to recognize and process elements used predominantly for inline formatting. This allows content-based element alignment and supports overlaps in the formatted-text range between compared versions. Such formatting differences are represented using extensions introduced in version 2.1 of DeltaV2 and described in the document Two and Three Document DeltaV2 Format. Formatting differences can be rendered or styled independently from structural changes according to need.

A practical example of formatting element processing is included in the Formatting Element Changes sample.

Overlaps in formatting in different versions are detected and recorded in the DeltaV2.

MathML Elements

MathML is used to display mathematical expressions as part of an XML document. The specification is given here. XML Compare includes the ability to show changes in Presentation Markup (chapter 3 of the specification) rather than Content Markup.

For more information see: MathML Comparison