XML Compare provides a powerful solution to identify and process the differences between any two XML files that share the same root element. Its primary use is as a toolkit for integration into other systems or applications via the comprehensive API, but it may also be run standalone from the command-line or a simple GUI.
This user guide introduces you to the XML Compare product, providing a high-level product description along with a look at the main features and concepts associated with this product. You can find more detailed information on the subjects covered here by following the links to an extensive set of tutorials, samples and papers that complement this product.
The Getting Started 'ReadMe' file provides a quick start for the product as well as a description of all dependencies. The Samples 'ReadMe' gives a summary of all the included samples. An overview of the features in the Document Comparator can be found in the Document Comparator Guide. Comprehensive technical implementation information can be found in the Java and .NET API documentation.
The XML Compare approach is unique in that:
The change file is recorded in a XML 'delta file'.
The delta file has the same look and feel as the original files.
The delta file can include changes only or changes plus unchanged data.
The delta file is easy to understand and to process because it is an XML file.
The delta file can therefore be processed with standard XML tools.
Comparison can be customized by defining/extending filter pipelines.
Two input files are used for an XML Compare comparison, referred to here as 'A' and 'B' files. Whilst it is often the case that 'B' is a modification of 'A', it is also possible that both inputs are derived independently from a common source. Using this terminology, a user-oriented set of high-level features is outlined below:
Find all the differences between any two XML files ('A' and 'B').
Apply changes to convert an 'A' XML file into the 'B' version (i.e. a diff 'patch').
Undo changes to convert a 'B' XML file into the 'A' version.
Display change information in either XML or HTML form, using a standard web browser.
Report changes only or changes+unchanged data.
Use XSLT input and output filters to pre and post process the XML data.
Handle large files without performance degradation.
Extensible pipeline with embedded functionality.
Text processing for differences on a word-by-word basis.
Special processing for formatting-elements.
HTML/CALS table structure aware.
Extension points for adding filter steps to the pipeline.
Ignoring changes to non-significant whitespace.
New in 9.1 - Handle moves of uniquely identifiable element. See Detecting and Handling Moves for more details.
XML Compare runs locally on your own hardware and allows you to quickly embed XML comparison functionality into your own systems, it can be run in a variety of ways, with the range of options determined by the version of the product downloaded.
Table 1. Comparison options for XML Compare Downloads
The XML Compare Download Page provides a choice of three possible downloads: Java (UNIX & Windows), macOS, and .NET. To start with, provided you don't want to use the .NET API, you should select the download option that matches your target operating system, each download option includes the required Java JAR files (e.g. deltaxml.jar) and support resources. If you're developing for .NET (on Windows) you should select the .NET download option - note that this version does not include the GUI available in other downloads. Note: A licence file is required to run XML Compare, see the Licensing User Guide for more details.
A comparison can be run programmatically, using Java or .NET APIs or, alternatively, it can be user-driven via the command-line (see the Command-Line ReadMe), an oXygen plugin (after installation of the DeltaXML oXygen Adaptor) or a simple graphical user-interface (GUI). Note that the GUI is designed to help demonstrate some of the built-in capabilities of XML Compare, but it is not intended as a standalone productivity tool.
It is also possible to invoke a further nested comparison from within an XSLT filter using
compare() XSLT extension function, this is described in the Java API
Since XML Compare uses XML to represent changes, an API and Pipeline Configuration architecture allows standard XML technologies such as XSLT to be applied, complex information pipelines can therefore be built from a set of simple proven components.
A set of samples are included with XML Compare; these include working code and documentation for a number of customized comparison scenarios.
When a comparison is invoked via the recommended com.deltaxml.cores9api API, you have the choice of two comparator classes:
that when the GUI or command-line processor is used to start a comparison, the standard
pipelined comparator class:
PipelinedComparatorS9 performs the
Implemented via the
PipelinedComparatorS9 class, this provides a very flexible
form of comparison, best suited for when the input XML is not always document based or
when your require low-level control of the processing pipeline. Except for
restrictions associated with lexical preservation filters, input and output filters
can be added to the processing pipeline at any point.
Implemented through the
DocumentComparator class, this has a pipeline specially
optimized for document comparison, Figure 5 shows a simplified
representation of this pipeline. Explicit extension points are available on the
pipeline so new filter-steps or chains can be inserted in a managed way.
Figure 4. Filter steps or chains can be applied to specific extension points of the Document Comparator
The Pipelined Comparator allows comparisons to be optimized for particular types
of data or document structure, it also allows customisation of the way detected
differences are represented in the output. The pipeline for a Pipelined Comparator is
defined using a set of filters managed in
FilterChain objects that can be added at both comparator inputs ('A'
and 'B') or the comparator output.
The guide, Specifying a Comparison Pipeline provides an overview of how pipelines can be defined with the Pipelined Comparator, specifically through the use of Java, C# or an XML pipeline descriptor file format, called DXP.
More details on the use of DXP can be found in the document Pipeline Configuration using DXP.
The Document Comparator differs from the Pipelined Comparator in that key parts of the pipeline are pre-defined with specialist document camparison features; this pipeline is modified by adding filters at certain named 'extension points'.
As in the Pipelined Comparator, filters are managed as
FilterChain objects in Java or C#, these are added to the pipeline
using the DocumentComparator's
setExtensionPoint method. An alternative
way to configure a Document Comparator is to use a Document Comparator Pipelines
configuration file (DCP).
A lower level method (now regarded as legacy but still useful for advanced users) for creating pipelines is also available for Java developers, this exploits JAXP interfaces. For this, JAXP Pipeline Examples introduces you to a set of examples available for download, the paper Powering Pipelines with JAXP provides further details on using JAXP.
When there is a need to diagnose stages in a pipeline, a
debugFiles mode is available where the inputs and outputs of
each filter is output to separate file, a file naming convention is used to indicate
where each 'debug file' fits into the pipeline. The debugFiles mode is set either by
setDebugFiles method call or with a Configuration Property (see
Configuration Propereties) in a DeltaXML Configuration file named
'deltaXMLConfig.xml', sample XML for setting this property is shown below:
<!DOCTYPE deltaxmlConfig SYSTEM "deltaxml-config.dtd"> <deltaxmlConfig> <configProperty name="com.deltaxml.cores9api.DocumentComparator.debugFiles" value="true" /> <configProperty name="com.deltaxml.cores9api.PipelinedComparatorS9.debugFiles" value="true" /> </deltaxmlConfig>
Low-level XML Compare functionality is configured using different methods according to how the functionality is implemented. These different methods are summarized below:
Table 2. Configuration Summary
|[a]DocType is affected by the LexicalPreservation configuration property.|
|[b]Preferred method for setting LexicalPreservation is via the API.|
Configuration Properties are used to control certain properties of a comparison operation that may have a wider scope than standard features and properties, more details can be found in the Configuration Properties guide.
Features and properties are managed using the API or a DXP/DCP definition, the Features and Properties document describes the features and properties available.
Features for the Apache Xerces parser can be set either from the API or a DXP/DCP configuration, a DXP example can be found in the sample XInclude and XML Compare.
Output properties control the serializer of XML Compare's internal Saxon processor, they are set from the API or using DXP or DCP. An example of how DocType and indentation is set using DXP can be found in the Pipeline Configuration using DXP document.
XML document types such as DITA, DocBook and XHTML share a common set of features such as: inline formatting, tables, ordered/orderless lists and linked resources. To help achieve simple and accurate difference reports when comparing documents, each element supporting these features can be processed in a special way, both at comparison time and when the result is output.
For optimized processing of document-centric features, two approaches are recommended. The first approach is to exploit built-in features in XML Compare's document comparator augmented with custom XSLT filters where required. The second, more complex approach, is to use the pipelined comparator with a specially configured pipeline exploiting a set of custom XSLT filters.
Most of the features outlined in this section are incorporated into the document comparator, however links are included to samples for cases where you wish customize your own pipelined comparator, these samples also provide some useful insight into how the capabilities that are built into the document comparator actually work. Not all features are enabled by default in the document comparator.
Normally text comparisons are case-sensitive, but there are certain contexts where case should be ignored, the Case Insensitive Comparison sample shows how this can be done. Also, comparison of text within each element of a document can be performed at different levels, three levels are considered for XML Compare, as outlined below:
Text-node Level - if the contents of a text-node changes the whole node is marked as a change (noting that a mixed-content element may contain more than one text node).
Word by Word - allows differences in content to be resolved down to specific words - normally differences are shown at the element level. The Word by Word Text Comparison tutorial introduces you to this concept.
Character by Character - a further refinement to Word by Word comparison, where differences within words are marked, this is described further in the Character by Character Comparison tutorial.
For preservation of content that is often lost when processing XML, this covers XML comments, XML processing-instructions, CDATA tags, DOCTYPE declarations and entity references. The features supporting this in XML Compare are described in the Lexical Preservation reference. For further help on the use of custom lexical preservation filters, there are also the tutorials: How to Preserve Processing Instructions and Comments and How to Preserve Doctype Information.
Whitespace-only nodes found in an XML document should be treated differently depending on whether they are a significant part of content (as in mixed content) or simply used for formatting the XML source. The technique for this is described in the Managing White Space tutorial.
Complications arise when comparing tables where the structure has changed, for
example, when a column has been inserted or removed, the
class of the XML Compare API has
ProcessHTMLTables boolean properties (with get/set methods for Java)
that, when set, will manage table comparison so that the result remains valid.
Some document elements have unique content, such as
id attributes, that
can be highlighted for the comparator by adding a special key attribute. Keys are
particularly useful for matching 'orderless' elements, but can also be of value for
ordered elements, with some additional processing to handle moves (see Detecting and Handling
Moves. More information can be found in the following samples and
Comparison, Mixed Ordered and Orderless
Data and Comparing Orderless
For elements whose main purpose is to link to other resources such as images or other documents, results can be improved if special processing is applied. Filters can be included in the processing pipeline to handle such cases. The Image and Binary Comparison sample shows how such link elements can be processed using an XSLT filter that exploits a Java extension function for binary file comparison. This sample could be adapted to suit cases where the link target is a text or XML resource.
The document comparator can be configured (by modifying a simple XSLT identity transform) to recognize and process elements used predominantly for inline formatting. This allows content-based element alignment and supports overlaps in the formatted-text range between compared versions. Such formatting differences are represented using extensions introduced in version 2.1 of DeltaV2 and described in the document Overlapping Hierarchies in DeltaV2. Formatting differences can be rendered or styled independently from structural changes according to need.
A practical example of formatting element processing is included in the Formatting Element Changes sample.
For more data-centric XML resources, the comparison pipeline may have a number of design considerations and priorities different from those for comparing document-centric resources (as described in the previous section). This section outlines comparison features that are more significant in this context, but of course, many features described in the Document Comparison section above may also apply.
For comparison of floating point numbers there may be a requirement to ignore value differences within a specified tolerance, this tolerance can be implemented via output filters based on existing filter resources included in XML Compare, Numeric Tolerances is a worked example of this.
When comparing large datasets there are some extra factors to consider, these are covered in the Comparing Large Files guide.
For cases where changes in data are expected but not deemed significant, changes can be 'ignored' in the processing pipeline, a technique for this is explained in the sample: Ignoring Changes.
While other methods are provided (such as the command line), XML Compare is designed primarily to be controlled through its API. This runs natively on the Java 1.7 platform but there is also a .NET API wrapper for easy integration with the .NET framework.
Certain parts of the API allow for integration with an external (Saxonica Ltd) Saxon
XSLT/XQuery processor, for example overloads of the
provided by the comparator APIs take Saxon XdmNode instances as arguments. To minimize
potential version conflicts with XML Compare's internal processor, XML Compare (versions 8.2 and
later) exploits a 'compatibility layer' supporting Saxon versions 9.7 and 9.8.
XML Compare uses a custom version of the Apache commons OASIS catalog resolver by default, this can however be changed. Further details are in the guide: Using a Catalog Resolver.
Systems often have the need to self-monitor or provide progress feedback to an end-user for operations that have the potential to take a noticable amount of time. The XML Compare API has provision for adding progress listeners via a ProgressListener interface, allowing a comparison to be monitored through each significant processing stage in the pipeline configuration.
The direct output from XML Compare is the 'Delta', this is the base XML output for both the Pipelined Comparator and the Document Comparator. By default the Delta includes all content, including unchanged content, but there's also an option for a 'patch' output where only the changes are included. Other output format options are also available and described in this section, these are essentially transforms of the original Delta.
The Delta is the XML output direct from the XML Compare comparator which uses the DeltaV2 format to mark up changes. This format is designed to be compact whilst also making code that processes it clean and efficient. Version 2.0 of the DeltaV2 format is used by default, but if the Document Comparator is used with marked up formatting elements, then version 2.1 is used. Version 2.1 is a superset of 2.0 with extensions to represent overlapping XML hierarchies.
At its simplest, the DeltaV2 format is a representation of the 'A' and 'B' documents
in a single document. For this,
deltaxml:deltaV2 attributes (in the DeltaXML
namespace) are added to all elements where differences are found. The
attribute may hold one of the following values:
the document source, and the
!= separator indicates if the
matching source elements are the same or different. Extra elements in the DeltaXML
namespace are used to represent modified text or attribute nodes. The DeltaV2 format is
defined in full in the DeltaV2 reference, a more detailed description of the extensions added in
version 2.1 are described in the reference: Overlapping Hierarchies in
This section describes output format filters included with the XML Compare distribution. These are used to transform the Delta output within the comparison pipeline (Pipelined Comparator or Document Comparator) immediately prior to serialization.
In the Pipelined Comparator, the HTML for this view can be generated using a
built in DXP configuration which is invoked from the command-line or GUI using the
diffreport-sbs configuration id. Alternatively, it can be generated
from the XML Compare API with the
dx2-side-by-side.xsl stylesheet added as the
final output filter.
For the Document Comparator, the DCP equivalent configuration
doc-diffreport-sbsmust be used. Or, if using the API, the XSLT filter
dx2-side-by-side.xsl should be added as a filter-step to the
OUTPUT_FINAL extension point, as shown in the following Java
DocumentComparator dcr= new DocumentComparator(); FilterStepHelper fsh= dcr.newFilterStepHelper(); FilterChain fChain= fsh.newFilterChain(); FilterStep fsSBS= fsh.newFilterStepFromResource( "xsl/side-by-side/dx2-side-by-side.xsl", "side-by-side"); fChain.addStep(fsSBS); dcr.setExtensionPoint(ExtensionPoint.OUTPUT_FINAL, fChain);
The color of the rendered XML indicates the type of change (blue, green and red for 'modified', 'added' and 'deleted' respectively). The view of each element nodes may be folded/unfolded by pressing the icon immediately to the left of the start tag. A simple toolbar and differences list allow for easier navigation of changes in large documents.
With the Pipelined Comparator, the HTML for the folding view is generated using
a built in DXP configuration which can be invoked either from the GUI or from the
command-line, as with the side-by-side view, but now using the
diffreport configuration id.
For the Document Comparator, the folding view can be created from the
command-line or GUI using the DCP configuration id
Alternatively the associated XSLT stylesheet can be added as a filter-step to final
output extension point. This is illustrated in the DCPdiffReport sample.
The XML Compare comparators may be configured to output either a full context delta (the default), or a changes only delta. When the pipelined comparator (but not the document comparator) is used, the changes-only format may be used to recreate document B from document A, this could be useful in version control systems and similar scenarios. A worked example of this is: Using Deltas for XML Versioning (diff and patch)
Many XML editors support a tracked changes feature incorporated
into an Author Mode with a WISYWIG view; the output from XML Compare can be be represented as
tracked changes in supported tools. The main benefit is that detected changes can be more
easily accepted or rejected and further edits made within the chosen editor. The Document
Comparator API provides a
setResultFormat method of the
OutputFormatConfiguration object to produce output conforming to the
tracked changes format for the following XML editors: