XML Compare provides a powerful solution to identify and process the differences between any two XML files that share the same root element. Its primary use is as a toolkit for integration into other systems or applications via comprehensive APIs, but it may also be run standalone from the command-line or a simple GUI.
This user guide introduces you to the XML Compare product, providing a high-level product description along with a look at the main features and concepts associated with this product. You can find more detailed information on the subjects covered here by following the links to an extensive set of tutorials, samples and papers that complement this product.
The Getting Started page provides a quick start for the product as well as a description of all dependencies. The Samples and Guides page gives a summary of all the included samples. An overview of the features in the Document Comparator can be found in the Document Comparator Guide. Comprehensive technical implementation information can be found in the Java and .NET API documentation. See also the REST documentation.
The XML Compare API provides a high level of extensibility
The XML Compare approach is unique in that:
- The change file is recorded in a XML 'delta file'.
- The delta file has the same look and feel as the original files.
- The delta file can include changes only or changes plus unchanged data.
- The delta file is easy to understand and to process because it is an XML file.
- The delta file can therefore be processed with standard XML tools.
- Comparison can be customized by defining/extending filter pipelines.
2. XML Comparison Features
Two input files are used for an XML Compare comparison, referred to here as 'A' and 'B' files. Whilst it is often the case that 'B' is a modification of 'A', it is also possible that both inputs are derived independently from a common source. Using this terminology, a user-oriented set of high-level features is outlined below:
2.1. General Features
- Find all the differences between any two XML files ('A' and 'B').
- Apply changes to convert an 'A' XML file into the 'B' version (i.e. a diff 'patch').
- Undo changes to convert a 'B' XML file into the 'A' version.
- Display change information in either XML or HTML form, using a standard web browser.
- Report changes only or changes+unchanged data.
- Use XSLT input and output filters to pre and post process the XML data.
- Handle large files without performance degradation.
2.2. Document Comparison Features and Benefits
- Extensible pipeline with embedded functionality.
- Text processing for differences on a word-by-word basis.
- Special processing for formatting-elements.
- HTML/CALS table structure aware.
- Extension points for adding filter steps to the pipeline.
- Ignoring changes to non-significant whitespace.
- Handle moves of uniquely identifiable element. See Detecting and Handling Moves for more details.
2.3. Multi-Document Comparisons
3. Running a Comparison
XML Compare runs locally on your own hardware and allows you to quickly embed XML comparison functionality into your own systems, it can be run in a variety of ways, with the range of options determined by the version of the product downloaded.
3.1. Comparison options for XML Compare Downloads
The .NET release is considered deprecated as of version 10.0.0 and will be removed in a future release. Please contact DeltaXML if you need help with migrating to our Java or REST APIs.
The XML Compare Download Page provides a choice of three possible downloads: Java (UNIX & Windows), macOS, and .NET. To start with, provided you don't want to use the .NET API, you should select the download option that matches your target operating system, each download option includes the required Java JAR files (e.g. deltaxml-x.y.z.jar [replacing x.y.z with the major.minor.patch version number of your release e.g. deltaxml-10.0.0.jar]) and support resources. If you're developing for .NET (on Windows) you should select the .NET download option - note that this version does not include the GUI available in other downloads. Note: A licence file is required to run XML Compare, see the Licensing User Guide for more details.
XML Compare can be invoked using a choice of interfaces (simplified view)
A comparison can be run programmatically, using Java, REST or .NET APIs. Alternatively, it can be user-driven via the command-line (see the Command-Line page), an Oxygen plugin (after installation of the DeltaXML Oxygen Adaptor) or a simple graphical user-interface (GUI). Note that the GUI is designed to help demonstrate some of the built-in capabilities of XML Compare, but it is not intended as a standalone productivity tool.
It is also possible to invoke a further nested comparison from within an XSLT filter using a provided
compare() XSLT extension function, this is described in the Java API documentation and .NET API documentation.
4. Customising a Comparison
Since XML Compare uses XML to represent changes, an API and Pipeline Configuration architecture allows standard XML technologies such as XSLT to be applied, complex information pipelines can therefore be built from a set of simple proven components.
Configuration of a typical custom comparison pipeline
4.1. Samples of Customized Comparisons
A set of samples are included with XML Compare; these include working code and documentation for a number of customized comparison scenarios.
4.2. Choosing the Comparator
When a comparison is invoked via the recommended com.deltaxml.cores9api API, you have the choice of two comparator classes:
When invoking a comparison through the graphical interface (GUI) or command-line interface (CLI), the comparator class used will depend on whether a DCP file ID (for DocumentComparator), or DXP file ID (for PipelinedComparatorS9) is used.
4.2.1. Pipelined Comparator
Implemented via the
PipelinedComparatorS9 Java class, this provides a very flexible form of comparison, best suited for when the input XML is not always document based or when your require low-level control of the processing pipeline. Except for restrictions associated with lexical preservation filters, input and output filters can be added to the processing pipeline at any point.
4.2.2. Document Comparator
Implemented through the
DocumentComparator Java class, this has a pipeline specially optimized for document comparison, the figure below shows a simplified representation of this pipeline. Explicit extension points are available on the pipeline so new filter-steps or chains can be inserted in a managed way.
Filter steps or chains can be applied to specific extension points of the Document Comparator
4.3. Defining Pipelines
4.3.1. Pipelined Comparator
The Pipelined Comparator allows comparisons to be optimized for particular types of data or document structure, it also allows customisation of the way detected differences are represented in the output. The pipeline for a Pipelined Comparator is defined using a set of filters managed in
FilterChain objects that can be added at both comparator inputs ('A' and 'B') or the comparator output.
The guide, Specifying a Comparison Pipeline provides an overview of how pipelines can be defined with the Pipelined Comparator, specifically through the use of Java, C# or an XML pipeline descriptor file format, called DXP.
More details on the use of DXP can be found in the document Pipeline Configuration using DXP.
4.3.2. Document Comparator
The Document Comparator differs from the Pipelined Comparator in that key parts of the pipeline are pre-defined with specialist document comparison features; this pipeline is modified by adding filters at certain named 'extension points'.
As in the Pipelined Comparator, filters are managed as
FilterChain objects in Java or C#, these are added to the pipeline using the DocumentComparator's
setExtensionPoint method. An alternative way to configure a Document Comparator is to use a Document Comparator Pipelines configuration file (DCP).
4.3.3. JAXP Pipeline Comparator (legacy)
A lower level method (now regarded as legacy but still useful for advanced users) for creating pipelines is also available for Java developers, this exploits JAXP interfaces. For this, JAXP Pipeline Examples introduces you to a set of examples available for download, the paper Powering Pipelines with JAXP provides further details on using JAXP.
4.3.4. Pipeline Diagnostics
When there is a need to diagnose stages in a pipeline, a debugFilesmode is available where the inputs and outputs of each filter is output to separate file, a file naming convention is used to indicate where each 'debug file' fits into the pipeline. The debugFiles mode is set either by the
setDebugFiles method call or with a Configuration Property (see Configuration Properties) in a DeltaXML Configuration file named 'deltaXMLConfig.xml', sample XML for setting this property is shown below:
Low-level XML Compare functionality is configured using different methods according to how the functionality is implemented. These different methods are summarized below:
4.4.1. Configuration Summary
|Config Properties||Comparator Features & Properties||Parser Features||Output Properties|
|Diagnostics Settings||DeltaV Format||Configure XInclude||Indentation|
|Catalog Settings||Matching Algorithm||JAXP/SAX Features|
^ DocType is affected by the LexicalPreservation configuration property.
4.4.2. Configuration Properties
Configuration Properties are used to control certain properties of a comparison operation that may have a wider scope than standard features and properties, more details can be found in the Configuration Properties guide.
4.4.3. Comparator Features and Properties
Features and properties are managed using the API or a DXP/DCP definition, the Features and Properties document describes the features and properties available.
4.4.4. Parser Features
Features for the Apache Xerces parser can be set either from the API or a DXP/DCP configuration, a DXP example can be found in the sample XInclude and XML Compare.
4.4.5. Output Properties
Output properties control the serializer of XML Compare's internal Saxon processor, they are set from the API or using DXP or DCP. An example of how DocType and indentation is set using DXP can be found in the Pipeline Configuration using DXP document.
5. Document Comparison
XML document types such as DITA, DocBook and XHTML share a common set of features such as: inline formatting, tables, ordered/orderless lists and linked resources. To help achieve simple and accurate difference reports when comparing documents, each element supporting these features can be processed in a special way, both at comparison time and when the result is output.
Many document features can benefit from special processing.
For optimized processing of document-centric features, two approaches are recommended. The first approach is to exploit built-in features in XML Compare's document comparator augmented with custom XSLT filters where required. The second, more complex approach, is to use the pipelined comparator with a specially configured pipeline exploiting a set of custom XSLT filters.
Most of the features outlined in this section are incorporated into the document comparator, however links are included to samples for cases where you wish customize your own pipelined comparator, these samples also provide some useful insight into how the capabilities that are built into the document comparator actually work. Not all features are enabled by default in the document comparator.
5.1. Text Comparison
Normally text comparisons are case-sensitive, but there are certain contexts where case should be ignored, the Case Insensitive Comparison sample shows how this can be done. Also, comparison of text within each element of a document can be performed at different levels, three levels are considered for XML Compare, as outlined below:
Text-node Level- if the contents of a text-node changes the whole node is marked as a change (noting that a mixed-content element may contain more than one text node).
Word by Word- allows differences in content to be resolved down to specific words - normally differences are shown at the element level. The Word by Word Text Comparison tutorial introduces you to this concept.
Character by Character- a further refinement to Word by Word comparison, where differences within words are marked, this is described further in the Character by Character Comparison tutorial.
5.2. Lexical Preservation
For preservation of content that is often lost when processing XML, this covers XML comments, XML processing-instructions, CDATA tags, DOCTYPE declarations and entity references. The features supporting this in XML Compare are described in the Lexical Preservation reference. For further help on the use of custom lexical preservation filters, there are also the tutorials: How to Preserve Processing Instructions and Comments and How to Preserve Doctype Information.
5.3. Whitespace Management
Whitespace-only nodes found in an XML document should be treated differently depending on whether they are a significant part of content (as in mixed content) or simply used for formatting the XML source. The technique for this is described in the Managing White Space tutorial.
5.4. Table Comparison
Complications arise when comparing tables where the structure has changed, for example, when a column has been inserted or removed, the
DocumentComparator class of the XML Compare API has
ProcessHTMLTables boolean properties (with get/set methods for Java) that, when set, will manage table comparison so that the result remains valid.
5.5. Key-assisted Matching
Some document elements have unique content, such as
id attributes, that can be highlighted for the comparator by adding a special key attribute. Keys are particularly useful for matching 'orderless' elements, but can also be of value for ordered elements, with some additional processing to handle moves (see Detecting and Handling Moves. More information can be found in the following samples and guides: Ordered Comparison, Mixed Ordered and Orderless Data and Comparing Orderless Elements.
5.6. Linked Resources
For elements whose main purpose is to link to other resources such as images or other documents, results can be improved if special processing is applied. Filters can be included in the processing pipeline to handle such cases. The Image and Binary Comparison sample shows how such link elements can be processed using an XSLT filter that exploits a Java extension function for binary file comparison. This sample could be adapted to suit cases where the link target is a text or XML resource.
5.7. Formatting Elements
The document comparator can be configured (by modifying a simple XSLT identity transform) to recognize and process elements used predominantly for inline formatting. This allows content-based element alignment and supports overlaps in the formatted-text range between compared versions. Such formatting differences are represented using extensions introduced in version 2.1 of DeltaV2 and described in the document Two and Three Document DeltaV2 Format. Formatting differences can be rendered or styled independently from structural changes according to need.
A practical example of formatting element processing is included in the Formatting Element Changes sample.
Overlaps in formatting in different versions are detected and recorded in the DeltaV2.
6. Data Comparison
For more data-centric XML resources, the comparison pipeline may have a number of design considerations and priorities different from those for comparing document-centric resources (as described in the previous section). This section outlines comparison features that are more significant in this context, but of course, many features described in the Document Comparison section above may also apply.
6.1. Numeric Tolerances
For comparison of floating point numbers there may be a requirement to ignore value differences within a specified tolerance, this tolerance can be implemented via output filters based on existing filter resources included in XML Compare, Numeric Tolerances is a worked example of this.
6.2. Comparing Large Datasets
When comparing large datasets there are some extra factors to consider, these are covered in the Comparing Large Files guide.
6.3. Ignoring Changes
For cases where changes in data are expected but not deemed significant, changes can be 'ignored' in the processing pipeline, a technique for this is explained in the sample: Ignoring Changes.
7. System Integration
7.1. Java, REST and .NET APIs
While other methods are provided (such as the command line), XML Compare is designed primarily to be controlled through its API. This runs natively on the Java 1.8 platform but there is also a .NET API wrapper for easy integration with the .NET framework. REST provides further flexibility allowing integration with many systems.
7.2. Saxon Compatibility
Certain parts of the API allow for integration with an external (Saxonica Ltd) Saxon XSLT/XQuery processor, for example overloads of the
compare function provided by the comparator APIs take Saxon XdmNode instances as arguments. To minimize potential version conflicts with XML Compare's internal processor, XML Compare (versions 10.0 and later) exploits a 'compatibility layer' supporting Saxon versions 9.8 and 9.9.
7.3. XML Catalog Resolving
XML Compare uses a custom version of the Apache commons OASIS catalog resolver by default, this can however be changed. Further details are in the guide: Using a Catalog Resolver.
7.4. Progress Listeners
Systems often have the need to self-monitor or provide progress feedback to an end-user for operations that have the potential to take a noticeable amount of time. The XML Compare API has provision for adding progress listeners via a ProgressListener interface, allowing a comparison to be monitored through each significant processing stage in the pipeline configuration.
8. Output Formats
8.1. Direct XML Compare Output
The direct output from XML Compare is the 'Delta', this is the base XML output for both the Pipelined Comparator and the Document Comparator. By default the Delta includes all content, including unchanged content, but there's also an option for a 'patch' output where only the changes are included. Other output format options are also available and described in this section, these are essentially transforms of the original Delta.
8.1.1. The Delta
The Delta XML output from XML Compare uses the DeltaV2 format.
The Delta is the XML output direct from the XML Compare comparator which uses the DeltaV2 format to mark up changes. This format is designed to be compact whilst also making code that processes it clean and efficient. Version 2.0 of the DeltaV2 format is used by default, but if the Document Comparator is used with marked up formatting elements, then version 2.1 is used. Version 2.1 is a superset of 2.0 with extensions to represent overlapping XML hierarchies.
At its simplest, the DeltaV2 format is a representation of the 'A' and 'B' documents in a single document. For this,
deltaxml:deltaV2 attributes (in the DeltaXML namespace) are added to all elements where differences are found. The
deltaV2 attribute may hold one of the following values:
B represents the document source, and the
!= separator indicates if the matching source elements are the same or different. Extra elements in the DeltaXML namespace are used to represent modified text or attribute nodes. The DeltaV2 format is defined in full in the DeltaV2 reference, a more detailed description of the extensions added in version 2.1 are described in the reference: Overlapping Hierarchies in DeltaV2.
8.2. Supplementary Output Formats
This section describes output format filters included with the XML Compare distribution. These are used to transform the Delta output within the comparison pipeline (Pipelined Comparator or Document Comparator) immediately prior to serialization.
8.2.1. HTML Difference Reports
HTML5 Side-by-Side (diffreport-sbs)
The 'side-by-side' output format.
In the Pipelined Comparator, the HTML for this view can be generated using a built in DXP configuration which is invoked from the command-line or GUI using the
diffreport-sbs configuration id. Alternatively, it can be generated from the XML Compare API with the
dx2-deltaxml-sbs-folding-html.xsl stylesheet added as the final output filter.
For the Document Comparator, the DCP equivalent configuration
doc-diffreport-sbs must be used. Or, if using the API, the XSLT filter
dx2-deltaxml-sbs-folding-html.xsl should be added as a filter-step to the
OUTPUT_FINAL extension point, as shown in the following Java code:
HTML Folding Report (diffreport)
The color of the rendered XML indicates the type of change (blue, green and red for 'modified', 'added' and 'deleted' respectively). The view of each element nodes may be folded/unfolded by pressing the icon immediately to the left of the start tag. A simple toolbar and differences list allow for easier navigation of changes in large documents.
The 'folding' output format.
With the Pipelined Comparator, the HTML for the folding view is generated using a built in DXP configuration which can be invoked either from the GUI or from the command-line, as with the side-by-side view, but now using the
diffreport configuration id.
For the Document Comparator, the folding view can be created from the command-line or GUI using the DCP configuration id
doc-diffreport. Alternatively the associated XSLT stylesheet can be added as a filter-step to final output extension point. This is illustrated in the DCP Folding Diff Report sample.
8.2.2. XML Diff and Patch Output
The XML Compare comparators may be configured to output either a full context delta (the default), or a changes only delta. When the pipelined comparator (but not the document comparator) is used, the changes-only format may be used to recreate document B from document A, this could be useful in version control systems and similar scenarios. A worked example of this is: Using Deltas for XML Versioning (diff and patch)
8.3. Document Comparator Formats
8.3.1. Tracked Changes
Many XML editors support a tracked changes feature incorporated into an Author Mode with a WYSIWYG view; the output from XML Compare can be be represented as tracked changes in supported tools. The main benefit is that detected changes can be more easily accepted or rejected and further edits made within the chosen editor. The Document Comparator API provides a
setResultFormat method of the
OutputFormatConfiguration object to produce output conforming to the tracked changes format for the following XML editors:
- PTC ArborText
- Adobe FrameMaker
The tracked changes feature supports a number of XML Editors