This document explains the concepts behind binary comparison using a sample. The sample resources and a description on how to run it can be found at: https://bitbucket.org/deltaxml/imagecompare/src/default/.
Many document formats provide mechanisms to refer to or include images and possibly other forms of binary content in documents. A typical example is the
img element in HTML where the
src attribute is typically a URI which refers to an image. Many applications or publishing processes which handle documentation formats support a range of image types, examples include:
Most formats also support the use of both relative and absolute referencing. So for example it is possible to refer to an image using a relative URI:
In this case the location of the image is relative to the location of the file containing the image element and src URI.
It is also possible to use an absolute URI, these come in various forms or 'schemes', for example:
2. Comparing Images
Looking solely at an XML file perspective it is possible to determine when the value of an attribute has changed. However, it is possible to do more operations that would be useful to a comparison user when you can:
- access a filesystem or similar hierarchical store to determine or resolve file locations, and then
- read the files to analyze their contents
With filesystem and similar access it is possible to resolve a relative URI into an absolute URI.
Consider the following example:
file1.htmlis located inside the
/Users/Joe/Documentsdirectory and contains the following image reference:
file2.htmlis located in the same directory, but contains
If you are given just the two files without any knowledge of their location, it's only possible to say that the
src attribute has changed. However, with the knowledge of where the files are located (Joe's
Documents directory) it is possible to resolve the URIs and determine that both
src attributes are actually referring to the same file.
The above example demonstrates that image attribute change does not necessarily imply image change. The converse however is also true, it is possible to have an unchanged attribute value where the image does change. This can occur for example where the two xml input files are stored in different locations in the tree (not the same directory) and each has its associated images with local relative references.
To summarize our processing approach:
- If we don't have access to the filesystem or navigation tree we can only compare attribute values
- When we do have tree access, we resolve the references relative to the base of the two input files.
- When the references resolve to the same location we know the image is the same at that point and the comparison result can either contain the absolute reference or one of the two input relative references, but with the proviso that when relative references are used the result file should be located in the tree such that the relative references still work.
- If the absolute references resolve to different locations then the images could be identical copies or they could be different. We perform a byte-by-byte comparison of the images. If we determine that every byte is identical we can then say that the images are identical and we only need to provide one of them in the result. If they differ, or if we cannot fully compare them byte-wise we will report them as changed and provide both image elements in the result (one marked with an A or deleted delta and the other marked B or added).
We have tried to provide both a conservative implementation, in that we will always assume change, unless we can be absolutely certain that the images or other binary content is identical. At the same time we would like an optimal and fast implementation. Here are some implementation notes:
- If we have file system access we can ask for the sizes of the files (without reading their entire contents) and if they differ we assume that they are different without reading their content.
- If there are any failures in the process (file permissions etc.) we assume the worst and that the files will differ.
- The byte code comparison extension function is fail fast, it will report not-equal when the first byte that differs is found. Correspondingly it can only report equal when the last bytes of both files are read.
One important consideration for this sample is how the
xml:base attribute of the two input files is determined so that the relative references can be resolved against a base URI. The xml:base attribute is required on the root element because the baseURI property of document nodes is not preserved through each stage of the processing pipeline. Where the compare function uses
java.io.File, String/URI or similar inputs the code has or can easily determine the URI or systemId of the inputs. When other forms of inputs are used there are often ways of providing a systemId (eg:
The sample pipeline sets the
documentLocation lexical preservation property 'true', the lexical preservation processor then gets the base URI from the document node, and copies it to the root element of each input document using the
xml:base attribute. The output filter, which handles the image processing exploits the xml:base attributes added at the lexical preservation input stage.
3. The test cases
For this XML Compare sample we have needed to use a slightly more complex structure than some of our other samples. We needed to create a testcase (test2) where the input files are located in different subdirectories. The test cases are written in xhtml so that the results and inputs can be easily viewed in a browser.
There is one further aspect to test2 that is worth considering. The two input files that are used in the comparison are actually identical byte-for-byte copies of one another, the actual differences that will appear in the result come about because of differences in the associated referenced files.
4. Running the sample
Download the sample from here: https://bitbucket.org/deltaxml/imagecompare/downloads/
The file README.md file gives instructions on how to run the sample.
5. Applying the sample to other formats and data
This sample for xhtml has an output filter that uses two templates to match images both with unchanged and modified source attributes:
It is a requirement to match the element containing the attribute so that it and its other attributes can be duplicated when there is an image change. We would recommend the
xhtml-binary-image-compare.xsl filter be modified by changing the match statements on both major templates to include any new image related elements and attributes in a consistent way.