Using Deltas for XML Versioning (diff and patch)
Introduction
DeltaXML can be used to provide a 'diff and patch' capability for any XML document or data file. The diff would be generated by comparing two versions and producing a delta that contains only the differences, known as a changes-only delta. The patch capability is provided by the recombine operation.
This 'diff and patch' capability can be useful where updates to XML files need to be sent using minimum bandwidth, for example in mobile or satellite applications. Another application is in situations where it is necessary to keep a full audit trail of changes to an XML file using minimum storage space.
The DeltaXML DeltaV2 format comes in two main flavours: full context and changes-only.
The full context delta efficiently stores the entire contents of the input files. It can be used to create a difference report or a document with changes marked-up.
The changes-only delta format stores only what has changed and the necessary context for those changes. It is an ideal format for storing differences between different versions of XML documents.
XML Compare comes bundled with a recombiner which can recreate Document ‘B’ with Document ‘A’ and the changes-only delta. The recombiner can also be used in reverse to recreate Document ‘A’ from Document ‘B’ and the changes-only delta. This process is heavily used at DeltaXML as part of our comprehensive roundtrip test suite (to ensure the comparator is 100% accurate).
More usefully, these tools can be used when there are many versions of a document. There are various strategies that can be used to achieve this.
Using Changes-only Generated Deltas Against Version 1
The most obvious approach is storing the original version of the document, and the changes-only delta between the versions.
Any version can then easily be recreated by using the recombiner with the stored version and the relevant changes-only delta.
The big issue with this approach is that as the document gets more revised the deltas will get larger and larger, potentially to the point where the delta is effectively the entire document.
Using Changes-only Deltas Generated Against the Previous Version
Another possible approach is to store the first and last versions, and changes-only deltas between each version and the last.
Any version can then be recreated by either forward recombining from version 1, or reverse recombining form the latest version. The issue here is that the more versions you have, the more recombinations it will potentially take to reconstruct a version.
The sensible solution here would be to store the first, latest and each nth
version (what n
is depends on your system’s use/data).
Lexical Preservation
XML Compare expects to be run in an environment where either its inputs or its output may have been processed by XSLT filters, which in particular conforms to the XSLT processing model's XDM tree model (as specified in XQuery 1.0 and XPath 2.0 Data Model). This model does not contain entries that correspond to all the XML node types, such as 'DOCTYPEs', 'entities', and 'CDATA sections and ignorable whitespace', which are removed, expanded and converted to text characters respectively as part of the XSLT parsing.
Our lexical preservation filters can preserve the four mentioned XML node types, by converting them to and from XML element nodes. These preservation filters also enable processing instructions and comments, which are otherwise ignored by our comparator technology, to be retained and compared in a similar manner. For further information please refer to the How to Preserve Doctype Information sample, the How to Preserve Processing Instructions and Comments sample, and the LexicalPreservation Java API documentation.
In order for the recombiner to work with filtered documents, its inputs need to be identical to those used to generate the delta. The sample ensures this by storing the marked-up version (i.e. after filtering with the LexicalPreservation filter), see graphic below (the inputs with '(Preserved)' are marked-up with the lexical preservation).
The output filter is only run as the last stage when retrieving a version from the system, see below:
Implementation
The sample, which is available from Bitbucket, https://bitbucket.org/deltaxml/deltas-for-versioning, is an implementation of the second approach discussed in the previous section. It is made up of three classes:
DeltasForVersioning.java - this class stores a list of
Version
s and offers methods for management. Its constructor offers a quick way to disable the lexical preservation filtering (the default is for it to be enabled). The important methods are:addVersion(File)
- this adds theFile
as a version of the document.verifyDeltaForInput(File, File, File)
- validates that the changes-only deltaV2 is valid and can be used to recreate either of the inputs, this is called byaddVersion(File)
. This throws aInvalidChangesOnlyDeltaException
when the changes-only deltaV2 fails the validation.retrieveVersionForwards(int, File)
- this retrieves the requested version of the document using forward recombine, starting with the first version of the document.retrieveVersionBackwards(int, File)
- this retrieves the requested version of the document using reverse recombine, starting with the latest version.runLexicalPreservationInfilter(File input)
- this runs the lexical preservation input filter, and returns the marked-up version of the input.runLexicalPreservationOutfilter(File input, File output)
- this runs the lexical preservation output filter, and returns the non-marked-up version of the input (i.e. the expanded lexical content is back in its original form.
Version.java - instances of this class store details about a version of the document (mainly the generated changes-only delta file).
InvalidChangesOnlyDeltaException.java - an exception thrown when the generated changes-only deltaV2 is invalid.
Limitations
There are a few limitations with the sample implementation, including:
Can only be used from PipelinedComparatorS9 - DocumentComparator does not support producing changes-only deltas
The sample is implemented in Java.
The Version object is storing the full copy of each document version, this should be a trivial thing to clean up.
There is no persistence, so the version model needs to be built up each run.
The LexicalPreservation output filter requires a Saxon PE license as the DeltaXML OEM license cannot be used in sample code. If this is a problem, please contact us.
Running the sample
The sample resources and a description on how to run it can be found at: https://bitbucket.org/deltaxml/deltas-for-versioning/.
The resources should be checked-out, cloned or downloaded and unzipped into the samples directory of the XML Compare release. They should be located such that they are two levels below the top level release directory, for example DeltaXML-XML-Compare-10_0_0_j/samples/deltas-for-versioning
.