Introduction to Delta Format Version 2 (deltaV2)

 Table of Contents

This document provides an introduction to the new DeltaXML delta format (referred to as deltaV2) for representing changes between two XML documents. It is intended primarily for those familiar with the existing delta format (referred to here as deltaV1) to show how this has been improved. This document does not describe either the old or the new delta format in detail.

1. Background

The current DeltaXML delta format was designed in 2000 and has been 100% stable since then. We are pleased to introduce a new delta format which builds on this and improves it, in particular:

  • deltaV2 is simpler with fewer elements and attributes
  • deltaV2 is even easier to process
  • deltaV2 is extensible to more than two documents

There are some particular areas in which the new format will prove itself. These include:

  • changes in attributes now much easier to process because changed attributes are represented as elements and no longer embedded in attribute values
  • attribute namespaces are handled as namespaces rather than using prefixes embedded in attribute values
  • attribute values and text values are handled in a similar way
  • the deltaxml:exchange element is no longer needed, removing one level of structure
  • an attribute on the root element indicates whether the delta is full-context or changes-only

Additionally, deltaV2 preserves some of the unique features and benefits of the original delta format:

  • both full-context and changes-only deltas have the same basic format
  • the delta remains bi-directional, i.e. can be used to convert either document to the other
  • an unchanged, added or deleted subtree has the same format as the original documents
  • at each element in the delta you know immediately if it is added, deleted, changed or unchanged
  • deltaxml:key and deltaxml:ordered attributes are handled in the same way as before, so  input filters will not need to be changed

Initially the XML Compare product will adopt this new format in a new 5.0 release. At a later date the DeltaXML Sync product will also adopt this new delta format.

We will look at some of these areas in more detail.

2. Simpler with fewer elements and attributes

DeltaV1 had six elements and three attributes:

  • deltaxml:PCDATAmodify
  • deltaxml:PCDATAold
  • deltaxml:PCDATAnew
  • deltaxml:exchange
  • deltaxml:old
  • deltaxml:new
  • @deltaxml:delta
  • @deltaxml:new-attributes
  • @deltaxml:old-attributes

DeltaV2 has four elements and one attribute, apart from two additional attributes on the root element:

  • deltaxml:attributes
  • deltaxml:attributeValue
  • deltaxml:textGroup
  • deltaxml:text
  • @deltaxml:deltaV2

This has the advantage that less code needs to be written to process delta data. Note also that since the new format caters for three or more documents as well as the basic two, there is even less that needs to be learned in order to process changes.

The delta attribute is similar, and the correspondence between the old and new formats is as follows:

deltaxml:delta='add'deltaxml:deltaV2='B'The element appears in the 'new' document or 'B' document only.
deltaxml:delta='delete'deltaxml:deltaV2='A'The element appears in the 'old' document or 'A' document only.
deltaxml:delta='unchanged'deltaxml:deltaV2='A=B'The element appears in both documents and is equal.
deltaxml:delta='WFmodify'deltaxml:deltaV2='A!=B'The element appears in  both documents and is different in each, i.e. not equal.
deltaxml:delta= 'WFmodifyUnordered'deltaxml:deltaV2='A!=B' with deltaxml:ordered='false'The element appears in  both documents and is different in each, i.e. not equal.

3. Attributes easier to process

One of the biggest changes is in the way attribute values are handled. DeltaV1 was compact in the way that it handled attribute values but quite difficult to process, and could not be extended to more than two documents.

In deltaV1, changed attributes were encoded within the two delta attributes @deltaxml:new-attributes and @deltaxml:old-attributes. This meant that to process the attribute values they needed to be extracted. Also, because the old and new values were separated in these two attributes, it was often necessary to do set operations to determine whether an attribute was added, deleted or modified.

In deltaV2, attributes are handled within markup and processing is therefore very much easier. Unchanged attributes are handled as before: they remain unchanged as attributes.

Consider this small example to see how this works, where attribute a1 is unchanged, a2 is added, a3 is deleted and a4 is modified.

Document A (old):

<p a1="value1" a3="value3" a4="value4"/>

In deltaV1 this would be represented as:

<p deltaxml:delta="WFmodify" a1="value1" 
   deltaxml:old-attributes="a3='value3' a4='value4'" 
   deltaxml:new-attributes="a2='value2' a4='value5'" />

In deltaV2 this is represented as:

<p deltaxml:deltaV2="A!=B" a1="value1">
  <deltaxml:attributes deltaxml:deltaV2="A!=B">
    <dxa:a2 deltaxml:deltaV2="B">
      <deltaxml:attributeValue deltaxml:deltaV2="B">
    <dxa:a3 deltaxml:deltaV2="A">
      <deltaxml:attributeValue deltaxml:deltaV2="A">
    <dxa:a4 deltaxml:deltaV2="A!=B">
      <deltaxml:attributeValue deltaxml:deltaV2="A">
      <deltaxml:attributeValue deltaxml:deltaV2="B">

The new format is much more verbose, but the code to process it is much shorter and simpler. For example, to determine which attributes have been modified, in deltaV1 it is necessary to parse deltaxml:old-attributes and deltaxml:new-attributes to extract the names of all the attributes and then do a set intersection on these to find the names of any attributes in both lists. In deltaV2, it is only necessary to find elements within deltaxml:attributes which have more than one deltaxml:attributeValue within them.

The handling of attribute namespaces is now more consistent because the attribute names become element names (for attributes where the value has changed) rather than the prefixes being embedded in the deltaxml:old-attribtues and deltaxml:new-attributes values. This makes for easier handling of the namespaces.

Note also that deltaV2 can be extended to handle three or more documents, whereas deltaV1 is limited to just two.

3.1. Attribute Namespaces

Some special namespaces are used when representing attribute change in the deltaxml:attributes element. These are listed below:

usual or recommended prefixnamespace uripurpose
dxa namespace of an element used to represent an attribute which was not in a namespace in one or both input files.
dxx namespace of an element used to represent an attribute in the XML namespace (corresponding to the URI: and always bound to the prefix xml:).  Such attributes include: xml:space, xml:id, xml:base and xml:lang.

These new namespaces are used for several reasons:

  • The semantics/use of default (or 'non-prefixed') namespaces applies differently to attributes than it does to elements. We use the 'dxa' namespace so that an attribute converted into an element in a file with a default namespace does not inherit any default namespace.
  • If the same name is used for an element and an attribute in a grammar (for example the xhtml style element and attribute) there may have been confusion or even mismatching when used with existing XSLT stylesheets/software.  Using a new namespace should avoid such issues.
  • Use of the XML prefix and URI are reserved for future standards.  Converting attributes such as xml:space into xml:space elements would have contravened these guidelines/rules.

4. New Root element attributes

An attribute on the root element specifies that the document is a delta document and is conforms to deltaV2: deltaxml:version='2.0'

Another attribute on the root element indicates whether the delta document contains just the changes (deltaxml:content-type='changes-only') or if the data that is unchanged in all the documents is also present (deltaxml:content-type='full-context').

5. Text handling

Text is handled in a similar manner but there are changes to enable more than two documents to be represented. Consider the following example:

Document A (old):

<p>The quick brown fox</p>

Document B (new):

<p>The quick red fox</p>

In deltaV1 this would be represented as:

<p deltaxml:delta="WFmodify">
    <deltaxml:PCDATAold>The quick brown fox</deltaxml:PCDATAold>
    <deltaxml:PCDATAnew>The quick red fox</deltaxml:PCDATAnew>

In deltaV2 this is represented as:

<p deltaxml:deltaV2="A!=B">
  <deltaxml:textGroup deltaxml:deltaV2="A!=B">
    <deltaxml:text deltaxml:deltaV2="A">
      The quick brown fox</deltaxml:text>
    <deltaxml:text deltaxml:deltaV2="B">
      The quick red fox</deltaxml:text>

This could also be represented in deltaV2 more precisely as:

<p deltaxml:deltaV2="A!=B">
  The quick
  <deltaxml:textGroup deltaxml:deltaV2="A!=B">
    <deltaxml:text deltaxml:deltaV2="A">brown</deltaxml:text>
    <deltaxml:text deltaxml:deltaV2="B">red</deltaxml:text>

There is therefore no significant difference in the way that text is handled, except that the absence of text in one document is treated in a slightly different manner:

Document A (old):

<p>The quick brown fox</p>

Document B (new):


In deltaV1 this would be represented as:

<p deltaxml:delta="WFmodify">
    <deltaxml:PCDATAold>The quick brown fox</deltaxml:PCDATAold>

In deltaV2 this is represented as:

<p deltaxml:deltaV2="A!=B">
  <deltaxml:textGroup deltaxml:deltaV2="A!=B">
    <deltaxml:text deltaxml:delta="A">
      The quick brown fox
#content .code