Orderless Comparison
Introduction
When XML Data Compare aligns XML elements for a comparison, the elements are treated as ordered by default. Ordering of elements however is often unimportant during data capture or storage. For example, entries in an electronic address book can be stored in any order. This document describes the concepts behind the comparison of orderless elements. For the resources associated with this sample, see here
Overview
Any element in an XML file can be declared as a container for orderless elements. XML Data Compare will match up and compare the child elements of such a container regardless of the order in which they appear. What counts about orderless elements is membership in a set, not relative position. Arbitrary rearrangement leaves the meaning of the set unaffected.
To summarise, for comparing orderless elements, we are concerned with two types of element:
Orderless Container
This is an element that has child elements arranged in an order that is not considered significant.
Orderless Element
This is an element whose position relative to its siblings is not regarded as significant because its parent element is an orderless container.
Note that XML Data Compare assumes that the order of data is significant unless an orderless container is defined.
In the example below the ’addressList’ element is the ‘orderless container’, whereas the ‘person’ elements within ‘addressList’ are ‘orderless elements’.
Orderless Comparison With and Without Keys
XML Data Compare uses a heuristic algorithm for matching orderless elements. This approach does not require keys to be added or any other work to be done other than specifying the Container as orderless (see above).
Automated matching does not understand the meaning of your data. Therefore, if you have attributes or leaf elements whose values effectively identify each orderless element, you may find it useful to use keying (see below) to be explicit about which orderless elements match. Using keys may also make the matching more efficient and use less resources.
Using keys is not always possible, as the sets of data which make up keys may vary even from element to element of the same name. Therefore you may specify keys for some orderless child elements but not for other sibling elements if you specify fail-if-no-key="false"
in the configuration file for the child-order element (see below and also Configuration Schema Guide).
Generally, a good approach is to start off without using keys and then add them to clarify the matching between orderless elements.
Note
The sample comparisons provided here produce the same result with keyed orderless elements or with non-keyed orderless elements.
Non-Keyed Orderless Comparison
How to declare an orderless container without keys for the contained orderless elements
Assume that we have the following entries stored as our contacts data and wish to track any modifications using XML Data Compare:
Example: a 'contacts' XML document (contactsA.xml in the sample on Bitbucket).
<contacts>
<person>
<first-Name>Alan</first-Name>
<last-Name>Alpha</last-Name>
<address type="home">
<postcode>B92 6DC</postcode>
</address>
<occupation>Train Driver</occupation>
</person>
<person>
<first-Name>Betty</first-Name>
<last-Name>Alpha</last-Name>
<address type="home">
<postcode>WR18 4GU</postcode>
</address>
<occupation>Engineer</occupation>
</person>
</contacts>
In the 'B' version of the contacts XML document, the order of the person elements is reversed so the person element for 'Betty Alpha' occurs before that for 'Alan Alpha'. There are also the following differences in the B document:
Alan Alpha's postcode is changed from 'B92 6DC' to 'CR12 4QN'
Betty Alpha's postcode is changed from 'WR18 4GU' to 'B92 6DC' (Alan Alpha's old postcode)
The first step is to use the config file to tell XML Data Compare that child elements of <contacts>
are orderless elements. Or, to put it differently: to identify <contacts>
elements as orderless containers.
Example: Specifying orderless container with no keys added to child orderless elements
<dcf:location name="My contacts" xpath="/contacts">
<dcf:child-order ignore-order="true" fail-if-no-key="false"/>
</dcf:location>
This tells XML Data Compare that the order of child elements of <contacts> can be ignored, i.e. they can appear in any order. The advanced algorithm matches the corresponding 'person' elements for Alan Alpha and Betty Alpha based on probability. This is calculated by identifying unique values for specific elements or attributes for each person with common values in both input documents, in this case the elements 'first-name' and 'occupation' are unique. The comparison with this configuration produces the following DeltaV2 Result:
DeltaV2 Result
<contacts xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context">
<person deltaxml:deltaV2="A!=B" deltaxml:original-position="A=2, B=1">
<first-Name deltaxml:deltaV2="A=B">Betty</first-Name>
<last-Name deltaxml:deltaV2="A=B">Alpha</last-Name>
<address deltaxml:deltaV2="A!=B" type="home">
<postcode deltaxml:deltaV2="A!=B">
<deltaxml:textGroup deltaxml:deltaV2="A!=B">
<deltaxml:text deltaxml:deltaV2="A">WR18 4GU</deltaxml:text>
<deltaxml:text deltaxml:deltaV2="B">B92 6DC</deltaxml:text>
</deltaxml:textGroup>
</postcode>
</address>
<occupation deltaxml:deltaV2="A=B">Engineer</occupation>
</person>
<person deltaxml:deltaV2="A!=B" deltaxml:original-position="A=1, B=2">
<first-Name deltaxml:deltaV2="A=B">Alan</first-Name>
<last-Name deltaxml:deltaV2="A=B">Alpha</last-Name>
<address deltaxml:deltaV2="A!=B" type="home">
<postcode deltaxml:deltaV2="A!=B">
<deltaxml:textGroup deltaxml:deltaV2="A!=B">
<deltaxml:text deltaxml:deltaV2="A">B92 6DC</deltaxml:text>
<deltaxml:text deltaxml:deltaV2="B">CR12 4QN</deltaxml:text>
</deltaxml:textGroup>
</postcode>
</address>
<occupation deltaxml:deltaV2="A=B">Train Driver</occupation>
</person>
</contacts>
Note that in the DeltaV2 result above, a deltaxml:original-position attribute (lines 3 and 16) on each of the 'person' elements shows how the position of the person elements has changed.
Side by Side DiffReport
Here's a partial screenshot of the Side by Side DiffReport HTML output (see Output format sample for details on how to configure the output) that was generated from the DeltaV2 result:
Note that XML Data Compare preserves the element order found in the 'B' document. So, in this case, 'Betty Alpha' occurs first in the result.
Keyed Orderless Comparison
How to declare an orderless container with keyed orderless elements
Assume that we have the following entries stored as our address data and wish to track any modifications using XML Data Compare:
Example: an address book XML document (addressesA.xml in the sample on Bitbucket) Ellipses are used to replace longer sections of data.
<addressList>
<person customerid="15">
<name>Joe Bloggs</name>
<email>jblogs@msn.com</email>
<address>
<line>1234 Green Lane</line>
<line>London</line>
<zip>SW1 7WJ</zip>
</address>
<age>26</age>
<notes>Joe Bloggs is a placeholder name ... thinking and writing.</notes>
<extra>The name Bloggs ... deriving from bloc, a bloke.</extra>
</person>
<person customerid="17">
<name>Spongebob Squarepants</name>
<email>spongebob@hotmail.com</email>
<address>
<line>124 Conch Street</line>
<line>Bikini Bottom</line>
<line>Pacific Ocean</line>
</address>
<age>4</age>
<notes>SpongeBob SquarePants is an American animated ... revenue for Nickelodeon.
</notes>
<extra>Many of the ideas for the series originated ...
</extra>
</person>
.
.
.
</addressList>
The first step is to use the config file to tell XML Data Compare that child elements of <addressList>
are orderless elements. Or, to put it differently: to identify <addressList>
elements as orderless containers.
Example: Specifying orderless container with keys added to child orderless elements
<dcf:location name="My address list" xpath="/addressList">
<dcf:child-order ignore-order="true" fail-if-no-key="true">
<dcf:child-alignment child-xpath="person" key-xpath="@customerid"/>
</dcf:child-order>
</dcf:location>
This tells XML Data Compare that the order of child elements of <addressList> can be ignored, i.e. they can appear in any order. Secondly the customerid
attribute can serve as a matching key. Two elements, one from each input file, will be picked out as matches when their respective keys match. The advanced algorithm will also find matching elements in the input files. When the elements match XML Data Compare scans the element in question and records any changes found within its contents.
Rules for keys in orderless comparisons
The following rules apply to the use of key attributes in the orderless case:
Orderless containers must be assigned using the
ignore-order="true"
attribute in the configuration file.XML elements are considered ordered by default.
Orderless containers may not contain parsed character data.
Any element may be designated as an orderless container, regardless of the ordering status of its parent or child elements.
Rules for orderless elements in a set where only some have keys
Using no keys at all is generally preferable to a mixed case. Nevertheless, XML Data Compare's rules for the mixed case are simple:
Elements of the same type with the same key are matched, and any changes within them are shown.
If the comparison file contains an identical element (perfect match, but neither element has a key available) XML Data Compare assumes these are actually the same. No change is recorded.
Remaining elements (with no key) are aligned according to a computed matching probability.
Anchoring position of one element in an orderless set
Orderless sets sometimes incorporate a header element, such as <setheader>
below:
Example: Orderless set with header
<MyDataItems xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<setheader>
<date>20 January 2003</date>
<owner>MegaCorp</owner>
</setheader>
<item> Some data </item>
<item> More data </item>
<item> Extra data </item>
</MyDataItems>
Technically <setheader>
is a member of the orderless set <MyDataItems>
. Yet unlike the <item>
members, <setheader>
will not move around. It has a special status within the orderless set. So we would like XML Data Compare to match it across revisions, not treating changes to this one element in the same way as changes to other members. Therefore we need to artificially create data that allows a key to be created for the <setheader>
element in the same way as the <item>
elements.
Ideas for designing keys
The key to an element can take many forms, including:
the name of the element itself, if it is known that this element only appears once within its parent
a child element
part of a child element
a combination of attributes
a combination of attributes and child elements
a random number such as a GUID
arbitrary data from external sources (other files, databases, user input, etc).
The choices are almost unlimited. Moreover, an element class may apply different key formulations to different instances of that class. For example, some instances could use child data while others use attribute data.
We refer to key assignment based on element data as data-driven. For example with a bare address list without any identifiers or keys a key could be used made up of first name, last name and zip code. This design is more natural than id attributes because the keys derive from actual data. The location element for this is presented below:
<dcf:location name="My address list" xpath="/addressList">
<dcf:child-order ignore-order="true" fail-if-no-key="true">
<dcf:child-alignment child-xpath="person" key-xpath="concat(name/text(), ' ', address/zip/text())"/>
</dcf:child-order>
</dcf:location>
Nesting orderless sets
Orderless containers may nest, but child elements do not inherit the orderless quality automatically. It must be declared explicitly. For example, each person in the address list might have several phone numbers, so <phoneNumbers>
needs to be configured as orderless:
Example: Nested orderless sets
<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1">
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<address>
<line>26 High Street</line>
<line>Newtown</line>
<line>Malvern</line>
<zip>WR13 0JK</zip>
</address>
<phoneNumbers>
<phone>1 234</phone>
<phone>2 345</phone>
<phone>3 456</phone>
</phoneNumbers>
</person>
<person>
<firstName>David</firstName>
<lastName>Jones</lastName>
<address>
<line>12 New Street</line>
<line>Newtown</line>
<line>Malvern</line>
<zip>WR13 0PL</zip>
</address>
<phoneNumbers>
<phone>4 567</phone>
<phone>5 678</phone>
<phone>6 789</phone>
</phoneNumbers>
</person>
</addressList>
Comparison works the same no matter how deeply nested the elements. The rules for keys are the same across all levels.
Example: Multiple child-order locations in a config file
<dcf:location name="My address list" xpath="/addressList">
<dcf:child-order ignore-order="true" fail-if-no-key="true">
<dcf:child-alignment child-xpath="person" key-xpath="concat(firstName/text(), ' ', lastName/text(), ' ', address/zip/text())"/>
</dcf:child-order>
</dcf:location>
<dcf:location name="My phone list" xpath="//phoneNumbers">
<dcf:child-order ignore-order="true" fail-if-no-key="true">
<dcf:child-alignment child-xpath="phone" key-xpath="substring(text(),1,1)"/>
</dcf:child-order>
</dcf:location>
Should everything be treated as orderless?
The default XML behaviour is that element ordering is significant. The previous example data, for example, the order of the <line>
elements in the <address>
is significant. Print them out in the wrong order and your postman, or rather the automated post-office sorting machines, may have a few problems! Thus by default you should not try to treat XML data as orderless.
Another example, imagine a simple XML vector graphics format:
<line>
<point x="0" y="0"/>
<point x="1" y="1"/>
</line>
<closedPolygon>
<point x="0" y="0"/>
<point x="1" y="0"/>
<point x="1" y="1"/>
<point x="0.5" y="1.5"/>
<point x="0" y="1"/>
</closedPolygon>
Now in the case of a line, perhaps it would be fine to treat the <line>
element as orderless. It doesn't matter which end you start at the resulting lines are the same and so the comparator should perhaps treat them so. But the <closedPolygon>
is a different matter, reorder the points and you end up with different shapes. These should not be compared as equal.
So what approach should you take? We believe that understanding the semantics of the data is important. In some cases good 'clues' are present in DTDs, XMLSchema or RNG grammars for the data, for example the lack of a * or + repetition operator is a good indication not to treat as orderless. But there are cases where human expertise is needed to complement the grammars.
Running the sample
Download the sample from https://bitbucket.org/deltaxml/xml-data-compare-orderless-comparison
Details of how to run the sample are given in README.md. See REST User Guide for how to run the comparison using REST.