Comparing Orderless Elements
Introduction
This document describes the concepts behind the comparison of orderless elements. For the resources associated with this sample, see here.
Overview
Any element in an XML file can be declared as a container for orderless elements. XML Compare will match up and compare the child elements of such a container regardless of the order in which they appear. What counts about orderless elements is membership in a set, not relative position. Arbitrary rearrangement leaves the meaning of the set unaffected.
To summarise, for comparing orderless elements, we are concerned with two types of element:
Orderless Container
This is an element that has child elements arranged in an order that is not considered significant.
Orderless Element
This is an element whose position relative to its siblings is not regarded as significant because its parent element is an orderless container.
In the example below the ’addressList’ element is the ‘orderless container’, whereas the ‘person’ elements within ‘addressList’ are ‘orderless elements’.
Using Keys
To detect changes in sets or orderless elements, XML Compare must align corresponding elements in spite of rearrangements.Keys help XML Compare perform this correlation. Keys remove any need for sorting. (Orderless comparisons have been supported since version 2.0 of XML Compare.)
The special-purpose attributes called keys simplify comparisons in many situations. XML permits both ordered and orderless data. Keys help XML Compare track changes in both categories. Keys increase both the quality of the matches made and the runtime speed of the comparisons. For orderless data keys should be provided whenever possible to get the best results.
Key attributes impose no requirements on data formats. Extensible Stylesheet Language (XSLT) scripts can generate keys on the fly, whenever needed, and discard them in post-processing. The original XML remains unaffected by their use.
Three practical solutions are presented here, with each using a different comparator and method for customising a comparison:
Pipelined Comparator (DXP) - Uses a filter pipeline defined by an XML file called a 'DXP' to customise the comparison.
Document Comparator - Uses Java API calls to customise a pre-existing pipeline with a number of extension points.
Document Comparator (API and DCP) - Uses either Java API calls to customise a pre-existing pipeline with a number of extension points, or through a parameter in a DCP file
Removing 'ignorable' whitespace
One point that should be noted, before reading further, is that it is very important to remove whitespace only PCDATA nodes during orderless comparison (for example using the com.deltaxml.pipe.filters.NormalizeSpace
Java XML Filter). Also, note that an orderless container must not contain PCDATA as an immediate child: it will cause an exception (com.deltaxml.api.UnorderedElementContainingPCDATAException
). The reason is that while elements are separated unambiguously by tags, if two PCDATA items are put next to each other they are in effect merged into one PCDATA item, so orderless PCDATA makes no sense.
How to declare an orderless container
Ordering is often unimportant during data capture or storage. For example, entries in an electronic address book can be stored in any order. Assume that we have the following entries stored as our address data and wish to track any modifications using XML Compare:
Example 1: an address book XML document (documentA.xml in the sample on Bitbucket: https://bitbucket.org/deltaxml/orderless-comparison )
<addressList>
<person customerid="15">
<name>Joe Bloggs</name>
<email>jblogs@msn.com</email>
<address>
<line>1234 Green Lane</line>
<line>London</line>
<zip>SW1 7WJ</zip>
</address>
</person>
<person customerid="14">
<name>John Doe</name>
<email>jdoe@hotmail.com</email>
<address>
<line>1234 East Beverley Drive</line>
<line>Phoenix</line>
<zip>AZ 12345</zip>
</address>
</person>
<person customerid="62">
<name>Joe Bloggs</name>
<email>jblogs@hotmail.com</email>
<address>
<line>1234 Green Lane</line>
<line>London</line>
<zip>SW1 7WJ</zip>
</address>
</person>
</addressList>
The first step is to tell XML Compare that child elements of <addressList>
are orderless elements. Or, to put it differently: to identify <addressList>
elements as orderless containers.
Example 2: Adding the orderless attribute using an XSLT template (part of address-key.xsl in the sample on Bitbucket: https://bitbucket.org/deltaxml/orderless-comparison )
<xsl:template match="addressList">
<xsl:copy>
<xsl:attribute name="deltaxml:ordered" select="'false'"/>
<xsl:apply-templates select="@*, node()"/>
</xsl:copy>
</xsl:template>
The attribute deltaxml:ordered="false" is the signal. This tells XML Compare that the order of child elements of <addressList>
can be ignored, i.e. they can appear in any order.
The second step is to provide XML Compare with a means of matching elements across the two input files. In our example the id attribute can serve as a matching key. Two elements, one from each input file, are considered to match if, and only if, their respective keys match. When the keys match XML Compare scans the element in question and records any changes found within its contents. Note how we record the key information using an attribute that is clearly flagged as being in the deltaxml namespace:
Example 3: Adding keys using an XSLT template (part of address-key.xsl in the sample on Bitbucket: https://bitbucket.org/deltaxml/orderless-comparison )
<xsl:template match="person">
<xsl:copy>
<xsl:attribute name="deltaxml:key" select="@customerid"/>
<xsl:apply-templates select="@*, node()"/>
</xsl:copy>
</xsl:template>
While it might seem more sensible to reference the id attribute than to duplicate its value, concrete reasons exist for such a protocol. Keys can exhibit much more complexity than shown here.
The final result after applying these templates together is the following file.
Example 4: the keyed, orderless address book
<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<person customerid="15" deltaxml:key="15">
<name>Joe Bloggs</name>
<email>jblogs@msn.com</email>
<address>
<line>1234 Green Lane</line>
<line>London</line>
<zip>SW1 7WJ</zip>
</address>
</person>
<person customerid="14" deltaxml:key="14">
<name>John Doe</name>
<email>jdoe@hotmail.com</email>
<address>
<line>1234 East Beverley Drive</line>
<line>Phoenix</line>
<zip>AZ 12345</zip>
</address>
</person>
<person customerid="62" deltaxml:key="62">
<name>Joe Bloggs</name>
<email>jblogs@hotmail.com</email>
<address>
<line>1234 Green Lane</line>
<line>London</line>
<zip>SW1 7WJ</zip>
</address>
</person>
</addressList>
Comparing orderless elements with keys
Ignoring data shuffling
XML Compare is now equipped to perform an orderless comparison. The test will involve a simple rearrangement of the <person>
elements:
Example 5: Rearranged address book (documentB.xml in the sample on Bitbucket: https://bitbucket.org/deltaxml/orderless-comparison )
<addressList>
<person customerid="15">
<name>Joe Bloggs</name>
<email>jblogs@msn.com</email>
<address>
<line>1234 Green Lane</line>
<line>London</line>
<zip>SW1 7WJ</zip>
</address>
</person>
<person customerid="62">
<name>Joe Bloggs</name>
<email>jblogs@hotmail.com</email>
<address>
<line>1234 Green Lane</line>
<line>London</line>
<zip>SW1 7WJ</zip>
</address>
</person>
<person customerid="14">
<name>John Doe</name>
<email>jdoe@hotmail.com</email>
<address>
<line>1234 East Beverley Drive</line>
<line>Phoenix</line>
<zip>AZ 12345</zip>
</address>
</person>
</addressList>
The comparison result shows no changes, as expected:
Example 6: Delta showing no differences between original and rearranged address books
<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:deltaV2="A=B"
deltaxml:ordered="false"
deltaxml:version="2.0"
deltaxml:content-type="full-context">
<person deltaxml:key="15" customerid="15">
<name>Joe Bloggs</name>
<email>jblogs@msn.com</email>
<address>
<line>1234 Green Lane</line>
<line>London</line>
<zip>SW1 7WJ</zip>
</address>
</person>
<person deltaxml:key="14" customerid="14">
<name>John Doe</name>
<email>jdoe@hotmail.com</email>
<address>
<line>1234 East Beverley Drive</line>
<line>Phoenix</line>
<zip>AZ 12345</zip>
</address>
</person>
<person deltaxml:key="62" customerid="62">
<name>Joe Bloggs</name>
<email>jblogs@hotmail.com</email>
<address>
<line>1234 Green Lane</line>
<line>London</line>
<zip>SW1 7WJ</zip>
</address>
</person>
</addressList>
Detecting changes after shuffling
The preceding example showed that XML Compare correctly ignores rearrangement of orderless sets. The next file introduces actual changes to address book:
Example 7: Address book with data change (documentC.xml in the sample on Bitbucket: https://bitbucket.org/deltaxml/orderless-comparison )
<addressList>
<person customerid="14">
<name>John Doe</name>
<email>jdoe@hotmail.com</email>
<address>
<line>1234 E. Beverley Drive</line>
<line>Phoenix</line>
<zip>AZ 12345 6789</zip>
</address>
</person>
<person customerid="15">
<name>Joe Bloggs</name>
<email>jblogs@hotmail.com</email>
<address>
<line>1234 Green Lane</line>
<line>London</line>
<zip>SW1 7WJ</zip>
</address>
</person>
</addressList>
Now XML Compare has something more to report, because embedded data has changed:
Example 8: Delta showing data change detected in the orderless set
<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:deltaV2="A!=B"
deltaxml:ordered="false"
deltaxml:version="2.0"
deltaxml:content-type="full-context">
<person deltaxml:deltaV2="A!=B" deltaxml:key="15" customerid="15">
<name deltaxml:deltaV2="A=B">Joe Bloggs</name>
<email deltaxml:deltaV2="A!=B">
<deltaxml:textGroup deltaxml:deltaV2="A!=B">
<deltaxml:text deltaxml:deltaV2="A">jblogs@msn.com</deltaxml:text>
<deltaxml:text deltaxml:deltaV2="B">jblogs@hotmail.com</deltaxml:text>
</deltaxml:textGroup>
</email>
<address deltaxml:deltaV2="A=B">
<line>1234 Green Lane</line>
<line>London</line>
<zip>SW1 7WJ</zip>
</address>
</person>
<person deltaxml:deltaV2="A!=B" customerid="14" deltaxml:key="14">
<name deltaxml:deltaV2="A=B">John Doe</name>
<email deltaxml:deltaV2="A=B">jdoe@hotmail.com</email>
<address deltaxml:deltaV2="A!=B">
<line deltaxml:deltaV2="A!=B">
<deltaxml:textGroup deltaxml:deltaV2="A!=B">
<deltaxml:text deltaxml:deltaV2="A">1234 East Beverley Drive</deltaxml:text>
<deltaxml:text deltaxml:deltaV2="B">1234 E. Beverley Drive</deltaxml:text>
</deltaxml:textGroup>
</line>
<line deltaxml:deltaV2="A=B">Phoenix</line>
<zip deltaxml:deltaV2="A!=B">
<deltaxml:textGroup deltaxml:deltaV2="A!=B">
<deltaxml:text deltaxml:deltaV2="A">AZ 12345</deltaxml:text>
<deltaxml:text deltaxml:deltaV2="B">AZ 12345 6789</deltaxml:text>
</deltaxml:textGroup>
</zip>
</address>
</person>
<person deltaxml:deltaV2="A" deltaxml:key="62" customerid="62">
<name>Joe Bloggs</name>
<email>jblogs@hotmail.com</email>
<address>
<line>1234 Green Lane</line>
<line>London</line>
<zip>SW1 7WJ</zip>
</address>
</person>
</addressList>
In this case there were changes to <person>
elements. XML Compare accurately detected and recorded the change even though the order of elements was different.
Rules for keys in orderless comparisons
The following rules apply to the use of key attributes in the orderless case:
Orderless containers must be assigned a
deltaxml:ordered="false"
attribute.XML elements are considered ordered by default (in the absence of this attribute on their parent).
Two elements that are identical in all respects except the
deltaxml:ordered
attribute are considered distinct and will not be compared with each other. This unusual event triggers a warning in XML Compare verbose mode.Orderless containers may not contain parsed character data.
Any element may be designated as an orderless container, regardless of the ordering status of its parent or child elements.
Rules for orderless elements lacking keys
Generally, elements that are members of an orderless set of elements should all have a key attribute. Orderless sets with a mixture of keyed and unkeyed elements may yield obscure results during comparisons.
Using no keys at all is preferable to a mixed case. Nevertheless, XML Compare's rules for the mixed case are simple:
Elements of the same type with the same key are matched, and any changes within them are shown.
If the comparison file contains an identical element (perfect match, but neither element has a key attribute) XML Compare assumes these are actually the same. No change is recorded.
If no identical element exists, the keyless element is considered a delta change. It is recorded as an addition or deletion.
Anchoring position of one element in an orderless set
Orderless sets sometimes incorporate a header element, such as <setheader>
below:
Example 9: Orderless set with header
<MyDataItems xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<setheader>
<date>20 January 2003</date>
<owner>MegaCorp</owner>
</setheader>
<item> Some data </item>
<item> More data </item>
<item> Extra data </item>
</MyDataItems>
Technically <setheader>
is a member of the orderless set <MyDataItems>
. Yet unlike the <item>
members, <setheader>
will not move around. It has a special status within the orderless set. So we would like XML Compare to match it across revisions, not treating changes to this one element in the same way as changes to other members. Therefore we assign a deltaxml:key attribute to the <setheader> element:
Example 10: Orderless set with keyed header
<MyDataItems xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<setheader deltaxml:key="headerkey">
<date>20 January 2003</date>
<owner>MegaCorp</owner>
</setheader>
<item> Some data </item>
<item> More data </item>
<item> Extra data </item>
</MyDataItems>
Of course, a better design would employ keys for all elements:
Example 11: Orderless set with keyed header and keyed elements
<MyDataItems xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<setheader deltaxml:key="headerkey">
<date>20 January 2003</date>
<owner>MegaCorp</owner>
</setheader>
<item deltaxml:key="q43"> Some data </item>
<item deltaxml:key="342"> More data </item>
<item deltaxml:key="09r"> Extra data </item>
</MyDataItems>
Ideas for designing keys
The key to an element can take many forms, including:
the name of the element itself, if it is know that this element only appears once within its parent
a child element
part of a child element
a combination of attributes
a combination of attributes and child elements
a random number such as a GUID
arbitrary data from external sources (other files, databases, user input, etc).
The choices are almost unlimited. Moreover, an element class may apply different key formulations to different instances of that class. For example, some instances could use child data while others use attribute data.
Key generation has to be equally flexible. One possibility is to manufacture keys as part of the XML file. In this case they are stored as a permanent component of the XML data. However this technique leaves behind a footprint that may be inappropriate. It is often better to factor XML Compare-specific data out. This factoring is possible by creating DeltaXML keys on a temporary, as-needed basis. A simple XSL script can generate keys just prior to XML Compare processing, while another strips them upon completion. A separate paper on the use of XSL Filters details this technique.
We refer to key assignment based on element data as data-driven. For example, consider the bare address list without any identifiers or keys:
Example 12: Address book with no "id" attributes or keys
<addressList>
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<address>
<line>26 High Street</line>
<line>Newtown</line>
<line>Malvern</line>
<zip>WR13 0JK</zip>
</address>
</person>
<person>
<firstName>Adam</firstName>
<lastName>Smith</lastName>
<address>
<line>26 High Street</line>
<line>Newtown</line>
<line>Malvern</line>
<zip>WR13 0JK</zip>
</address>
</person>
<person>
<firstName>David</firstName>
<lastName>Jones</lastName>
<address>
<line>12 New Street</line>
<line>Newtown</line>
<line>Malvern</line>
<zip>WR13 0PL</zip>
</address>
</person>
<person>
<firstName>Mark</firstName>
<lastName>Lane</lastName>
<address>
<line>99 Bow Street</line>
<line>Newtown</line>
<line>Malvern</line>
<zip>WR13 0LL</zip>
</address>
</person>
</addressList>
We expect the combination of firstName, lastName, and postCode to form a unique key. Such a key can be easily generated by XSL using XPath expressions. The results might look as follows:
Example 13: Data-driven keys added to address book elements
<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<person deltaxml:key="John Smith WR13 0JK">
<firstName>John</firstName>
<lastName>Smith</lastName>
<address>
<line>26 High Street</line>
<line>Newtown</line>
<line>Malvern</line>
<zip>WR13 0JK</zip>
</address>
</person>
<person deltaxml:key="Adam Smith WR13 0JK">
<firstName>Adam</firstName>
<lastName>Smith</lastName>
<address>
<line>26 High Street</line>
<line>Newtown</line>
<line>Malvern</line>
<zip>WR13 0JK</zip>
</address>
</person>
<person deltaxml:key="David Jones WR13 0PL">
<firstName>David</firstName>
<lastName>Jones</lastName>
<address>
<line>12 New Street</line>
<line>Newtown</line>
<line>Malvern</line>
<zip>WR13 0PL</zip>
</address>
</person>
<person deltaxml:key="Mark Lane WR13 0LL">
<firstName>Mark</firstName>
<lastName>Lane</lastName>
<address>
<line>99 Bow Street</line>
<line>Newtown</line>
<line>Malvern</line>
<zip>WR13 0LL</zip>
</address>
</person>
</addressList>
This new design is more natural than id attributes because the keys derive from actual data. Furthermore the filter supports key generation on the fly, leaving permanent data untouched by DeltaXML markup. The XSL definition for this is presented below:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
version="2.0">
<xsl:output method="xml" indent="no" />
<xsl:template match="@*, node()">
<xsl:copy>
<xsl:apply-templates select="@*,node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="person">
<xsl:copy>
<xsl:attribute name="deltaxml:key"
select="concat(firstName/text(), ' ', lastName/text(), ' ', address/zip/text())"/>
<xsl:apply-templates select="@*,node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="addressList">
<xsl:copy>
<xsl:attribute name="deltaxml:ordered" select="'false'"/>
<xsl:apply-templates select="@*,node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
In the above example, the deltaxml:ordered="false"
attribute is added to the addressList element as specified in the template's match attribute. If you wanted it to apply to a range of different elements the XSL match attribute could be extended to include further XPaths, for example
<xsl:template match="addressList | elem1 | elem2/elem3">
<xsl:copy>
<xsl:attribute name="deltaxml:ordered" select="'false'"/>
<xsl:apply-templates select="@*,node()"/>
</xsl:copy>
</xsl:template>
Here the attribute would be added to all <addressList>
elements, all <elem1>
elements, and <elem3>
elements whose immediate parents where <elem2>
.
It should be noted that we only suggest adding the deltaxml:ordered="false"
element to data where order really does not matter. If you have an element whose sub-elements are a mixture of ordered and orderless elements and wish to manage these then please see our "how to" reference on "How to configure XML Compare to handle mixed ordered and orderless data within a single element".
Nesting orderless sets
Orderless containers may nest, but child elements do not inherit the orderless quality automatically. It must be declared explicitly. For example, each person in the address list might have several phone numbers, so <phoneNumbers>
needs to be configured as orderless:
Example 14: Nested orderless sets
<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" deltaxml:ordered="false">
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<address>
<line>26 High Street</line>
<line>Newtown</line>
<line>Malvern</line>
<zip>WR13 0JK</zip>
</address>
<phoneNumbers deltaxml:ordered="false">
<phone>1 234</phone>
<phone>2 345</phone>
<phone>3 456</phone>
</phoneNumbers>
</person>
<person>
<firstName>David</firstName>
<lastName>Jones</lastName>
<address>
<line>12 New Street</line>
<line>Newtown</line>
<line>Malvern</line>
<zip>WR13 0PL</zip>
</address>
<phoneNumbers deltaxml:ordered="false">
<phone>4 567</phone>
<phone>5 678</phone>
<phone>6 789</phone>
</phoneNumbers>
</person>
</addressList>
Comparison works the same no matter how deeply nested the elements. The rules for keys are the same across all levels.
Should everything be treated as orderless?
The default XML behaviour is that element ordering is significant. The previous example data, for example, the order of the <line>
elements in the <address>
is significant. Print them out in the wrong order and your postman, or rather the automated post-office sorting machines, may have a few problems! Thus by default you should not try to treat XML data as orderless.
Another example, imagine a simple XML vector graphics format:
<line>
<point x="0" y="0"/>
<point x="1" y="1"/>
</line>
<closedPolygon>
<point x="0" y="0"/>
<point x="1" y="0"/>
<point x="1" y="1"/>
<point x="0.5" y="1.5"/>
<point x="0" y="1"/>
</closedPolygon>
Now in the case of a line, perhaps it would be fine to add deltaxml:ordered="false"
to the <line>
element. It doesn't matter which end you start at the resulting lines are the same and so the comparator should perhaps treat them so. But the <closedPolygon>
is a different matter, reorder the points and you end up with different shapes. These should not be compared as equal.
So what approach should you take? We believe that understanding the semantics of the data is important. In some cases good 'clues' are present in DTDs, XMLSchema or RNG grammars for the data, for example the lack of a * or + repetition operator is a good indication not to add orderless. But there are cases where human expertise is needed to complement the grammars.
Orderless Presentation
When presenting the results of an orderless comparison, we have choices about whether to favour the order of elements in the A document or the B document, and also whether 'adds' should be shown before or after 'deletes'. The OrderlessPresentation
comparator property is used to control this presentation order, this is illustrated in the following extract from the sample DXP:
<comparatorProperties>
<property name="http://deltaxml.com/api/property/orderlessPresentation"
literalValue="a_matches_deletes_adds"/>
</comparatorProperties>
The DocumentComparator allows control of the OrderlessPresentation property through the OutputFormatConfiguration
class, as shown in the following extract from the Java code sample:
DocumentComparator dc= new DocumentComparator();
...
dc.getOutputFormatConfiguration().setOrderlessPresentationMode(
OrderlessPresentationMode.B_DELETES);
Orderless Comparison - with Document Comparator
The Pipelined Comparator works well with the xml data as shown in the above examples. The Document Comparator is an extension to the Pipelined Comparator designed specifically for comparing documents. For example, the html documents having the meta sections. Sample inputs are shown below.
Document 1: HTML document
<html>
<head>
<title>Meta Tags Example</title>
<meta name="keywords" content="Keyword1, Keyword2"/>
<meta name="description" content="Description of the webpage"/>
<meta name="revised" content="3/7/2014"/>
</head>
<body>
<p>Hello</p>
</body>
</html>
Document 2: HTML document
<html>
<head>
<title>Meta Tags Example</title>
<meta name="keywords" content="Keyword1, Keyword2, Keyword3"/>
<meta name="description" content="Information of the webpage"/>
<meta name="revised" content="3/3/2015"/>
</head>
<body>
<p>Hello</p>
</body>
</html>
This sample contains the file OrderlessComparisonSample.java
or OrderlessComparisonSample.cs (depending on which XML Compare variant is in use). This contains the source code to handle the orderless comparison using the Document Comparator. It uses the input filter (meta-section-key.xsl) which adds keys to meta elements and a wrapper around meta sections.
Data-driven keys added to meta elements with a wrapper around meta sections having deltaxml:ordered attribute
<html>
<head>
<title>Meta Tags Example </title>
<meta-wrapper xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<meta deltaxml:key="keywords" name="keywords" content="Keyword1, Keyword2 " />
<meta deltaxml:key="description" name="description" content="Description of the webpage" />
<meta deltaxml:key="revised" name="revised" content="3/7/2014" />
</meta-wrapper>
</head>
<body>
<p>Hello</p>
</body>
</html>
Running the sample
The sample code illustrates two methods of performing orderless element comparison. The first method uses the Pipelined Comparator, and specifies input and output filters with a DXP file. The second method uses the API exposed by the Document Comparator.
Using Java
Download the sample from https://bitbucket.org/deltaxml/orderless-comparison
Details of how to run the sample are given in README.md.