XML is often used to represent engineering, scientific or financial data where floating point numbers are widely used. Comparison using tolerances is used when writing software which handles floating point numbers and this article describes techniques which can be used in conjunction with XML Compare.
The comparator processes well-formed XML which in turn represents numbers as textual XML. It performs textual comparison of PCDATA and therefore will only report numbers being equal if they have the same lexical representation. If different processors or different serialization software is being used to generate the different XML data being compared it is even possible that the 'same' numbers will have different lexical representations (think of '1.0' and '1.00') and therefore be reported as differing. The W3C XML Schema Datatypes, also supported as part of XSLT 2.0, provide facilities for converting, reading and writing floating point numbers. Rather than build complicated datatype facilities and associated mechanics into the comparison engine, we recommend the use of XSLT 2.0 for post-processing delta output to resolve these issues with floating point numbers and their tolerances.
This article will use a worked example to explain some possible techniques.
3. Example Data
The above example is designed to show numeric values used in element content and in attributes. There are some differences in handling and so we'll discuss elements and attributes separately.
4. Element tolerances
When these are compared using the comparator some of these changes are represented in deltaV2 as follows; here is part of the file corresponding to a change in the 'Salisbury' record element containing a floating point number:
The numbers represented in the deltaV2 representation of this change are fairly easy to process with XSLT 2.0. Here is a function that we will use later (defined in tolerance-checker.xsl in the sample on Bitbucket):
This function when applied to the temperature element (the first parameter), will report if the values are within the tolerance (the second parameter). Given this function we can then use XPath match expression where we know floating point numbers will be used, for example:
We could implement a template which removed the extra change information at this point and replaced one of the values. However certain output filters which we could use to further process the result expect to deal with well-formed deltas (they assume deltaV2 attributes are accurate for any subtree). The generic ignore changes mechanism is designed to deal with these issues so it makes sense to utilize it. So all that we will do when we detect a change within tolerance is to add an ignore change attribute, using a template based on the identity template (defined in tolerance-checker.xsl in the sample on Bitbucket):
After applying our tolerance detection filter our Salisbury temperature record becomes:
The tolerance detection filter is equivalent to the mark changes filter in our standard ignore changes process. The next filter to apply is
apply-ignore-changes.xsl. This will convert the above record to:
This is almost correct, however notice that the deltaV2 attribute on the record element is incorrectly reporting a change when both child elements are now unchanged. The
propagate-ignore-changes.xsl filter is finally used to correct this problem:
5. Using XPaths to identify the numeric values
In the above example we used a template which matched all temperature elements, assuming they would contain a numeric value. However more explicit XPaths could also be used and also a template could be used to handle multiple numeric elements. Here are some examples which partially illustrate the power of XPath:
6. Attribute Tolerances
The representation of attribute change in deltaV2 is more complicated than that for element content, shown above. Here is how the 'time' attribute used in the example above is represented:
In the input data the attribute would have an XPath of
/weather/@time, however when this is represented in deltaV2 the XPath becomes
/weather/deltaxml:attributes/dxa:time. The reasons for this change are covered in the (deltaV2 documentation but arise from ease of XSLT processing and differences in XML namespace inheritance rules for attributes and elements requiring the use of various namespaces. Therefore to identify and process the attribute change the following template is used (defined in tolerance-checker.xsl in the sample on Bitbucket):
7. Specifying Tolerances
There are a number of ways in which the tolerances may be specified. Here are some suggestions:
7.1. Fixed values in XSLT
When using the tolerance checking functions it is possible to specify a fixed parameter value, for example:
7.2. Paramerized XSLT
Rather than fixing the value, it can be passed into the filter using a parameter, for example:
The parameter has a default value, alternatively one could be passed in from the invoking code, which could include DXP parameters.
7.3. Annotated instance data
It may be possible to annotate your data with the tolerances, for example:
The match statement would then become:
It may even be possible to use the attribute to identify toleranced numeric data:
However, please remember that there are two comparator inputs and you will need to ensure that both tolerance attributes are identical or you deal with possible changes.
7.4. Implicit annotation via DTDs and schemas
One way of avoiding the problem of mismatched tolerance attributes would be to include them as default and/or fixed attributes in a DTD or schema, for example:
If you wish to handle toleranced numeric data we suggest using this approach:
- use the
tolerance-checker.xslfilter with the predefined xsl:functions for element/attribute tolerances
- apply these functions using XPaths to where you use numeric data
- the functions add deltaxml:ignore-changes attributes at appropriate places to the data
- the final two filter stages of the general ignore changes process then process these attributes
Also included in the sample on Bitbucket is a filter for applying tolerance checking to all numeric text items and attributes (generic-tolerance-checker.xsl). This can be adapted as necessary to suit your needs.
- The code and examples assume that the elements and attributes contain exactly one numeric value. Unfortunately DTDs are not type aware and cannot enforce such constraints (but W3C XML Schema and RelaxNG can do so). If you are unsure of the numeric values we would recommend schema checking and if that is not possible consider adding more error checking to the XSLT filter code as appropriate for the data.
- Word-by-word filtering and certain punctuation characters typically found in numeric values should not be used in conjunction with this code. If you need to apply word-by-word filtering to other parts of the data please ensure that numeric values are not processed using the appropriate filter control attributes.
- We have used very simple example tolerances in this article, real tolerances for float point numbers are more complex. An ideal tolerance is not a fixed value but depends on the magnitude of the numbers involved. Discussion of this in an XML specific context is limited, however the following article while aimed at Java programmers discusses ULP (Units of Least Precision) in detail and is also applicable to XML: IBM Developer Works, Java's new math, Part 2: Floating-point numbers If
java.lang.Math.ulp()is available on your platform we would suggest using it via an XSLT extension function as a basis for tolerance values.
10. Running the sample
The resources should be checked-out, cloned or downloaded and unzipped into the samples directory of the XML Compare release. They should be located such that they are two levels below the top level release directory, for example