Attribute Splitting

1. Introduction

XML Data Compare allows attribute value splitting and produces a detailed representation of changes while comparing XML files with attribute values being changed. In the 2.0 release, we introduced a configuration that applies to attributes. This configuration can be applied in different parts using XPaths or for an entire XML file. For the resources associated with this sample, see the Bitbucket sample

2. Defining Attribute Splitting

By default, when a configuration file is specified, the attribute-splitting option is switched on, meaning that single word changes are shown in the delta. Any part of an attribute value that is different is shown in the text group and text that differs is shown against either "A" or "B". 

If you wish to switch attribute-splitting off and see the whole contents of attribute values against either "A" or "B" then you can change the default behaviour of attribute-splitting for the whole comparison and set it to false as shown in config-attributes-splitting.xml

  <dcf:defaults>
    <dcf:attribute-splitting enabled="false"/>
  </dcf:defaults>


Alternatively, you can choose to change the default behaviour of attribute-splitting and then switch it on for a particular element. The following example of the configuration file sets narrative-text for the @description attribute using XPath expression.

  <dcf:location name="attribute-splitting" xpath="/persons/@description">
    <dcf:attribute-splitting enabled="true">
      <dcf:narrative-text/>
    </dcf:attribute-splitting>
  </dcf:location>


3. Change representations

The ‘attribute-splitting’ configuration controls the granularity and method used for comparing and describing differences inside attribute values. The different modes used for splitting attributes are described below:

3.1. Whole String

For the value of 'company' attribute-splitting has been switched off. As a result, XML Data Compare treats attribute value as a single block of change from respective input files.

<dxa:company deltaxml:deltaV2="A!=B">
	<deltaxml:attributeValue deltaxml:deltaV2="A">ABC Ltd</deltaxml:attributeValue>
	<deltaxml:attributeValue deltaxml:deltaV2="B">XYZ Limited</deltaxml:attributeValue>
</dxa:company>


3.2. Narrative text

The default mode for splitting attribute values is ‘narrative-text’. The narrative text mode uses the ICU4J library for breaking plain-text into words using language-specific information on punctuation, number formats and whitespace separators

  <dcf:location name="attribute-splitting" xpath="/persons/@description">
    <dcf:attribute-splitting enabled="true">
      <dcf:narrative-text/>
    </dcf:attribute-splitting>
  </dcf:location>
  <dxa:description deltaxml:deltaV2="A!=B">
  	<deltaxml:attributeValueWords deltaxml:deltaV2="A!=B">
		data list of 
		<deltaxml:textGroup deltaxml:deltaV2="A!=B">
			<deltaxml:text deltaxml:deltaV2="A">3</deltaxml:text>
			<deltaxml:text deltaxml:deltaV2="B">4</deltaxml:text>
		</deltaxml:textGroup>
		persons
	</deltaxml:attributeValueWords>
  </dxa:description>

3.3. Token List

Splits text content into an ordered list of items using a tokenisation method included in the configuration file. The tokenisation method can be specified using either separator or regular expression.

  <dcf:location name="attribute-splitting" xpath="/persons/@description">
    <dcf:attribute-splitting enabled="true">
       <dcf:data-list separator=" " output-token-separator=" "/>
    </dcf:attribute-splitting>
  </dcf:location>
<dxa:address deltaxml:deltaV2="A!=B">
	<deltaxml:attributeTokenList deltaxml:deltaV2="A!=B">
		<deltaxml:token deltaxml:deltaV2="A=B">UnitNo</deltaxml:token>
			<deltaxml:token deltaxml:deltaV2="A=B">Street</deltaxml:token>
			<deltaxml:token deltaxml:deltaV2="A=B">Address</deltaxml:token>
			<deltaxml:token deltaxml:deltaV2="A">City</deltaxml:token>
			<deltaxml:token deltaxml:deltaV2="A=B">Country</deltaxml:token>
	</deltaxml:attributeTokenList>
</dxa:address>

3.4. Token Set

Splits text content into a set of unique items where each item is inside token with initial position from A input and B input being stored in original-position attribute. The tokenisation method can be specified using either separator or regular expression.

  <dcf:location name="attribute-splitting" xpath="/persons/@names">
    <dcf:attribute-splitting enabled="true">
       <dcf:data-set separator="," output-token-separator=","/>
    </dcf:attribute-splitting>
  </dcf:location>
<dxa:names deltaxml:deltaV2="A!=B">
	<deltaxml:attributeTokenSet deltaxml:deltaV2="A!=B">
		<deltaxml:token deltaxml:deltaV2="A=B" deltaxml:original-position="A=1, B=1">anna</deltaxml:token>
		<deltaxml:token deltaxml:deltaV2="B" deltaxml:original-position="B=2">ben</deltaxml:token>
		<deltaxml:token deltaxml:deltaV2="A=B" deltaxml:original-position="A=2, B=3">chris</deltaxml:token>
		<deltaxml:token deltaxml:deltaV2="A=B" deltaxml:original-position="A=3, B=4">david</deltaxml:token>
	</deltaxml:attributeTokenSet>
</dxa:names>

4. Tokenisation based Regex and Separator

4.1. Regex

This mode employs a regular expression specified in the regex configuration attribute. This is used to tokenise the attribute value using the same regular expression syntax as that used by the XPath tokenize function. In this mode, if no output-separator is specified in the configuration, the ','(comma) character is used.

4.2. Separator

This mode employs a set of one or more separator characters specified in the 'separator' configuration attribute. The attribute value is split at each point in the string that a separator character is encountered. In this mode, if no output-separator is specified, in the configuration the first separator character specified in the 'separator' attribute is used.

5. Example data

Having the following input files with personal data stored as attribute values, XML Data Compare can represent changes in four different ways.

Input A
<persons id="id" 
         company="ABC Ltd" 
         description="data list of 3 persons" 
         names="anna,chris,david"
         address="UnitNo,Street,Address,City,Country"/>
Input B
<persons id="id" 
         company="XYZ Limited" 
         description="data list of 4 persons" 
         names="anna,ben,chris,david"
         address="UnitNo,Street,Address,City,Country"/>


Different comparison representations can be set using the following configuration file. You can see that description, names and address use narrative-text, data-set and data-list, respectively. The separator attribute of data-list and data-set passes information to the comparator about a delimiter which then is used to tokenize attribute values. The output-token-separator specifies the string used to separate tokens in the output. The default is a ',' (comma) character.

Configuration File
<dcf:configuration 
  xmlns:dcf="com.deltaxml.data.config"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
  xmlns="http://schemas.dell.com/dbi/fsl/compliance/v1_0"
  xsi:schemaLocation="com.deltaxml.data.config ../../../../main/resources/xsd/config.xsd"
  xmlns:ignore="http://www.deltaxml.com/ns/ignoreForAlignment"
  version="1.0"
  id="ic2">
  
  <dcf:defaults>
    <dcf:attribute-splitting enabled="false"/>
  </dcf:defaults>
  
  <dcf:location name="attribute-splitting" xpath="/persons/@description">
    <dcf:attribute-splitting enabled="true">
      <dcf:narrative-text/>
    </dcf:attribute-splitting>
  </dcf:location>
  
  <dcf:location name="attribute-splitting names" xpath="/persons/@names">
    <dcf:attribute-splitting enabled="true">
      <dcf:data-set separator="," output-token-separator=","/>
    </dcf:attribute-splitting>
  </dcf:location>
  
  <dcf:location name="attribute-splitting address" xpath="/persons/@address">
    <dcf:attribute-splitting enabled="true">
      <dcf:data-list separator="," output-token-separator=","/>
    </dcf:attribute-splitting>
  </dcf:location>
  
</dcf:configuration>


XML Data Compare will produce the following delta:

<persons xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
    xmlns:dxx="http://www.deltaxml.com/ns/xml-namespaced-attribute"
    xmlns:pi="http://www.deltaxml.com/ns/processing-instructions"
    xmlns:ignore="http://www.deltaxml.com/ns/ignoreForAlignment"
    xmlns:preserve="http://www.deltaxml.com/ns/preserve"
    xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute"
    xmlns:er="http://www.deltaxml.com/ns/entity-references" deltaxml:deltaV2="A!=B" id="id"
    deltaxml:version="2.0" deltaxml:content-type="full-context"
    address="UnitNo,Street,Address,City,Country">
    <deltaxml:attributes deltaxml:deltaV2="A!=B">
        <dxa:names deltaxml:deltaV2="A!=B">
            <deltaxml:attributeTokenSet deltaxml:deltaV2="A!=B">
                <deltaxml:token deltaxml:deltaV2="A=B" deltaxml:original-position="A=1, B=1"
                    >anna</deltaxml:token>
                <deltaxml:token deltaxml:deltaV2="B" deltaxml:original-position="B=2"
                    >ben</deltaxml:token>
                <deltaxml:token deltaxml:deltaV2="A=B" deltaxml:original-position="A=2, B=3"
                    >chris</deltaxml:token>
                <deltaxml:token deltaxml:deltaV2="A=B" deltaxml:original-position="A=3, B=4"
                    >david</deltaxml:token>
            </deltaxml:attributeTokenSet>
        </dxa:names>
        <dxa:description deltaxml:deltaV2="A!=B">
            <deltaxml:attributeValueWords deltaxml:deltaV2="A!=B">data list of <deltaxml:textGroup
                    deltaxml:deltaV2="A!=B"><deltaxml:text deltaxml:deltaV2="A"
                        >3</deltaxml:text><deltaxml:text deltaxml:deltaV2="B"
                    >4</deltaxml:text></deltaxml:textGroup> persons</deltaxml:attributeValueWords>
        </dxa:description>
        <dxa:company deltaxml:deltaV2="A!=B">
            <deltaxml:attributeValue deltaxml:deltaV2="A">ABC Ltd</deltaxml:attributeValue>
            <deltaxml:attributeValue deltaxml:deltaV2="B">XYZ Limited</deltaxml:attributeValue>
        </dxa:company>
    </deltaxml:attributes>
</persons>
#content .code