Whitespace normalisation

1. Introduction

The default setting is to normalize whitespace so that multiple spaces, tabs, and newlines are usually treated as a single space. Ignorable whitespace such as formatting whitespace is completely removed. Significant whitespace nodes are rare in data-centric XML. Even when blocks of text are included, they are unlikely to have formatting elements added.

2. Example

The samples provided on Bitbucket (addressesA.xml and addressesB.xml) are identical except for the arrangement of whitespace. When you use an empty configuration file the default setting is to normalize the whitespace so the result shows "A" equal to "B".

Run the comparison with the other two config files and you will see the way the whitespace changes are detected. When the default value is changed to false, using config-nw-false.xml, all changes in whitespace are detected.

  <dcf:defaults>
    <dcf:normalize-whitespace on="false"/>
  </dcf:defaults>

When, using config-nw-specific.xml , the value for normalize-whitespace is only changed for the specific element extra, then only whitespace differences in the text in those elements will be detected.

  <dcf:location name="Switch nw off in extra elements only" xpath="/addressList/person/extra">
    <dcf:normalize-whitespace on="false"/>
  </dcf:location>

Text in the extra element on the "Sherlock Holmes" record is wrapped differently in the two input files.

Text in the notes element for "Harry Potter" has an additional space before the start of each line in the "B" file.

2.1. Comparing addressesC.xml with itself

There is a third sample file on Bitbucket, addressesC.xml, that has examples of whitespace that you may want to keep. It has a <zip> element that just has spaces in and the <extra> element has formatting. For this example addressesC.xml can be used for both inputA and inputB.

<addressList>
  <person customerid="63">
    <name>Harry Potter</name>
    <email>hpotter@hotmail.com</email>
    <address>
      <line>4 Privet Drive</line>
      <line>Little Whinging</line>
      <line>Surrey</line>
      <zip>      </zip>
    </address>
    <age>15</age>
    .
	.
    <extra>Mr. and Mrs. Dursley of number four, Privet Drive, were proud to say that they were <b>perfectly normal</b> <i>thank you</i> very much. </extra>
  </person>
</addressList>

If you make the comparison using an empty config file, then the whitespace normalisation will be enabled and the spaces will be removed from the <zip> element in the result file. Also the space between the bold words and the words in italics in the <extra> element will be removed. Using the config file config-nw-false.xml will set normalize-whitespace to false and the spaces will still be there in the <zip> element and the space between the bold and italics elements will not be lost.

Most of the time, in data-centric XML, you will probably want to normalize whitespace. If you have particular elements where spaces are significant or where formatting can be used, you will probably need to specifically switch off whitespace-normalization in these elements. You would use a setting in your config file like:

  <dcf:location name="Switch nw off in zip elements because they may be all spaces" xpath="//zip">
    <dcf:normalize-whitespace on="false"/>
  </dcf:location>

#content .code