The default setting is to normalize whitespace so that multiple spaces, tabs, and newlines are usually treated as a single space. Ignorable whitespace such as formatting whitespace is completely removed. Significant whitespace nodes are rare in data-centric XML. Even when blocks of text are included, they are unlikely to have formatting elements added.
The samples provided on Bitbucket (
addressesB.xml) are identical except for the arrangement of whitespace. When you use an empty configuration file the default setting is to normalize the whitespace so the result shows "A" equal to "B".
Run the comparison with the other two config files and you will see the way the whitespace changes are detected. When the default value is changed to false, using
config-nw-false.xml, all changes in whitespace are detected.
config-nw-specific.xml , the value for
normalize-whitespace is only changed for the specific element
extra, then only whitespace differences in the text in those elements will be detected.
Text in the
extra element on the "Sherlock Holmes" record is wrapped differently in the two input files.
Text in the
notes element for "Harry Potter" has an additional space before the start of each line in the "B" file.
Comparing addressesC.xml with itself
There is a third sample file on Bitbucket,
addressesC.xml, that has examples of whitespace that you may want to keep. It has a
<zip> element that just has spaces in and the
<extra> element has formatting. For this example
addressesC.xml can be used for both
If you make the comparison using an empty config file, then the whitespace normalisation will be enabled and the spaces will be removed from the
<zip> element in the result file. Also the space between the bold words and the words in italics in the
<extra> element will be removed. Using the config file
config-nw-false.xml will set
false and the spaces will still be there in the
<zip> element and the space between the bold and italics elements will not be lost.
Most of the time, in data-centric XML, you will probably want to normalize whitespace. If you have particular elements where spaces are significant or where formatting can be used, you will probably need to specifically switch off whitespace-normalization in these elements. You would use a setting in your config file like: