Word by Word Text Comparison
This document explains the concepts behind word-by-word text comparison using a sample. The sample resources and a description on how to run it can be found at: https://bitbucket.org/deltaxml/word-by-word.
When comparing documents containing text, XML Compare treats each block of text as a single node. This can lead to large amounts of change when in fact only certain words within the text have been changed. Consider the following document and the changes made to it:
Example 1: an XML document containing text (input1.xml in Bitbucket, https://bitbucket.org/deltaxml/word-by-word)
<document> <para>A pangram uses all the letters of the alphabet. It is often used to test typewriters or computer keyboards, for example: The quick brown fox jumps over the lazy dog.</para> </document>
Example 2: a modified version of the document (input2.xml in Bitbucket, https://bitbucket.org/deltaxml/word-by-word)
<document> <para>A pangram uses all the letters of the alphabet and is often used to test typewriters or computer keyboards, for example: A quick movement of the enemy will jeopardize six gunboats.</para> </document>
If these inputs are compared as they are, we get the following result (the actual delta file is converted to a colour-coded result to make it easier to read)
Example 3: the result of comparing the documents above
The sample code shows how the same comparison can be run using either the Pipelined Comparator or the Document Comparator. The description of specific filters here applies only to the Pipelined Comparator; the equivalent 'word-by-word' and 'orphaned word' features are built-in to the Document Comparator and are controlled via its API or a DCP XML configuration file. Because the Document Comparator has word-by-word comparison enabled by default, for cases where word-by-word must be disabled, a special 'disable-word-by-word.xsl' is used to add a 'deltaxml:word-by-word="false"' attribute to the root element of the input files.
Comparing text word by word
While the result above is technically correct, it is not particularly useful for displaying what has actually changed. A much better approach would be to compare the text on a word by word basis.
Word by word filters
XML Compare includes Java filters to split text into individual words before comparing them and also to convert the split words back into larger chunks of text. This allows the comparison to show only those words that have changed. These filters are
com.deltaxml.pipe.filters.dx2.wbw.WordOutfilter. The following result shows the effect they have on the sample data:
Example 4: the comparison result with word by word filters added
Word by word attributes
When word by word filters are in place, the default is for the word by word feature to affect all parts of the document. However, it is possible to specify which parts of a document are affected or unaffected through the use of a
deltaxml:word-by-word attribute. This attribute has permitted values of
false and may be attached to any element in an input document.
deltaxml:word-by-word attribute affects the element in which it occurs, any descendant elements of this element are also affected unless they themselves have a
deltaxml:word-by-word attribute, which would override the inherited behaviour.
The next improvement to make is to the changed sentence at the end of the text. This is made up of fragmented added/deleted words alongside the occasional unchanged word that was common between the two sentences. This is not easy to read due to the fragmentation. The
OrphanedWordOutfilter is a post-processing filter that will detect unchanged words in the middle of changes and duplicate them as an added and deleted copy of the same word. In the example above, the words 'quick' and 'the' would be treated in this way. Applying this filter gives the following result:
Example 5: the comparison result with orphaned words detected
This final result has the advantage of showing the changes in more detail but also being readable.
The order of the word by word filters is important. Because they convert each word, space and potentially punctuation character into an element that contains the word, they can dramatically increase the size of the XML document tree. Processing this large tree will then often consume much larger amounts of memory than before if the processing involves holding the entire tree in memory (as it does with XSLT filtering). For that reason, the word filters are written as Java streaming filters that do not load the tree into memory. They are also placed as close as possible on either side of the Java-implemented comparison stage.
WordInfilter is typically used as the last input filter before the comparison on either input filter chain. Punctuation needs to be defined in the inputs in a previous filter if required.
The minimal requirement for output filters is to use the
WordOutfilter as the first output filter. This will highlight changes at the finer granularity and , where there are no orphaned words, will gather together the added and deleted text into a continuous chunk of text.
If the orphaned word processing is required,
OrphanedWordOutfilter should be used as the first output filter, followed by
All other output filters (particularly XSLT filters) should go after the word filters.
Configuring the Orphaned Words Filter
OrphanedWordOutfilter has two configuration parameters that change its behaviour; orphanedLengthLimit and orphanedThresholdPercentage.
The orphanedLengthLimit Parameter
This parameter specifies the maximum number of consecutive unchanged words that could be treated as orphaned words. Its default value is 2 and it should be kept as a fairly small number to avoid all words being treated as orphans and the advantage of word by word comparison being lost.
The orphanedThresholdPercentage Parameter
This specifes a threshold that is used in a calculation that assesses the size of the unchanged word section in relation to the changed words either side of it. The default value is 20 i.e. the unchanged words can count for no more than 20 percent of the total count of changed and unchanged words. The calculated value that is compared against this percentage is:
unchanged words / (changed words before + unchanged words + changed words after) * 100
This calculated value must be less than the value of
orphanedThresholdPercentage for the unchanged words to be treated as orphans.
Running the sample
The sample resources and a description on how to run it can be found at: https://bitbucket.org/deltaxml/word-by-word for Java.
The resources should be checked-out, cloned or downloaded and unzipped into the samples directory of the XML Compare release. They should be located such that they are two levels below the top level release directory, for example