Word by Word Text Comparison

Introduction

This document explains the concepts behind word-by-word text comparison using a sample. The sample resources and a description on how to run it can be found at: https://bitbucket.org/deltaxml/word-by-word.

When comparing documents containing text, XML Compare treats each block of text as a single node. This can lead to large amounts of change when in fact only certain words within the text have been changed. Consider the following document and the changes made to it:

Example 1: an XML document containing text (input1.xml in Bitbucket, https://bitbucket.org/deltaxml/word-by-word)

XML

<document>
  <para>A pangram uses all the letters of the alphabet. It is often used to test typewriters 
    or computer keyboards, for example: The quick brown fox jumps over the lazy dog.</para>
</document>

Example 2: a modified version of the document (input2.xml in Bitbucket, https://bitbucket.org/deltaxml/word-by-word)

XML

<document>
  <para>A pangram uses all the letters of the alphabet and is often used to test typewriters
    or computer keyboards, for example: A quick movement of the enemy will jeopardize six gunboats.</para>
</document>

If these inputs are compared as they are, we get the following result (the actual delta file is converted to a colour-coded result to make it easier to read)

Example 3: the result of comparing the documents above

The sample code shows how the same comparison can be run using either the Pipelined Comparator or the Document Comparator. The description of specific filters here applies only to the Pipelined Comparator; the equivalent 'word-by-word' and 'orphaned word' features are built-in to the Document Comparator and are controlled via its API or a DCP XML configuration file. Because the Document Comparator has word-by-word comparison enabled by default, for cases where word-by-word must be disabled, a special 'disable-word-by-word.xsl' is used to add a 'deltaxml:word-by-word="false"' attribute to the root element of the input files.

Comparing text word by word

While the result above is technically correct, it is not particularly useful for displaying what has actually changed. A much better approach would be to compare the text on a word by word basis.

Word by word filters

XML Compare includes Java filters to split text into individual words before comparing them and also to convert the split words back into larger chunks of text. This allows the comparison to show only those words that have changed. These filters are com.deltaxml.pipe.filters.dx2.wbw.WordInfilter and com.deltaxml.pipe.filters.dx2.wbw.WordOutfilter. The following result shows the effect they have on the sample data:

Example 4: the comparison result with word by word filters added

Word by word attributes

When word by word filters are in place, the default is for the word by word feature to affect all parts of the document. However, it is possible to specify which parts of a document are affected or unaffected through the use of a deltaxml:word-by-word attribute. This attribute has permitted values of true or false and may be attached to any element in an input document.

The deltaxml:word-by-word attribute affects the element in which it occurs, any descendant elements of this element are also affected unless they themselves have a deltaxml:word-by-word attribute, which would override the inherited behaviour.

Orphaned Words

The next improvement to make is to the changed sentence at the end of the text. This is made up of fragmented added/deleted words alongside the occasional unchanged word that was common between the two sentences. This is not easy to read due to the fragmentation. The OrphanedWordOutfilter is a post-processing filter that will detect unchanged words in the middle of changes and duplicate them as an added and deleted copy of the same word. In the example above, the words 'quick' and 'the' would be treated in this way. Applying this filter gives the following result:

Example 5: the comparison result with orphaned words detected

This final result has the advantage of showing the changes in more detail but also being readable.

Filter ordering

The order of the word by word filters is important. Because they convert each word, space and potentially punctuation character into an element that contains the word, they can dramatically increase the size of the XML document tree. Processing this large tree will then often consume much larger amounts of memory than before if the processing involves holding the entire tree in memory (as it does with XSLT filtering). For that reason, the word filters are written as Java streaming filters that do not load the tree into memory. They are also placed as close as possible on either side of the Java-implemented comparison stage.

Input Filters

The WordInfilter is typically used as the last input filter before the comparison on either input filter chain. Punctuation needs to be defined in the inputs in a previous filter if required.

Output Filters

The minimal requirement for output filters is to use the WordOutfilter as the first output filter. This will highlight changes at the finer granularity and , where there are no orphaned words, will gather together the added and deleted text into a continuous chunk of text.

If the orphaned word processing is required, OrphanedWordOutfilter should be used as the first output filter, followed by WordOutfilter.

All other output filters (particularly XSLT filters) should go after the word filters.

Configuring the Orphaned Words Filter

The OrphanedWordOutfilter has two configuration parameters that change its behaviour; orphanedLengthLimit and orphanedThresholdPercentage.

The orphanedLengthLimit Parameter

This parameter specifies the maximum number of consecutive unchanged words that could be treated as orphaned words. Its default value is 2 and it should be kept as a fairly small number to avoid all words being treated as orphans and the advantage of word by word comparison being lost.

The orphanedThresholdPercentage Parameter

This specifes a threshold that is used in a calculation that assesses the size of the unchanged word section in relation to the changed words either side of it. The default value is 20 i.e. the unchanged words can count for no more than 20 percent of the total count of changed and unchanged words. The calculated value that is compared against this percentage is:

unchanged words / (changed words before + unchanged words + changed words after) * 100

This calculated value must be less than the value of orphanedThresholdPercentage for the unchanged words to be treated as orphans.

Running the sample

The sample resources and a description on how to run it can be found at: https://bitbucket.org/deltaxml/word-by-word for Java.

The resources should be checked-out, cloned or downloaded and unzipped into the samples directory of the XML Compare release. They should be located such that they are two levels below the top level release directory, for example DeltaXML-XML-Compare-10_0_0_j/samples/word-by-word.