Character by Character Comparison

Introduction

This "how to" document discusses how you can perform character by character comparison, using the sample 'cbc-compare.xsl' post comparison output filter. Two practical solutions are presented here, with each using a different comparator and method for customising a comparison:

Pipelined Comparator (DXP) - Uses a filter pipeline defined by an XML file called a 'DXP' to customise the comparison.
Document Comparator - Uses Java API calls to customise a pre-existing pipeline with a number of extension points.

The document comprises three sections:

Background: some background to the character by character processing, including why it is implemented as a post comparison output filter and at what stage in the output filtering it should be applied;
Tutorial: a discussion on what results to expect from the character by character processing, how to make use of the code and how to run the sample; and
Design: an overview of the character by character design, focusing on where it would be appropriate to add user specific character by character result adjustment filters.

Before continuing it is worth mentioning this sample's scope. Specifically it is designed to process modified deltaxml:textGroups and produce new deltaxml:textGroups that contain the character level comparison results. Other changes to the input files are not handled by this sample. It is expected that the character by character comparison code presented here will be incorporated into a general comparison pipeline, which handles other types of change.

For the resources associated with this sample, see here.

Background

One significant problem with performing character by character comparison, is that it is relatively easy to accidentally align two completely different words or phrases which happen to share some characters. Previous experiments with adding character by character comparison as part of the main DeltaXML comparison have produced poor results. Therefore, for this sample we have taken a different approach. Here we perform the character by character comparison as a post comparison filter on regions of text that have already been aligned, i.e. any modified deltaxml:textGroup elements.

It would have been possible to choose to apply the character by character analysis to modified words, instead of text groups, but this would not enable simple word splitting and joining to be detected. Further, it would require the character by character comparison to happen before other word and block level filters had improved the alignment of the text to be compared. For example, the dx2-threshold.xsl filter separates out blocks of text that have been aligned by the main comparison but would be better viewed as separate blocks.

It is recommended that character by character comparison be performed once other filters that adjust the alignment of deltaxml:textGroups have been run. In general, this approach should reduce the number of modified text groups and increase the likelihood that these modifications contain similar text which can sensibly be character by character compared.

Tutorial

Expected results

The sample output file contains many examples of the output of character by character processing, along with an explanation of why those results were produced. For the purposes of this document, we will choose to discuss a few of these examples.

Consider the following two paragraphs:

The quick brown fox jumps over the lazy dog
The dirty brown foxes jump over the lazy hogs

The raw character by character comparison of these two paragraphs would produce:

The qu d ick rty brown foxes jumps over the lazy d h ogs

Here the words 'quick' and 'dirty' have been aligned due to their sharing an 'i'. In order to prevent this type of result, the raw comparison is filtered. We have implemented a simple 'maximum number of changes' filter, which limits the number of changes (insertions and deletions) to two. The comparison between the words 'quick' and 'dirty' produces four changes (two insertions and two deletions), therefore these words are separated as illustrated by the following filter output:

The ~~quick~~ dirty brown foxes jumps over the lazy ~~dog~~ hogs

The observant reader will have noticed that the words 'dog' and 'hogs' have also been judged to have too many changes, and thus are separated. This is because there are three changes (two insertions and one deletion), which is one more than is allowed by the 'max number of changes' filter. This is arguably inappropriate, but when further examples of character by character comparison are considered (such as those in the sample inputs/output) this measure generally produces better results than setting the tolerance to three changes. The sample output document also contains a discussion on the potential impact of replacing the max-change filter with a filter that uses the proportion of unchanged characters as a threshold to determine whether a region of text has too many changes.

Phrase comparison: The previous example illustrated how two strings that happened to be single words can be character by character compared. In general, the character by character comparison is going to be given two 'phrases' - sequences of characters that may contain several words. Therefore, it is possible to detect and represent word splitting and joining. For example:

Comparing 'in to' and 'into' results in 'in to'
Comparing 'into' and 'in to' results in 'in to'

When the phrases become longer, it is worth splitting then into 'sub-phrases' or 'text regions' that can be independently analyzed. This sample uses shared white-space characters to split the 'raw' output of the character by character comparison into 'text regions', so that the 'amount of change' filtering can be independently applied to each 'text' region, rather than the phrase as a whole. For example:

Comparing 'one green car' and 'two red cars' results in 'twone ~~green~~ red cars'

However, if you turn off the text region 'chunking' then the above phrase has five changes, which is higher than the two permitted, so the output would be filtered to:

Comparing 'one green car' and 'two red cars' results in '~~one green car~~ two red cars'

Using the pipelined comparator

The character by character comparison implemented by this sample requires several files in order to work. First it requires the 'cbc-compare.xsl' output filter, which exploits a (built-in) XSLT extension function to invoke an inner character by character comparison pipeline 'cbc-pipeline.dxp'; this in turn requires the 'dx2-red-green-outfilter.xsl' and all the filters in the 'cbc-xsl' directory.

Assuming that you have your own pipeline 'myPipeline.dxp' in a directory called 'myPipeline', then you would:

copy the 'cbc-compare.xsl', 'cbc-pipeline.dxp', and the 'dx2-red-green-outfilter.xsl' files to the 'myPipeline' directory;
recursively copy the 'cbc-xsl' directory to the directory to the 'myPipeline' directory;
add the 'cbc-compare.xsl' filter to the 'myPipeline.dxp' file in a similar manner to which it is added to the 'cbc-compare.dxp' pipeline (in this sample).

Using the document comparator

The document comparator solution for this sample uses an outer pipeline with just one filter named 'cbc-compare-dc.xsl', this invokes the character by character comparison via a custom XSLT extension function implemented here in Java. The remaining .xsl files used are for the inner pipeline and are the same as for the inner pipeline described above for the pipelined comparator.

The main difference between this solution and that for the pipelined comparator is that the document comparator uses two .java source files 'CBCComparator.java' and 'CBCCompare.java' (instead of .dxp files) to define the outer and inner pipelines respectively.

N.B. If you wish to use the document comparator sample in your own code, you must first compile the two .java files (you can use 'ant compile' to do that) and then include the compiled .class files on your classpath. Failing to do this will result in ClassNotFoundException being thrown.

Running the sample (in Java)

For the resources associated with this sample, see this Bitbucket repo.

The sample resources should be checked-out, cloned or downloaded and unzipped into the samples directory of the XML Compare release. They should be located such that they are two levels below the top level release directory for example DeltaXML-XML-Compare-10_0_0_j/samples/CharacterByCharacter

Design

The character by character sample is built around the 'cbc-compare.xsl' script, which prepares the input for comparison, calls the character by character comparison pipeline 'cbc-pipeline.dxp' or compare method from 'CBCComparator.java', and then returns the content of the comparison as the result.

From the user perspective it is the 'cbc-pipeline.dxp' pipeline or 'CBCComparator.java' that is intended to be modified to change the character by character comparison behavior. It runs the following post comparison filters:

dx2-red-green-outfilter.xsl - to normalise the raw character-by-character output.
cbc-xsl/chunking.xsl - to split the compared text into regions that are separated by shared whitespace.
cbc-xsl/merge-adjacent-changes.xsl - to produce inserted, deleted, and unchanged strings.
cbc-xsl/max-changes.xsl - to filter out changes that should not be represented at the character level.
cbc-xsl/convert-to-deltaV2.xsl - to convert the result back into deltaxml:textGroups
dx2-red-green-outfilter.xsl - to normalise the filtered character-by-character output

Note: only the chunking and max-change filters, (2 and 4 in the above list), are intended to be removed or updated.

The remainder of this section discuses removing the chunking filter, the purpose of the max-change filter, and how to adapt the max-change filter to use a percentage based threshold.

Removing the chunking scheme: In order to remove the chunking scheme the cbc-pipeline.dxp or CBCComparator.java files need to be modified, say by setting the default value of the chunking parameter to 'false'. Alternatively, both this parameter and the chunking filter can be removed from the pipeline. The remaining filters in the pipeline have been designed to work on either chunked or non-chunked comparison output.

Maximum number of changes mechanism: The cbc-xsl/max-changes.xsl filter currently determines whether the character by character comparison has produced too much change, by checking if the number of insertions and deletions within a 'region of text' (a chunk if chunking is enable, otherwise the whole phrase) is greater than two. In such cases, the text region is converted into a single deletion followed by a single insertion, as the character by character change has been deemed to be inappropriate at this point. Otherwise the text region is left unchanged.

Updating maximum number of changes filter: The decision algorithm of this filter is implemented by the following template.

CODE

<xsl:template match="*[delete or insert or unchanged]">
  <xsl:variable name="number-of-changes" select="count(*[self::insert or self::delete])" as="xs:integer"/>
  <xsl:variable name="max-changes" select="xs:integer($max-number-of-changes)" as="xs:integer"/>
  
  <xsl:choose>
    <xsl:when test="$number-of-changes gt $max-changes">
      <xsl:call-template name="convertToOneDeleteAndInsert" />
    </xsl:when>
    <xsl:otherwise>
      <xsl:apply-templates select="." mode="copy"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

In order to change the algorithm convert the first logical test to say when the algorithm has detected too much change. For example, the measure of an acceptable change could be converted to the percentage of changed text, using the formula:

CODE

100 * Sum(unchanged characters)/Sum(total characters)

In order to implement this change the previous XSL template could be adapted as follows (where the $max-number-of-changes variable is now interpreted as the threshold of change percentage).

CODE

<xsl:template match="*[delete or insert or unchanged]">
  <xsl:variable name="all-chars" select="sum(for $n in (node()) return string-length(string($n)))" as="xs:integer"/>
  <xsl:variable name="unchanged-chars" select="sum(for $n in (unchanged) return string-length(string($n)))" as="xs:integer"/>
  <xsl:variable name="threshold-percentage" select="xs:integer($max-number-of-changes)" as="xs:integer"/>
  
  <xsl:choose>
    <xsl:when test="$threshold-percentage gt (100*$unchanged-chars) idiv $all-chars">
      <xsl:call-template name="convertToOneDeleteAndInsert" />      
    </xsl:when>
    <xsl:otherwise>
      <xsl:apply-templates select="." mode="copy"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

Here the appropriate threshold (ratio between common and total characters) should be greater than 20%, otherwise the words 'bat' and 'car' will match and produce the following change 'b cat r' (100 * 1/5), but less than 33%, otherwise comparing 'use' and 'using' would result in the two words being repeated, rather than 'use ing' (100 * 2/6). However, setting the threshold below 33% would mean that comparing the words 'overstated' and 'understand' would result in 'ov understate nd' as this represents 43% (100 * 6/14) unchanged text content. Therefore this percentage measure may not provide a significant improvement over the simple counting mechanism.