Document Comparator Character by Character Comparison

Introduction

This "how to" document discusses how you can perform character by character comparison. For the resources associated with this sample, see here.

Document Comparator - Uses Java API calls to customise a pre-existing pipeline with a number of extension points.

This document comprises two sections:

Background: some background to the character by character processing, including why it is implemented as a post comparison output filter and at what stage in the output filtering it should be applied;
Tutorial: a discussion on what results to expect from the character by character processing, how to make use of the code and how to run the sample; and how to run the sample.

Background

The pipelined comparator character-by-character sample (which is now deprecated) faced challenges in accurately aligning text due to the possibility of inadvertently aligning unrelated words or phrases. Previous experimentation integrating this feature as part of the main DeltaXML comparison yielded unsatisfactory results. Consequently, a new approach has been devised to achieve character by character comparisons. In this updated method, character-by-character comparison serves as a post-comparison filter applied exclusively to regions of text that have already been aligned, specifically targeting modified deltaxml:textGroup elements.

Character by character analysis now focuses on text groups rather than applying analysis solely on modified words, which may not capture simple words splitting and joining. Additionally to improve the overall accuracy of the comparison, character-by-character comparison takes place after other word and block-level filters have adjusted text alignment.

This method aims to minimise the number of modified text groups and enhance the likelihood of comparing similar text accurately. Furthermore, the introduction of the @deltaxml:character-by-character attribute enables the selective enabling or disabling of character-by-character comparison for specific subtrees or elements. This attribute offers greater flexibility in controlling the scope of character-by-character comparison.

Tutorial

Expected results

The sample output file contains many examples of the output of character by character processing, along with an explanation of why those results were produced.

For the purposes of this document, we will choose to discuss a few of these examples.

The sample file demonstrates character-by-character processing's capability to detect changes.

Initially, we examine pluralisation modifications within a sentence.

"The quick brown foxes jumps over the lazy dogs."

The following example shows how changing a letter in a word is represented.

"The quick brown fox jumps over the lazy dhog."

Character-by-character processing scenarios.

Correction of Spelling Mistakes

Examples include corrected misspelled words.

"It is easy to omit letters when typing."

"It is easy to swawp letters when typing."

However, sometimes the amount of change required to correct a spelling breaks the max number of allowed changes. For example:

“~~one green car~~two red cars parked”

Capitalisation

Examples showcase changes in capitalisation.

"fForgetting to capitalise a sentence is easy."

"It is also easy to Iinappropriately capitalise a word."

Running the sample (in Java)

For the resources associated with this sample, see the Bitbucket repository.

The sample resources should be checked-out, cloned or downloaded and unzipped into the samples directory of the XML Compare release. They should be located such that they are two levels below the top level release directory for example DeltaXML-XML-Compare-10_0_0_j/samples/CharacterByCharacter