How are DITA Comparisons Performed?

DITA Compare makes use of the fact that DITA is an XML format when performing document comparison. XML documents are machine readable documents that conform to a set of rules defined by the W3C. For more information on XML see Resources.

The document comparison is performed by another of our products, XML Compare, with various pre-configured pre- and post-processing steps, referred to as a filter pipeline.

XML Compare works by matching together elements that have the same name, and where possible, the same or similar contents. This means that a paragraph (<p> element) can only ever be compared against another paragraph and will never be compared against a note (<note> element). Understanding this is a key part of understanding how the comparison works.

When deciding which elements match best, e.g. which amongst a number of possible paragraph pairings is the best match, XML Compare uses the words within an element. Elements that have the same or similar content are much more likely to be matched together than those that are quite different. Once this matching phase has taken place, XML Compare will then compare the contents of the two elements it has matched, recursing in this fashion until it reaches the bottom of the XML structure.

Generalization

In order to compare two input files, DITA Compare must first ensure that the root elements are the same. This is a requirement of the underlying tool used to compare the documents. In order to ensure that this is the case, the inputs are generalized back to a DITA topic using the generalization mechanism in the DITA OpenToolkit.

Pre-processing

As well as generalizing the two inputs, the pre-processing stages perform many other tasks on the documents. Some of these tasks are described below, along with any parameter settings available for configuring them.

Removing track change markup from input documents

Some of the output formats generate tracked change markup. All supported tracked change format markup is removed before comparison. This avoids confusion between pre-existing changes and those added as part of the comparison. Note this has the affect of accepting all the changes on both input documents before comparison begins

Preserving comments and processing instructions

XML comments (text contained in  markup) and processing instructions (special instructions marked as <?instruction_name more details ?>) in the document need to be converted into other XML markup in order to be output in the result document. This task is carried out before comparison, the elements are then compared and they are converted back into comments and processing instructions afterwards. Because there is no useful way of marking changes to comments and processing instructions, the result contains only those from the second input.

Whitespace preservation

For most DITA elements, whitespace is not significant (i.e. multiple spaces and newlines are effectively turned into a single space when converting to a published format such as PDF). Therefore, when using the DITA markup such spaces are 'normalized' before comparison.

Tracked changes output formats are intended for use with editors, where 'roundtrip' processing is the typical behaviour. In this case we are typically not interested in whitespace change, but want to preserve the document indentation. Therefore, we 'ignore' changes in whitespace.

The whitespace-processing-mode parameter provides a means for configuring how whitespace differences are to be handled.

Word by word

When comparing text, the comparison can treat text blocks as a single chunk of text and compare one chunk against another or it can treat it as a sequence of words, comparing one word against another. The word-based comparison gives more understandable results at a much finer-grained level and is the default setting. If you wish to turn it off, set the word-by-word parameter to false. See the word-by-word parameter for more details.

Table processing

Table comparison is a complicated matter and part of the requirements for processing DITA tables is that the table is a valid 'CALS Table'. This is a separate standard that defines how tables should be constructed and is used as the definition for DITA tables. However, it is possible for a table to be valid according to the DITA language but semantically invalid according to the CALS table specification. Part of the input processing analyzes tables in the document, performs normalization and annotates them to inform later processing stages about their validity.

Table normalization involves the following:

Converting a column width (the colwidth attribute) value of * to 1*. These are semantically equivalent but would register as a difference when compared.
Explicitly outputting inferred column specifications (colspec elements). For example, if the first column defined is listed as column 2, then there is an inferred default entry for column 1. The input processing adds an explicit definition of such inferred colspecs.

DITA Compare also includes support for comparing DITA's simple tables (as identified by the topic/simpletable class attribute). Processing for these tables is slightly different as they are a much simpler form of table than CALS tables.

See Table Comparison in DITA Compare for more details on how tables are compared.

Post-processing

Specialization

As mentioned above, all inputs are generalized back to a DITA topic to ensure that they have the same root element (a prerequisite for comparison). Once comparison has taken place, the result document is specialized back to whatever DITA type was originally passed in. This works well if the input documents are both the same type, e.g. both DITA tasks but if the two inputs are different, e.g. a task and a concept, specialization can result in an invalid result. This is because the result can contain two different specialization types. If elements are deleted, their specialization type will be from the first input document, if they are added, it will be from the second input document. There may also be changes to the class attribute itself if two different elements, once generalized, match together. In these cases, it is not obvious how to perform specialization of the result file and indeed it is quite likely to lead to an invalid result. The default behaviour of DITACompare is to leave the result as a DITA topic but with the class attributes available should you wish to edit it and then specialize. It is possible to force specialization to occur using the force-specialization parameter. At present, this will specialize the result to the type used for the second, or 'B', input. Future releases may allow this to be configured to choose the 'A' document as an option. Please be aware if using this option that the result file is quite likely to be invalid.

Content choice

XML grammars often present author's with a choice of elements to use in a particular context. Sometimes this choice will be a list of elements, any of which can be used multiple times. Other choices involve mutually exclusive sets of elements. One such example of this in DITA is the choice of steps or stepsunordered in a DITA task. If one input document contains steps and the other contains stepsunordered, the result would normally contain both, one marked as deleted and one marked as added. This means that the result file is invalid as it is not permissible to use both of these elements. Part of the post-processing performed by DITA Compare is to detect these cases and try to resolve them. At present, the only case supported is the one mentioned here, steps vs stepsunordered. The behaviour of DITA Compare can be configured using steps-conflict-resolution. Support for more cases will be added in future releases.

Conflict resolution of id attributes

When comparing aligned topics it is possible to encounter conflicting 'id' definitions. Conflicting 'id' definitions can cause problems for red-line document production from the DITA-Markup output-format. Therefore, conflicting 'id' definitions are renamed to avoid such conflicts (when DITA-Markup output format is selected). The associated direct cross-referencing and reuse attributes, 'href' and 'conref', are also updated to be consistent with such 'id' attribute renaming.

When performing a topic-only comparison, global DITA cross-references are not processed using the conflict resolution scheme described above. Here, a global cross-reference is of the form uri#topic-id[/element-id] and a local cross-reference is of the form #topic-id[/element-id].