Table Comparison in DITA Compare

DITA tables (which use the CALS table model) are handled slightly differently from the rest of the document because displaying change, particularly structural change, in tables is more difficult. For this reason, structural table changes are shown at various different levels of granularity. Our main aim in the table processing is to produce a result where the changes can be seen in as much detail as possible but with the result document still maintaining validity against the DITA specification and the CALS table specification. Producing an invalid result document can cause problems further down the publishing pipeline.

Simple Structural Change

When the column definitions for the two tables have not changed, it is possible to represent changes to column or row spanning at the effected row granularity. Rather than repeating the whole of the table in two tgroup elements, it is possible to repeat only individual rows, or in some cases a set of consecutive rows in the same format of original document rows (marked with status="deleted") followed by the latest document rows (marked with status="new"). The number of rows that are repeated depends on what type of structural change has occurred. If the change involves changes to column spanning within a single row that does not overlap other rows and is itself not overlapped, it is possible to repeat only that single row. If column spanning changes occur on a row that overlaps other rows or is itself overlapped, it is necessary to group together all of the rows affected by the row spanning and repeat them together. This is also the case for any changes involving changes to row spanning.

Complex Structural Change

Some structural changes are too complex to represent in a single result table section (the tgroup element) and so the result document contains a table with two table sections: the first contains the table from the original document with a status="deleted" attribute on it, the second contains the table from the latest document with a status="new" attribute on it. Although it is not possible to see individual changes to rows/cells etc that occurred between the document versions, it is possible to see the two table versions and, providing the inputs were both valid, be sure that the result document is valid.

This type of result is produced when a table contains changes to row or column spanning as well as changes to the column definitions (e.g. changed column names or added/deleted columns).

Improved table processing in 10.0.0 release

Comparing Tables in documents is complex, there is a trade-off between trying to capture the structure changes in the underlying markup and showing the content changes as the User sees them. DeltaXML have been listening to our customers and have been working on a new approach to comparing Tables which emphasises the content changes in Tables as Users see them. This should avoid results where Tables are irregular and should better reveal finer grain cell content modifications. We also now have an output presentation which preserves the spans from one Version and shows content changes within those.

In this 10.0.0. release we have included our latest table comparison algorithms for CALS Tables. The old approach was roughly based on 2 techniques and stemmed partly from the goal that it should be possible to reconstruct both A and B documents from the result:

Keying cells according to the column to which they came from in sequence.
Where there were structure changes, such as a span in one input that’s not there in another, it’s not possible to show both A and B so
the idea was to show separate A and B rows or even tables.

The new approach is based on different techniques:

Comparing the columns of a table first, based on the content that they contain.
Reconstructing a non-ragged table showing column and row adds and deletes but basing the final structure on that of the B input. So we
are losing some information about the A structure in order to show changes at a finer granularity. So, for example, where a B span merges what were individual cells in A it will not always be easy to tell which A values came from which columns. A and B values will not however be lost.

For more details please see DITA Compare 10.0.0 Tester User Guide

Orderless Tables

Sometimes the order of rows within a table is insignificant. For example, consider a simple product information table, where the first column of the table contains a unique product name, the second column its 'tag line', the third column its standard price, etc. The rows in this table can be reasonably ordered in a variety of ways, such as by 'name', or by 'price'. When two versions of a document are compared that use different row ordering mechanisms, a significant number of rows are likely to be added and deleted due to them moving position. If such differences are insignificant then an orderless row comparison would be useful.

Orderless row comparison support can be provided so long as there is no row spanning within the tables being compared. In such cases, the <?dxml-orderless-rows?> processing instruction can be added within the element that directly contains the rows that are to be processed in an orderless fashion. It is important to ensure that this processing instruction is added to the relevant table in both input documents.

The orderless comparison algorithm is greatly improved through the use of unique row keys. Adding a <?dxml-key id1?> processing instruction within the element that directly contains the row, sets that rows key to 'id1'. It is also possible to specify the row 'cell position' that is used for defining the default value for a row's key. For example, the <?dxml-orderless-rows cell-pos:2?> processing instruction specifies that the text content of the row's second cell (e.g. <entry> or <stentry> element) should be used as the row's key. Note that the row cell position takes no account of 'column' data (e.g. @colnum attribute), it just counts the number of cells.

Other Changes

Other kinds of simple structural change can be represented within a single table without needing to repeat any rows. For example, column deletion in a table that does not have any changes to column or row spanning can be represented by marking each of the deleted cells with the status="deleted" attribute.