Configuration for Processing CALS Tables
Introduction
This page gives a detailed explanation of the settings used for processing CALS tables. DCP settings are also summarised in the DCP Schema Guide.
The use of CALS tables is explained in the page Comparing Document Tables (CALS or HTML).
Note that this page is only applicable when using the Document Comparator.
Improved table processing in 12.0.0 release
Comparing Tables in documents is complex, there is a trade-off between trying to capture the structure changes in the underlying markup and showing the content changes as the User sees them. DeltaXML have been listening to our customers and have been working on a new approach to comparing Tables which emphasises the content changes in Tables as Users see them. This should avoid results where Tables are irregular and should better reveal finer grain cell content modifications. We also now have an output presentation which preserves the spans from one Version and shows content changes within those.
In this 12.0.0. release we have included our latest table comparison algorithms for CALS Tables. The old approach was roughly based on 2 techniques and stemmed partly from the goal that it should be possible to reconstruct both A and B documents from the result:
Keying cells according to the column to which they came from in sequence.
Where there were structure changes, such as a span in one input that’s not there in another, it’s not possible to show both A and B so
the idea was to show separate A and B rows or even tables.
The new approach is based on different techniques:
Comparing the columns of a table first, based on the content that they contain.
Reconstructing a non-ragged table showing column and row adds and deletes but basing the final structure on that of the B input. So we
are losing some information about the A structure in order to show changes at a finer granularity. So, for example, where a B span merges what were individual cells in A it will not always be easy to tell which A values came from which columns. A and B values will not however be lost.
Validation
CALS tables are validated by a Schematron file. This is published in GitHub. There are different ways to configure the messages, this is explained below. CALS tables are those that have a <tgroup>
element.
Each message includes the XPath of the element concerned. This allows the identification of the erroneous element in the input file within an XML editor. Alternatively you may search for the phrase CALS Table Validation Warning in the result document.
The CALS reference refers to page https://www.oasis-open.org/specs/a502.htm. The Semantic description for the CALS table model section has a number of tables. For example, reference T5R1 refers to Table 5, Row 1.
Configuration
New Column Alignment Features
The new table alignment algorithm uses a different approach for aligning columns in compared tables to detect if a column has been added, deleted or moved.
‘Ordered’ and ‘Orderless’ Table Columns
By default, columns alignment by the comparator is ‘ordered’. That is, in two compared tables, columns are aligned such that the order of columns in each table is regarded as significant.
In many, if not most, types of data table, column order is not significant. Column headings identify columns such that if the column position changes, the meaning of the the table does not change. We describe such tables as having ‘orderless’ columns.
Column Alignment
Column Keying
It may be that you already know which columns from A and B you wish to align. In this case you can use the new @deltaxml:table-column-keys control attribute on a tgroup to give keys to columns so that your pre-determined alignment will override the value based alignment which is the new default.
If you are happy that the colspec colname or implicit column position can be used as a key then you can use the configuration parameter 'column KeyingMode' and the product will do this for you. In cases where the old approach gave you what you wanted this should do the job. If there are some tables which are better without Column Keys then you will have to create an input filter to insert the deltaxml:table-column-keys attribute values where necessary.
If you are using DCP, you can set this setting using columnKeyingMode under standardConfig/calsTableConfiguration. If you using suing JAVA API, please use setColumnKeyingMode method on CalsTableConfiguration object.
The processing instruction <?dxml-column-keying-mode auto|colname|position?> can also be used inside tgroup element to apply the column keying mode. User defined column keys can also be applied using the processing instruction <?dxml-column-keys "One, Two, Three, Four,"?> within the tgroup element where comma separated list identifies the columns.
Orderless Columns
The order of columns after any ‘header row’ may not be important to you and you may wish to see the differences to cell values in columns rather than the fact that the whole column has been moved. Putting a control attribute @deltaxml:table-columns-ordered=”false” on a tgroup will mean that the columns will be compared in an orderless fashion without showing swaps or movement.
If you are using DCP, you can set this setting using ignoreColumnOrder under dtandardConfig/calsTableConfiguration. If you using suing JAVA API, please use setIgnoreColumnOrder method on CalsTableConfiguration object.
For more details about the the new feature and how to use them please see the sample CALS Tables Column Ordering.
processCalsTables
true (default)
A table has to have a <tgroup>
to be seen as a CALS table.
false
If you have no CALS tables in the input documents, the comparison may be faster if you switch CALS table processing off.
Use false if you have <tgroup>
elements that are not part of a CALS table.
warningReportMode
processingInstructions (default)
By default the CALS Table Validation Warning appears in a processing instruction. In a large file you could search for the string CALS Table Validation Warning.
Sample result file showing processing instruction
<tgroup deltaxml:deltaV2="B" cols="3"><?dxml_warn CALS Table Validation Warning for Input B:
In /Q{}article[1]/Q{}table[1]/Q{}tgroup[1]/Q{}tbody[1]/Q{}row[1]/Q{}entry[2] the column specified by the namest attribute (c) must be to the left of the column specified by nameend (b) CALS-T10R4F?>
<colspec colnum="1" colname="a"/>
<colspec colnum="2" colname="b"/>
<colspec colnum="3" colname="c"/>
<tbody>
<row><entry colname="a" morerows="2"/><entry namest="c" nameend="b"/></row>
<row><entry colname="b"/><entry colname="c" morerows="1"/></row>
<row><entry colname="b"/></row>
</tbody>
</tgroup>
message
When using a value of message
for this parameter and running from the command line you will obtain output like this if there is an error:
Sample Command line output when warningReportMode is message
DeltaXML Command Processor, version: 2.1
Copyright (c) 2000-2015 DeltaXML Ltd. All rights reserved.
Using: XML-Compare, version: 10.1.1
CALS Table Validation Warning for Input A=B: The @morerows on /Q{}article[1]/Q{}table[4]/Q{}tgroup[1]/Q{}tbody[1]/Q{}row[1]/Q{}entry[1] needs to be a non-negative integer CALS-T10R11
comments
If you set the warningReportMode
to comments then and XML comment is inserted after a <tgroup>
that has an error.
Sample result file showing comments
<tgroup deltaxml:deltaV2="B" cols="3"><!--CALS Table Validation Warning for Input B:
In /Q{}article[1]/Q{}table[1]/Q{}tgroup[1]/Q{}tbody[1]/Q{}row[1]/Q{}entry[2] the column specified by the namest attribute (c) must be to the left of the column specified by nameend (b) CALS-T10R4F-->
<colspec colnum="1" colname="a"/>
<colspec colnum="2" colname="b"/>
<colspec colnum="3" colname="c"/>
<tbody>
<row><entry colname="a" morerows="2"/><entry namest="c" nameend="b"/></row>
<row><entry colname="b"/><entry colname="c" morerows="1"/></row>
<row><entry colname="b"/></row>
</tbody>
</tgroup>
The expression can be copied and pasted into an XPath builder in an XML editor such as oXygen and, with the focus on the file with the problem, the element concerned will be highlighted.
For both PIs and comments the message will appear in the tgroup with the problem.
Note that both the comments and processingInstructions options include the wording CALS Table Validation Warning so that searching the result file for this phrase will show the errors.
invalidTableBehaviour
propagateUp (default)
This propagates the error up to the next tgroup. Remember that the XML will not be considered to be a CALS table if there is no tgroup element.
Sample default result
<tgroup deltaxml:deltaV2="B" cols="3"><?dxml_warn CALS Table Validation Warning for Input B:
In /Q{}article[1]/Q{}table[1]/Q{}tgroup[1]/Q{}tbody[1]/Q{}row[1]/Q{}entry[2] the column specified by the namest attribute (c) must be to the left of the column specified by nameend (b) CALS-T10R4F?>
<colspec colnum="1" colname="a"/>
<colspec colnum="2" colname="b"/>
<colspec colnum="3" colname="c"/>
<tbody>
<row><entry colname="a" morerows="2"/><entry namest="c" nameend="b"/></row>
<row><entry colname="b"/><entry colname="c" morerows="1"/></row>
<row><entry colname="b"/></row>
</tbody>
</tgroup>
compareAsXml
This processes the table as if it is just xml and not a table. The warning is still given, defaulting to a processing instruction on the <tgroup>
.
Sample result file when behaviour is set to xml
<tgroup deltaxml:deltaV2="A!=B" cols="3">
<?dxml_warn CALS Table Validation Warning for Input B:
In /Q{}article[1]/Q{}table[1]/Q{}tgroup[1]/Q{}tbody[1]/Q{}row[1]/Q{}entry[2] the column specified by the namest attribute (c) must be to the left of the column specified by nameend (b) CALS-T10R4F?>
...
<entry deltaxml:deltaV2="A!=B">
<deltaxml:attributes deltaxml:deltaV2="A!=B">
<dxa:namest deltaxml:deltaV2="A!=B">
<deltaxml:attributeValue deltaxml:deltaV2="A">b</deltaxml:attributeValue>
<deltaxml:attributeValue deltaxml:deltaV2="B">c</deltaxml:attributeValue>
</dxa:namest>
<dxa:nameend deltaxml:deltaV2="A!=B">
<deltaxml:attributeValue deltaxml:deltaV2="A">c</deltaxml:attributeValue>
<deltaxml:attributeValue deltaxml:deltaV2="B">b</deltaxml:attributeValue>
</dxa:nameend>
</deltaxml:attributes>
</entry>
...
</tgroup>
fail
This will fail with a message starting Detected invalid table(s), for example:
Sample Output when behaviour is set to fail
java -jar /usr/local/deltaxml/DeltaXML-XML-Compare-10_1_1_j/command-10.1.1.jar compare CALS-fail valid.xml invalid.xml result-fail.xml
DeltaXML Command Processor, version: 2.1
Copyright (c) 2000-2015 DeltaXML Ltd. All rights reserved.
Using: XML-Compare, version: 10.1.1
Comparison runtime fault: Exception thrown when attempting to run the 'result/0-output-part-1/4-dxml-calsTable/7-validity-reporter' step (source: 'net.sf.saxon.s9api.XsltExecutable@102dff25')
Exception Stack Trace:
com.deltaxml.cores9api.FilterProcessingSingleException: Exception thrown when attempting to run the 'result/0-output-part-1/4-dxml-calsTable/7-validity-reporter' step (source: 'net.sf.saxon.s9api.XsltExecutable@102dff25')
at com.deltaxml.cores9api.RunnableFilterChain.c_a(RunnableFilterChain.java:30)
at com.deltaxml.cores9api.RunnableFilterChain.c_a(RunnableFilterChain.java:172)
at com.deltaxml.cores9api.PipelinedComparatorS9.c_b(PipelinedComparatorS9.java:387)
at com.deltaxml.cores9api.PipelinedComparatorS9.c_a(PipelinedComparatorS9.java:707)
at com.deltaxml.cores9api.DocumentComparator.c_a(DocumentComparator.java:679)
at com.deltaxml.cores9api.AbstractComparator.compare(AbstractComparator.java:90)
at com.deltaxml.cores9api.DocumentComparator.compare(DocumentComparator.java:278)
at com.deltaxml.cmdline.PipelinedTextUI.c_b(PipelinedTextUI.java:345)
at com.deltaxml.cmdline.PipelinedTextUI.<init>(PipelinedTextUI.java:389)
at com.deltaxml.cmdline.PipelinedTextUI.main(PipelinedTextUI.java:97)
Caused by: net.sf.saxon.trans.XPathException: Detected invalid table(s):
Input B: In /Q{}article[1]/Q{}table[1]/Q{}tgroup[1]/Q{}tbody[1]/Q{}row[1]/Q{}entry[2] the column specified by the namest attribute (c) must be to the left of the column specified by nameend (b) CALS-T10R4F
calsValidationLevel
strict
relaxed (default)
If there's an error with spanname
being used in thead
or tfoot
when colspec
is defined, then no error will be reported in relaxed mode. In strict mode you will see an error like:
Error given in strict validation mode
Use of the spanname attribute in a thead/tfoot is not allowed when local colspec elements are defined.
Relaxed validation does not give any error for this tgroup even though the <thead>
has both a colspec
and a spanname
.
Sample tgroup illustrating the difference between 'strict' and 'relaxed' validation
<tgroup cols="2">
<colspec colnum="1" colname="a"/>
<colspec colnum="2" colname="b"/>
<spanspec spanname="ab" namest="a" nameend="b"/>
<thead>
<colspec colnum="1" colname="a"/>
<colspec colnum="2" colname="b"/>
<row><entry spanname="ab"/></row>
</thead>
<tbody>
<row><entry spanname="ab"/></row>
</tbody>
</tgroup>
There are other tests that are only reported when validatinLevel is strict. These are shown in the Schematron file cals.sch with an attribute on the assert element role="warning"
.
Known limitations
Empty tables or tables with little significant data in
One current weakness happens when tables are empty or contain data that is heavily repeated across columns. Because the new approach analyses content when comparing versions it does a bad job in these cases.
This is a problem we are currently working on.
In these cases for now you can use the Column Keying as described above because an empty table is mostly just structure. The same approach can be used with heavily repeated data.
Small tables with many hidden columns or tables with adjacent columns whose data is much the same
Another current weakness happens when a table has many columns that are hidden by spans. Typically this is where Users are using columns to format the alignment of cells within tables. When one item of content spans several columns there is an issue about how to assign this content to the columns and this can cause column alignment issues in some situations. The same can occur when tables consist of mostly the same data.
This is a problem we are currently working on, but for now you can use the Column Keying as described above if you already ‘know’ the alignment.
spanspecs
We have not yet completed work on spanpecs. Whilst they are taken into account when reconstructing the spans in the result, we don’t yet preserve the link between the spanspec and those entrys via the spanname attribute. It is replaced in the result with equivalent namest and nameend attributes. This means any style information attached to the spanspec, like the align attribute value will be lost. Let us know if this causes you problems, but remember that we can only ever use one style (currently the B style) to reconstruct span styles.
colspecs in thead and tfoot
tgroups define the number of columns that the tbody, thead and tfoot elements may contain. But thead and tfoot may define (though not in the CALS XML Exchange Table Model Document Type Definition) different styling for those elements using their own colspec elements. If the colspecs in the thead and tfoot have different names from the tgroup colspecs then those names will be used when reconstructing the table, as with tgroup colspecs mentioned above, the B colspecs will be used except where there are columns only present in A.