Configuration for Processing HTML Tables

Introduction

This page gives a detailed explanation of the settings used in a DCP file when processing HTML tables.  These are summarised in the DCP Schema Guide.

The use of HTML tables is explained in the page Comparing Document Tables (CALS or HTML)

Note that this page is only applicable when using the Document Comparator.

Validation

HTML tables are validated by a Schematron file.  There are several ways to configure how validation error messages appear.  See below for details.   

Each message includes the XPath of the element concerned.  This allows the identification of the erroneous element in the input file within an XML editor.  Alternatively you may search for the phrase HTML Table Validation Warning in the result document.

Validation is performed on HTML <table>, DocBook <informaltable> DITA <simpletable> elements. Any validation error messages will appear in the element of concern in the result file - if you have asked for processing instructions or comments.

Configuration 

processHtmlTables

true (default)

false

If you have no HTML tables in the input documents, the comparison may be faster if you switch HTML table processing off.

warningReportMode

processingInstructions (default)

By default the HTML Table Validation Warning appears in a processing instruction.  In a large file you could search for the string HTML Table Validation Warning.

Sample result file showing processing instruction
<table deltaxml:deltaV2="A">
  <?dxml_warn HTML Table Validation Warning for Input A:
                A rowspan attribute (/Q{}mydoc[1]/Q{}body[1]/Q{}table[1]/Q{}tbody[1]/Q{}tr[1]/Q{}td[4]) must be an integer value?>
  <thead>
    <tr>
      <th>No.</th>
      <th>Name</th>
      <th>Email</th>
      <th>DOB</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>Joe</td>
      <td>joe@gmail.com</td>
      <td rowspan="z">1-1-1985</td>
    </tr>
  </tbody>
</table>

message

When using a value of message  for this parameter and running from the command line you will obtain output like this if there is an error:

Sample Command line output when warningReportMode is message
DeltaXML Command Processor, version: 2.1
Copyright (c) 2000-2015 DeltaXML Ltd. All rights reserved.
Using: XML-Compare, version: 10.2.0

HTML Table Validation Warning for Input A: 
                 A rowspan attribute (/Q{}mydoc[1]/Q{}body[1]/Q{}table[1]/Q{}tbody[1]/Q{}tr[1]/Q{}td[4]) must be an integer value

comments

If you set the warningReportMode to comments then an XML comment is inserted after a <tgroup> that has an error.

Sample result file showing comments
<table deltaxml:deltaV2="A">
  <!--HTML Table Validation Warning for Input A:
                A rowspan attribute (/Q{}mydoc[1]/Q{}body[1]/Q{}table[1]/Q{}tbody[1]/Q{}tr[1]/Q{}td[4]) must be an integer value -->
  <thead>
    <tr>
      <th>No.</th>
      <th>Name</th>
      <th>Email</th>
      <th>DOB</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>Joe</td>
      <td>joe@gmail.com</td>
      <td rowspan="z">1-1-1985</td>
    </tr>
  </tbody>
</table>


The XPath expression in a validation message can be copied and pasted into an XPath builder in an XML editor such as oXygen and, with the focus on the file with the problem, the element concerned will be highlighted.

For both processing-instructions and comments, the message will appear in the tgroup element with the problem.

Note that both the comments and processingInstructions options include the wording HTML Table Validation Warning so that searching the result file for this phrase will show the errors. 

invalidHtmlTableBehaviour

propagateUp (default)

This propagates the error up to the containing table.  

Sample default result
<table deltaxml:deltaV2="A">
  <!--HTML Table Validation Warning for Input A:
                A colspan attribute (/Q{}mydoc[1]/Q{}body[1]/Q{}table[1]/Q{}thead[1]/Q{}tr[1]/Q{}th[1]) must be an integer value -->
  <thead>
    <tr>
      <th colspan="a">No.</th>
      <th>Name</th>
      <th>Email</th>
      <th>DOB</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>Joe</td>
      <td>joe@gmail.com</td>
      <td>1-1-1985</td>
    </tr>
  </tbody>
</table>

compareAsXml

This processes the table as if it is just xml and not a table.  The warning is still given, defaulting to a processing instruction on the table.

Sample result file when behaviour is set to xml
<table deltaxml:deltaV2="A!=B">
  <?dxml_warn HTML Table Validation Warning for Input A:
                A colspan attribute (/Q{}mydoc[1]/Q{}body[1]/Q{}table[1]/Q{}thead[1]/Q{}tr[1]/Q{}th[1]) must be an integer value ?>
  <thead deltaxml:deltaV2="A!=B">
    <tr deltaxml:deltaV2="A!=B">
      <th deltaxml:deltaV2="A!=B">
        <deltaxml:attributes deltaxml:deltaV2="A">
          <dxa:colspan deltaxml:deltaV2="A">
            <deltaxml:attributeValue deltaxml:deltaV2="A">a</deltaxml:attributeValue>
          </dxa:colspan>
        </deltaxml:attributes>No. </th>
      <th deltaxml:deltaV2="A=B">Name</th>
      <th deltaxml:deltaV2="A=B">Email</th>
      <th deltaxml:deltaV2="A=B">DOB</th>
    </tr>
  </thead>
  <tbody deltaxml:deltaV2="A=B">
...
  </tbody>
</table>

fail

This will fail with a message starting Detected invalid table(s), for example:

Sample Output when behaviour is set to fail
com.deltaxml.cores9api.FilterProcessingSingleException: Exception thrown when attempting to run the 'result/0-output-part-1/5-dxml-htmlTable/7-html-validity-reporter' step (source: 'net.sf.saxon.s9api.XsltExecutable@5b696b4e')
	at com.deltaxml.cores9api.RunnableFilterChain.fillInFPSEDetailsAndRethrow(RunnableFilterChain.java:314)
	at com.deltaxml.cores9api.RunnableFilterChain.runFilterChain(RunnableFilterChain.java:167)
	at com.deltaxml.cores9api.PipelinedComparatorS9.compareXdmNode(PipelinedComparatorS9.java:1485)
	at com.deltaxml.cores9api.PipelinedComparatorS9.compare(PipelinedComparatorS9.java:1165)
...
Caused by: net.sf.saxon.trans.XPathException: Detected invalid table(s):
Input A: 
                A colspan attribute (/Q{}mydoc[1]/Q{}body[1]/Q{}table[1]/Q{}thead[1]/Q{}tr[1]/Q{}th[1]) must be an integer value

htmlValidationLevel

strict

relaxed (default)

For certain errors nothing will be reported in relaxed mode.  In strict mode you will see errors like:

  • A caption element can occur in an HTML table only once
  • A caption element must be inserted immediately after the table element
  • A tr element should have one or more td or th elements inside
  • A tfoot element should appear before any tbody element

Relaxed validation does not give any error for this table even though there are two captions and the <tfoot> element is after the <tbody>

Sample table illustrating the difference between 'strict' and 'relaxed' validation
<table>
  <caption>CAPTION</caption>
  <caption>ANOTHER CAPTION</caption>
  <thead>
    <tr>
      <th>No.</th>
      <th>Name</th>
      <th>Email</th>
      <th>DOB</th>
    </tr>
  </thead>
  <tbody>
   ...
  </tbody>
  <tfoot>
    <tr>
      <td>Number</td>
      <td>Name</td>
      <td>Email Address</td>
      <td>Date of Birth</td>
    </tr>
  </tfoot>
</table>

normalizeTables  

It is difficult to compare two tables where different conventions have been used.  By default HTML tables will be normalized before comparison takes place to ensure the underlying structure of all tables is the same.  If you do not wish this to take place you can set normalizeTables to false.  This only applies to HTML tables, simple tables and informal tables because there are different ways of expressing the same structure.  

Currently, the normalization feature is limited to the use of <colgroup>, this feature may be extended to cover more cases in future.

This setting is recommended when there is a difference between inputs of specifying columns, e.g. if one uses just * <colgroup> and another uses <col> without <colgroup>.

Standard Features

The <tbody> element is treated as a special case for HTML table comparison. This is because the tbody element may not be present in a table or there may one or more instances of this element in a table. The tbody elements are flattened in the input so that differences in the existence or number of tbody elements can be hidden in the comparison result. How any tbody structure differences are shown is determined by standard FormattingElement settings.

#content .code