Skip to main content
Skip table of contents

Lexical Preservation in DITA Compare: Preserving Entities, DTDs, CDATA, PIs and Comments

It can be useful to preserve the lexical structure of the inputs when performing a comparison. This section discusses our support for lexical preservation and its limitations.

DITA Compare provides a selection of output formats with different intended use cases as discussed in Output Formats. Some are intended for use in a publication pipeline, whereas others are intended for onward review and editing (we refer to this as 'round trip' processing). For onward editing, it is useful to provide the user with a document that is as close to the original input documents as possible. For example, it is important not to expand entity references and CDATA sections.

These lexical preservation modes can be set by the 'preservation-mode' parameter as discussed in DITA Compare Parameters.

Modes

Round trip preservation mode

When using a track-change output format, a user is likely to expect that accepting all the changes would result in the 'B' document, whereas rejecting all the changes would result in the 'A' document. The 'round trip' preservation mode is designed to achieve this as far as possible, within the limitations of standard XML parsing and XSLT 2.0 transformation technologies. However, as some data cannot be tracked using tracked change markup, it is necessary to choose either the 'A' or 'B' version of that data. By default the result document uses data in the 'B' document in preference to that in the 'A' document. Hence, accepting all changes is likely to be close to the 'B' document whereas rejecting all changes may not be as close to the 'A' document.

Document preservation mode

When marking changes using attributes, such as revision flags, the user is likely to expect full content expansion. Here entity references and CDATA sections are expanded and compared, rather than kept in their original source form. This typically enables finer grained change identification and display. It can also significantly improve the aligning of the documents before the comparison is performed. This type of processing is performed when using the 'document' preservation mode.

Document and attribute preservation mode

One issue with the document preservation mode is that all the attributes that are provided by the DTD are retained in the output, which can lead to unnecessary clutter in the output, which both increases the size and decreases its clarity for manual review/editing. The 'document and attribute' preservation mode address this issue by tracking which attributes have been supplied by the DTD, and removing them so long as they have not changed.

Entity reference and nested entity reference preservation modes.

These are variations on the 'round trip' mode to enable expert users to know when the underpinning definitions of an entity have changed, as explained in the Details section below.


Details

The table below shows the different preservation modes and their effect on how various items in the file are preserved.

Preservation Mode

Preserve Comments & Processing Instructions

XML Declaration & Doctype

Preserve defaulted attributes

Preserve CDATA sections & whitespace

Preserve entity references

Preserve entity references & content

Preserve nested entity references & content

document

on

on

off

off

off

n/a

n/a

docAndAttrib

on

on

on

off

off

n/a

n/a

roundTrip

on

on

on

on

on*

off

off

entityRef

on

on

on

on

on*

on*

off

nestedEntityRef

on

on

on

on

on*

on*

on*


*It is not feasible to preserve entity references when using the DITA Markup output format.

The effects of turning these preservation items 'on' or 'off' is now discussed in the following list, where the use of 'this column' in an item's description refers to the corresponding column in the above table.

Mode

Notes

Preserve Comments & Processing Instructions

Comments and Processing Instructions (PIs) in the 'B' document are preserved in the result, whereas comments and PIs in the 'A' document (that are not also in the 'B' document) do not appear in the result. The exception here is that PIs that represent oXygen tracked changes are removed prior to comparison so that they do not get confused with the changes identified by the comparator. Further, neither comments or PIs in the internal DTD subset are currently preserved.

Preserve XML Declaration & Document Type (DTD & internal subset)

Most of the XML declaration, doctype and internal subset data is preserved (for the preservation modes that contain an 'on' in this column). A current limitation is that comments and processing instructions within an internal subset are lost. Another limitation is that XML declaration's standalone marking is not preserved.

Preserve defaulted attributes

Default attribute values can be specified in a DTD and these are automatically put onto the elements in the document by the parser. If they are preserved as defaulted attributes (i.e. an 'on' in this column), then these default values will not appear in the result document.

Preserve CDATA sections and whitespace

CDATA (character data) sections are preserved in the result (for the preservation modes that contain an 'on' in this column). Insignificant whitespace characters are treated as normal whitespace characters, and modifications in whitespace are by default ignored in the output.

Preserve entity references

General parsed entities are preserved as entities - rather than expanded (i.e. replaced by their content) - in the result document when an 'on' is in this column. This is usually what you want when you continue to edit the document. For example, consider two documents that differ in how the name of a city - London - is represented: in the first document the city is written as the string 'London', and in the second document the city is written as an entity reference '&city;' whose value is the string 'London'. In this case, modes with an 'on' in this column the two representations of city London are marked as different, because the unexpanded entity is different from the text, whereas those modes with an 'off' in this column mark the two representations of the city London as the same, because the expanded entity reference is the same as the text.

Preserve entity references and content

This is intended only for expert users who understand how entities work. In roundTrip mode you will not see changes in entity references in the (unusual) situation where the definition of these entities is different in the two documents. For example, consider two documents containing the entity reference '&city;' that differ only in the value of the 'city' entity, which has changed from 'London' in one document to 'Birmingham' in the other. Both of these documents use the same '&city;' entity reference, which would be marked as unmodified as it is identical from the round trip (source document) perspective. If you need to see such changes, then use a mode with an 'on' in this column. In the result document, there can only be one entity definition and this will be either from the original ('A' document) or new ('B' document). Therefore the entities are guaranteed to be the same in the result document, and so any difference is shown by adding and removing an identical element.

Preserve nested entity references and content

This is intended only for expert users who understand the way one entity can reference another. An 'on' in this column means that subtle changes in entity reference structure are shown. The full structure of nested entities is preserved and compared and any changes are shown. This is useful in some complex cases where the overall semantics of an entity does not change, but the way in which it is defined changes. For example, consider a document that contains a reference to the entity '<!ENTITY ent "&inner1;">', where the 'inner1' entity has the value 'val'. Let a second version of the document be the same as the first, except that the inner entity reference is renamed to '&inner2;'. In this case, both the syntactic and semantic analyses will miss this change, as the syntax analysis compares '&ent;' against itself and the semantic analysis compare the text 'val' against itself. An 'on' in this column means the comparator will detect such changes in the internal definition of an entity, and marks them using the same scheme as above: the addition and deletion of an identical entity reference.

Limitations

There are some fundamental limitations on what changes can be shown, which reflect the nature of a given output format and XML parsing and processing technology. These fundamental limitations include:

  1. Some output formats cannot represent changes in attributes. In these cases, it is possible to configure the resultant document to contain the 'A' version, the 'B' version, the 'A' version if it exists otherwise the 'B' version, etc; see the 'modified-attribute-mode' parameter documentation for details.

  2. Many output formats - such as DITA markup and Arbortext tracked change formats - cannot represent changes in the document type and internal subset data. In these cases, it is possible to configure the resultant document to contain the 'A' version, the 'B' version, the 'A' version if it exists otherwise the 'B' version, etc; see the 'unmarked-change-mode' parameter documentation for details.

  3. Some changes in white space cannot be reproduced, as whitespace outside the root element of a document is not reported by an XML parser.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.