Subtree Processing Mode
Introduction
This document introduces the Subtree Processing feature. The sample resources including how to set up this feature can be found here.
This “how to” guide introduces then discusses how different content in a set of inputs can be compared as either Data or Text in a customisable way to achieve better results.
What is Subtree Processing?
Previously customers could not effectively extend or use our Data processing features on Text within Documents or Text processing features in Data, however sometimes files are a mix of Text content and Data. In order to improve the performance some content needs to be marked as a Data section to be treated appropriately or vice-versa.
The Subtree Processing feature allows sections to be compared with their appropriate content type. This improves performance and helps XML Compare to provided better results.
What is Data vs Text?
A simple way to view Data vs Text is to imagine the main purpose. Text is designed for a human to read it (e.g. content in a book or article) vs Data being something you could read but more inclined to be storage of data (e.g. a list of contacts or technical specification).
There is no true definition for Data or Text in XML however for deciding how to use Subtree Processing best, the following can be used to help understand whether content should be marked as text or data:
The leaf content of Data is more likely to be short consisting of fewer or single ‘words’ contained within an element which describes its meaning. Other data elements are structures built up from these leaf elements, and they are often repeated. For data, attributes are often used to store name/value data pairs and are critical to the meaning
Text can be considered as content which consists of more than 5 words, contains tables, or contains mixed content (i.e. when text and elements appear adjacent to one another within the same parent element.). For text content, attributes are often used for supplementary information such as display colour or text style.
See our Comparison Report in an XML Compare Evaluation which will outline whether we believe you are working with Data or Text
See the examples in the Subtree Processing Mode Configuration, or the Bitbucket Sample, for a demonstration of Data and Text content being compared with their respective processing types, and further discussion.
Subtree Processing Mode Configuration
Text content processing aligns elements based on the text content of the elements, taking into account the order of the words within the text. Data content processing aligns elements considering the attributes as name/value pairs and the unique words within the text content of the element (but not their order).
Content Processing Type
The following sections will outline the difference between the Content Processing types and how to set the default Content Processing type, which determines the default for how content is treated within the comparison.
To set the default Content Processing type for a comparison using the Java API see the code below, or see the Bitbucket Sample to see how to set Subtree Processing using a DCP file.
SubtreeProcessingMode subtreeProcessingMode = new SubtreeProcessingMode(SubtreeType.TEXT);
DocumentComparator dc= new DocumentComparator();
dc.setSubtreeProcessingMode(subtreeProcessingMode);
dc.compare(in1, in2, result);
Text
Text Content Processing is the default Subtree Processing Mode for XML Compare. Text Content Processing is better for XML files containing Text or that can be considered as Documents as defined above in the What is Data vs Text section.
Text is the default processing in DocumentComparator, however it can be set using the Java API with the following code:
SubtreeProcessingMode subtreeProcessingMode = new SubtreeProcessingMode(SubtreeType.TEXT);
Data
Data Content Processing can be enabled as the default for a comparison using the following initialisation of the SubtreeProcessingMode, and is the default in other DeltaXML Products such as XML Data Compare. Data Content Processing can be set using the Java API with the following code:
SubtreeProcessingMode subtreeProcessingMode = new SubtreeProcessingMode(SubtreeType.DATA);
Examples
Example 1
The following example shows the difference between comparing a simple Document file with Text vs Data Content Processing.
In the following A input contain items with the phrase ‘Anna visited her grandmother on Sunday, they went to a castle', whilst the B input contains a similar phrase of ‘Anna visited her grandmother on Sunday, they went to a lake’ in addition to an item with the phrase 'on Sunday her grandmother visited Anna, they went to a castle’ which contains all the same words as the sentence in input A but reordered. We chose this example to demonstrate:
That the expected match would be between the two phrases which match most closely visually
i.e. phrase “Anna visited her grandmother on Sunday, they went to a castle” matching with “Anna visited her grandmother on Sunday, they went to a castle”
The Text Content Processing matches as described above, taking the structure of the text into account and matching on the most similar Text
The Data Content Processing matches on the rearranged phrase “on Sunday her grandmother visited Anna, they went to a castle”
This is because it has the most commonality in the content (in this case words) regardless of structure, which is all that matters for this type of Content Processing.
A
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1">
<container>
<item>on Sunday her grandmother visited Anna, they went to a castle</item>
<item>Anna visited her grandmother on Sunday, they went to a lake</item>
<item>they had a good time</item>
</container>
</root>
B
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1">
<container>
<item>Anna visited her grandmother on Sunday, they went to a castle</item>
<item>they had a good time</item>
</container>
</root>
Results
Text Content Processing
Data Content Processing
Example 2
In the following example A and B contain the same items, with the item in B with name Daniel and Edith are swapped in order.
Data Content Processing matches the item with name Daniel with the item with name Edith because they have common PCData nodes in the same order
Daniel, Edith, Fred, 1940, 1960, 1980
Edith, Fred, Adam, 1960, 1980, 1990
Marking the items inside the container to be of type DATA can potentially make the result more precise and meaningful. Alternatively, when talking about Documents/Text the above type of matching can still be advantageous as the markup around the text can be meta data and not what the User perceives.
A
<?xml version="1.0" encoding="UTF-8"?>
<container>
<item>
<name>Adam</name>
<spouse>Beth</spouse>
<children>
<child>Chris</child>
</children>
<date_of_birth>1950</date_of_birth>
<date_of_marriage>1970</date_of_marriage>
<member_since>1970</member_since>
</item>
<item>
<name>Daniel</name>
<spouse>Edith</spouse>
<children>
<child>Fred</child>
</children>
<date_of_birth>1940</date_of_birth>
<date_of_marriage>1960</date_of_marriage>
<member_since>1980</member_since>
</item>
<item>
<name>Edith</name>
<spouse>Fred</spouse>
<children>
<child>Adam</child>
</children>
<date_of_birth>1960</date_of_birth>
<date_of_marriage>1980</date_of_marriage>
<member_since>1990</member_since>
</item>
<item>
<name>Ottie</name>
<spouse>Phil</spouse>
<children>
<child>Rich</child>
</children>
<date_of_birth>1990</date_of_birth>
<date_of_marriage>2010</date_of_marriage>
<member_since>2011</member_since>
</item>
</container>
B
<?xml version="1.0" encoding="UTF-8"?>
<container>
<item>
<name>Adam</name>
<spouse>Beth</spouse>
<children>
<child>Chris</child>
</children>
<date_of_birth>1950</date_of_birth>
<date_of_marriage>1970</date_of_marriage>
<member_since>1970</member_since>
</item>
<item >
<name>Edith</name>
<spouse>Fred</spouse>
<children>
<child>Adam</child>
</children>
<date_of_birth>1960</date_of_birth>
<date_of_marriage>1980</date_of_marriage>
<member_since>1990</member_since>
</item>
<item>
<name>Daniel</name>
<spouse>Edith</spouse>
<children>
<child>Fred</child>
</children>
<date_of_birth>1940</date_of_birth>
<date_of_marriage>1960</date_of_marriage>
<member_since>1980</member_since>
</item>
<item>
<name>Ottie</name>
<spouse>Phil</spouse>
<children>
<child>Rich</child>
</children>
<date_of_birth>1990</date_of_birth>
<date_of_marriage>2010</date_of_marriage>
<member_since>2011</member_since>
</item>
</container>
Results
Text
Data
Setting Subtree Processing Mode on Subtrees
When working with mixed content, a set of inputs may contain both Data and Text content. This is where Subtrees come into play. A Subtree consists of an XPath and a type, i.e. an XPath to an element which will then be compared with the relevant processing type.
Descendants of the element specified by the XPath in the Subtree will be compared as the relevant type, therefore allowing Text to be treated as Text or Text containers, and Data to be treated as Data or Data containers.
To add Subtrees using the Java API see the code below. Subtrees can be added to the SubtreeProcessingMode in three ways: A Subtrees List can be set through the SubtreeProcessingMode in initialisation (see lines 1-4), a Subtree list can be set using the setSubtrees() function (see lines 6-9), or individual Subtrees can be added via the addSubtree() function (see line 14). Once all desired settings are provided to the SubtreeProcessingMode, it can be set on the Document Comparator and a comparison can be performed.
List<Subtree> subtrees = new ArrayList<>();
Subtree Subtree = new Subtree("p", SubtreeType.TEXT);
Subtrees.add(Subtree);
SubtreeProcessingMode subtreeProcessingMode = new SubtreeProcessingMode(Subtrees);
List<Subtree> subtrees2 = new ArrayList<>();
Subtree subtree2 = new Subtree("para", SubtreeType.TEXT);
subtrees2.add(subtree2);
subtreeProcessingMode.setSubtrees(subtrees2)
subtreeProcessingMode.addSubtree(Subtree)
DocumentComparator dc= new DocumentComparator();
dc.setSubtreeProcessingMode(subtreeProcessingMode);
dc.compare(in1, in2, result);
Example 3
Subtrees Processing Mode allows mixed content to be compared with the processing that is best suited for the different types of content. Consider combining the previous two examples of Text and Data into one consolidated file. Using the following Java code, the container element within the inputs can be treated appropriately as Data whilst the rest of the document is treated as Text.
List<Subtree> subtrees = new ArrayList<>();
Subtree subtree = new Subtree("root/container", SubtreeType.DATA);
subtrees.add(subtree);
SubtreeProcessingMode subtreeProcessingMode = new SubtreeProcessingMode(SubtreeType.Text, subtrees);
DocumentComparator dc= new DocumentComparator();
dc.setSubtreeProcessingMode(subtreeProcessingMode);
dc.compare(in1, in2, result);
A
<?xml version="1.0" encoding="UTF-8"?>
<root>
<item>on Sunday her grandmother visited Anna, they went to a castle</item>
<item>Anna visited her grandmother on Sunday, they went to a lake</item>
<item>they had a good time</item>
<container>
<item>
<name>Adam</name>
<spouse>Beth</spouse>
<children>
<child>Chris</child>
</children>
<date_of_birth>1950</date_of_birth>
<date_of_marriage>1970</date_of_marriage>
<member_since>1970</member_since>
</item>
<item>
<name>Daniel</name>
<spouse>Edith</spouse>
<children>
<child>Fred</child>
</children>
<date_of_birth>1940</date_of_birth>
<date_of_marriage>1960</date_of_marriage>
<member_since>1980</member_since>
</item>
<item>
<name>Edith</name>
<spouse>Fred</spouse>
<children>
<child>Adam</child>
</children>
<date_of_birth>1960</date_of_birth>
<date_of_marriage>1980</date_of_marriage>
<member_since>1990</member_since>
</item>
<item>
<name>Ottie</name>
<spouse>Phil</spouse>
<children>
<child>Rich</child>
</children>
<date_of_birth>1990</date_of_birth>
<date_of_marriage>2010</date_of_marriage>
<member_since>2011</member_since>
</item>
</container>
</root>
B
<?xml version="1.0" encoding="UTF-8"?>
<root>
<item>Anna visited her grandmother on Sunday, they went to a castle</item>
<item>they had a good time</item>
<container>
<item>
<name>Adam</name>
<spouse>Beth</spouse>
<children>
<child>Chris</child>
</children>
<date_of_birth>1950</date_of_birth>
<date_of_marriage>1970</date_of_marriage>
<member_since>1970</member_since>
</item>
<item >
<name>Edith</name>
<spouse>Fred</spouse>
<children>
<child>Adam</child>
</children>
<date_of_birth>1960</date_of_birth>
<date_of_marriage>1980</date_of_marriage>
<member_since>1990</member_since>
</item>
<item>
<name>Daniel</name>
<spouse>Edith</spouse>
<children>
<child>Fred</child>
</children>
<date_of_birth>1940</date_of_birth>
<date_of_marriage>1960</date_of_marriage>
<member_since>1980</member_since>
</item>
<item>
<name>Ottie</name>
<spouse>Phil</spouse>
<children>
<child>Rich</child>
</children>
<date_of_birth>1990</date_of_birth>
<date_of_marriage>2010</date_of_marriage>
<member_since>2011</member_since>
</item>
</container>
</root>