Comparing Large Files

Introduction

It is possible to compare very large datasets using XML Compare; example test data has demonstrated that files over 1GB in size can be loaded from disk and compared in around 7 minutes.

There are many different factors that affect performance with large files apart from the CPU type and speed, and the amount of physical memory, both of which must be adequate for the job. Some of the more important are discussed in the following sections. We also provide some typical metrics for DeltaXML on basic machines.

White space

White space nodes are generally significant in XML files. Each white space, e.g. newline or space, can be treated as a node in the XML file. This can increase the memory image size and slow the comparison process. It can also result in differences being identified that are not significant.

In many situations white space nodes are not important and can be ignored. If a file has a DTD, an XML parser can use this as the file is read in to identify whether white space is ignorable or not. If a white space node is ignorable, for example because it appears between two markup tags, DeltaXML will ignore it in the comparison process.

If there is no DTD white space nodes should be removed either using an editor or by processing using an XSL filter such as normalize-space.xsl, though using XSL can be time consuming for large files. The delta files generated by DeltaXML have no white space added to them: if you look at them in an editor you will see that new lines are added only inside tags. This may look strange at first but it is an effective way to have shorter lines without adding white space nodes to an XML file. White space inside tags will be ignored by any XML parser.

Remember also that indentation of PCDATA within a file has an effect: often white space in PCDATA and attributes should be normalized before comparison. Otherwise, again, there will be a lot of differences reported that are not important.

XML file structure

There is a performance difference in comparing 'flat' XML files, i.e. large number of records at one level, and more nested files, which tends to require less processing because there are fewer nodes at each level. Comparison of orderless data is generally slower.

Number of differences

The performance is affacted by the number of differences: it is quickest when there are no differences! The more differences there are the slower the comparison process because the software is trying to find a best match between the files. The LCS algorithm used in DeltaXML for pattern matching ordered sequences has optimal performance for small numbers of differences and slows significantly for large numbers of differences.

PCDATA

DeltaXML shares text strings, so many different text strings will result in a larger memory image and may cause the program to hit memory size limitations sooner. On the other hand, files with many identical strings will be stored very efficiently.

Changes only or Full delta

The DeltaXML API has the ability to generate a delta file with 'changes-only' or a 'full delta' that includes unchanged data as well.

The time for comparison and the memory required is independent of the type of delta being produced. However the full-context delta output is typically larger and will require more disk IO and CPU time to write to disk.

Java Heap Size

The size of the JVM heap is one of the main factors which determines the size of datasets which DeltaXML can process. The size of the heap, amount of available RAM and other JVM configuration options affects both capacity and performance (too small a heap will result in excess garbage collection, similarly not enough RAM will causes performance degradation). The following guidelines are suggested:

Using java -Xmx can be used to increase the fairly small default JVM heap size. For example invoking using the: (java -Xmx512m...) command line argument will allocate half a gigabyte of RAM to the JVM heap.
Performance is poor if there isn't enough RAM available to support the requested JVM heap size. Using disk based swapping to support the heap exhibited significant slow downs. We suggest ensuring that the heap size specified with the java -Xmx argument is available as free RAM.
The J2SE server JVM can provide much better performance than the client JVM (in some cases twice as fast), but at the expense of increased memory consumption. If enough RAM is available, adding: java -server ... is recommended for best performance.
32 bit Operating Systems and processors can limit the process virtual address space and thus the amount of memory that you can dedicate to JVM heap usage. Some Operating Systems divide the 32 bit process address space into space for system/kernel and space for user code. For example, Windows™, Linux™ (most distributions/kernels) and MacOSX™ do a 50/50 split, leaving on a 32 bit machine around 2GBytes of space available for the Java heap, even on machines which have larger amounts of RAM installed. 32 bit processes on Solaris Sparc™ (7, 8 & 9) avoid the 50/50 split and make most of the 4Gbytes available to the java heap, for example java -Xmx3900m is possible.
To exceed the 2 or 4GByte Java heap size limits, a 64 bit JVM is usually required. However, for this to work usefully and to see benefits, 64 bit processors, corresponding Operating System support (some Operating Systems available for 64 bit processors only support a 32 bit address space, for example MacOSX™ 10.3) and more than 4 GBytes of RAM will be needed.
The use of Multiple Page Size Support
java -XX:+UseMPSS ... on Solaris provided a 5% runtime improvement in testing, with no measurable memory overhead.
Using the incremental garbage collector (java -Xincgc ...) showed no benefit when tested.
It was hoped the use of the Parallel Garbage collector (java -XX:+UseParallelGC ...) would provide improved run times on multiprocessors as garbage collection could occur concurrently on a separate CPU. It actually had the opposite effect, doubling the elapsed runtime and trebling the CPU time consumed.

SAX input sources

Reading from disk based files, for example, using the command-line interpreter, is typically slower than processing SAX events produced from an existing in memory data representation. As well as the reduced disk IO a more significant speedup arises from the lack of lexical analysis/tokenization that is otherwise performed by a SAX parser. We also recommend testing different SAX parsers and comparing their performance using your data if you need to read XML files from disk.

Metrics

It is difficult to give accurate performance metrics for the reasons outlined above. But some examples may help as an indication.

There are very few large XML datasets which are publicly available so we have used the XMark benchmark generator - from http://xmlbench.sourceforge.net. This is typically used for testing XML content repositories/databases and XQuery implementations. Suggestions for alternative benchmark data and particularly documents are welcome.

Test file generation

Example files of test data were generated using the following command-lines with the xmlgen application, the intention was to generate files around 1GByte in size:

BASH

$ ./xmlgen -v
This is xmlgen, version 0.92
by Florian Waas (flw@mx4.org)
$ ./xmlgen -f 10.0 -d -o f10.xml 
$ ./xmlgen -e -o auction.dtd
$ ls -l f10.xml 
-rw-rw-r--   1 nigelw   staff  1172322571 Nov  7 16:36 f10.xml

Test file characteristics

Some characteristics of the generated file are described using the following XPaths and their result values:

XPath Expression	value
count(/site/closed_auctions/closed_auction)	97,500
count(/site/open_auctions/open_auction)	120,000
count(/site/people/person)	255,000
count(/site/catgarph/edge)	10,000
count(/site/categories/category)	10,000
count(/site/regions/*)	6

Creating changes

While the shortest runtime is from performing an identity comparison (ie comparing the same file or data with itself) we wanted a more realistic test with perhaps a small number of changes. To achieve this we deleted 7 random grand-children elements in the data, from the elements with large numbers of children.

The following XPaths describe the elements which were deleted:

/site/closed_auctions/closed_auction[1]
/site/closed_auctions/closed_auction[97500]
/site/closed_auctions/closed_auction[66592]
/site/people/person[69105]
/site/people/person[120615]
/site/people/person[255000]
/site/catgraph/edge[5274]

This 'trimmed' file was called f10t.xml and was slightly smaller than the original input.

Test Hardware and Software

Test hardware was a Sun x4100, with:

2 AMD Opteron 275 CPUs (providing 4 cores at 2.2GHz)
12 GBytes of RAM
Internal 73GByte, 10k rpm SAS disks
Solaris 10 Update 3 (11/2006)
J2SE 1.5.0_06

Test software was XML Compare 5.2

Test Command

The command-line driver was used to run and time the tests. The following command was used:

BASH

$ time java -server -d64 -Xmx6g -jar /usr/local/deltaxml/DeltaXMLCore-5_2/command.jar compare delta f10.xml f10t.xml f10x.xml 
DeltaXML Command Processor, version: 1.5 
Copyright (c) 2000-2008 DeltaXML Ltd. All rights reserved. 
Using: XML Compare, version: 5.2
real    7m28.145s
user    10m7.796s
sys     1m13.039s

Result comments

The above command represents the basic, default command-line usage. The UNIX time results show a comparison time of around 7.5 minutes for these 1GByte data files. Faster times were obtained with the following techniques:

Turning off output indentation by adding "Indent=no" to the command-line saves around 10 seconds from both the CPU and elapsed times.
Turning off enhanced match 1 by adding "Enhanced Match 1=false" to the command line reduces the times to 6m38s real/9m05s user.
Comparing the f10.xml file with itself (an identity comparison which reads the files traverses them and writes a delta result) gives a lower-bound comparison time for the data of 5m33s real/8m17s user.

Some further issues relating to performance include:

The times generated above have a larger amount of CPU or 'user' time compare to the elapsed or 'real' time of the command. This is due to the introduction of multithreading in the XML Compare 5.2 release.
A substantial proportion of these times is spent doing disk IO and therefore the performance of the IO system will matter. Using Java code it is possible to send/receive SAX events from the comparator, this will be faster as it avoids both the disk IO and also parsing times. However, such tests are harder to configure and time.
It is possible to tune certain aspects of the JVM operation such as threading and garbage collection. Such results are often specific to the JVM being used (Sun and IBM for example offer different options) and may also give different results/benefits on different hardware/OS platforms.

We welcome feedback on these results and are prepared to look at tuning and performance issues for customers through our normal support channels. Any suggestions for large XML datasets which can be used for benchmarking and performance testing would also be welcomed.

How to test DeltaXML on your own large files

We often have enquiries about handling large files, 500Mb to several Gb. Here are a few other comments and suggestions.

Download an evaluation of XML Compare to try it on your own files. The Professional Named User edition has a 1M node limit - it will tell you if you hit this. The Professional Server and Enterprise do not have this limit. If you do word-by-word comparison the node count goes up a lot because the text is split into words and a word is counted as a node in the XML tree.

You would need to have sufficient memory - exactly how much depends on the nature of the data but for 500Mb files we would initially suggest somewhere in the 4 to 8 Gb range.

DeltaXML will work fine in a 64 bit environment, provided that you:
1. use a 64 bit OS and hardware
2. use a 64 bit JVM and invoke it appropriately.

For example, use the command line access like this, replacing x.y.z with the major.minor.patch version number of your release e.g. command-10.0.0.jar

BASH

$ java -Xmx4g -jar command-x.y.z.jar compare delta file1.xml file2.xml file1-2.xml

The java -version command will often report the use of a 32 or 64 bit JVM, for example:

BASH

$ java -version 
java version "1.6.0_24" 
Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-10M3326) 
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02-334, mixed mode)

Use the -d64 argument if your default JVM is reported as 32 bit and use the -Xmx argument to adjust the heap size should you get any Java memory exceptions. The example above was for 4Gbytes which works well on a Mac desktop with 8GB of RAM.

We spend time optimizing our products and associated XML tools for lower memory footprints - some recent work was reported here: "XML Pipeline Performance". This paper describes advanced methods for optimizing XML pipeline performance. Presented at XML Prague 2010, held March 13th and 14th, 2010, Prague, CZ. Paper: PDF. Poster: PDF. (Please note that the performance figures presented predate the Saxon 9.3 release which addresses some of the issues discussed).

Conclusion

Be sure to remove white space from large input files. Performance depends on file structure and text content so needs to be evaluated on your own data.

However, it is clear from the above that XML Compare can be used successfully with very large XML datasets.

[data-colorid=peo81dxqmb]{color:#007800} html[data-color-mode=dark] [data-colorid=peo81dxqmb]{color:#87ff87}Introduction

White space

XML file structure

Number of differences

PCDATA

Changes only or Full delta

Java Heap Size

SAX input sources

Metrics

Test file generation

Test file characteristics

Creating changes

Test Hardware and Software

Test Command

Result comments

How to test DeltaXML on your own large files

Conclusion

Introduction