Performance Comparison with various Document Parsers

by krishna
December 19, 2008

The performance comparisons used in this article are based on parsing and working with a set of selected XML documents intended to be representative of a wide range of applications:

much_ado.xml, the Shakespeare play marked up as XML. No attributes and a fairly flat structure (202K bytes).
periodic.xml, periodic table of the elements in XML. Some attributes, also fairly flat (117K bytes).
soap1.xml, sample SOAP document taken from the specification. Heavy namespaces and attributes (0.4K bytes, repeated 49 times each test pass).
soap2.xml, list of values in SOAP document form. Heavy on namespaces and attributes (134K bytes).
nt.xml, the New Testament marked up as XML. No attributes and very flat structure, heavy text content (1047K bytes).
xml.xml, the XML specification, with the DTD reference removed and all entities defined inline. Text-style markup with heavy mixed content, some attributes (160K bytes).

Document Build Time
buildmintime

Document Walk Time

buildmintime

Document ModifyTime

buildmintime

Text Generation Time

buildmintime

Document Memory Size

buildmintime

Java Serialization

Serialization output time
- Serialization input time
- Serialized document size
Conslusion

The different Java XML document models all have some areas of strength, but from the performance standpoint there are some clear winners.

XPP is the performance leader in most respects. For middleware-type applications that do not require validation, entities, processing instructions, or comments, XPP looks to be an excellent choice despite its newness. This is especially true for applications running as browser applets or in limited memory environments.

dom4j doesn’t have the sheer speed of XPP, but it does provide very good performance with a much more standardized and fully functional implementation, including built-in support for SAX2, DOM, and even XPath. Xerces DOM (with deferred node creation) also does well on most performance measurements, though it suffers on small files and Java serialization. For general XML handling, both dom4j and Xerces DOM are probably good choices, with the preference between the two determined by whether you consider Java-specific features or cross-language compatibility more important.

JDOM and Crimson DOM consistently rank poorly on the performance tests. Crimson DOM may still be worth using in the case of small documents, where Xerces does poorly. JDOM doesn’t really have anything to recommend it from the performance standpoint, though the developers have said they intend to focus on performance before the official release. However, it’ll probably be difficult for JDOM to match the performance of the other models without some restructuring of the API.

EXML is very small (in jar file size) and does well in some of the performance tests. Even with the advantage of deleting isolated whitespace content, though, EXML does not match XPP performance. Unless you need one of the features EXML supports but that XPP lacks, XPP is probably a better choice in limited-memory environments.

Currently none of the models can offer good performance for Java serialization, though dom4j does the best. If you need to transfer a document representation between programs, generally your best alternative is to write the document out as text and parse it back in to reconstruct the representation. Custom serialization formats may offer a better alternative in the future.

Reference:
Xerces4Java – http://xerces.apache.org/xerces-j/
Crimson – http://xml.apache.org/crimson/
JDOM – http://jdom.org
dom4j – http://dom4j.org
XML Pull Parser (XPP) – http://www.xmlpull.org

The performance comparisons used in this article are based on parsing and working with a set of selected XML documents intended to be representative of a wide range of applications: much_ado.xml, the Shakespeare play marked up as XML. No attributes and a fairly flat structure (202K bytes). periodic.xml, periodic table of the elements in XML.…

December 19, 2008

Performance Comparison with various Document Parsers

by krishna
December 19, 2008

The performance comparisons used in this article are based on parsing and working with a set of selected XML documents intended to be representative of a wide range of applications:

much_ado.xml, the Shakespeare play marked up as XML. No attributes and a fairly flat structure (202K bytes).
periodic.xml, periodic table of the elements in XML. Some attributes, also fairly flat (117K bytes).
soap1.xml, sample SOAP document taken from the specification. Heavy namespaces and attributes (0.4K bytes, repeated 49 times each test pass).
soap2.xml, list of values in SOAP document form. Heavy on namespaces and attributes (134K bytes).
nt.xml, the New Testament marked up as XML. No attributes and very flat structure, heavy text content (1047K bytes).
xml.xml, the XML specification, with the DTD reference removed and all entities defined inline. Text-style markup with heavy mixed content, some attributes (160K bytes).

Document Build Time
buildmintime

(more…)

The performance comparisons used in this article are based on parsing and working with a set of selected XML documents intended to be representative of a wide range of applications: much_ado.xml, the Shakespeare play marked up as XML. No attributes and a fairly flat structure (202K bytes). periodic.xml, periodic table of the elements in XML.…

December 19, 2008

Performance Comparison with various Document Parsers

Leave a Reply Cancel reply

Performance Comparison with various Document Parsers

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories