Performance Comparison with various Document Parsers
by krishna
The performance comparisons used in this article are based on parsing and working with a set of selected XML documents intended to be representative of a wide range of applications:
- much_ado.xml, the Shakespeare play marked up as XML. No attributes and a fairly flat structure (202K bytes).
- periodic.xml, periodic table of the elements in XML. Some attributes, also fairly flat (117K bytes).
- soap1.xml, sample SOAP document taken from the specification. Heavy namespaces and attributes (0.4K bytes, repeated 49 times each test pass).
- soap2.xml, list of values in SOAP document form. Heavy on namespaces and attributes (134K bytes).
- nt.xml, the New Testament marked up as XML. No attributes and very flat structure, heavy text content (1047K bytes).
- xml.xml, the XML specification, with the DTD reference removed and all entities defined inline. Text-style markup with heavy mixed content, some attributes (160K bytes).
Document Build Time
Document Walk Time
Document ModifyTime
Text Generation Time
Document Memory Size
Java Serialization
- Serialization output time
- Serialization input time
- Serialized document size
Conslusion
The different Java XML document models all have some areas of strength, but from the performance standpoint there are some clear winners.
XPP is the performance leader in most respects. For middleware-type applications that do not require validation, entities, processing instructions, or comments, XPP looks to be an excellent choice despite its newness. This is especially true for applications running as browser applets or in limited memory environments.
dom4j doesn’t have the sheer speed of XPP, but it does provide very good performance with a much more standardized and fully functional implementation, including built-in support for SAX2, DOM, and even XPath. Xerces DOM (with deferred node creation) also does well on most performance measurements, though it suffers on small files and Java serialization. For general XML handling, both dom4j and Xerces DOM are probably good choices, with the preference between the two determined by whether you consider Java-specific features or cross-language compatibility more important.
JDOM and Crimson DOM consistently rank poorly on the performance tests. Crimson DOM may still be worth using in the case of small documents, where Xerces does poorly. JDOM doesn’t really have anything to recommend it from the performance standpoint, though the developers have said they intend to focus on performance before the official release. However, it’ll probably be difficult for JDOM to match the performance of the other models without some restructuring of the API.
EXML is very small (in jar file size) and does well in some of the performance tests. Even with the advantage of deleting isolated whitespace content, though, EXML does not match XPP performance. Unless you need one of the features EXML supports but that XPP lacks, XPP is probably a better choice in limited-memory environments.
Currently none of the models can offer good performance for Java serialization, though dom4j does the best. If you need to transfer a document representation between programs, generally your best alternative is to write the document out as text and parse it back in to reconstruct the representation. Custom serialization formats may offer a better alternative in the future.
Reference:
Xerces4Java – http://xerces.apache.org/xerces-j/
Crimson – http://xml.apache.org/crimson/
JDOM – http://jdom.org
dom4j – http://dom4j.org
XML Pull Parser (XPP) – http://www.xmlpull.org
The performance comparisons used in this article are based on parsing and working with a set of selected XML documents intended to be representative of a wide range of applications: much_ado.xml, the Shakespeare play marked up as XML. No attributes and a fairly flat structure (202K bytes). periodic.xml, periodic table of the elements in XML.…
Performance Comparison with various Document Parsers
by krishna
The performance comparisons used in this article are based on parsing and working with a set of selected XML documents intended to be representative of a wide range of applications:
- much_ado.xml, the Shakespeare play marked up as XML. No attributes and a fairly flat structure (202K bytes).
- periodic.xml, periodic table of the elements in XML. Some attributes, also fairly flat (117K bytes).
- soap1.xml, sample SOAP document taken from the specification. Heavy namespaces and attributes (0.4K bytes, repeated 49 times each test pass).
- soap2.xml, list of values in SOAP document form. Heavy on namespaces and attributes (134K bytes).
- nt.xml, the New Testament marked up as XML. No attributes and very flat structure, heavy text content (1047K bytes).
- xml.xml, the XML specification, with the DTD reference removed and all entities defined inline. Text-style markup with heavy mixed content, some attributes (160K bytes).
Document Build Time
The performance comparisons used in this article are based on parsing and working with a set of selected XML documents intended to be representative of a wide range of applications: much_ado.xml, the Shakespeare play marked up as XML. No attributes and a fairly flat structure (202K bytes). periodic.xml, periodic table of the elements in XML.…
Leave a Reply Cancel reply
Recent Comments
Archives
- August 2025
- July 2025
- June 2025
- May 2025
- April 2025
- March 2025
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- February 2012
- January 2012
- December 2011
- October 2011
- August 2011
- July 2011
- May 2011
- January 2011
- November 2010
- October 2010
- September 2010
- July 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- August 2008
- July 2008
- June 2008
- December 2007
- April 2007
- January 2007