Make my XML Parser in Java using WoodStox run faster and use less memory, or just generally better

Question 1

I have an XML Parser in Java using WoodStox that I wrote. This parser is going to be parsing through extremely large files, could be 5+GB. The goal of the parser is to convert a nest XML file into a CSV. The XML file is going to be formatted in such a way where there will be a 'rowTag' that has the actual information that the parser is interested in. Take for example XML file:

<persons>
 <person id="1">
 <firstname>James</firstname>
 <lastname>Smith</lastname>
 <middlename></middlename>
 <dob_year>1980</dob_year>
 <dob_month>1</dob_month>
 <gender>M</gender>
 <salary currency="Euro">10000</salary>
 <street>456 apple street</street>
 <city>newark</city>
 <state>DE</state> 
 </person>
 <person id="2">
 <firstname>Michael</firstname>
 <lastname></lastname>
 <middlename>Rose</middlename>
 <dob_year>1990</dob_year>
 <dob_month>6</dob_month>
 <gender>M</gender>
 <salary currency="Dollor">10000</salary>
 <street>4367 orange st</street>
 <city>sandiago</city>
 <state>CA</state> 
 </person>
</persons>

The rowtag here would be "person", and the headers would be all the tags inside of <person>.

Below is the class I wrote for this. Would love to hear feedback. You can use WoodStox with either this jar file here, or include it in gradle:

implementation(["org.codehaus.woodstox:stax2-api:3.1.1"])

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Files;
import javax.xml.parsers.FactoryConfigurationError;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import org.apache.commons.lang3.StringUtils;
import org.codehaus.stax2.XMLInputFactory2;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class XmlConverter2
{
 private static final Logger logger = LoggerFactory.getLogger(XmlConverter.class);
 private static final String ROWTAG = "person";
 public void readLargeXmlWithWoodStox(String file)
 throws FactoryConfigurationError, XMLStreamException, IOException
 {
 long startTime = System.nanoTime();
 // set up a Woodstox reader
 XMLInputFactory xmlif = XMLInputFactory2.newInstance();
 XMLStreamReader xmlStreamReader = xmlif.createXMLStreamReader(new FileReader(file));
 boolean firstPass = true;
 boolean insideRowTag = false;
 Files.deleteIfExists(new File( file + ".csv").toPath());
 BufferedWriter br = new BufferedWriter(new FileWriter(file + ".csv", true), 64*1024*1024);
 StringBuilder firstItems = new StringBuilder();
 try
 {
 while (xmlStreamReader.hasNext())
 {
 xmlStreamReader.next();
 // If 4 event, meaning just some random '\n' or something, we skip.
 if (xmlStreamReader.isCharacters())
 {
 continue;
 }
 // If we are at a start element, we want to check a couple of things
 if (xmlStreamReader.isStartElement())
 {
 // If we are at our rowtag, we want to start looking at what is inside.
 // We are 'continuing' because a Rowtag will not have any "elementText" in it, so we want to continue to the next tag.
 if (xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG))
 {
 insideRowTag = true;
 continue;
 }
 // if we are at a tag inside a row tag, we want to extract that information (the text it contains) from it....
 if (insideRowTag)
 {
 // ...but first, if we have not started to collect everything, we need to collect the headers!
 // This makes an assumption that all the "headers" are constant. If the first record has 6 tags in it,
 // but the next one has 7 tags in it, we are in trouble. We can add flexibility for that, I think.
 if (firstPass)
 {
 // We want to write the headers first
 br.write(xmlStreamReader.getLocalName() + ',');
 // And collect the items inside in a stringBuilder, which we'll dump later.
 firstItems.append(xmlStreamReader.getElementText()).append(',');
 } else
 {
 // If we're not in the first pass, just write the elements directly.
 br.write(xmlStreamReader.getElementText() + ',');
 }
 }
 }
 // If we are at an end element that is the rowTag, so at the end of the record, we want to do a couple of things
 if (xmlStreamReader.isEndElement() && xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG))
 {
 // First, if we are at the first pass, we want to send out the elements inside the first record
 // that we were collecting to dump *after* we got all the headers
 if (firstPass)
 {
 firstPass = false;
 br.write('\n' + StringUtils.chop(firstItems.toString()));
 }
 // Then we set this off so that we no longer collect irrelevant data if it is present.
 insideRowTag = false;
 br.write('\n');
 }
 }
 }
 catch (Exception e)
 {
 logger.error("Error! " + e.toString());
 }
 finally
 {
 xmlStreamReader.close();
 }
 br.close();
 long endTime = System.nanoTime();
 long totalTime = endTime - startTime;
 logger.info("Done! Time took: {}", totalTime / 1000000000);
 }
}

My goal is to make this faster and/or consume less memory. Any other advice is appreciated, of course. I am executing it using -Xms4g -Xmx4g tags. Right now, it takes around 25 seconds to run on an xml file that is 1.5Gb approximately.

Question 2

It's a bit odd to call your program a "parser". Woodstox is the parser.

Question 3

I think 1.5Gb in 25s is a pretty good speed, and I wouldn't think it can be improved all that much. Try writing an application that does the absolute minimum to invoke Woodstox to parse the XML, and I suspect it won't be much faster than this. The critical component from a performance perspective is almost certainly the Woodstox parser, and not your calling application. There might be a few tiny savings you can make, for example detecting the ROW end tag by tracking nesting depth rather than with a string comparison; but they're going to be very minor.

Question 4

In fact since you aren't handling nested rowtags anyway, you could avoid the string comparison on start tag names when inrowtag is true.

Question 5

@mtj in my experience it matters a lot more if a parser allocates Strings at any point or only juggles some internal buffers, giving you some sort of a view on it.

Question 6

FileReader and FileWriter use the default platform encoding; probably wrong. Nice would be UTF-8. For XML better use an InputStream. If so needed with a BOM as first char "\uFEFF". I also have some doubt on skipping isCharacters but I assume that is for the header line.

Question 7

I have some suggestion for the code, that will not make to code faster, but cleaner, in my opinion.

You can use the try-with-resources to handle the closing of the stream automatically (java 8+)

try(BufferedWriter br = new BufferedWriter(new FileWriter(file + ".csv", true), 64 * 1024 * 1024)) {
 //[...]
}

I suggest that you separate the code in more methods, to separate the different sections; preferably the section that handles the reading / parsing of the XML.

private void parseXml(XMLStreamReader xmlStreamReader, boolean firstPass, boolean insideRowTag, BufferedWriter br) throws XMLStreamException, IOException {
 StringBuilder firstItems = new StringBuilder();
 while (xmlStreamReader.hasNext()) {
 xmlStreamReader.next();
 // If 4 event, meaning just some random '\n' or something, we skip.
 if (xmlStreamReader.isCharacters()) {
 continue;
 }
 // If we are at a start element, we want to check a couple of things
 if (xmlStreamReader.isStartElement()) {
 // If we are at our rowtag, we want to start looking at what is inside.
 // We are 'continuing' because a Rowtag will not have any "elementText" in it, so we want to continue to the next tag.
 if (xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG)) {
 insideRowTag = true;
 continue;
 }
 // if we are at a tag inside a row tag, we want to extract that information (the text it contains) from it....
 if (insideRowTag) {
 // ...but first, if we have not started to collect everything, we need to collect the headers!
 // This makes an assumption that all the "headers" are constant. If the first record has 6 tags in it,
 // but the next one has 7 tags in it, we are in trouble. We can add flexibility for that, I think.
 if (firstPass) {
 // We want to write the headers first
 br.write(xmlStreamReader.getLocalName() + ',');
 // And collect the items inside in a stringBuilder, which we'll dump later.
 firstItems.append(xmlStreamReader.getElementText()).append(',');
 } else {
 // If we're not in the first pass, just write the elements directly.
 br.write(xmlStreamReader.getElementText() + ',');
 }
 }
 }
 // If we are at an end element that is the rowTag, so at the end of the record, we want to do a couple of things
 if (xmlStreamReader.isEndElement() && xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG)) {
 // First, if we are at the first pass, we want to send out the elements inside the first record
 // that we were collecting to dump *after* we got all the headers
 if (firstPass) {
 firstPass = false;
 br.write('\n' + StringUtils.chop(firstItems.toString()));
 }
 // Then we set this off so that we no longer collect irrelevant data if it is present.
 insideRowTag = false;
 br.write('\n');
 }
 }
}

Refactored code

public class XmlConverter2 {
 private static final Logger logger = LoggerFactory.getLogger(XmlConverter2.class);
 private static final String ROWTAG = "person";
 public void readLargeXmlWithWoodStox(String file)
 throws FactoryConfigurationError, XMLStreamException, IOException {
 long startTime = System.nanoTime();
 // set up a Woodstox reader
 XMLInputFactory xmlif = XMLInputFactory2.newInstance();
 XMLStreamReader xmlStreamReader = xmlif.createXMLStreamReader(new FileReader(file));
 boolean firstPass = true;
 boolean insideRowTag = false;
 Files.deleteIfExists(new File(file + ".csv").toPath());
 try (BufferedWriter br = new BufferedWriter(new FileWriter(file + ".csv", true), 64 * 1024 * 1024)) {
 parseXml(xmlStreamReader, firstPass, insideRowTag, br);
 } catch (Exception e) {
 logger.error("Error! " + e.toString());
 } finally {
 xmlStreamReader.close();
 }
 long endTime = System.nanoTime();
 long totalTime = endTime - startTime;
 logger.info("Done! Time took: {}", totalTime / 1000000000);
 }
 private void parseXml(XMLStreamReader xmlStreamReader, boolean firstPass, boolean insideRowTag, BufferedWriter br) throws XMLStreamException, IOException {
 StringBuilder firstItems = new StringBuilder();
 while (xmlStreamReader.hasNext()) {
 xmlStreamReader.next();
 // If 4 event, meaning just some random '\n' or something, we skip.
 if (xmlStreamReader.isCharacters()) {
 continue;
 }
 // If we are at a start element, we want to check a couple of things
 if (xmlStreamReader.isStartElement()) {
 // If we are at our rowtag, we want to start looking at what is inside.
 // We are 'continuing' because a Rowtag will not have any "elementText" in it, so we want to continue to the next tag.
 if (xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG)) {
 insideRowTag = true;
 continue;
 }
 // if we are at a tag inside a row tag, we want to extract that information (the text it contains) from it....
 if (insideRowTag) {
 // ...but first, if we have not started to collect everything, we need to collect the headers!
 // This makes an assumption that all the "headers" are constant. If the first record has 6 tags in it,
 // but the next one has 7 tags in it, we are in trouble. We can add flexibility for that, I think.
 if (firstPass) {
 // We want to write the headers first
 br.write(xmlStreamReader.getLocalName() + ',');
 // And collect the items inside in a stringBuilder, which we'll dump later.
 firstItems.append(xmlStreamReader.getElementText()).append(',');
 } else {
 // If we're not in the first pass, just write the elements directly.
 br.write(xmlStreamReader.getElementText() + ',');
 }
 }
 }
 // If we are at an end element that is the rowTag, so at the end of the record, we want to do a couple of things
 if (xmlStreamReader.isEndElement() && xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG)) {
 // First, if we are at the first pass, we want to send out the elements inside the first record
 // that we were collecting to dump *after* we got all the headers
 if (firstPass) {
 firstPass = false;
 br.write('\n' + StringUtils.chop(firstItems.toString()));
 }
 // Then we set this off so that we no longer collect irrelevant data if it is present.
 insideRowTag = false;
 br.write('\n');
 }
 }
 }
}

Question 8

This definitely looks a lot cleaner! Thank you for the suggestion!

Question 9

I was somewhat curious if Woodstox has improved, so I wrote a complete parser for your example data. It's in a different style than your code, complete repo: https://github.com/chhh/testing-woodstox-xml-parsing

My results with fake data records that I created:
Parsed 4,000,000 persons (1.36 GB) in 16.75 seconds (Ryzen5 3600), memory usage wasn't really significant.

First of all there's a newer version of Woodstox on Maven Central.
Gradle dependency: implementation 'com.fasterxml.woodstox:woodstox-core:6.0.3'
They now have XMLStreamReader2 with .configureForSpeed() option. I didn't really check what it does, but for my test it didn't do much.

Had to create fake data. You can make files of any size with FakeData.createHugeXml(Path path, int numEntries).

Just in case, here's main parsing code, excluding the Person class (which is not very interesing and can be found here)

public class WoodstoxParser {
 @FunctionalInterface
 interface ConditionCallback {
 boolean processXml(XMLStreamReader2 sr) throws XMLStreamException;
 }
 interface TagPairCallback {
 void tagStart(String tagName, XMLStreamReader2 sr) throws XMLStreamException;
 void tagContents(String tagName, StringBuilder sb);
 }
 public static void processUntilTrue(XMLStreamReader2 sr, ConditionCallback callback) throws XMLStreamException {
 do {
 if (callback.processXml(sr))
 return;
 } while (sr.hasNext() && sr.next() >= 0);
 throw new IllegalStateException("xml document ended without callback returning true");
 }
 /** Main parsing function. **/
 public static List<Person> parse(Path path) throws IOException, XMLStreamException {
 XMLInputFactory2 f = (XMLInputFactory2) XMLInputFactory2.newFactory();
 f.configureForSpeed();
// f.configureForLowMemUsage();
 XMLStreamReader2 sr = null;
 try (BufferedReader br = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
 sr = (XMLStreamReader2) f.createXMLStreamReader(br);
 // fast forward to beginning 'persons' tag (will throw if we don't find the tag at all)
 processUntilTrue(sr, sr1 -> isTagStart(sr1, "persons"));
 final List<Person> persons = new ArrayList<>(); // we've found the tag, so we can allocate storage for data
 final StringBuilder sb = new StringBuilder(); // reuse a single string builder for all character aggregation
 // now keep processing unless we reach closing 'persons' tag
 processUntilTrue(sr, sr1 -> {
 if (isTagEnd(sr1, "persons"))
 return true;
 if (isTagStart(sr1, "person")) {
 // now we're finally reached a 'person', can start processing it
 int idIndex = sr1.getAttributeInfo().findAttributeIndex("", "id");
 Person p = new Person(Integer.parseInt(sr1.getAttributeValue(idIndex)));
 sr1.next();
 processUntilTrue(sr1, sr2 -> {
 // processing the meat of a 'person' tag
 // split it into a function of its own to not clutter the main loop
 //return processPerson(sr2, p, sb);
 if (isTagEnd(sr2, "person"))
 return true; // we're done processing a 'person' only when we reach the ending 'person' tag
 if (isTagStart(sr2))
 processTagPair(sr2, sb, p);
 return false;
 });
 // we've reached the end of a 'person'
 if (p.isComplete()) {
 persons.add(p);
 } else {
 throw new IllegalStateException("Whoa, a person had incomplete data");
 }
 }
 return false;
 });
 return persons;
 } finally {
 if (sr != null)
 sr.close();
 }
 }
 public static void processTagPair(XMLStreamReader2 sr, StringBuilder sb, TagPairCallback callback) throws XMLStreamException {
 final String tagName = sr.getLocalName();
 callback.tagStart(tagName, sr); // let the caller do whatever they need with the tag name and attributes
 sb.setLength(0); // clear our buffer, preparing to read the characters inside
 processUntilTrue(sr, sr1 -> {
 switch (sr1.getEventType()) {
 case XMLStreamReader2.END_ELEMENT: // ending condition
 callback.tagContents(tagName, sb); // let the caller do whatever they need with text contents of the tag
 return true;
 case XMLStreamReader2.CHARACTERS:
 sb.append(sr1.getText());
 break;
 }
 return false;
 });
 }
 public static boolean isTagStart(XMLStreamReader2 sr, String tagName) {
 return XMLStreamReader2.START_ELEMENT == sr.getEventType() && tagName.equalsIgnoreCase(sr.getLocalName());
 }
 public static boolean isTagStart(XMLStreamReader2 sr) {
 return XMLStreamReader2.START_ELEMENT == sr.getEventType();
 }
 public static boolean isTagEnd(XMLStreamReader2 sr, String tagName) {
 return XMLStreamReader2.END_ELEMENT == sr.getEventType() && tagName.equalsIgnoreCase(sr.getLocalName());
 }
}

Question 10

Thank you so much for the answer. I’m currently on vacation and get back in two or so weeks. Will take a closer look then - unless I get curios, which I probably will, and look at it before :)

Question 11

I was in a similar position several years ago - needing to parse multi-gigabyte XML files. Tried all the standard solutions Woodstox, Xerces, Piccolo whatnot - can't remember all the names. Ended up using an XML parser from a library called Javolution. It's development has stalled a while back, but the parser works well.
Available from Maven Central: https://search.maven.org/artifact/org.javolution/javolution-core-java/6.0.0/bundle
I got it to parse at about 1 GB/s with an SSD.
- A very old example of my usage (link to the line where the XML parser is instantiated): https://github.com/chhh/MSFTBX/blob/e53ae6be982e2de3123292be7d5297715bec70bb/MSFileToolbox/src/main/java/umich/ms/fileio/filetypes/mzml/MZMLMultiSpectraParser.java#L105
- Description of their XML package: https://github.com/javolution/javolution/blob/master/src/main/java/org/javolution/xml/package-info.java
If you're using an HDD without RAID, then you're most likely limited to 100-200 MB/s just by IO, so likely you can't be faster than 1 GB in 5 seconds with that scenario.
Core thing for XML parsing speed (apart from just good io code) is to not allocate unnecessary garbage, the parser should not be allocating Strings all the time to just do a comparison or give you an array of tag's attributes. Javolution does exactly that using an internal sliding buffer and refernecing it. Like a java.lang.CharSequence, called CharArray in javolution. It's important to use CharArray#contentEquals() when comparing to Strings to avoid extra String creation.

Question 12

According to xml.com/pub/a/2007/05/09/xml-parser-benchmarks-part-1.html (which is quite old), Javolotion parsing performance is pretty much the same as Woodstox - fast, but not wildly faster than other parsers.

Question 13

@MichaelKay Just sharing my experience. I wasn't able to squeeze out comparable performance out of woodstox, but maybe I did domething wrong somewhere.

Question 14

Aside from other suggestions which make sense, one simple thing to try is to use Aalto-xml parser from:

https://github.com/FasterXML/aalto-xml

it implements Stax API (as well as Stax2 extension, SAX); and as long as you do not need full DTD handling (which I suspect you don't) has the feature set you need. For common read use cases I think it can be 30-40% faster; but most importantly it should be very easy to just try out.

Maven coordinates are:

group id: com.fasterxml
artifact id: aalto-xml
version (latest): 1.2.2

And XMLInputFactory implementation com.fasterxml.aalto.stax.InputFactoryImpl. I would recommend creating instance directly, instead of using XMLInputFactory.newInstance() so you can sure of the exact implementation you have (if you have multiple Stax implementations in classpath, choice is arbitrary).

Doi9t Doi9t 3,3643 gold badges11 silver badges23 bronze badges · Answer 1 · 2020-01-31 00:33:40Z

I have some suggestion for the code, that will not make to code faster, but cleaner, in my opinion.

You can use the try-with-resources to handle the closing of the stream automatically (java 8+)

try(BufferedWriter br = new BufferedWriter(new FileWriter(file + ".csv", true), 64 * 1024 * 1024)) {
 //[...]
}

I suggest that you separate the code in more methods, to separate the different sections; preferably the section that handles the reading / parsing of the XML.

private void parseXml(XMLStreamReader xmlStreamReader, boolean firstPass, boolean insideRowTag, BufferedWriter br) throws XMLStreamException, IOException {
 StringBuilder firstItems = new StringBuilder();
 while (xmlStreamReader.hasNext()) {
 xmlStreamReader.next();
 // If 4 event, meaning just some random '\n' or something, we skip.
 if (xmlStreamReader.isCharacters()) {
 continue;
 }
 // If we are at a start element, we want to check a couple of things
 if (xmlStreamReader.isStartElement()) {
 // If we are at our rowtag, we want to start looking at what is inside.
 // We are 'continuing' because a Rowtag will not have any "elementText" in it, so we want to continue to the next tag.
 if (xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG)) {
 insideRowTag = true;
 continue;
 }
 // if we are at a tag inside a row tag, we want to extract that information (the text it contains) from it....
 if (insideRowTag) {
 // ...but first, if we have not started to collect everything, we need to collect the headers!
 // This makes an assumption that all the "headers" are constant. If the first record has 6 tags in it,
 // but the next one has 7 tags in it, we are in trouble. We can add flexibility for that, I think.
 if (firstPass) {
 // We want to write the headers first
 br.write(xmlStreamReader.getLocalName() + ',');
 // And collect the items inside in a stringBuilder, which we'll dump later.
 firstItems.append(xmlStreamReader.getElementText()).append(',');
 } else {
 // If we're not in the first pass, just write the elements directly.
 br.write(xmlStreamReader.getElementText() + ',');
 }
 }
 }
 // If we are at an end element that is the rowTag, so at the end of the record, we want to do a couple of things
 if (xmlStreamReader.isEndElement() && xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG)) {
 // First, if we are at the first pass, we want to send out the elements inside the first record
 // that we were collecting to dump *after* we got all the headers
 if (firstPass) {
 firstPass = false;
 br.write('\n' + StringUtils.chop(firstItems.toString()));
 }
 // Then we set this off so that we no longer collect irrelevant data if it is present.
 insideRowTag = false;
 br.write('\n');
 }
 }
}

Refactored code

public class XmlConverter2 {
 private static final Logger logger = LoggerFactory.getLogger(XmlConverter2.class);
 private static final String ROWTAG = "person";
 public void readLargeXmlWithWoodStox(String file)
 throws FactoryConfigurationError, XMLStreamException, IOException {
 long startTime = System.nanoTime();
 // set up a Woodstox reader
 XMLInputFactory xmlif = XMLInputFactory2.newInstance();
 XMLStreamReader xmlStreamReader = xmlif.createXMLStreamReader(new FileReader(file));
 boolean firstPass = true;
 boolean insideRowTag = false;
 Files.deleteIfExists(new File(file + ".csv").toPath());
 try (BufferedWriter br = new BufferedWriter(new FileWriter(file + ".csv", true), 64 * 1024 * 1024)) {
 parseXml(xmlStreamReader, firstPass, insideRowTag, br);
 } catch (Exception e) {
 logger.error("Error! " + e.toString());
 } finally {
 xmlStreamReader.close();
 }
 long endTime = System.nanoTime();
 long totalTime = endTime - startTime;
 logger.info("Done! Time took: {}", totalTime / 1000000000);
 }
 private void parseXml(XMLStreamReader xmlStreamReader, boolean firstPass, boolean insideRowTag, BufferedWriter br) throws XMLStreamException, IOException {
 StringBuilder firstItems = new StringBuilder();
 while (xmlStreamReader.hasNext()) {
 xmlStreamReader.next();
 // If 4 event, meaning just some random '\n' or something, we skip.
 if (xmlStreamReader.isCharacters()) {
 continue;
 }
 // If we are at a start element, we want to check a couple of things
 if (xmlStreamReader.isStartElement()) {
 // If we are at our rowtag, we want to start looking at what is inside.
 // We are 'continuing' because a Rowtag will not have any "elementText" in it, so we want to continue to the next tag.
 if (xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG)) {
 insideRowTag = true;
 continue;
 }
 // if we are at a tag inside a row tag, we want to extract that information (the text it contains) from it....
 if (insideRowTag) {
 // ...but first, if we have not started to collect everything, we need to collect the headers!
 // This makes an assumption that all the "headers" are constant. If the first record has 6 tags in it,
 // but the next one has 7 tags in it, we are in trouble. We can add flexibility for that, I think.
 if (firstPass) {
 // We want to write the headers first
 br.write(xmlStreamReader.getLocalName() + ',');
 // And collect the items inside in a stringBuilder, which we'll dump later.
 firstItems.append(xmlStreamReader.getElementText()).append(',');
 } else {
 // If we're not in the first pass, just write the elements directly.
 br.write(xmlStreamReader.getElementText() + ',');
 }
 }
 }
 // If we are at an end element that is the rowTag, so at the end of the record, we want to do a couple of things
 if (xmlStreamReader.isEndElement() && xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG)) {
 // First, if we are at the first pass, we want to send out the elements inside the first record
 // that we were collecting to dump *after* we got all the headers
 if (firstPass) {
 firstPass = false;
 br.write('\n' + StringUtils.chop(firstItems.toString()));
 }
 // Then we set this off so that we no longer collect irrelevant data if it is present.
 insideRowTag = false;
 br.write('\n');
 }
 }
 }
}

This definitely looks a lot cleaner! Thank you for the suggestion!

Dmitry Avtonomov Dmitry Avtonomov 1913 silver badges7 bronze badges · Answer 2 · 2020-02-01 01:52:08Z

I was somewhat curious if Woodstox has improved, so I wrote a complete parser for your example data. It's in a different style than your code, complete repo: https://github.com/chhh/testing-woodstox-xml-parsing

My results with fake data records that I created:
Parsed 4,000,000 persons (1.36 GB) in 16.75 seconds (Ryzen5 3600), memory usage wasn't really significant.

First of all there's a newer version of Woodstox on Maven Central.
Gradle dependency: implementation 'com.fasterxml.woodstox:woodstox-core:6.0.3'
They now have XMLStreamReader2 with .configureForSpeed() option. I didn't really check what it does, but for my test it didn't do much.

Had to create fake data. You can make files of any size with FakeData.createHugeXml(Path path, int numEntries).

Just in case, here's main parsing code, excluding the Person class (which is not very interesing and can be found here)

public class WoodstoxParser {
 @FunctionalInterface
 interface ConditionCallback {
 boolean processXml(XMLStreamReader2 sr) throws XMLStreamException;
 }
 interface TagPairCallback {
 void tagStart(String tagName, XMLStreamReader2 sr) throws XMLStreamException;
 void tagContents(String tagName, StringBuilder sb);
 }
 public static void processUntilTrue(XMLStreamReader2 sr, ConditionCallback callback) throws XMLStreamException {
 do {
 if (callback.processXml(sr))
 return;
 } while (sr.hasNext() && sr.next() >= 0);
 throw new IllegalStateException("xml document ended without callback returning true");
 }
 /** Main parsing function. **/
 public static List<Person> parse(Path path) throws IOException, XMLStreamException {
 XMLInputFactory2 f = (XMLInputFactory2) XMLInputFactory2.newFactory();
 f.configureForSpeed();
// f.configureForLowMemUsage();
 XMLStreamReader2 sr = null;
 try (BufferedReader br = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
 sr = (XMLStreamReader2) f.createXMLStreamReader(br);
 // fast forward to beginning 'persons' tag (will throw if we don't find the tag at all)
 processUntilTrue(sr, sr1 -> isTagStart(sr1, "persons"));
 final List<Person> persons = new ArrayList<>(); // we've found the tag, so we can allocate storage for data
 final StringBuilder sb = new StringBuilder(); // reuse a single string builder for all character aggregation
 // now keep processing unless we reach closing 'persons' tag
 processUntilTrue(sr, sr1 -> {
 if (isTagEnd(sr1, "persons"))
 return true;
 if (isTagStart(sr1, "person")) {
 // now we're finally reached a 'person', can start processing it
 int idIndex = sr1.getAttributeInfo().findAttributeIndex("", "id");
 Person p = new Person(Integer.parseInt(sr1.getAttributeValue(idIndex)));
 sr1.next();
 processUntilTrue(sr1, sr2 -> {
 // processing the meat of a 'person' tag
 // split it into a function of its own to not clutter the main loop
 //return processPerson(sr2, p, sb);
 if (isTagEnd(sr2, "person"))
 return true; // we're done processing a 'person' only when we reach the ending 'person' tag
 if (isTagStart(sr2))
 processTagPair(sr2, sb, p);
 return false;
 });
 // we've reached the end of a 'person'
 if (p.isComplete()) {
 persons.add(p);
 } else {
 throw new IllegalStateException("Whoa, a person had incomplete data");
 }
 }
 return false;
 });
 return persons;
 } finally {
 if (sr != null)
 sr.close();
 }
 }
 public static void processTagPair(XMLStreamReader2 sr, StringBuilder sb, TagPairCallback callback) throws XMLStreamException {
 final String tagName = sr.getLocalName();
 callback.tagStart(tagName, sr); // let the caller do whatever they need with the tag name and attributes
 sb.setLength(0); // clear our buffer, preparing to read the characters inside
 processUntilTrue(sr, sr1 -> {
 switch (sr1.getEventType()) {
 case XMLStreamReader2.END_ELEMENT: // ending condition
 callback.tagContents(tagName, sb); // let the caller do whatever they need with text contents of the tag
 return true;
 case XMLStreamReader2.CHARACTERS:
 sb.append(sr1.getText());
 break;
 }
 return false;
 });
 }
 public static boolean isTagStart(XMLStreamReader2 sr, String tagName) {
 return XMLStreamReader2.START_ELEMENT == sr.getEventType() && tagName.equalsIgnoreCase(sr.getLocalName());
 }
 public static boolean isTagStart(XMLStreamReader2 sr) {
 return XMLStreamReader2.START_ELEMENT == sr.getEventType();
 }
 public static boolean isTagEnd(XMLStreamReader2 sr, String tagName) {
 return XMLStreamReader2.END_ELEMENT == sr.getEventType() && tagName.equalsIgnoreCase(sr.getLocalName());
 }
}

Thank you so much for the answer. I’m currently on vacation and get back in two or so weeks. Will take a closer look then - unless I get curios, which I probably will, and look at it before :)

Dmitry Avtonomov Dmitry Avtonomov 1913 silver badges7 bronze badges · Answer 3 · 2020-01-30 23:43:46Z

I was in a similar position several years ago - needing to parse multi-gigabyte XML files. Tried all the standard solutions Woodstox, Xerces, Piccolo whatnot - can't remember all the names. Ended up using an XML parser from a library called Javolution. It's development has stalled a while back, but the parser works well.
Available from Maven Central: https://search.maven.org/artifact/org.javolution/javolution-core-java/6.0.0/bundle
I got it to parse at about 1 GB/s with an SSD.
- A very old example of my usage (link to the line where the XML parser is instantiated): https://github.com/chhh/MSFTBX/blob/e53ae6be982e2de3123292be7d5297715bec70bb/MSFileToolbox/src/main/java/umich/ms/fileio/filetypes/mzml/MZMLMultiSpectraParser.java#L105
- Description of their XML package: https://github.com/javolution/javolution/blob/master/src/main/java/org/javolution/xml/package-info.java
If you're using an HDD without RAID, then you're most likely limited to 100-200 MB/s just by IO, so likely you can't be faster than 1 GB in 5 seconds with that scenario.
Core thing for XML parsing speed (apart from just good io code) is to not allocate unnecessary garbage, the parser should not be allocating Strings all the time to just do a comparison or give you an array of tag's attributes. Javolution does exactly that using an internal sliding buffer and refernecing it. Like a java.lang.CharSequence, called CharArray in javolution. It's important to use CharArray#contentEquals() when comparing to Strings to avoid extra String creation.

According to xml.com/pub/a/2007/05/09/xml-parser-benchmarks-part-1.html (which is quite old), Javolotion parsing performance is pretty much the same as Woodstox - fast, but not wildly faster than other parsers.
@MichaelKay Just sharing my experience. I wasn't able to squeeze out comparable performance out of woodstox, but maybe I did domething wrong somewhere.

StaxMan StaxMan 1213 bronze badges · Answer 4 · 2021-04-19 22:03:59Z

Aside from other suggestions which make sense, one simple thing to try is to use Aalto-xml parser from:

https://github.com/FasterXML/aalto-xml

it implements Stax API (as well as Stax2 extension, SAX); and as long as you do not need full DTD handling (which I suspect you don't) has the feature set you need. For common read use cases I think it can be 30-40% faster; but most importantly it should be very easy to just try out.

Maven coordinates are:

group id: com.fasterxml
artifact id: aalto-xml
version (latest): 1.2.2

And XMLInputFactory implementation com.fasterxml.aalto.stax.InputFactoryImpl. I would recommend creating instance directly, instead of using XMLInputFactory.newInstance() so you can sure of the exact implementation you have (if you have multiple Stax implementations in classpath, choice is arbitrary).

Stack Exchange Network

Make my XML Parser in Java using WoodStox run faster and use less memory, or just generally better

4 Answers 4

Refactored code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Make my XML Parser in Java using WoodStox run faster and use less memory, or just generally better

4 Answers 4

Refactored code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions