I have an XML Parser in Java using WoodStox that I wrote. This parser is going to be parsing through extremely large files, could be 5+GB. The goal of the parser is to convert a nest XML file into a CSV. The XML file is going to be formatted in such a way where there will be a 'rowTag
' that has the actual information that the parser is interested in. Take for example XML file:
<persons>
<person id="1">
<firstname>James</firstname>
<lastname>Smith</lastname>
<middlename></middlename>
<dob_year>1980</dob_year>
<dob_month>1</dob_month>
<gender>M</gender>
<salary currency="Euro">10000</salary>
<street>456 apple street</street>
<city>newark</city>
<state>DE</state>
</person>
<person id="2">
<firstname>Michael</firstname>
<lastname></lastname>
<middlename>Rose</middlename>
<dob_year>1990</dob_year>
<dob_month>6</dob_month>
<gender>M</gender>
<salary currency="Dollor">10000</salary>
<street>4367 orange st</street>
<city>sandiago</city>
<state>CA</state>
</person>
</persons>
The rowtag here would be "person", and the headers would be all the tags inside of <person>
.
Below is the class I wrote for this. Would love to hear feedback. You can use WoodStox with either this jar file here, or include it in gradle:
implementation(["org.codehaus.woodstox:stax2-api:3.1.1"])
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Files;
import javax.xml.parsers.FactoryConfigurationError;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import org.apache.commons.lang3.StringUtils;
import org.codehaus.stax2.XMLInputFactory2;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class XmlConverter2
{
private static final Logger logger = LoggerFactory.getLogger(XmlConverter.class);
private static final String ROWTAG = "person";
public void readLargeXmlWithWoodStox(String file)
throws FactoryConfigurationError, XMLStreamException, IOException
{
long startTime = System.nanoTime();
// set up a Woodstox reader
XMLInputFactory xmlif = XMLInputFactory2.newInstance();
XMLStreamReader xmlStreamReader = xmlif.createXMLStreamReader(new FileReader(file));
boolean firstPass = true;
boolean insideRowTag = false;
Files.deleteIfExists(new File( file + ".csv").toPath());
BufferedWriter br = new BufferedWriter(new FileWriter(file + ".csv", true), 64*1024*1024);
StringBuilder firstItems = new StringBuilder();
try
{
while (xmlStreamReader.hasNext())
{
xmlStreamReader.next();
// If 4 event, meaning just some random '\n' or something, we skip.
if (xmlStreamReader.isCharacters())
{
continue;
}
// If we are at a start element, we want to check a couple of things
if (xmlStreamReader.isStartElement())
{
// If we are at our rowtag, we want to start looking at what is inside.
// We are 'continuing' because a Rowtag will not have any "elementText" in it, so we want to continue to the next tag.
if (xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG))
{
insideRowTag = true;
continue;
}
// if we are at a tag inside a row tag, we want to extract that information (the text it contains) from it....
if (insideRowTag)
{
// ...but first, if we have not started to collect everything, we need to collect the headers!
// This makes an assumption that all the "headers" are constant. If the first record has 6 tags in it,
// but the next one has 7 tags in it, we are in trouble. We can add flexibility for that, I think.
if (firstPass)
{
// We want to write the headers first
br.write(xmlStreamReader.getLocalName() + ',');
// And collect the items inside in a stringBuilder, which we'll dump later.
firstItems.append(xmlStreamReader.getElementText()).append(',');
} else
{
// If we're not in the first pass, just write the elements directly.
br.write(xmlStreamReader.getElementText() + ',');
}
}
}
// If we are at an end element that is the rowTag, so at the end of the record, we want to do a couple of things
if (xmlStreamReader.isEndElement() && xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG))
{
// First, if we are at the first pass, we want to send out the elements inside the first record
// that we were collecting to dump *after* we got all the headers
if (firstPass)
{
firstPass = false;
br.write('\n' + StringUtils.chop(firstItems.toString()));
}
// Then we set this off so that we no longer collect irrelevant data if it is present.
insideRowTag = false;
br.write('\n');
}
}
}
catch (Exception e)
{
logger.error("Error! " + e.toString());
}
finally
{
xmlStreamReader.close();
}
br.close();
long endTime = System.nanoTime();
long totalTime = endTime - startTime;
logger.info("Done! Time took: {}", totalTime / 1000000000);
}
}
My goal is to make this faster and/or consume less memory. Any other advice is appreciated, of course. I am executing it using -Xms4g -Xmx4g
tags. Right now, it takes around 25 seconds to run on an xml file that is 1.5Gb approximately.
4 Answers 4
I have some suggestion for the code, that will not make to code faster, but cleaner, in my opinion.
- You can use the
try-with-resources
to handle the closing of the stream automatically (java 8+)
try(BufferedWriter br = new BufferedWriter(new FileWriter(file + ".csv", true), 64 * 1024 * 1024)) {
//[...]
}
- I suggest that you separate the code in more methods, to separate the different sections; preferably the section that handles the reading / parsing of the XML.
private void parseXml(XMLStreamReader xmlStreamReader, boolean firstPass, boolean insideRowTag, BufferedWriter br) throws XMLStreamException, IOException {
StringBuilder firstItems = new StringBuilder();
while (xmlStreamReader.hasNext()) {
xmlStreamReader.next();
// If 4 event, meaning just some random '\n' or something, we skip.
if (xmlStreamReader.isCharacters()) {
continue;
}
// If we are at a start element, we want to check a couple of things
if (xmlStreamReader.isStartElement()) {
// If we are at our rowtag, we want to start looking at what is inside.
// We are 'continuing' because a Rowtag will not have any "elementText" in it, so we want to continue to the next tag.
if (xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG)) {
insideRowTag = true;
continue;
}
// if we are at a tag inside a row tag, we want to extract that information (the text it contains) from it....
if (insideRowTag) {
// ...but first, if we have not started to collect everything, we need to collect the headers!
// This makes an assumption that all the "headers" are constant. If the first record has 6 tags in it,
// but the next one has 7 tags in it, we are in trouble. We can add flexibility for that, I think.
if (firstPass) {
// We want to write the headers first
br.write(xmlStreamReader.getLocalName() + ',');
// And collect the items inside in a stringBuilder, which we'll dump later.
firstItems.append(xmlStreamReader.getElementText()).append(',');
} else {
// If we're not in the first pass, just write the elements directly.
br.write(xmlStreamReader.getElementText() + ',');
}
}
}
// If we are at an end element that is the rowTag, so at the end of the record, we want to do a couple of things
if (xmlStreamReader.isEndElement() && xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG)) {
// First, if we are at the first pass, we want to send out the elements inside the first record
// that we were collecting to dump *after* we got all the headers
if (firstPass) {
firstPass = false;
br.write('\n' + StringUtils.chop(firstItems.toString()));
}
// Then we set this off so that we no longer collect irrelevant data if it is present.
insideRowTag = false;
br.write('\n');
}
}
}
Refactored code
public class XmlConverter2 {
private static final Logger logger = LoggerFactory.getLogger(XmlConverter2.class);
private static final String ROWTAG = "person";
public void readLargeXmlWithWoodStox(String file)
throws FactoryConfigurationError, XMLStreamException, IOException {
long startTime = System.nanoTime();
// set up a Woodstox reader
XMLInputFactory xmlif = XMLInputFactory2.newInstance();
XMLStreamReader xmlStreamReader = xmlif.createXMLStreamReader(new FileReader(file));
boolean firstPass = true;
boolean insideRowTag = false;
Files.deleteIfExists(new File(file + ".csv").toPath());
try (BufferedWriter br = new BufferedWriter(new FileWriter(file + ".csv", true), 64 * 1024 * 1024)) {
parseXml(xmlStreamReader, firstPass, insideRowTag, br);
} catch (Exception e) {
logger.error("Error! " + e.toString());
} finally {
xmlStreamReader.close();
}
long endTime = System.nanoTime();
long totalTime = endTime - startTime;
logger.info("Done! Time took: {}", totalTime / 1000000000);
}
private void parseXml(XMLStreamReader xmlStreamReader, boolean firstPass, boolean insideRowTag, BufferedWriter br) throws XMLStreamException, IOException {
StringBuilder firstItems = new StringBuilder();
while (xmlStreamReader.hasNext()) {
xmlStreamReader.next();
// If 4 event, meaning just some random '\n' or something, we skip.
if (xmlStreamReader.isCharacters()) {
continue;
}
// If we are at a start element, we want to check a couple of things
if (xmlStreamReader.isStartElement()) {
// If we are at our rowtag, we want to start looking at what is inside.
// We are 'continuing' because a Rowtag will not have any "elementText" in it, so we want to continue to the next tag.
if (xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG)) {
insideRowTag = true;
continue;
}
// if we are at a tag inside a row tag, we want to extract that information (the text it contains) from it....
if (insideRowTag) {
// ...but first, if we have not started to collect everything, we need to collect the headers!
// This makes an assumption that all the "headers" are constant. If the first record has 6 tags in it,
// but the next one has 7 tags in it, we are in trouble. We can add flexibility for that, I think.
if (firstPass) {
// We want to write the headers first
br.write(xmlStreamReader.getLocalName() + ',');
// And collect the items inside in a stringBuilder, which we'll dump later.
firstItems.append(xmlStreamReader.getElementText()).append(',');
} else {
// If we're not in the first pass, just write the elements directly.
br.write(xmlStreamReader.getElementText() + ',');
}
}
}
// If we are at an end element that is the rowTag, so at the end of the record, we want to do a couple of things
if (xmlStreamReader.isEndElement() && xmlStreamReader.getLocalName().equalsIgnoreCase(ROWTAG)) {
// First, if we are at the first pass, we want to send out the elements inside the first record
// that we were collecting to dump *after* we got all the headers
if (firstPass) {
firstPass = false;
br.write('\n' + StringUtils.chop(firstItems.toString()));
}
// Then we set this off so that we no longer collect irrelevant data if it is present.
insideRowTag = false;
br.write('\n');
}
}
}
}
-
\$\begingroup\$ This definitely looks a lot cleaner! Thank you for the suggestion! \$\endgroup\$John Lexus– John Lexus2020年02月01日 10:41:21 +00:00Commented Feb 1, 2020 at 10:41
I was somewhat curious if Woodstox has improved, so I wrote a complete parser for your example data. It's in a different style than your code, complete repo: https://github.com/chhh/testing-woodstox-xml-parsing
My results with fake data records that I created:
Parsed 4,000,000 persons (1.36 GB) in 16.75 seconds (Ryzen5 3600), memory usage wasn't really significant.
First of all there's a newer version of Woodstox on Maven Central.
Gradle dependency: implementation 'com.fasterxml.woodstox:woodstox-core:6.0.3'
They now have XMLStreamReader2 with .configureForSpeed()
option. I didn't really check what it does, but for my test it didn't do much.
Had to create fake data. You can make files of any size with FakeData.createHugeXml(Path path, int numEntries).
Just in case, here's main parsing code, excluding the Person class (which is not very interesing and can be found here)
public class WoodstoxParser {
@FunctionalInterface
interface ConditionCallback {
boolean processXml(XMLStreamReader2 sr) throws XMLStreamException;
}
interface TagPairCallback {
void tagStart(String tagName, XMLStreamReader2 sr) throws XMLStreamException;
void tagContents(String tagName, StringBuilder sb);
}
public static void processUntilTrue(XMLStreamReader2 sr, ConditionCallback callback) throws XMLStreamException {
do {
if (callback.processXml(sr))
return;
} while (sr.hasNext() && sr.next() >= 0);
throw new IllegalStateException("xml document ended without callback returning true");
}
/** Main parsing function. **/
public static List<Person> parse(Path path) throws IOException, XMLStreamException {
XMLInputFactory2 f = (XMLInputFactory2) XMLInputFactory2.newFactory();
f.configureForSpeed();
// f.configureForLowMemUsage();
XMLStreamReader2 sr = null;
try (BufferedReader br = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
sr = (XMLStreamReader2) f.createXMLStreamReader(br);
// fast forward to beginning 'persons' tag (will throw if we don't find the tag at all)
processUntilTrue(sr, sr1 -> isTagStart(sr1, "persons"));
final List<Person> persons = new ArrayList<>(); // we've found the tag, so we can allocate storage for data
final StringBuilder sb = new StringBuilder(); // reuse a single string builder for all character aggregation
// now keep processing unless we reach closing 'persons' tag
processUntilTrue(sr, sr1 -> {
if (isTagEnd(sr1, "persons"))
return true;
if (isTagStart(sr1, "person")) {
// now we're finally reached a 'person', can start processing it
int idIndex = sr1.getAttributeInfo().findAttributeIndex("", "id");
Person p = new Person(Integer.parseInt(sr1.getAttributeValue(idIndex)));
sr1.next();
processUntilTrue(sr1, sr2 -> {
// processing the meat of a 'person' tag
// split it into a function of its own to not clutter the main loop
//return processPerson(sr2, p, sb);
if (isTagEnd(sr2, "person"))
return true; // we're done processing a 'person' only when we reach the ending 'person' tag
if (isTagStart(sr2))
processTagPair(sr2, sb, p);
return false;
});
// we've reached the end of a 'person'
if (p.isComplete()) {
persons.add(p);
} else {
throw new IllegalStateException("Whoa, a person had incomplete data");
}
}
return false;
});
return persons;
} finally {
if (sr != null)
sr.close();
}
}
public static void processTagPair(XMLStreamReader2 sr, StringBuilder sb, TagPairCallback callback) throws XMLStreamException {
final String tagName = sr.getLocalName();
callback.tagStart(tagName, sr); // let the caller do whatever they need with the tag name and attributes
sb.setLength(0); // clear our buffer, preparing to read the characters inside
processUntilTrue(sr, sr1 -> {
switch (sr1.getEventType()) {
case XMLStreamReader2.END_ELEMENT: // ending condition
callback.tagContents(tagName, sb); // let the caller do whatever they need with text contents of the tag
return true;
case XMLStreamReader2.CHARACTERS:
sb.append(sr1.getText());
break;
}
return false;
});
}
public static boolean isTagStart(XMLStreamReader2 sr, String tagName) {
return XMLStreamReader2.START_ELEMENT == sr.getEventType() && tagName.equalsIgnoreCase(sr.getLocalName());
}
public static boolean isTagStart(XMLStreamReader2 sr) {
return XMLStreamReader2.START_ELEMENT == sr.getEventType();
}
public static boolean isTagEnd(XMLStreamReader2 sr, String tagName) {
return XMLStreamReader2.END_ELEMENT == sr.getEventType() && tagName.equalsIgnoreCase(sr.getLocalName());
}
}
-
\$\begingroup\$ Thank you so much for the answer. I’m currently on vacation and get back in two or so weeks. Will take a closer look then - unless I get curios, which I probably will, and look at it before :) \$\endgroup\$John Lexus– John Lexus2020年02月01日 10:40:56 +00:00Commented Feb 1, 2020 at 10:40
- I was in a similar position several years ago - needing to parse multi-gigabyte XML files. Tried all the standard solutions Woodstox, Xerces, Piccolo whatnot - can't remember all the names. Ended up using an XML parser from a library called Javolution. It's development has stalled a while back, but the parser works well.
Available from Maven Central: https://search.maven.org/artifact/org.javolution/javolution-core-java/6.0.0/bundle
I got it to parse at about 1 GB/s with an SSD.
- A very old example of my usage (link to the line where the XML parser is instantiated): https://github.com/chhh/MSFTBX/blob/e53ae6be982e2de3123292be7d5297715bec70bb/MSFileToolbox/src/main/java/umich/ms/fileio/filetypes/mzml/MZMLMultiSpectraParser.java#L105
- Description of their XML package: https://github.com/javolution/javolution/blob/master/src/main/java/org/javolution/xml/package-info.java
If you're using an HDD without RAID, then you're most likely limited to 100-200 MB/s just by IO, so likely you can't be faster than 1 GB in 5 seconds with that scenario.
Core thing for XML parsing speed (apart from just good io code) is to not allocate unnecessary garbage, the parser should not be allocating Strings all the time to just do a comparison or give you an array of tag's attributes. Javolution does exactly that using an internal sliding buffer and refernecing it. Like a
java.lang.CharSequence
, calledCharArray
in javolution. It's important to useCharArray#contentEquals()
when comparing to Strings to avoid extra String creation.
-
1\$\begingroup\$ According to xml.com/pub/a/2007/05/09/xml-parser-benchmarks-part-1.html (which is quite old), Javolotion parsing performance is pretty much the same as Woodstox - fast, but not wildly faster than other parsers. \$\endgroup\$Michael Kay– Michael Kay2020年01月31日 09:50:02 +00:00Commented Jan 31, 2020 at 9:50
-
1\$\begingroup\$ @MichaelKay Just sharing my experience. I wasn't able to squeeze out comparable performance out of woodstox, but maybe I did domething wrong somewhere. \$\endgroup\$Dmitry Avtonomov– Dmitry Avtonomov2020年01月31日 19:27:48 +00:00Commented Jan 31, 2020 at 19:27
Aside from other suggestions which make sense, one simple thing to try is to use Aalto-xml parser from:
https://github.com/FasterXML/aalto-xml
it implements Stax API (as well as Stax2 extension, SAX); and as long as you do not need full DTD handling (which I suspect you don't) has the feature set you need. For common read use cases I think it can be 30-40% faster; but most importantly it should be very easy to just try out.
Maven coordinates are:
- group id:
com.fasterxml
- artifact id:
aalto-xml
- version (latest): 1.2.2
And XMLInputFactory
implementation com.fasterxml.aalto.stax.InputFactoryImpl
. I would recommend creating instance directly, instead of using XMLInputFactory.newInstance()
so you can sure of the exact implementation you have (if you have multiple Stax implementations in classpath, choice is arbitrary).
inrowtag
is true. \$\endgroup\$String
s at any point or only juggles some internal buffers, giving you some sort of a view on it. \$\endgroup\$"\uFEFF"
. I also have some doubt on skippingisCharacters
but I assume that is for the header line. \$\endgroup\$