3

I've to read a big XML file with a lot of information. Afterwards I extract the needed information (~20 Points(columns) / ~80 relevant Data (rows, some of them with subdatasets) and write them out in a Excel File.

My Question is how to handle the extraction (of unused Data) part,

  • should I copy the whole file and delete the unused parts, and then write it to excel
  • or is it a good approach to create Objects for each column?
  • should I write the whole xml to excel and start to delete rows in excel?

What would be performant and a acceptable solution?

gnat
20.5k29 gold badges117 silver badges308 bronze badges
asked Sep 21, 2012 at 8:02
2
  • 1
    This is more like stack overflow question. You could easily just export them to csv and to whatever you want with them afterwards. Commented Sep 21, 2012 at 8:15
  • csv formatted text files, is an easier option to consider Commented Sep 21, 2012 at 11:43

1 Answer 1

4

I'd say do the filtering as part of the processing. Programming in Excel is significantly more painful and limited than any server-side technology you could possibly be using.

CSV as an output format is much easier to work with than Excel proper, and virtually every programming language can easily output CSV without requiring any libraries (even writing your own CSV writer should be doable in about an hour or so); as long as you're only interested in the plain data, no formulae, multi-sheet workbooks or layout, CSV should be fine.

Now, depending on the size of the XML input, I'd either:

a) Read the entire XML file into memory, parse it into a DOM tree, and use XPath to extract the information you want. If the transformation is nontrivial, consider using XSLT (you'll need a tiny bit of post-processing, because generating valid CSV with XSLT is unnecessarily complicated). Because the DOM tree needs to fit into RAM in its entirety, this is only doable for smaller XML files (say, up to the tens-of-megabytes range).

For larger documents:

b) Use a SAX parser to walk the document, outputting the relevant nodes as you go. This is a bit harder to write, because you'll be doing input and output intertwined, but it has the advantage that memory requirements are linear with the depth of the tree rather than its total size (that is, the SAX parser only keeps the path from the document root up to the current node in memory, not the entire DOM).

answered Sep 21, 2012 at 8:35

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.