Parse XML data and update values of each node's children

Question 1

The code below iterates through the nodes of an XML file and updates values based on a Regex expression in the rule child node from an XPath expression. XML is included at the bottom.

Are there better alternatives to this approach? Would using LINQ be a good approach?

using System;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.XPath;
namespace XMLParser
{
 class Program
 {
 static void Main()
 {
 string ocrString = "";
 string rule = "";
 string output = "";
 string dataNodeIDValue = "";
 string dataNodeIDName = "";
 string xpathStr = "";
 Match match;
 int groupInt = 0;
 string filename = "C:\\Users\\name\\train\\dev\\offer\\TestParsing.xml";
 XmlDocument xmlDoc = new XmlDocument();
 xmlDoc.Load(filename);
 XmlElement root = xmlDoc.DocumentElement;
 XmlNodeList nodes = root.SelectNodes("//offer/data");
 XPathNavigator xnav = xmlDoc.CreateNavigator(); 
 
 // Read in all 'data' nodes and perform functions
 foreach (XmlNode node in nodes)
 {
 // Set to 0 so regex matches first match unless otherwise specified
 groupInt = 0;
 // Cycle through inner nodes of main node and pull in values
 foreach (XmlNode xmlNode in node.ChildNodes)
 {
 switch (xmlNode.Name)
 {
 case "ocrstring":
 ocrString = xmlNode.InnerText;
 break;
 case "rule":
 rule = xmlNode.InnerText;
 break;
 case "group":
 //groupInt = xmlNode.InnerText;
 if (Int32.TryParse(xmlNode.InnerText, out groupInt)) { groupInt = Int32.Parse(xmlNode.InnerText); }
 break;
 }
 }
 // No rule given because ocr works effectively
 if (rule.Length < 2) { continue; }
 
 // If ocrstring is empty try finding text in pdf
 if (String.IsNullOrEmpty(ocrString) | String.IsNullOrWhiteSpace(ocrString)) // This is to iterate through pdf
 {
 // TODO: Implement over full text doc <- ignore for now
 }
 else // This is to use XML string
 {
 var regex = new Regex(rule);
 match = regex.Match(ocrString);
 }
 //if (match.Groups.Count > 0) { };
 if (groupInt > 0 & match.Groups.Count > 0)
 {
 output = match.Groups[groupInt].Value.ToString();
 }
 else
 {
 output = match.Value.ToString().Trim();
 }
 dataNodeIDValue = node.Attributes[0].Value;
 dataNodeIDName = node.Attributes[0].Name;
 xpathStr = "//offer/data[@" + dataNodeIDName + "='" + dataNodeIDValue + "']/output";
 if (String.IsNullOrEmpty(output))
 {
 root.SelectSingleNode(xpathStr).InnerText = "NA";
 }
 else
 {
 root.SelectSingleNode(xpathStr).InnerText = output;
 }
 
 xmlDoc.Save(filename); // Save XML session back to file
 }
 Console.WriteLine("Exiting...");
 }
 }
}

XML Data

<?xml version="1.0" encoding="utf-8"?>
<offer>
 <data id="Salary">
 <ocrstring>which is equal to 40,000ドル.00 if working 40 hours per week</ocrstring>
 <rule>.*(([+-]?\$[0-9]{1,3}(?:,?[0-9]{3})*\.[0-9]{2}))</rule>
 <group>1</group>
 <output></output>
 </data>
 <data id="DefaultWeeklyHours">
 <ocrstring></ocrstring>
 <rule><![CDATA["(?<=working).*?(?=hours)"]]></rule>
 <output></output>
 </data>
 <data id="RelocationAttachment">
 <ocrstring>LongWindingRoad222</ocrstring>
 <rule>Regex2</rule>
 <output></output>
 </data>
</offer>

Question 2

Linq2XML (or XLINQ in short) would shift your code from imperative to more declarative. Is it a better approach? It depends. It might reduce the line of code by being more concise. This can increase (or decrease) readability depending on the reader's skills and Linq expression's complexity. Will it be more performant? It depends. It might be faster and it might be easier to move it into the world of parallelism. What you are really looking for?

Question 3

@PeterCsala just suggestions and to hear what is more performant, your response answers my inquiry.

Question 4

If you would define a model like this:

public class Data
{
 public string Id { get; set; }
 public string OCR { get; set; }
 public string Rule {get; set; }
 public string Output {get; set; }
}

then you could easily separate your ETL job's different stages.

For example the Extract phase would look like this:

Document doc = XDocument.Parse(xml);
var parsedData = from data in doc.Descendants("Data")
 select new Data()
 {
 Id = (string)data.Attribute("id"),
 OCR = (string)data.Element("ocrstring"),
 Rule = (string)data.Element("rule")
 };

In your Transform phase you could perform the regex based transformations. The biggest gain here is that it is free from any input or output format. It is just pure business logic.

And finally in your Load phase you could simply serialize the whole (modified) data collection. Or if it is too large, then create logic to find the appropriate element (based on the Id property) and overwrite only the output child element.

What you have gained here is a pretty nice separation of concerns.

Your read logic is not mixed with the processing logic.
Because of the separation it is easier to spot where is the bottleneck of the application (if any).
Input format can be changed without affecting processing logic.
Pipeline like processing can be introduced to improve performance by invoking processing right after a Data object has been populated from the source.
Many other advantages. :)

Question 5

I find using XDocument to be a lot simpler:

var fileName = @"C:\Users\name\train\dev\offer\TestParsing.xml";
var document = XDocument.Load(fileName);
var offerData = document.Descendants("offer").Descendants("data");
foreach (var d in offerData)
{ 
 var rule = (string)d.Element("rule");
 if(rule.Length < 2)
 {
 continue;
 }
 var ocrString = (string)d.Element("ocrstring");
 if(string.IsNullOrWhiteSpace(ocrString))
 {
 continue;
 }
 
 var match = Regex.Match(ocrString, rule);
 var result = "NA";
 if (match.Success)
 {
 var group = (int?)d.Element("group");
 result = match.Groups[group.GetValueOrDefault(0)].Value;
 }
 
 d.SetElementValue("output", result);
}
document.Save(fileName);

The logic is no longer obscured by the XML-parsing and can be descerned more easily. All the parsing is done by just casting the elements to the desired type.

Peter Csala Peter Csala 10.7k1 gold badge16 silver badges36 bronze badges · Accepted Answer · 2020-06-23 13:33:43Z

If you would define a model like this:

public class Data
{
 public string Id { get; set; }
 public string OCR { get; set; }
 public string Rule {get; set; }
 public string Output {get; set; }
}

then you could easily separate your ETL job's different stages.

For example the Extract phase would look like this:

Document doc = XDocument.Parse(xml);
var parsedData = from data in doc.Descendants("Data")
 select new Data()
 {
 Id = (string)data.Attribute("id"),
 OCR = (string)data.Element("ocrstring"),
 Rule = (string)data.Element("rule")
 };

In your Transform phase you could perform the regex based transformations. The biggest gain here is that it is free from any input or output format. It is just pure business logic.

And finally in your Load phase you could simply serialize the whole (modified) data collection. Or if it is too large, then create logic to find the appropriate element (based on the Id property) and overwrite only the output child element.

What you have gained here is a pretty nice separation of concerns.

Your read logic is not mixed with the processing logic.
Because of the separation it is easier to spot where is the bottleneck of the application (if any).
Input format can be changed without affecting processing logic.
Pipeline like processing can be introduced to improve performance by invoking processing right after a Data object has been populated from the source.
Many other advantages. :)

Stack Exchange Network

Parse XML data and update values of each node's children

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse XML data and update values of each node's children

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions