2
\$\begingroup\$

The code below iterates through the nodes of an XML file and updates values based on a Regex expression in the rule child node from an XPath expression. XML is included at the bottom.

Are there better alternatives to this approach? Would using LINQ be a good approach?

using System;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.XPath;
namespace XMLParser
{
 class Program
 {
 static void Main()
 {
 string ocrString = "";
 string rule = "";
 string output = "";
 string dataNodeIDValue = "";
 string dataNodeIDName = "";
 string xpathStr = "";
 Match match;
 int groupInt = 0;
 string filename = "C:\\Users\\name\\train\\dev\\offer\\TestParsing.xml";
 XmlDocument xmlDoc = new XmlDocument();
 xmlDoc.Load(filename);
 XmlElement root = xmlDoc.DocumentElement;
 XmlNodeList nodes = root.SelectNodes("//offer/data");
 XPathNavigator xnav = xmlDoc.CreateNavigator(); 
 
 // Read in all 'data' nodes and perform functions
 foreach (XmlNode node in nodes)
 {
 // Set to 0 so regex matches first match unless otherwise specified
 groupInt = 0;
 // Cycle through inner nodes of main node and pull in values
 foreach (XmlNode xmlNode in node.ChildNodes)
 {
 switch (xmlNode.Name)
 {
 case "ocrstring":
 ocrString = xmlNode.InnerText;
 break;
 case "rule":
 rule = xmlNode.InnerText;
 break;
 case "group":
 //groupInt = xmlNode.InnerText;
 if (Int32.TryParse(xmlNode.InnerText, out groupInt)) { groupInt = Int32.Parse(xmlNode.InnerText); }
 break;
 }
 }
 // No rule given because ocr works effectively
 if (rule.Length < 2) { continue; }
 
 // If ocrstring is empty try finding text in pdf
 if (String.IsNullOrEmpty(ocrString) | String.IsNullOrWhiteSpace(ocrString)) // This is to iterate through pdf
 {
 // TODO: Implement over full text doc <- ignore for now
 }
 else // This is to use XML string
 {
 var regex = new Regex(rule);
 match = regex.Match(ocrString);
 }
 //if (match.Groups.Count > 0) { };
 if (groupInt > 0 & match.Groups.Count > 0)
 {
 output = match.Groups[groupInt].Value.ToString();
 }
 else
 {
 output = match.Value.ToString().Trim();
 }
 dataNodeIDValue = node.Attributes[0].Value;
 dataNodeIDName = node.Attributes[0].Name;
 xpathStr = "//offer/data[@" + dataNodeIDName + "='" + dataNodeIDValue + "']/output";
 if (String.IsNullOrEmpty(output))
 {
 root.SelectSingleNode(xpathStr).InnerText = "NA";
 }
 else
 {
 root.SelectSingleNode(xpathStr).InnerText = output;
 }
 
 xmlDoc.Save(filename); // Save XML session back to file
 }
 Console.WriteLine("Exiting...");
 }
 }
}

XML Data

<?xml version="1.0" encoding="utf-8"?>
<offer>
 <data id="Salary">
 <ocrstring>which is equal to 40,000ドル.00 if working 40 hours per week</ocrstring>
 <rule>.*(([+-]?\$[0-9]{1,3}(?:,?[0-9]{3})*\.[0-9]{2}))</rule>
 <group>1</group>
 <output></output>
 </data>
 <data id="DefaultWeeklyHours">
 <ocrstring></ocrstring>
 <rule><![CDATA["(?<=working).*?(?=hours)"]]></rule>
 <output></output>
 </data>
 <data id="RelocationAttachment">
 <ocrstring>LongWindingRoad222</ocrstring>
 <rule>Regex2</rule>
 <output></output>
 </data>
</offer>
Peilonrayz
44.4k7 gold badges80 silver badges157 bronze badges
asked Jun 22, 2020 at 19:58
\$\endgroup\$
2
  • 1
    \$\begingroup\$ Linq2XML (or XLINQ in short) would shift your code from imperative to more declarative. Is it a better approach? It depends. It might reduce the line of code by being more concise. This can increase (or decrease) readability depending on the reader's skills and Linq expression's complexity. Will it be more performant? It depends. It might be faster and it might be easier to move it into the world of parallelism. What you are really looking for? \$\endgroup\$ Commented Jun 23, 2020 at 6:16
  • \$\begingroup\$ @PeterCsala just suggestions and to hear what is more performant, your response answers my inquiry. \$\endgroup\$ Commented Jun 23, 2020 at 11:32

2 Answers 2

4
\$\begingroup\$

If you would define a model like this:

public class Data
{
 public string Id { get; set; }
 public string OCR { get; set; }
 public string Rule {get; set; }
 public string Output {get; set; }
}

then you could easily separate your ETL job's different stages.

For example the Extract phase would look like this:

Document doc = XDocument.Parse(xml);
var parsedData = from data in doc.Descendants("Data")
 select new Data()
 {
 Id = (string)data.Attribute("id"),
 OCR = (string)data.Element("ocrstring"),
 Rule = (string)data.Element("rule")
 };

In your Transform phase you could perform the regex based transformations. The biggest gain here is that it is free from any input or output format. It is just pure business logic.

And finally in your Load phase you could simply serialize the whole (modified) data collection. Or if it is too large, then create logic to find the appropriate element (based on the Id property) and overwrite only the output child element.


What you have gained here is a pretty nice separation of concerns.

  • Your read logic is not mixed with the processing logic.
  • Because of the separation it is easier to spot where is the bottleneck of the application (if any).
  • Input format can be changed without affecting processing logic.
  • Pipeline like processing can be introduced to improve performance by invoking processing right after a Data object has been populated from the source.
  • Many other advantages. :)
answered Jun 23, 2020 at 13:33
\$\endgroup\$
2
\$\begingroup\$

I find using XDocument to be a lot simpler:

var fileName = @"C:\Users\name\train\dev\offer\TestParsing.xml";
var document = XDocument.Load(fileName);
var offerData = document.Descendants("offer").Descendants("data");
foreach (var d in offerData)
{ 
 var rule = (string)d.Element("rule");
 if(rule.Length < 2)
 {
 continue;
 }
 var ocrString = (string)d.Element("ocrstring");
 if(string.IsNullOrWhiteSpace(ocrString))
 {
 continue;
 }
 
 var match = Regex.Match(ocrString, rule);
 var result = "NA";
 if (match.Success)
 {
 var group = (int?)d.Element("group");
 result = match.Groups[group.GetValueOrDefault(0)].Value;
 }
 
 d.SetElementValue("output", result);
}
document.Save(fileName);

The logic is no longer obscured by the XML-parsing and can be descerned more easily. All the parsing is done by just casting the elements to the desired type.

answered Jun 23, 2020 at 13:51
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.