Return to Answer

replaced http://codereview.stackexchange.com/ with https://codereview.stackexchange.com/

Source Link

edited Apr 13, 2017 at 12:40

Community Bot

edited Apr 13, 2017 at 12:40

Community Bot

With the IsKey() method I go with @mjolka @mjolka and also you prefer loops I will suggest to use Any() like

With the IsKey() method I go with @mjolka and also you prefer loops I will suggest to use Any() like

Source Link

answered Mar 5, 2015 at 7:42

Heslacher

answered Mar 5, 2015 at 7:42

Heslacher

50.9k
5
83
177

With the IsKey() method I go with @mjolka and also you prefer loops I will suggest to use Any() like

private static bool IsKey(HtmlNode span)
{
 var styleSheet = ExtractStyle(span);
 bool isKey = styleSheet.StyleRules[0].Declarations
 .Any(cssAttribute.Name == "font-weight" && cssAttribute.Term.ToString() == "bold");
 
 return isKey;
}

because it really is the same and it is more obvious what is meant.

In the GetIndentationFromNode() method, there should be only one cssAttribute with the name left so you should break out of the loop after you have found the value.

You should let your variables some space to breathe. Instead of e.g decimal thisIndent=0; you better should write decimal thisIndent = 0;, because this will make your code more readable.

In the ParseHtmlToObjects() method there is no need to call ToList() on the IEnumerable<> here page.Descendants().Where(x => (x.Name == "span")).ToList();.

Always code against interfaces if the implementation isn't needed.

If step isn't a key you can by adding a continue; omit the else and therefor save horizontal spacing.

By removing the if (level == currentObject.level) condition you can shorten the code.

By ectracting the analyzing of the Descendants of the current page this method can be shortened and be more readable.

By extracting the while loop to equalize the level to a meaningful method, you can omit the comment also this is only a cosmetic change.

The check if (key != null) could be omitted, because if key == null the AddProperty() method just returns the passed in DataObject.

Applying these points lead to

private static DataObject ParseHtmlToObjects(IEnumerable<HtmlNode> pages)
{
 var rootObject = new DataObject();
 var currentObject = rootObject;
 foreach (var page in pages)
 {
 var steps = page.Descendants().Where(x => (x.Name == "span")).ToList();
 currentObject = AnalyzeSpanTags(steps, currentObject);
 }
 return rootObject;
}
private static DataObject AnalyzeSpanTags(IEnumerable<HtmlNode> steps, DataObject currentObject)
{
 string key = null;
 foreach (var step in steps)
 {
 if (!IsKey(step))
 {
 // If this is not a key, the key was detected before. Use it to populate the object
 currentObject = AddProperty(currentObject, key, GetTextFromSpan(step));
 key = null;
 continue;
 }
 // Special case: Maybe we detected a new key, although the old key has not been used as property yet
 // This can happen for keys without value, so add it empty.
 currentObject = AddProperty(currentObject, key, "");
 key = GetKeyFromNode(step);
 var level = GetIndentationFromNode(step);
 if (level > currentObject.level)
 {
 // Decend to lower level: create a new child
 var child = new DataObject { level = level, Parent = currentObject };
 currentObject.Children.Add(child);
 currentObject = child;
 }
 else
 {
 currentObject = EqualizeLevel(currentObject, level);
 }
 }
 return currentObject;
}
private static DataObject EqualizeLevel(DataObject obj, decimal level)
{
 while (level < obj.level)
 {
 obj = obj.Parent;
 }
 return obj;
}

Speaking about comments. Comments should describe why something is done. You should let the code speak for itself about what is done.

So comments like

// Go through all pages
foreach (var page in pages)

are superfluous, because they don't add any value.

By introducing a GetAddedSiblingIfKeyExists() method (the name is not that optimal, but I couldn't come up with a better one) like

private static DataObject GetAddedSiblingIfKeyExists(DataObject obj, string key)
{
 if (key == null || !obj.Properties.ContainsKey(key)) { return obj; }
 var sibling = new DataObject { level = obj.level, Parent = obj.Parent };
 obj.Parent.Children.Add(sibling);
 return sibling;
}

the AddProperty() method would result in

private static DataObject AddProperty(DataObject obj, string key, string value)
{
 // Special case: <Span> which contains the page information. Skip it.
 if (key == null) return obj;
 obj.Properties.Add(key, value);
 return obj;
}

which would change the AnalyzeSpanTags() method like

private static DataObject AnalyzeSpanTags(IEnumerable<HtmlNode> steps, DataObject currentObject)
{
 string key = null;
 foreach (var step in steps)
 {
 currentObject = GetAddedSiblingIfKeyExists(currentObject, key);
 if (!IsKey(step))
 {
 .....

maybe GetAddedSiblingForExistingKey() would be a little bit more meaningful.

 /// <summary>
 /// Converts the HTML input file into JSON and writes the output file
 /// </summary>
 public void Convert()

the xml documentation clearly states that this method is doing to much.

The method is reading the htmlfile, converting the content to JSON and writing the JSON to the output file.
I wouldn't expect that a method named Convert() is writing to a file.

lang-cs