Method encodes all less than (<) characters in a HTML string, but not the HTML tags

Question 1

I am cleaning a HTML string to encode the less than sign (<) to a HTML entity (<), but leaving the HTML tags as they are. An example is converting "<div>Total < 500</div>" to "<div>Total < 500</div>".

There's a number of posts addressing this including: https://stackoverflow.com/questions/5464686/html-entity-encode-text-only-not-html-tag, https://stackoverflow.com/questions/2245066/encode-html-in-asp-net-c-sharp-but-leave-tags-intact, https://stackoverflow.com/questions/28301206/how-to-encode-special-characters-in-html-but-exclude-tags Each post points to using the Html Agility Pack and specifically the HtmlEntity.Entitize(html) method. This doesn't work for me because it actually ignores the < and & signs and using the above example, the output was the same as the input! I may be missing the point of how to use this method, but the code was simply this:

public static string EntitizeHtml( this string html )
{
 return HtmlEntity.Entitize( html );
}

I decided to write my own method to find the less than symbols and convert them to the HTML entity equivalent. It works, but it seems very clunky (a loop inside a loop and lots of string manipulation). I avoided Regex to keep it faster and more readable. I fully appreciate that manipulating HTML strings is often fraught with problems thanks to differing HTML versions and malformed HTML, but for my purpose this method solves my problem based on a discrete set of HTML tags. The following method will convert "<div>Total < 500</div>" to "<div>Total < 500</div>" as expected.

If anyone can improve the efficiency of this method I'd be very grateful.

 public static string EncodeLessThanEntities( this string html )
 {
 if( !html.Contains( "<" ) )
 {
 return html;
 }
 // get the full html length
 int length = html.Length;
 // set a limit on the tag string to compare to reduce the 
 // string size (i.e. the longest tags are <center> and <strong>)
 int tagLength = 6;
 // gather all the included html tags to check
 string s = "div|span|p|br|ol|ul|li|center|font|strong|em|sub|sup";
 string[] tags = s.Split( '|' ).ToArray( );
 // find all the indices of the less than entity or tag
 var indices = AllIndexesOf( html, "<" );
 // initiate a list of indices to be replaced
 var replaceable = new List<int>( );
 // loop through the indices
 foreach( var index in indices )
 {
 // store the potential tag (up to the tag length)
 if( length - ( index + 1 ) < tagLength ) tagLength = length - ( index + 1 );
 string possibleTag = html.Substring( index + 1, tagLength );
 // automatically ignore any closing tags
 if( possibleTag.Substring( 0, 1 ) == "/" )
 {
 continue;
 }
 bool match = false;
 // loop through each html tag to find a match
 foreach( var tag in tags )
 {
 if( possibleTag.StartsWith( tag ) )
 {
 match = true;
 break;
 }
 }
 if( !match )
 {
 // if there is no match to a html tag, store the index
 replaceable.Add( index );
 }
 }
 if( replaceable?.Any( ) ?? false )
 {
 // loop through each index and replace the '<' with '&lt;'
 foreach( var index in Enumerable.Reverse( replaceable ) )
 {
 html = html.ReplaceAt( index, 1, "&lt;" );
 }
 }
 return html;
 }
 public static List<int> AllIndexesOf( this string input, string value )
 {
 List<int> indexes = new List<int>( );
 if( string.IsNullOrEmpty( value ) )
 {
 return indexes;
 }
 for( int index = 0; ; index += value.Length )
 {
 index = input.IndexOf( value, index );
 if( index == -1 )
 {
 return indexes;
 }
 indexes.Add( index );
 }
 }
 public static string ReplaceAt( this string str, int index, int length, string replace )
 {
 return str.Remove( index, Math.Min( length, str.Length - index ) ).Insert( index, replace );
 }

I'm hoping there is a way to make the above method more efficient. Or maybe this is as good as it gets? I know there are a lot of very smart guys out there with a lot more experience and I don't include myself in that group!

Question 2

Welcome to CodeReview! Have you tested your implementation against other inputs as well?

Question 3

did you try HtmlEntity.Entitize(html, true, true); e.g. for inner HTML of the tag pair? Btw, <div>Total < 500</div> isn't valid HTML, probably de-entitizing problem. It's better to fix such HTML source first.

Question 4

What about a tag, or img, maybe table?...thead, tbody, th, tr, td, b, i, finally html, head, body, meta, title, script, style, another tags? Also the solution at least isn't compartible with HTML5 because any single word can be HTML a valid tag e.g. <aepot>some text</aepot> is valid HTML5.

Question 5

Hi guys, thanks for the welcome! I've tested this extensively against many inputs and I have now fixed the source HTML from this point on. The issue is that I am dealing with a few instances of malformed HTML from historical records. The set of tags I am dealing with are discrete and based on a WYSIWYG editor input so these tags are the only ones that I need to evaluate.

Question 6

Thanks @aepot. It's VS 2019 .and NET 4.72 and C# 7.3.

Question 7

Every time you call Substring, Remove, Replace or any other method which modifies the string, you create new string instance with new memory allocation. Because string is immutable. All string operations are (relativley to numeric) are slow/expensive. But that's OK if you keep in mind this issue and agree with it.
html.Contains, AllIndexesOf, foreach(var index in indices) - 2.5 scans of the same string.
possibleTag.Substring(0, 1) == "/", why not possibleTag[0] == '/'? It would be faster.
if (replaceable?.Any() ?? false) 1) replaceable cannot be null, then if (replaceable.Any()) is almost fine 2) but it just check is List contains any elements, then if (replaceable.Length > 0) is better. 3) but foreach will not process the collection if it's empty, even Enumerable.Reverse call is fine for the empty collection. Thus you may wipe the if statement completely.
The fastest way to construct some string from data in .NET Framework is StringBuilder. (.NET Core and newer .NET has Span-based method string.Create which is faster in some cases)

And finally, here's my version of the implementation

// required array can be created once per application start
private static readonly string[] tags = "div span p br ol ul li center font strong em sub sup".Split();
public static string EncodeLessThanEntities(this string html)
{
 if (html.Length < 8) // the shortest valid html is <p></p>: 7 chars but there's no room for other chars
 return html;
 StringBuilder sb = new StringBuilder(html.Length); // spawn StringBuilder with initial capacity, this will reduce amount of memory allocations
 int i;
 for (i = 0; i < html.Length - 2; i++)
 {
 if (html[i] == '<' && !tags.Any(tag => html.StartsWithAt(i + (html[i + 1] == '/' ? 2 : 1), tag)))
 sb.Append("&lt;");
 else
 sb.Append(html[i]);
 }
 // 'i' has value 'html.Length - 2' here, append two last chars without changes
 return sb.Append(html[i]).Append(html[i + 1]).ToString();
}
// same as `String.StartsWith` but accepts a start index
public static bool StartsWithAt(this string text, int startIndex, string value)
{
 if (text.Length - startIndex < value.Length)
 return false;
 for (int i = 0; i < value.Length; i++)
 {
 if (text[startIndex + i] != value[i])
 return false;
 }
 return true;
}

I didn't test it a lot but you may.

Question 8

Thanks @aepot. This is brilliant! Not only does it work well, your explanation has given me a much greater insight and appreciation for writing better code. Thank you so much for taking the time to improve my code.

aepot aepot 2,1199 silver badges20 bronze badges · Accepted Answer · 2021-03-26 15:53:58Z

Every time you call Substring, Remove, Replace or any other method which modifies the string, you create new string instance with new memory allocation. Because string is immutable. All string operations are (relativley to numeric) are slow/expensive. But that's OK if you keep in mind this issue and agree with it.
html.Contains, AllIndexesOf, foreach(var index in indices) - 2.5 scans of the same string.
possibleTag.Substring(0, 1) == "/", why not possibleTag[0] == '/'? It would be faster.
if (replaceable?.Any() ?? false) 1) replaceable cannot be null, then if (replaceable.Any()) is almost fine 2) but it just check is List contains any elements, then if (replaceable.Length > 0) is better. 3) but foreach will not process the collection if it's empty, even Enumerable.Reverse call is fine for the empty collection. Thus you may wipe the if statement completely.
The fastest way to construct some string from data in .NET Framework is StringBuilder. (.NET Core and newer .NET has Span-based method string.Create which is faster in some cases)

And finally, here's my version of the implementation

// required array can be created once per application start
private static readonly string[] tags = "div span p br ol ul li center font strong em sub sup".Split();
public static string EncodeLessThanEntities(this string html)
{
 if (html.Length < 8) // the shortest valid html is <p></p>: 7 chars but there's no room for other chars
 return html;
 StringBuilder sb = new StringBuilder(html.Length); // spawn StringBuilder with initial capacity, this will reduce amount of memory allocations
 int i;
 for (i = 0; i < html.Length - 2; i++)
 {
 if (html[i] == '<' && !tags.Any(tag => html.StartsWithAt(i + (html[i + 1] == '/' ? 2 : 1), tag)))
 sb.Append("&lt;");
 else
 sb.Append(html[i]);
 }
 // 'i' has value 'html.Length - 2' here, append two last chars without changes
 return sb.Append(html[i]).Append(html[i + 1]).ToString();
}
// same as `String.StartsWith` but accepts a start index
public static bool StartsWithAt(this string text, int startIndex, string value)
{
 if (text.Length - startIndex < value.Length)
 return false;
 for (int i = 0; i < value.Length; i++)
 {
 if (text[startIndex + i] != value[i])
 return false;
 }
 return true;
}

I didn't test it a lot but you may.

Thanks @aepot. This is brilliant! Not only does it work well, your explanation has given me a much greater insight and appreciation for writing better code. Thank you so much for taking the time to improve my code.

Stack Exchange Network

Method encodes all less than (<) characters in a HTML string, but not the HTML tags

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Method encodes all less than (<) characters in a HTML string, but not the HTML tags

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions