I am cleaning a HTML string to encode the less than sign (<) to a HTML entity (<
), but leaving the HTML tags as they are. An example is converting "<div>Total < 500</div>"
to "<div>Total < 500</div>"
.
There's a number of posts addressing this including:
https://stackoverflow.com/questions/5464686/html-entity-encode-text-only-not-html-tag, https://stackoverflow.com/questions/2245066/encode-html-in-asp-net-c-sharp-but-leave-tags-intact, https://stackoverflow.com/questions/28301206/how-to-encode-special-characters-in-html-but-exclude-tags
Each post points to using the Html Agility Pack and specifically the HtmlEntity.Entitize(html)
method. This doesn't work for me because it actually ignores the < and & signs and using the above example, the output was the same as the input! I may be missing the point of how to use this method, but the code was simply this:
public static string EntitizeHtml( this string html )
{
return HtmlEntity.Entitize( html );
}
I decided to write my own method to find the less than symbols and convert them to the HTML entity equivalent. It works, but it seems very clunky (a loop inside a loop and lots of string manipulation). I avoided Regex to keep it faster and more readable. I fully appreciate that manipulating HTML strings is often fraught with problems thanks to differing HTML versions and malformed HTML, but for my purpose this method solves my problem based on a discrete set of HTML tags. The following method will convert "<div>Total < 500</div>"
to "<div>Total < 500</div>"
as expected.
If anyone can improve the efficiency of this method I'd be very grateful.
public static string EncodeLessThanEntities( this string html )
{
if( !html.Contains( "<" ) )
{
return html;
}
// get the full html length
int length = html.Length;
// set a limit on the tag string to compare to reduce the
// string size (i.e. the longest tags are <center> and <strong>)
int tagLength = 6;
// gather all the included html tags to check
string s = "div|span|p|br|ol|ul|li|center|font|strong|em|sub|sup";
string[] tags = s.Split( '|' ).ToArray( );
// find all the indices of the less than entity or tag
var indices = AllIndexesOf( html, "<" );
// initiate a list of indices to be replaced
var replaceable = new List<int>( );
// loop through the indices
foreach( var index in indices )
{
// store the potential tag (up to the tag length)
if( length - ( index + 1 ) < tagLength ) tagLength = length - ( index + 1 );
string possibleTag = html.Substring( index + 1, tagLength );
// automatically ignore any closing tags
if( possibleTag.Substring( 0, 1 ) == "/" )
{
continue;
}
bool match = false;
// loop through each html tag to find a match
foreach( var tag in tags )
{
if( possibleTag.StartsWith( tag ) )
{
match = true;
break;
}
}
if( !match )
{
// if there is no match to a html tag, store the index
replaceable.Add( index );
}
}
if( replaceable?.Any( ) ?? false )
{
// loop through each index and replace the '<' with '<'
foreach( var index in Enumerable.Reverse( replaceable ) )
{
html = html.ReplaceAt( index, 1, "<" );
}
}
return html;
}
public static List<int> AllIndexesOf( this string input, string value )
{
List<int> indexes = new List<int>( );
if( string.IsNullOrEmpty( value ) )
{
return indexes;
}
for( int index = 0; ; index += value.Length )
{
index = input.IndexOf( value, index );
if( index == -1 )
{
return indexes;
}
indexes.Add( index );
}
}
public static string ReplaceAt( this string str, int index, int length, string replace )
{
return str.Remove( index, Math.Min( length, str.Length - index ) ).Insert( index, replace );
}
I'm hoping there is a way to make the above method more efficient. Or maybe this is as good as it gets? I know there are a lot of very smart guys out there with a lot more experience and I don't include myself in that group!
1 Answer 1
Every time you call
Substring
,Remove
,Replace
or any other method which modifies thestring
, you create newstring
instance with new memory allocation. Becausestring
is immutable. Allstring
operations are (relativley to numeric) are slow/expensive. But that's OK if you keep in mind this issue and agree with it.html.Contains
,AllIndexesOf
,foreach(var index in indices)
- 2.5 scans of the same string.possibleTag.Substring(0, 1) == "/"
, why notpossibleTag[0] == '/'
? It would be faster.if (replaceable?.Any() ?? false)
1)replaceable
cannot benull
, thenif (replaceable.Any())
is almost fine 2) but it just check isList
contains any elements, thenif (replaceable.Length > 0)
is better. 3) butforeach
will not process the collection if it's empty, evenEnumerable.Reverse
call is fine for the empty collection. Thus you may wipe theif
statement completely.The fastest way to construct some string from data in .NET Framework is
StringBuilder
. (.NET Core and newer .NET hasSpan
-based methodstring.Create
which is faster in some cases)
And finally, here's my version of the implementation
// required array can be created once per application start
private static readonly string[] tags = "div span p br ol ul li center font strong em sub sup".Split();
public static string EncodeLessThanEntities(this string html)
{
if (html.Length < 8) // the shortest valid html is <p></p>: 7 chars but there's no room for other chars
return html;
StringBuilder sb = new StringBuilder(html.Length); // spawn StringBuilder with initial capacity, this will reduce amount of memory allocations
int i;
for (i = 0; i < html.Length - 2; i++)
{
if (html[i] == '<' && !tags.Any(tag => html.StartsWithAt(i + (html[i + 1] == '/' ? 2 : 1), tag)))
sb.Append("<");
else
sb.Append(html[i]);
}
// 'i' has value 'html.Length - 2' here, append two last chars without changes
return sb.Append(html[i]).Append(html[i + 1]).ToString();
}
// same as `String.StartsWith` but accepts a start index
public static bool StartsWithAt(this string text, int startIndex, string value)
{
if (text.Length - startIndex < value.Length)
return false;
for (int i = 0; i < value.Length; i++)
{
if (text[startIndex + i] != value[i])
return false;
}
return true;
}
I didn't test it a lot but you may.
-
1\$\begingroup\$ Thanks @aepot. This is brilliant! Not only does it work well, your explanation has given me a much greater insight and appreciation for writing better code. Thank you so much for taking the time to improve my code. \$\endgroup\$Scho– Scho2021年03月27日 15:51:06 +00:00Commented Mar 27, 2021 at 15:51
HtmlEntity.Entitize(html, true, true);
e.g. for inner HTML of the tag pair? Btw,<div>Total < 500</div>
isn't valid HTML, probably de-entitizing problem. It's better to fix such HTML source first. \$\endgroup\$a
tag, orimg
, maybetable
?...thead
,tbody
,th
,tr
,td
,b
,i
, finallyhtml
,head
,body
,meta
,title
,script
,style
, another tags? Also the solution at least isn't compartible with HTML5 because any single word can be HTML a valid tag e.g.<aepot>some text</aepot>
is valid HTML5. \$\endgroup\$