Method for returning valid URLs from a sitemap URL

Question 1

I need a method which fetches all of the text from a URL (generally a sitemap URL), and returns an IEnumerable of all valid URLs contained in the text returned from the initial address. What I have so far is:

public IEnumerable<Uri> GetSitemapUrls(Uri sitemapUrl)
{
 var sitemapText = GetSitemapText(sitemapUrl);
 if (string.IsNullOrWhiteSpace(sitemapText))
 yield break;
 var urls = new List<string>();
 var urlRegex = new Regex(@"\b(?:https?://|www\.)[^ \f\n\r\t\v\]]+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
 foreach (Match m in urlRegex.Matches(sitemapText))
 urls.Add(CleanUriString(m.Value));
 foreach (var url in urls)
 {
 var cleanedUriString = CleanUriString(url);
 if (Uri.IsWellFormedUriString(cleanedUriString, UriKind.RelativeOrAbsolute))
 yield return new Uri(cleanedUriString);
 }
}
string GetSitemapText(Uri sitemapUri)
{
 var wc = new WebClient
 {
 Encoding = System.Text.Encoding.UTF8
 };
 return wc.DownloadString(sitemapUri);
}
string CleanUriString(string dirtyUriString)
{
 var legalCharacters = @"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;=".ToCharArray();
 var cleanedString = dirtyUriString;
 foreach (var character in dirtyUriString)
 {
 var matchIndex = dirtyUriString.IndexOf(character);
 if (!legalCharacters.Any(x => x.Equals(character)) && matchIndex > 0)
 cleanedString = dirtyUriString.Substring(0, matchIndex);
 }
 return cleanedString;
}

It seems to work as intended for actual sitemaps, null/empty responses from the URL it receives, and URLs with URL-illegal characters in them. I have a feeling I'm missing out on potential issues or letting bad URLs through anyway, not to mention I haven't thought about spidering through URLs that the initial sitemap (if it is a sitemap) returns.

Is there anything I can do to improve it?

Question 2

I have to say I cannot figure out how the CleanUriString works. It looks for a char from dirtyUriString inside the same string. This means that matchIndex will never be negative :-| Could you help me out with that? ;-) Maybe an example?

Question 3

Not a full answer, but if you're actually pointing this at an XML formatted sitemap, you should parse it as XML instead of regex.

Question 4

inside your GetSiteMapUrls Method you clean the Urls twice, and I don't see a reason for this.

foreach (Match m in urlRegex.Matches(sitemapText))
 urls.Add(CleanUriString(m.Value));
foreach (var url in urls)
{
 var cleanedUriString = CleanUriString(url);
 if (Uri.IsWellFormedUriString(cleanedUriString, UriKind.RelativeOrAbsolute))
 yield return new Uri(cleanedUriString);
}

before you add the URL to the urls list you run them through the CleanUriString() method

then you traverse the list that you just made and run the urls through the same method before returning the new uri.

This is redundant.

instead you could just use a single foreach loop and return the list of good urls from it, like this:

public IEnumerable<Uri> GetSitemapUrls(Uri sitemapUrl)
{
 var sitemapText = GetSitemapText(sitemapUrl);
 if (string.IsNullOrWhiteSpace(sitemapText))
 yield break;
 var urlRegex = new Regex(@"\b(?:https?://|www\.)[^ \f\n\r\t\v\]]+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
 foreach (Match m in urlRegex.Matches(sitemapText))
 {
 var clean = CleanUriString(m.Value); 
 if (Uri.IsWellFormedUriString(clean, UriKind.RelativeOrAbsolute))
 yield return new Uri(clean);
 }
}

and personally I like Braces on my If's and Loops

Question 5

Just a couple of thougths...

if (string.IsNullOrWhiteSpace(sitemapText))
 yield break;

This is a very unusual handling of null/empty argument values and should be documented. A more meaniningful name like GetSitemapUrlsOrDefault would also help. Without it the expected way is to throw an ArgumentException.

foreach (Match m in urlRegex.Matches(sitemapText))
 urls.Add(CleanUriString(m.Value));
foreach (var url in urls)
{
 var cleanedUriString = CleanUriString(url);
 if (Uri.IsWellFormedUriString(cleanedUriString, UriKind.RelativeOrAbsolute))
 yield return new Uri(cleanedUriString);
}

You don't need all this. Instead you can build a nice query that does everything in one go and in a single run:

return 
 urlRegex
 .Matches(sitemapText)
 .Cast<Match>()
 .Select(m => CleanUriString(m.Value))
 .Where(cleanedUriString => Uri.IsWellFormedUriString(cleanedUriString, UriKind.RelativeOrAbsolute))
 .Select(cleanedUriString => new Uri(cleanedUriString));

or alternatively:

return
 from m in urlRegex.Matches(sitemapText).Cast<Match>()
 let cleanedUriString = CleanUriString(m.Value)
 where Uri.IsWellFormedUriString(cleanedUriString, UriKind.RelativeOrAbsolute)
 select new Uri(cleanedUriString);

You don't need the .ToCharArray(); on the legalCharacters string. A string is already a char-array.

Oh, and of course the WebClient needs to be disposed. Wrapping it with a using would be good.

Question 6

The intention of CleanUriString seems to be to return a new string containing only characters from the legalCharacters string.

If CleanUriString is passed a value containing more than one character not in legalCharacters it will return a substring of all the characters up to the last illegal character including any prior illegal characters.

Example:

CleanUriString("asdøæå") //Result = asdøæ

I would suggest you write a thorough specification of what you expect CleanUriString to do.

Should it strip all illegal characters? Cut the string off at the first illegal character? Outright reject any string with illegal characters?

When that is done rewrite the method step-by-step according to the specification.

Question 7

This also has the (bizarre) side-effect of it only finding the first value of the character that doesn't match. So CleanUriString("ABCD<ABCD>ABCD>") and CleanUriString("ABCD<ABCD>ABCD<") return completely different results, because they substring to the first instance of the illegal character, not to the current index, as does calling it a second time on CleanUriString("ABCD<ABCD>ABCD>").

Malachi Malachi 29k11 gold badges86 silver badges188 bronze badges · Accepted Answer · 2018-08-14 19:44:21Z

inside your GetSiteMapUrls Method you clean the Urls twice, and I don't see a reason for this.

foreach (Match m in urlRegex.Matches(sitemapText))
 urls.Add(CleanUriString(m.Value));
foreach (var url in urls)
{
 var cleanedUriString = CleanUriString(url);
 if (Uri.IsWellFormedUriString(cleanedUriString, UriKind.RelativeOrAbsolute))
 yield return new Uri(cleanedUriString);
}

before you add the URL to the urls list you run them through the CleanUriString() method

then you traverse the list that you just made and run the urls through the same method before returning the new uri.

This is redundant.

instead you could just use a single foreach loop and return the list of good urls from it, like this:

public IEnumerable<Uri> GetSitemapUrls(Uri sitemapUrl)
{
 var sitemapText = GetSitemapText(sitemapUrl);
 if (string.IsNullOrWhiteSpace(sitemapText))
 yield break;
 var urlRegex = new Regex(@"\b(?:https?://|www\.)[^ \f\n\r\t\v\]]+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
 foreach (Match m in urlRegex.Matches(sitemapText))
 {
 var clean = CleanUriString(m.Value); 
 if (Uri.IsWellFormedUriString(clean, UriKind.RelativeOrAbsolute))
 yield return new Uri(clean);
 }
}

and personally I like Braces on my If's and Loops

Stack Exchange Network

Method for returning valid URLs from a sitemap URL

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Method for returning valid URLs from a sitemap URL

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions