I need a method which fetches all of the text from a URL (generally a sitemap URL), and returns an IEnumerable
of all valid URLs contained in the text returned from the initial address. What I have so far is:
public IEnumerable<Uri> GetSitemapUrls(Uri sitemapUrl)
{
var sitemapText = GetSitemapText(sitemapUrl);
if (string.IsNullOrWhiteSpace(sitemapText))
yield break;
var urls = new List<string>();
var urlRegex = new Regex(@"\b(?:https?://|www\.)[^ \f\n\r\t\v\]]+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
foreach (Match m in urlRegex.Matches(sitemapText))
urls.Add(CleanUriString(m.Value));
foreach (var url in urls)
{
var cleanedUriString = CleanUriString(url);
if (Uri.IsWellFormedUriString(cleanedUriString, UriKind.RelativeOrAbsolute))
yield return new Uri(cleanedUriString);
}
}
string GetSitemapText(Uri sitemapUri)
{
var wc = new WebClient
{
Encoding = System.Text.Encoding.UTF8
};
return wc.DownloadString(sitemapUri);
}
string CleanUriString(string dirtyUriString)
{
var legalCharacters = @"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;=".ToCharArray();
var cleanedString = dirtyUriString;
foreach (var character in dirtyUriString)
{
var matchIndex = dirtyUriString.IndexOf(character);
if (!legalCharacters.Any(x => x.Equals(character)) && matchIndex > 0)
cleanedString = dirtyUriString.Substring(0, matchIndex);
}
return cleanedString;
}
It seems to work as intended for actual sitemaps, null/empty responses from the URL it receives, and URLs with URL-illegal characters in them. I have a feeling I'm missing out on potential issues or letting bad URLs through anyway, not to mention I haven't thought about spidering through URLs that the initial sitemap (if it is a sitemap) returns.
Is there anything I can do to improve it?
3 Answers 3
inside your GetSiteMapUrls
Method you clean the Urls twice, and I don't see a reason for this.
foreach (Match m in urlRegex.Matches(sitemapText)) urls.Add(CleanUriString(m.Value)); foreach (var url in urls) { var cleanedUriString = CleanUriString(url); if (Uri.IsWellFormedUriString(cleanedUriString, UriKind.RelativeOrAbsolute)) yield return new Uri(cleanedUriString); }
before you add the URL to the urls
list you run them through the CleanUriString()
method
then you traverse the list that you just made and run the urls through the same method before returning the new uri.
This is redundant.
instead you could just use a single foreach loop and return the list of good urls from it, like this:
public IEnumerable<Uri> GetSitemapUrls(Uri sitemapUrl)
{
var sitemapText = GetSitemapText(sitemapUrl);
if (string.IsNullOrWhiteSpace(sitemapText))
yield break;
var urlRegex = new Regex(@"\b(?:https?://|www\.)[^ \f\n\r\t\v\]]+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
foreach (Match m in urlRegex.Matches(sitemapText))
{
var clean = CleanUriString(m.Value);
if (Uri.IsWellFormedUriString(clean, UriKind.RelativeOrAbsolute))
yield return new Uri(clean);
}
}
and personally I like Braces on my If's and Loops
Just a couple of thougths...
if (string.IsNullOrWhiteSpace(sitemapText)) yield break;
This is a very unusual handling of null/empty argument values and should be documented. A more meaniningful name like GetSitemapUrlsOrDefault
would also help. Without it the expected way is to throw an ArgumentException
.
foreach (Match m in urlRegex.Matches(sitemapText)) urls.Add(CleanUriString(m.Value)); foreach (var url in urls) { var cleanedUriString = CleanUriString(url); if (Uri.IsWellFormedUriString(cleanedUriString, UriKind.RelativeOrAbsolute)) yield return new Uri(cleanedUriString); }
You don't need all this. Instead you can build a nice query that does everything in one go and in a single run:
return
urlRegex
.Matches(sitemapText)
.Cast<Match>()
.Select(m => CleanUriString(m.Value))
.Where(cleanedUriString => Uri.IsWellFormedUriString(cleanedUriString, UriKind.RelativeOrAbsolute))
.Select(cleanedUriString => new Uri(cleanedUriString));
or alternatively:
return
from m in urlRegex.Matches(sitemapText).Cast<Match>()
let cleanedUriString = CleanUriString(m.Value)
where Uri.IsWellFormedUriString(cleanedUriString, UriKind.RelativeOrAbsolute)
select new Uri(cleanedUriString);
You don't need the .ToCharArray();
on the legalCharacters
string. A string is already a char-array.
Oh, and of course the WebClient
needs to be disposed. Wrapping it with a using
would be good.
The intention of CleanUriString
seems to be to return a new string containing only characters from the legalCharacters
string.
If CleanUriString
is passed a value containing more than one character not in legalCharacters
it will return a substring of all the characters up to the last illegal character including any prior illegal characters.
Example:
CleanUriString("asdøæå") //Result = asdøæ
I would suggest you write a thorough specification of what you expect CleanUriString
to do.
Should it strip all illegal characters? Cut the string off at the first illegal character? Outright reject any string with illegal characters?
When that is done rewrite the method step-by-step according to the specification.
-
1\$\begingroup\$ This also has the (bizarre) side-effect of it only finding the first value of the character that doesn't match. So
CleanUriString("ABCD<ABCD>ABCD>")
andCleanUriString("ABCD<ABCD>ABCD<")
return completely different results, because they substring to the first instance of the illegal character, not to the current index, as does calling it a second time onCleanUriString("ABCD<ABCD>ABCD>")
. \$\endgroup\$Der Kommissar– Der Kommissar2018年08月15日 14:14:27 +00:00Commented Aug 15, 2018 at 14:14
CleanUriString
works. It looks for achar
fromdirtyUriString
inside the same string. This means thatmatchIndex
will never be negative :-| Could you help me out with that? ;-) Maybe an example? \$\endgroup\$