1
\$\begingroup\$

This is the first time using the Task Parallel Library and it seems to be performing much slower than I was expecting. The application is simply a web crawler that returns all URLs for a given URL.

Here is my method that, depending on the level param, will either download single page links (getSinglePageLinks()) or it will make use of my threaded method (getManyPageLinks()):

public static IEnumerable<string> getLinks(string url, bool hostMatch=true, bool validatePages=true, int level=0)
 {
 string formattedUrl = urlFormatValidator(url);
 if (string.IsNullOrEmpty(formattedUrl)) return Enumerable.Empty<string>();
 //download root url's
 IEnumerable<string> rootUrls = getSinglePageLinks(formattedUrl, hostMatch, validatePages);
 //download url's for each level
 for (int i=0; i<level; i++)
 {
 rootUrls = rootUrls.Union(getManyPageLinks(rootUrls, hostMatch, validatePages));
 }
 return rootUrls;
 }

getSinglePageLinks() makes use of the HTML agility pack, and it simply downloads and parses the given URL:

private static IEnumerable<string> getSinglePageLinks(string formattedUrl, bool hostMatch = true, bool validatePages = true)
 {
 try
 {
 HtmlDocument doc = new HtmlWeb().Load(formattedUrl);
 var linkedPages = doc.DocumentNode.Descendants("a")
 .Select(a => a.GetAttributeValue("href", null))
 .Where(u => !String.IsNullOrEmpty(u))
 .Distinct();
 //hostMatch and validatePages left out
 return linkedPages;
 }catch(...){...}

And my getManyPageLinks():

 private static IEnumerable<string> getManyPageLinks(IEnumerable<string> rootUrls, bool hostMatch, bool validatePages)
 {
 List<Task> tasks = new List<Task>();
 List<List<string>> allLinks = new List<List<string>>();
 foreach (string rootUrl in rootUrls)
 { 
 string rootUrlCopy = rootUrl; //required
 var task = Task.Factory.StartNew(() =>
 {
 IEnumerable<string> taskResult = getSinglePageLinks(rootUrlCopy, hostMatch, validatePages);
 return taskResult;
 });
 tasks.Add(task);
 allLinks.Add(task.Result.ToList());
 }
 Task.WaitAll(tasks.ToArray());
 return allLinks.SelectMany(x => x).Distinct(); 
 }

The app works OK if the level is set to 0, but if I set the level to 1 so it gets all links for all of the roots URL, the CPU/Network usage does not go above 1-3%. How can I improve performance?

dfhwze
14.1k3 gold badges40 silver badges101 bronze badges
asked Aug 27, 2016 at 14:48
\$\endgroup\$

2 Answers 2

1
\$\begingroup\$

From MSDN on Task.Wait:

If the current task has not started execution, the Wait method attempts to remove the task from the scheduler and execute it inline on the current thread.

I have a feeling that something similar might be happening with WaitAll, killing performances. I'll have a look and see if I can find documentation about it. As you are using StartNew some of the tasks might be started already so they won't be in-lined.

I would refactor it using async/await so you are sure you are not using and blocking only one thread from the pool:

async static Task<IEnumerable<string>> GetAllPagesLinks(IEnumerable<string> rootUrls, bool hostMatch, bool validatePages)
{
 var result = await Task.WhenAll(rootUrls.Select(url => GetPageLinks(url, hostMatch, validatePages)));
 return result.SelectMany(x => x).Distinct();
}
static async Task<IEnumerable<string>> GetPageLinks(string formattedUrl, bool hostMatch = true, bool validatePages = true)
{
 var htmlDocument = new HtmlDocument();
 try
 {
 using (var client = new HttpClient())
 htmlDocument.Load(await client.GetStringAsync(formattedUrl));
 return htmlDocument.DocumentNode
 .Descendants("a")
 .Select(a => a.GetAttributeValue("href", null))
 .Where(u => !string.IsNullOrEmpty(u))
 .Distinct();
 }
 catch
 {
 return Enumerable.Empty<string>();
 }
}
async static Task<IEnumerable<string>> GetLinks(string url, bool hostMatch = true, bool validatePages = true, int level = 0)
{
 if (level < 0)
 throw new ArgumentOutOfRangeException(nameof(level));
 string formattedUrl = FormatAndValidateUrl(url);
 if (string.IsNullOrEmpty(formattedUrl))
 return Enumerable.Empty<string>();
 var rootUrls = await GetPageLinks(formattedUrl, hostMatch, validatePages);
 if (level == 0)
 return rootUrls;
 var links = await GetAllPagesLinks(rootUrls, hostMatch, validatePages);
 var tasks = await Task.WhenAll(links.Select(link => GetLinks(link, hostMatch, validatePages, --level)));
 return tasks.SelectMany(l => l);
}

Haven't had a chance to test it, but just get the gist.

answered Aug 28, 2016 at 9:12
\$\endgroup\$
1
\$\begingroup\$

Blocking code in a loop

You are starting your tasks asynchronously..

Task.Factory.StartNew(()

only to block synchronously inside the loop..

allLinks.Add(task.Result.ToList());

resulting in sequentially starting and awaiting the intermediate results for each cycle in the loop.

 foreach (string rootUrl in rootUrls)
 { 
 // .. code omitted
 // start an asynchronous task
 var task = Task.Factory.StartNew(() =>
 {
 // .. code omitted
 });
 tasks.Add(task);
 // BOTTLE-NECK:
 // synchronously await its result before starting the next task
 allLinks.Add(task.Result.ToList());
 }
 // super fast! but only because you have already awaited all results synchronously
 Task.WaitAll(tasks.ToArray());
 return allLinks.SelectMany(x => x).Distinct(); 

You could solve this by extracting the line allLinks.Add(task.Result.ToList()); from the loop and returning return tasks.SelectMany(x => x.Result).Distinct(); .

private static IEnumerable<string> getManyPageLinks(IEnumerable<string> rootUrls, bool hostMatch, bool validatePages)
{
 var tasks = new List<Task>();
 foreach (var rootUrl in rootUrls)
 { 
 string rootUrlCaptured = rootUrl;
 var task = Task.Run(() =>
 {
 var taskResult = getSinglePageLinks(rootUrlCaptured, hostMatch, validatePages);
 return taskResult;
 });
 tasks.Add(task);
 }
 Task.WaitAll(tasks.ToArray());
 return tasks.SelectMany(task => task.Result).Distinct(); 
}
answered Jul 27, 2019 at 6:04
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.