so im making a program which is kind of a web crawler. it downloads the html of a page and parses it for a specific text using regex and then adds it to a list.
to achieve this, i used async http requests. the GET request is sent asynchronously and the parsing operation is performed on the returned html.
my issue, and i'm not sure if it may be simple, is that the program doesn't run smoothly. it will send a bunch of requests, pause for a couple seconds, then increments the items parsed all at once (although the counter is programmed to increment once every time an item is added) so that for example it jumps from 53 to 69 instead of showing, 54,55,56,...
sorry for being a newb but i taught myself all this stuff and some experienced advice would go a long way.
thanks
-
stackoverflow.com/questions/1732348/…SLaks– SLaks05/17/2012 03:14:38Commented May 17, 2012 at 3:14
-
this is for a specific site where the resulting html is always in the same form with changing variables so regex works fine.blizz– blizz05/17/2012 03:57:43Commented May 17, 2012 at 3:57
-
but just out of curiosity, is there another method of doing it more efficiently?blizz– blizz05/17/2012 03:58:11Commented May 17, 2012 at 3:58
1 Answer 1
That sounds correct.
The slowest part of your task is downloading the pages over the network.
Your program starts downloading a bunch of pages at once, waits for them to arrive, then parses them all almost instantly.
-
in that case, can I give priority to the main thread somehow? that is, the thread that is queuing the async requests into ThreadPool? i need this because the main thread is also making a request each time 20 async requests have been made. so whats happening is that its being backlogged behind all the already queued ThreadPool requests and blocking the whole program waiting for its response.blizz– blizz05/17/2012 03:42:19Commented May 17, 2012 at 3:42
-
@user1115071: Consider using the TPL, which is already optimized for this.SLaks– SLaks05/17/2012 11:45:10Commented May 17, 2012 at 11:45
-
Please forgive my ignorance as I've never used the TPL. Should I be using it for all threads, or only for the main ones I mentioned?blizz– blizz05/18/2012 22:20:17Commented May 18, 2012 at 22:20
-
Use
Parallel.For*
orTask
or LINQAsParallel()
and don't use threads or the threadpool directly at all.SLaks– SLaks05/20/2012 02:06:30Commented May 20, 2012 at 2:06