2

so im making a program which is kind of a web crawler. it downloads the html of a page and parses it for a specific text using regex and then adds it to a list.

to achieve this, i used async http requests. the GET request is sent asynchronously and the parsing operation is performed on the returned html.

my issue, and i'm not sure if it may be simple, is that the program doesn't run smoothly. it will send a bunch of requests, pause for a couple seconds, then increments the items parsed all at once (although the counter is programmed to increment once every time an item is added) so that for example it jumps from 53 to 69 instead of showing, 54,55,56,...

sorry for being a newb but i taught myself all this stuff and some experienced advice would go a long way.

thanks

asked May 17, 2012 at 3:07
3
  • stackoverflow.com/questions/1732348/… Commented May 17, 2012 at 3:14
  • this is for a specific site where the resulting html is always in the same form with changing variables so regex works fine. Commented May 17, 2012 at 3:57
  • but just out of curiosity, is there another method of doing it more efficiently? Commented May 17, 2012 at 3:58

1 Answer 1

4

That sounds correct.

The slowest part of your task is downloading the pages over the network.

Your program starts downloading a bunch of pages at once, waits for them to arrive, then parses them all almost instantly.

answered May 17, 2012 at 3:14
4
  • in that case, can I give priority to the main thread somehow? that is, the thread that is queuing the async requests into ThreadPool? i need this because the main thread is also making a request each time 20 async requests have been made. so whats happening is that its being backlogged behind all the already queued ThreadPool requests and blocking the whole program waiting for its response. Commented May 17, 2012 at 3:42
  • @user1115071: Consider using the TPL, which is already optimized for this. Commented May 17, 2012 at 11:45
  • Please forgive my ignorance as I've never used the TPL. Should I be using it for all threads, or only for the main ones I mentioned? Commented May 18, 2012 at 22:20
  • Use Parallel.For* or Task or LINQ AsParallel() and don't use threads or the threadpool directly at all. Commented May 20, 2012 at 2:06

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.