2

Let's say I want to program a parallelized web crawler which has a shared FIFO queue (multi consumer/multi producer). The queue only contains URLs. How do I detect the end of the queue?

A worker process is always consumer and producer at the same time because it takes an URL from the queue, crawls it and adds any found URLs to the queue. I think there is no way to have separate processes for consumer and producer tasks in this scenario.

Since the amount of input data is unknown but not infinite it's impossible to use a 'poison pill' as sentinel in the queue, right?

Also, the queue size is not a reliable way to find out if the queue is empty (because of multiple consumers/producers).

Please enlighten me :-)

asked Jan 2, 2017 at 18:38
1
  • 1
    Queue size plus the number of non-idle workers adding up to zero should tell you when you're done. Commented Jan 3, 2017 at 4:01

1 Answer 1

2

The main issue is that you will have to deal with cycles in the URL graph to prevent infinite looping -- when you see the same URL a second time, you probably should not put it into any queue.

Given cycle detection and prevention, then the queue will eventually converge to zero nodes allowing the zero queue size test to be reliable.

You can use a sentinel, and it can be used to indicate the end of a generation.

Or you can use a new queue for each generation. Thus, the workers start reading from the current generation queue while writing to the next generation queue. When a worker realizes that their currently configured input queue is empty, they reconfigure to read from a non-empty queue while writing to the next queue after that.

The queue for each generation can be retired when workers have all transitioned to reading from another queue. (Modulo that you still need some data structure to detect cycles.)

You can also use the notion of generation count to limit the search if it goes too far; it might give more consistent results than time-based termination, or URL-count-based termination.

answered Jan 2, 2017 at 19:38

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.