I saw some books on operating system concepts mention the producer-consumer problem in the context of synchronizing concurrent accesses to shared resources. All seem to be in shared memory architecture.
Does the producer-consumer problem appear in communication between processes in distributed memory architecture? If yes,
Is the output generated by a producer stored in a place shared with the consumer?
Do the same synchronization methods (e.g. locks, semaphore, monitor, ...) used in shared memory architecture apply to the producer-consumer problem in distributed memory architecture?
I saw "pipeline" and "stream processing" are popular words. Do they mean the producer-consumer problem/pattern? Do they require the same synchronization methods as in shared memory architecture?
Thanks.
1 Answer 1
You are right in thinking that the problems of races and deadlocks in producer-consumer stem from shared variables and data structures.
Any system that allows multiple accessors (processes, threads, interrupt handler & main, sometimes even simple recursion, or object patterns) and shared variables or shared data structures featuring raw read and write access can have such problems.
For example, even a distributed system in which:
A shared database (e.g. SQL, key-value, etc..) is used to store, read, and write variables or a data structure, non-atomically — if one actor reads the database, and as a result of testing some value, a suspend message is sent to the other actor (who can also update variables in the database) we'll have the same problems as with shared memory.
A shared file system used to store same variables, updated by multiple actors, can suffer the same problems.
We need to use domain-oriented abstractions that manipulate data structures, rather than raw read & write capabilities performed by independent and unsynchronized actors.
When I think of distributed systems; however, I think of meaningful, high-level message passing — message passing systems can be designed to send messages about job data to be worked on rather than messages managing shared variables.
To be clear, a node in a distributed system internally has to be able to buffer network packets, which means enqueue and dequeue, and that is a producer-consumer arrangement; so, of course, these implementations will protect internally shared buffers using locks or other, but this queuing is typically done at the system level below that of regular user code.
In a distributed system we expect a consumer to naturally suspend itself when it has processed all (buffered) job packets, and to resume from suspension on arrival of a new packet — thus, the producer doesn't necessarily have to wake a consumer at all; it just sends a job packet. This naturally takes care of the empty buffer situation in producer-consumer. By comparison with the producer-consumer approach given at wikipedia, this breaks the cyclic nature of the actors woken graph (consumer wakes producer though also producer wakes consumer).
A robust distributed system would also offer some back pressure mechanism, which would have a consumer send throttling messages (suspend/resume) to the producer — otherwise a producer producing faster than a consumer consumes would eventually overwhelm the consumer. This is similar to the full buffer condition of producer-consumer; however, by comparison, does not involve shared variables.
As with shared memory-based producer-consumer, multiple producers or consumers complicate matters in a distributed system, so there would likely be a coordinating service that manages the job queue — though not necessarily through external manipulation of shared variables. (Such a coordinating service could also facilitate retry on network and worker failures.)
Explore related questions
See similar questions with these tags.