long-lived pools/background jobs · sourcegraph/conc · Discussion #93

bobheadxi
Feb 17, 2023
Collaborator

I think there might be a category of use cases where one might want to maintain a long-lived pool where:

tasks can be continuously submitted, with per-task error handling (ideally per-task panic handling as well, rather than aggregated panic handling) and concurrency limits
on service shutdown, offer a mechansim to signal and wait for all goroutines to stop

Somewhat related: #86 (comment), #89 (comment)

This might be something that's useful to offer separately, since I think all the conc.WaitGroup-based mechanisms should definitely not be used for long-lived goroutine pools (most notably because panics get swallowed until Wait())

Replies: 2 comments 4 replies

camdencheek
Feb 18, 2023
Maintainer

I agree that this is a category of use for goroutines that is not currently nicely handled by conc. I don't have a good vision of how we can solve this while maintaining the scoped-concurrency spirit of conc though.

The stream package currently suffers from this too -- it doesn't propagate a panic until the stream is complete. So using it for long-lived streams is usually inappropriate. For the sake of a shared starting point, I'm going to explore the design here in terms of the existing stream package.

I think the biggest design hurdle here is, in order to propagate a panic before calling s.Wait(), we need something running in the parent goroutine. Consider the following snippet:

func processStream(input <-chan int, output chan<- int) {
	s := stream.New().WithMaxGoroutines(8)
	for elem := range input {
		elem := elem
		s.Go(func() stream.Callback {
			res := fetch(elem)
			return func() { output <- res }
		})
	}
	s.Wait()
}

While blocked on a read from input, it is impossible to propagate a panic to the parent goroutine. We could probably improve on the situation slightly by making it possible to propagate a panic from s.Go(), but that will still only trigger when we have a new task to spawn.

Another important consideration is, without a cancellation mechanism, groups don't work well for long-running goroutines because we can't propagate a panic until all the spawned goroutines exit. So even if we have a way to propagate a panic, we don't have a way to tell the other goroutines to exit unless we're using something like ContextPool.

In general, I think it's an antipattern to "handle" a panic by logging it or whatever, so I don't want to make it super easy to do that by adding something like "WithPanicHandler" (this is still achievable by using recover in your spawned tasks if you really want it). However, I think it might be valuable to allow opting out of the panic propagation, falling back to the default "crash when a panic hits the top of a goroutine" behavior.

I'm not sure this actually brought us closer to a solution, but I'll keep mulling over it 🙂 I'd be curious to see design ideas here.

1 reply

@bobheadxi

bobheadxi Feb 24, 2023
Collaborator Author

Throwing a convoluted quick sketch out: maybe we can have something that handles incoming work as a queue, sends panics to a results channel, and monitors for cancellation and panics to interrupt work:

type Processor struct {
	ctx context.Context
	cancel func()
	wg sync.WaitGroup
	startDone chan struct{}
	work chan func(context.Context)
	panics chan *panics.RecoveredPanic
}
func (p *Processor) Submit(fn func(context.Context)) { p.work <- fn }
func (p *Processor) Start() {
	for {
		select {
		case <-p.ctx.Done():
			close(p.startDone)
			break
		case recovered := <-p.panics:
			p.cancel()
			p.wg.Wait()
			panic(recovered) // or send to Wait to throw?
		// TODO maybe have work handling and panic/cancel handling in separate goroutines
		case fn := <-p.work:
			// TODO something more robust like concurrency handling
			p.wg.Add(1)
			go func() {
				defer p.wg.Done()
				recovered := panics.Try(func() {
					fn(p.ctx)
				})
				if recovered != nil {
					p.panics <- recovered
				}
			}()
		}
	}
}
func (p *Processor) Cancel() {
	p.cancel()
}
func (p *Processor) Wait() {
	p.wg.Wait()
	<-p.startDone
}

iamnoah
Feb 20, 2023

long-lived goroutine pools

Hi, I'm new to conc so sorry if I'm jumping in inappropriately, but, is there really a case for long-lived pools? Thread pools are a thing in other languages due to the cost of threads, but a conc pool is not expensive to create (or put another way, if you are creating so many pools that it becomes a problem, your probably have other issues.)

The thing I like about conc is that it steers me away from patterns that are more likely to result in headaches than benefits.

3 replies

@camdencheek

camdencheek Feb 20, 2023
Maintainer

Hi @iamnoah! Good question. You're not wrong -- there are not many cases for long-lived pools. There is no real need to re-use pools over a long period of time since they are so cheap to create.

Maybe it would be more clear to phrase it as "pools with long-lived goroutines." Consider a situation like listening on two different ports. For each port, a goroutine is spawned to handle new connections. These two goroutines exist for the lifetime of the server. However, if either panics or errors, you might still want to return control to the spawning goroutine to do some cleanup tasks.

Right now, conc doesn't handle this situation well because a panic/error will not return control to the spawning goroutine until all goroutines have exited. If one listener fails and we catch its panic, the second listener will still happily continue listening for all eternity.

IMO, that's the crux of this discussion -- first, whether that's a situation conc can handle, and then whether it's a situation conc should handle. Because I agree: I do not want to introduce features in conc that make it easy to do bad things.

@iamnoah

iamnoah Feb 20, 2023

Right now, conc doesn't handle this situation well because a panic/error will not return control to the spawning goroutine until all goroutines have exited.

What is the downside of WithCancelOnError? When I have multiple long-lived goroutines, I need them all to do their own cleanup before doing any final cleanup.

@bobheadxi

bobheadxi Feb 24, 2023
Collaborator Author

Thread pools are a thing in other languages due to the cost of threads, but a conc pool is not expensive to create

There is no real need to re-use pools over a long period of time since they are so cheap to create.

I meant don't mean long-lived pools in the sense the goroutines themselves are the resource - one example is the one Camden raises, where e.g. one might have some long-lived processes, though like you mention WithCancelOnError can be used for this because if a process fails (or panics #92) the other task will be cancelled and control will be returned to the caller to handle.

However, another example is one where you want to control concurrency and cleanup over arbitrary tasks that might be submitted (as described in the original discussion post). For example, a processor that handles tasks being submitted continuously, such that it can:

only handle N at a time
handle tasks failing independently
signal that all tasks that are currently running should stop
etc

The original discussion post links a few requests regarding similar use cases that I think are along similar lines where whether conc/pool can/should be used is a bit dubious. #93 (comment) has another good example

long-lived pools/background jobs #93

Uh oh!

bobheadxi Feb 17, 2023 Collaborator

Replies: 2 comments · 4 replies

Uh oh!

camdencheek Feb 18, 2023 Maintainer

Uh oh!

Uh oh!

bobheadxi Feb 24, 2023 Collaborator Author

Uh oh!

iamnoah Feb 20, 2023

Uh oh!

camdencheek Feb 20, 2023 Maintainer

Uh oh!

iamnoah Feb 20, 2023

Uh oh!

Uh oh!

bobheadxi Feb 24, 2023 Collaborator Author

bobheadxi
Feb 17, 2023
Collaborator

Replies: 2 comments 4 replies

camdencheek
Feb 18, 2023
Maintainer

bobheadxi Feb 24, 2023
Collaborator Author

iamnoah
Feb 20, 2023

camdencheek Feb 20, 2023
Maintainer

bobheadxi Feb 24, 2023
Collaborator Author