-
Notifications
You must be signed in to change notification settings - Fork 738
Add staging events to the TraceObserverV2 #6443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for nextflow-docs-staging canceled.
|
788f51a
to
a280d0e
Compare
Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
a280d0e
to
b6e17d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems going into the right direction. My only connect is about the notification events, here
Likely those events should be emitter *async* in this context for two reasons:
- Make sure the target path is accessible when the event is received
- Avoid that core runtime performance could be impacted by compute/data intensive that could be implemented by third-parties plugins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still prefer to enable this through the task events rather than adding a new event. I just haven't had time to prepare a new PR. I will try to have something ready this week or next week. I would at least like to try that before we add a whole new event
Trying my idea again #6460
b7b4221
to
c1114bc
Compare
Likely those events should be emitter async in this context for two reasons:
Thanks for the review, and I totally agree with you.
However, this PR mimics the event handling system that is in place for other events (e.g. notifyFilePublish
). If we would like to make the notification system asynchronous, this is something notifyEvent
should probably take care of, right?
Edit: I created #6471
I still prefer to enable this through the task events rather than adding a new event. I just haven't had time to prepare a new PR. I will try to have something ready this week or next week. I would at least like to try that before we add a whole new event.
Trying my idea again #6460
Alright, thank you for taking the time to take a look at this! Despite this new PR, would you agree that it might still make sense to merge this PR as well, since onFileStaged
nicely mirrors onFilePublish
?
I'm still considering the merits of having an event for file staging. I don't really see it as a mirror of file publishing. File staging is just an implementation detail, whereas file publishing is "essential" to understand the outputs of a run and their provenance.
Thanks for taking this PR into consideration!
From my perspective, the file staging is also essential to understand the inputs of a run and their provenance 😅 This PR is the only thing that is currently blocking me from implementing decent data provenance for Nextflow workflows with LaminDB using nf-lamin. For now, all of the Nextflow runs result in Lamin Runs being created with only outputs and no inputs, which goes a bit against the whole LaminDB philosophy.
Based on the proposed implementation, would you expect any significant slowdowns or breaking changes with existing functionality?
From my perspective, the file staging is also essential to understand the inputs of a run and their provenance
Of course, but file staging is tied to the task lifecycle in a way that file publishing is not. Output files can be published separately from the originating task. But remote input files are only staged in the context of a task getting ready to run, so there isn't as much of a need for a separate "file staged" event.
Based on the proposed implementation, would you expect any significant slowdowns or breaking changes with existing functionality?
I suppose there is a small cost to adding new workflow events, but it should be pretty minimal. I'm more concerned about adding an event for something that is more of an implementation detail
For example, you wouldn't really be able to use the "file staged" event on its own. You would always need to use it in concert with the task events, because not all input files need to be staged. That suggests that it shouldn't be a standalone event at all
This PR implements file staging event notifications for
TraceObserverV2
, allowing plugins to track when remote files are staged to the local work directory. This resolves #5905 by providing observers access to both the original remote file paths and their staged local paths.Background
Issue #5905 identified a need for provenance tracking plugins (specifically nf-lamin) to access the original remote paths of input files, not just their staged local paths. Two previous attempts were made to solve this:
onFileStage
toTraceObserver
- rejected in favor of Include remote path in FileHolder #5911FileHolder
to include remote paths - merged but then reverted due to concerns about cache invalidation and risk to core functionalityI met up with @bentsherman at the beginning of September. He said he wanted to have another shot at resolving this issue, and that I could reach out in a few weeks time if no progress was made.
Solution
This implementation is a simpler solution in comparison to the two previous PRs. I added this to the TraceObserverV2:
onFileStaged(FileStagingEvent event)
- Called after a file has been successfully staged from a remote locationThe
FileStagingEvent
provides:source
: The original remote path (e.g.,s3://bucket/file.txt
,https://example.com/data.csv
)target
: The staged local path in the work directory (e.g.,work/.../file.txt
)Usage Example
For plugins that need to track file provenance (like nf-lamin):
Related Issues