Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add staging events to the TraceObserverV2 #6443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rcannood wants to merge 1 commit into nextflow-io:master
base: master
Choose a base branch
Loading
from rcannood:add-staging-events

Conversation

Copy link
Contributor

@rcannood rcannood commented Oct 2, 2025

This PR implements file staging event notifications for TraceObserverV2, allowing plugins to track when remote files are staged to the local work directory. This resolves #5905 by providing observers access to both the original remote file paths and their staged local paths.

Background

Issue #5905 identified a need for provenance tracking plugins (specifically nf-lamin) to access the original remote paths of input files, not just their staged local paths. Two previous attempts were made to solve this:

I met up with @bentsherman at the beginning of September. He said he wanted to have another shot at resolving this issue, and that I could reach out in a few weeks time if no progress was made.

Solution

This implementation is a simpler solution in comparison to the two previous PRs. I added this to the TraceObserverV2:

  • onFileStaged(FileStagingEvent event) - Called after a file has been successfully staged from a remote location

The FileStagingEvent provides:

  • source: The original remote path (e.g., s3://bucket/file.txt, https://example.com/data.csv)
  • target: The staged local path in the work directory (e.g., work/.../file.txt)

Usage Example

For plugins that need to track file provenance (like nf-lamin):

class ProvenanceObserver implements TraceObserverV2 {
 
 @Override
 void onFileStaged(FileStagingEvent event) {
 // event.source = original remote path
 // event.target = staged local path
 log.info "File staged: ${event.source} -> ${event.target}"
 registerFileProvenance(event.source, event.target)
 }
}

Related Issues

Zethson reacted with thumbs up emoji
Copy link

netlify bot commented Oct 2, 2025
edited
Loading

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit b6e17d2
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/68de838084522d0007eca63e

@rcannood rcannood force-pushed the add-staging-events branch 2 times, most recently from 788f51a to a280d0e Compare October 2, 2025 13:47
Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
Copy link
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems going into the right direction. My only connect is about the notification events, here

https://github.com/nextflow-io/nextflow/pull/6443/files#diff-cb4252ec3616353913180e8e6eb95c249b4d591daf4143283e543be188568685R105-R111

Likely those events should be emitter *async* in this context for two reasons:

  1. Make sure the target path is accessible when the event is received
  2. Avoid that core runtime performance could be impacted by compute/data intensive that could be implemented by third-parties plugins

Copy link
Member

@bentsherman bentsherman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still prefer to enable this through the task events rather than adding a new event. I just haven't had time to prepare a new PR. I will try to have something ready this week or next week. I would at least like to try that before we add a whole new event

Copy link
Member

Trying my idea again #6460

rcannood reacted with thumbs up emoji

Copy link
Contributor Author

rcannood commented Oct 11, 2025
edited
Loading

@pditommaso:

Likely those events should be emitter async in this context for two reasons:

Thanks for the review, and I totally agree with you.

However, this PR mimics the event handling system that is in place for other events (e.g. notifyFilePublish). If we would like to make the notification system asynchronous, this is something notifyEvent should probably take care of, right?

Edit: I created #6471


@bentsherman:

I still prefer to enable this through the task events rather than adding a new event. I just haven't had time to prepare a new PR. I will try to have something ready this week or next week. I would at least like to try that before we add a whole new event.

Trying my idea again #6460

Alright, thank you for taking the time to take a look at this! Despite this new PR, would you agree that it might still make sense to merge this PR as well, since onFileStaged nicely mirrors onFilePublish?

Copy link
Member

I'm still considering the merits of having an event for file staging. I don't really see it as a mirror of file publishing. File staging is just an implementation detail, whereas file publishing is "essential" to understand the outputs of a run and their provenance.

Copy link
Contributor Author

Thanks for taking this PR into consideration!

From my perspective, the file staging is also essential to understand the inputs of a run and their provenance 😅 This PR is the only thing that is currently blocking me from implementing decent data provenance for Nextflow workflows with LaminDB using nf-lamin. For now, all of the Nextflow runs result in Lamin Runs being created with only outputs and no inputs, which goes a bit against the whole LaminDB philosophy.

Based on the proposed implementation, would you expect any significant slowdowns or breaking changes with existing functionality?

Copy link
Member

From my perspective, the file staging is also essential to understand the inputs of a run and their provenance

Of course, but file staging is tied to the task lifecycle in a way that file publishing is not. Output files can be published separately from the originating task. But remote input files are only staged in the context of a task getting ready to run, so there isn't as much of a need for a separate "file staged" event.

Based on the proposed implementation, would you expect any significant slowdowns or breaking changes with existing functionality?

I suppose there is a small cost to adding new workflow events, but it should be pretty minimal. I'm more concerned about adding an event for something that is more of an implementation detail

For example, you wouldn't really be able to use the "file staged" event on its own. You would always need to use it in concert with the task events, because not all input files need to be staged. That suggests that it shouldn't be a standalone event at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@pditommaso pditommaso Awaiting requested review from pditommaso

@bentsherman bentsherman Awaiting requested review from bentsherman

Requested changes must be addressed to merge this pull request.

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

Detect when files are being staged

AltStyle によって変換されたページ (->オリジナル) /