Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Walk file tree on directory publish #3933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bentsherman wants to merge 5 commits into master
base: master
Choose a base branch
Loading
from 3372-publish-directory-files

Conversation

Copy link
Member

@bentsherman bentsherman commented May 10, 2023

Close #3372

I just changed the publish dir to explicitly publish each file in a directory. I think it means that if the publish mode is symlink, each file in the directory will be symlinked instead of just the directory.

@robsyme do we just need to add a checksum comparison to S3 copies and uploads? I think I can just rip the code from this PR #3802

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman linked an issue May 10, 2023 that may be closed by this pull request
@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 81f7cb7 to 8a43489 Compare August 20, 2023 20:13
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Copy link

netlify bot commented Feb 8, 2024
edited
Loading

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit ad3c4cb
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/664bbe302c86c2000895f798

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman marked this pull request as ready for review February 8, 2024 23:12
Copy link
Member Author

This PR is ready for review. The publish dir now publishes each file in a directory instead of just the directory, which should improve detection of Reports in Seqera Platform and fix some issues with Fusion symlinks (#4725).

It does not yet solve #3372 , for that we also need to verify the checksum of published files when deciding whether a file needs to be re-published. I can implement that here or in a separate PR, up to you @pditommaso .

Copy link
Collaborator

robsyme commented Feb 9, 2024

Looks good, thanks Ben!

In the case where publish method is symlinks and we are emitting/publishing a directory, what does the published output look like?

Is it a symlinked directory full of symlinks, or a regular directory full of symlinks, or something else entirely?

Copy link
Member Author

It will be a regular directory of symlinks

Copy link
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we should do this. On some storage traversing a directory path is extremely expensive e.g. Lustre, NFS, etc.

Copy link
Member Author

I share your concern, although copying a directory also involves walking the directory tree at some level, so may it's not an issue. Either way, there are other reasons we should do this.

In general, I think it is a bad practice to define a directory output. When I see a process with a directory output, I know nothing about the file structure within that directory.

So I think we should either not support directory outputs or support them correctly. Currently we just support it incorrectly:

The performance implications you mention only happen if the user decides to use directory outputs.. I think we could simply document this point and encourage users to use flat files as much as possible.

Consider also that we already walk the work directory to collect the output files:

FileHelper.visitFiles(opts, workDir, namePattern) { Path it -> files.add(it) }

which has similar performance issues when there are nested directories

Copy link
Member Author

Indeed, when we publish a directory we are already traversing it:

CopyMoveHelper.copyDirectory(source, target, options)

This PR is just doing the traversal earlier

Copy link
Member Author

@robsyme what do you think about what I said about publishing a symlink directory? It is certainly a change in behavior, but I'm wondering if it's more desirable to symlink all of the files rather than only symlink the directory.

If not, I can modify the logic to only traverse the directory when the publish mode isn't symlink, but I feel like that would make the behavior inconsistent based on the publish mode.

Copy link
Collaborator

robsyme commented Feb 19, 2024

Thanks for moving this forward Ben, and apologies for my slow response times! My feeling is that we're stuck between a rock and a hard place, but inconsistencies between publishDir modes is the worse outcome.

As for

bad practice to define a directory output

I think it's inevitable that people will want to publish results that contain directories (many tools have complex/nested outputs).

Copy link
Member Author

Indeed, even though a directory output is harder to reason about, it seems to be convenient for the first iteration of a Nextflow process for certain tools

robsyme reacted with heart emoji

Copy link
Member

I think this is solved by #5005

Copy link
Member Author

The fusion symlinks are no longer an issue, but there are other reasons to merge this:

  • when the workflow output definition uses the deep overwrite mode, which compares checksums to decide whether to overwrite a file, comparing each individual file instead of the entire directory would allow for finer-grained caching

  • walking the directory emits a publish event for each individual file, which makes them easier to capture for reports

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Copy link

R3myG commented Nov 5, 2024

@bentsherman What's the status of this fix? Encountering is systematically with the https://github.com/nf-core/spatialvi pipeline. Alternatively, any known workaround?

Copy link
Member Author

The PR works but I'm not sure if we still want to merge it or not. We might take a different approach.

I'm not sure why your issue is happening because you don't need to create directories beforehand in S3. It could be a different bug in Nextflow. Feel free to submit an issue here if you aren't able to solve it.

A workaround to publishing a directory is to have a tuple of paths for each individual file in the directory. But this gets very impractical if there are many files

R3myG reacted with thumbs up emoji

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 5a93547 to 27345a6 Compare February 10, 2025 21:46
@pditommaso pditommaso force-pushed the master branch 3 times, most recently from b4b321e to 069653d Compare June 4, 2025 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@pditommaso pditommaso pditommaso left review comments

@robsyme robsyme Awaiting requested review from robsyme

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

PublishDir step fails on push to S3 even though pipeline succeeds Validation of published outputs Safe guard when using publishDir directive

AltStyle によって変換されたページ (->オリジナル) /