-
Notifications
You must be signed in to change notification settings - Fork 739
Walk file tree on directory publish #3933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
68c35f1
to
36b9e22
Compare
38c9931
to
295bc1f
Compare
e377812
to
fcdeec0
Compare
81f7cb7
to
8a43489
Compare
fd99141
to
19d2ccb
Compare
4e27468
to
dfd7d09
Compare
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
This PR is ready for review. The publish dir now publishes each file in a directory instead of just the directory, which should improve detection of Reports in Seqera Platform and fix some issues with Fusion symlinks (#4725).
It does not yet solve #3372 , for that we also need to verify the checksum of published files when deciding whether a file needs to be re-published. I can implement that here or in a separate PR, up to you @pditommaso .
Looks good, thanks Ben!
In the case where publish method is symlinks and we are emitting/publishing a directory, what does the published output look like?
Is it a symlinked directory full of symlinks, or a regular directory full of symlinks, or something else entirely?
It will be a regular directory of symlinks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we should do this. On some storage traversing a directory path is extremely expensive e.g. Lustre, NFS, etc.
I share your concern, although copying a directory also involves walking the directory tree at some level, so may it's not an issue. Either way, there are other reasons we should do this.
In general, I think it is a bad practice to define a directory output. When I see a process with a directory output, I know nothing about the file structure within that directory.
So I think we should either not support directory outputs or support them correctly. Currently we just support it incorrectly:
- Tower reports in a directory output are not detected
- Fusion symlinks in a directory output are not resolved
- If we implemented Overwrite published outputs only if they are stale #4729 without this PR, the validation of published outputs would not work for directory outputs
The performance implications you mention only happen if the user decides to use directory outputs.. I think we could simply document this point and encourage users to use flat files as much as possible.
Consider also that we already walk the work directory to collect the output files:
nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy
Line 1654 in 279c410
which has similar performance issues when there are nested directories
Indeed, when we publish a directory we are already traversing it:
This PR is just doing the traversal earlier
@robsyme what do you think about what I said about publishing a symlink directory? It is certainly a change in behavior, but I'm wondering if it's more desirable to symlink all of the files rather than only symlink the directory.
If not, I can modify the logic to only traverse the directory when the publish mode isn't symlink, but I feel like that would make the behavior inconsistent based on the publish mode.
Thanks for moving this forward Ben, and apologies for my slow response times! My feeling is that we're stuck between a rock and a hard place, but inconsistencies between publishDir modes is the worse outcome.
As for
bad practice to define a directory output
I think it's inevitable that people will want to publish results that contain directories (many tools have complex/nested outputs).
Indeed, even though a directory output is harder to reason about, it seems to be convenient for the first iteration of a Nextflow process for certain tools
I think this is solved by #5005
The fusion symlinks are no longer an issue, but there are other reasons to merge this:
-
when the workflow output definition uses the
deep
overwrite mode, which compares checksums to decide whether to overwrite a file, comparing each individual file instead of the entire directory would allow for finer-grained caching -
walking the directory emits a publish event for each individual file, which makes them easier to capture for reports
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
R3myG
commented
Nov 5, 2024
@bentsherman What's the status of this fix? Encountering is systematically with the https://github.com/nf-core/spatialvi pipeline. Alternatively, any known workaround?
The PR works but I'm not sure if we still want to merge it or not. We might take a different approach.
I'm not sure why your issue is happening because you don't need to create directories beforehand in S3. It could be a different bug in Nextflow. Feel free to submit an issue here if you aren't able to solve it.
A workaround to publishing a directory is to have a tuple of paths for each individual file in the directory. But this gets very impractical if there are many files
1f834a0
to
6454605
Compare
5a93547
to
27345a6
Compare
f6a3696
to
49b58d2
Compare
b4b321e
to
069653d
Compare
b7b4221
to
c1114bc
Compare
Close #3372
I just changed the publish dir to explicitly publish each file in a directory. I think it means that if the publish mode is symlink, each file in the directory will be symlinked instead of just the directory.
@robsyme do we just need to add a checksum comparison to S3 copies and uploads? I think I can just rip the code from this PR #3802