Separate workflow metadata from process work dirs in workDir | Voters

Separate workflow metadata from process work dirs in workDir

acknowledged

Additional Barnacle

To easily automate cleanup of (large) process work directories without losing important tower metadata files, like the nf-.. files used for Reports, it would be good to be able to store them in a different path (e.g. a different prefix on S3). Right now they are both directly in

workDir

, making it impossible to use simple AWS lifecycle rules to cleanup process work folders without also deleting the

nf-*

files and breaking the Report links from the tower front-end

April 8, 2024

Rob Newman

marked this post as

acknowledged

Rob Newman

Merged in a post:

Rendering of reports not being dependent on a .tsv file

Limited Mongoose

The rendering of Pipeline Run reports in the Seqera Platform is dependent on the file

zz://bucket/scratch/<workflow_id>/nf-<workflow_id>-reports.tsv

. We've found this can create a headache when trying to retain this file but clear up the rest of the scratch directory to save money.

It would be handy if this file could be output to the output

dir/ data_path

, and Seqera Platform could use it to render reports from there. It seems unintuitive that any files that might need to be kept, would be written to the

scratch

directory, which I feel would be considered impermanent.

April 17, 2024

Ben Sherman

The way I would like to deal with this is to move the helper files (or at least the relevant information within) into the task metadata cache (i.e. the .nextflow folder) when a run completes. Then it should be safe to delete the task directories, and whatever information the Platform needs, it should be able to query the task metadata cache for it.

Additional Barnacle

Ben Sherman I don't follow exactly. I was thinking for example of the nf-*-report.tsv files that control the links on tower under the 'Reports' tab to result files in "publishDir". It was my understanding that tower looks for those at a fixed path (top-level of the compute environment workDir). 
Are you referring to something different or am I misunderstanding?
Thanks
Felix

Ben Sherman

Additional Barnacle I see what you mean, you're talking about the report helper files. I guess I see it all as part of the same problem. The platform relies on these files in the run detail / task detail views, but the work directory is supposed to be temporary. The only difference is that the report helper files are not used by Nextflow, so instead of the Nextflow cache they should be saved into some database by the platform.

Drew DiPalma

Merged in a post:

nf-*reports.tsv should be written in a published location so that the entire workdir can be deleted and the reports retained on the platform.

Saffron Porcupine

April 9, 2024

Net Ox

There might be a simple fix for this. Write the

*report.tsv

files to the publish dir. However this still leaves the log files that require keeping. Having a way to keep all files required for correct rendering of the Seqera platform UI would be a crucial way to be able to reduce storage costs. I know that with Fusion you can use tags to differentially delete non-metadata files but I notice that the

*report.tsv

files are not tagged.

Brass Wildcat

Merged in a post:

The "Reports" tab in Seqera platform stops working after workdir cleanup

Medical Grouse

Our work directories (scratch space) can often be many TB in size for each NextFlow run, so we clean it up aggressively. Once it's cleaned up, it seems like the reports tab in the Tower page stops working, even if the published data is still present.

February 22, 2024

Brass Wildcat

Hi! I believe the issue is that the Seqera platform fetches a *reports.tsv
 file from the workDir to get the reports list every time you load the Run Details page. 
Cleaning up the workDir deletes that file and the platform loses the context -> cannot find any report. 
I am merging this request with this, because they seem to be the same problem and, in this way,  there will be a centralized follow-up point.

Yellow sunshine Firefly

To add to this, if the report file is "Published" to another S3 bucket, the report link should point to the published location.

Indigo Wildcat

This would be great and make the reports functionality more useful