"Dead man's" switch for in-flight compute jobs when the head node dies
planned
Rob Newman
planned
F
Flamingo pink Python
Rob Newman hey, I am just curious, what the planned mechanism for implementing this might be? I think that with SLURM and HPC schedulers, Nextflow can catch the SIGTERM or other process signals and perform a graceful shutdown of the child compute jobs. Not clear if that is a thing in AWS Batch?
Rob Newman
Flamingo pink Python: We're looking at a similar approach by enhancing the Seqera Platform-available information via Nextflow (sharing the specific task information with the Platform). Then the Platform itself can manage the child jobs/tasks directly. This also allows us to retrieve any debug information from cloud-based tools (e.g. CloudTrail or others), helping all parties to troubleshoot more effectively.
Rob Newman
acknowledged