We are seeing a lot of failed jobs due to "Failed to start a parallel pool".
A couple of things we could try..
Right now, this App uses tempname() to generate the temp path for JobStorageLocation. I believe it uses /tmp as parent directory.
I wonder if we could use the current working directory instead.
Instead, I think we should create it under the current working directory.. in case use of /tmp is somehow causing the issue.
%need to use different profile directory to make sure multiple jobs won't share the same directory and crash
profile_dir='./profile';
mkdir(profile_dir);
c = parcluster();
c.JobStorageLocation = profile_dir;
pool = parpool(c, config.workers);
Right now, this App is skipping to set JobStorageLocation if mkdir(tmpdir) fails.
% check and set cachedir location
if OK
% set local storage for parpool
clust.JobStorageLocation = tmpdir;
end
I suggest removing this block and let the App fail if it fails to create a tmpdir (or at least add the log message inside the block to know that we are setting the JobStorageLocation
I have seen a similar parpool startup failure / random matlab crash before. I've workaround this by simply rerunning the code a few times if it starts to fail.
https://github.com/brain-life/app-dp-modelfit/blob/master/fit_model.sh#L39
It's ugly but very simple thing to try.. and for the DP App this has cured the issue of occasional hiccups.
We are seeing a lot of failed jobs due to "Failed to start a parallel pool".
A couple of things we could try..
Right now, this App uses
tempname()to generate the temp path for JobStorageLocation. I believe it uses /tmp as parent directory.I wonder if we could use the current working directory instead.
Instead, I think we should create it under the current working directory.. in case use of /tmp is somehow causing the issue.
Right now, this App is skipping to set JobStorageLocation if
mkdir(tmpdir)fails.I suggest removing this block and let the App fail if it fails to create a tmpdir (or at least add the log message inside the block to know that we are setting the JobStorageLocation
I have seen a similar parpool startup failure / random matlab crash before. I've workaround this by simply rerunning the code a few times if it starts to fail.
It's ugly but very simple thing to try.. and for the DP App this has cured the issue of occasional hiccups.