Note: has been fixed for a long time
This was motivated by a problem when running parallel Foam-Runs on a Rocks cluster: During the run I modified the controlDict (which was reread at every timestep) and sometimes got these two kinds of behaviour:
- I modified deltaT: simulation progressed and crashed after some timesteps due to inconsistent time-steps over the processors
- The time at which results were written out was modified. Some processors wrote their results at different times than the others. The reconstructPar-Utiliy couldn't rebuild these results.
This only happened when different physical nodes were used for the calculations. The problem seemed to be (and current postings in the Rocks-mailing-list support this) that the time that changes done on an NFS-share needed to find their way to the different node varied sometimes by as much as half a minute. So some nodes were reading the modified version of controlDict while others were reading the unmodified version (they read the new version one or more timesteps later than the others)
This patch tries to fix this problem by making sure that the same version of a file is read by all processors.
1.1 Example utility to reproduce the problem
The rereadControlDict-program tries to demonstrate the behaviour for a patched and an unpatched version of OpenFoam (a small test case should help with this- it provides the minimum needed to run the utility) without the need to have a 'faulty' NFS.
After compiling the utility run it on the test case:
mpirun -np 8 rereadControlDict . controlDictTest -parallel
from another terminal modify controlDictTest/system/controlDict. Each processor waits for a different time (2 to 12 seconds) at each timestep. If deltaT is modified before some processors have ended waiting but after some others have already ended, the runTime on the processors becomes inconsistent and the programm terminates with an error message (this program detects the inconsistency - normal OpenFOAM-programs do not).
When the patch has been applied to an OpenFOAM-installation it should be impossible (or very hard: modifying controlDict twice in one second) to crash that program by varying deltaT.
2 Application of the patch
The patch can be applied to a vanilla OpenFOAM 1.3 source by doing this:
cd $FOAM_SRC/OpenFOAM cat regIOobjectRead.C.patch | patch -p2 -b wmake libso
(the -b-option is not necessary if you trust the provider of the patch. I use it)
This patch should not alter the usage of OpenFOAM (except for fixing the problem described above)
The patch works this way:
- before the dictionary is read the OS-timestamp for the file is read
- timestamps are compared accross processors. If they are equal the contents of the file are assumed to be in sync and execution progresses as normal (dictionary is read)
- Another reason for resuming normally is that the data is in one of the processor-specific directories processorX
- if the timestamps differ all processors wait for some seconds, then the timestamps are reread
- After a fixed number of retries the program is terminated (assuming that the file will never get into sync again)
The combination of waiting-time and retries works for my installation (were the time-stamps seem to drift apart by at most one minute)
4 Technical discussion
Some additional comments on the patch:
- This patch might cause an issue when a dictionary is only read on some processors (don't know an example for that now) - processors would hang when comparing timestamps because some processors will never reach the corresponding MPI-barrier.
- Another possible solution could be to provide dictionary with a method that calculates a check-sum of its contents (and compare these checksums when rereading during parallel runs). But
- that would involve modifications to core-classes
- still would not solve the problem of processor-specific data
- 2006, Nov. 28: Initial upload
- 2006, Nov. 28 later in the day: Patch was presented on the Message Board. Henry Weller wrote back a reply that it is already fixed and will be part of the nextg release.