|
TIFR - PORTABLE BATCHING SYSTEM |
|
|
Coding Programs for Manual Checkpointing
Automatically Resubmitting Jobs
Under normal circumstances, when a system is shut down, all running PBS jobs are checkpointed. All necessary information about these jobs is saved so that they may be continued (recovered) when the system is restarted. Unfortunately, since this does not always happen (as with a system crash), users should checkpoint longer jobs. Operations that PBS will perform automatically and information on how to increase the probability of recovering a job are discussed below.
When PBS is shut down in an orderly manner (that is, any shutdown other than a system crash) it automatically checkpoints every running job, except those that were submitted with the qsub -cn option (indicating that checkpoint/recovery is not to occur). Checkpoint files are created that contain a recoverable image of a job, which includes all necessary state information to resume the processes associated with the job.
If the system checkpoint space is exceeded, any jobs not yet checkpointed will be bypassed and will not be recoverable when the system restarts.These will have to be rerun from the beginning.
When PBS restarts, all checkpoint files are processed.
If a job is deemed recoverable, the system recovers it and continues running the job from
the point at which it stopped. A job is considered recoverable if:
If a job is not recoverable, it is restarted; the job will begin again at the first line of the script.
Although PBS automatically checkpoints and recovers all recoverable jobs after it is shut down and restarted, you should manually checkpoint your jobs, for the following reasons:
There are two methods to manually checkpoint jobs. The first method is to use the qsub parameter `-c interval' . The second method is to write code within your program that periodically saves the state of the program. This second method is necessary if checkpointing is required for the last four reasons cited above.
Coding Programs for Manual Checkpointing
Programs can be manually checkpointed by inserting code within your program to incrementally save data as the job is processed. Use this procedure to run a job that requires more than the maximum time available in the longest PBS batch queue. It's a good idea to checkpoint jobs that will take longer than 30 minutes to run. In the event of a system failure, all running jobs will restart from the beginning.
The three basic steps for manually checkpointing a job in this manner are:
The program used in Program 1 is also used in Program 2, which describes the process of
inserting code to checkpoint a job manually. Note that this program has no checkpointing
processes coded into it.
=================================================================================
Program 1.
! Program with no recovery procedure.
REAL, DIMENSION(5000) :: ARRAY
! Set maximum number of iterations.
ITMAX = 9999
! Normal initialization procedure.
CALL INIT (ARRAY)
! Do for ITMAX number of iterations. Subroutine SUB1
does
! all the work on the array.
DO I = 1,ITMAX
CALL SUB1 (ARRAY)
CONTINUE
! Save the final result in a file named "final.done".
OPEN(FILE='final.done',UNIT=17,FORM='UNFORMATTED')
WRITE (17) ARRAY
CLOSE (UNIT = 17, STATUS ='KEEP')
END
Program 2:
The program below provides two saved-state files to
ensure the reliability of the recovery procedure. In this example, the current state of
the program is defined by the information stored in ARRAY and the current iteration
number, I. In other codes, the information necessary to define the state of the
calculation may be in several arrays, common blocks, and variables. If you use a second
saved-state file, and the iteration counts do not match, the older saved-state file is
used. This provides a much safer mechanism for recovery, especially if the saved-state
file is large. Large saved-state recovery, especially if the saved-state file is large.
Large saved-state files can become unrecoverable because it takes longer for the
write to complete. This provides a larger window of time
for a coincidental crash.
! This program has a recovery procedure with two files
to save the state of the program. This will allow the program to
! recover if there is a system failure during the primary
state save.
REAL, DIMENSION(5000) ::
ARRAY
CHARACTER (KIND=13) FN
! Set maximum number of iterations.
ITMAX = 9999
! INTERVAL is the frequency of saving data.
INTERVAL = 20
! Open a scratch file named iter.save. (OPENCR is a
subroutine that is included below. It will create a file if it does not
! exist.)
CALL OPENCR (16,'iter.save', 'FORMATTED')
! The END = option allows for an initialization before
execution of the code, if the file iter.save does not exist.
! The variable ITERSAVE is the number of iterations
performed before the program's status was saved.
!ITERSAVE = 0 or end of file indicates it is the beginning
of a task and any normal initialization should be performed.
READ (16,*,END = 100)
ITERSAVE
GOTO 101
100 CONTINUE
ITERSAVE = 0
101 CONTINUE
CLOSE (16,STATUS = 'KEEP')
! This procedure uses two files for saving the state of
the program. One holds the current data and one holds the
! previous save state. This insures that at least one of
the two most recent save states is valid. The one to be used is
! determined by the count value in the scratch file
iter.save
IF (ITERSAVE /= 0) THEN
! The number 2 in the MOD function indicates there is
alternation between the two save files. This could be
! modified to save more than two states.
J = MOD(ITERSAVE/INTERVAL,2) + 1
! The following is an internal WRITE statement. This
is similar to the ENCODE statement found in some versions of
! Fortran. The result is placed in the character variable
FN instead of going to an output device. The I3.3 in the FORMAT
! statement forces leading zeros in the output. This
produces a character string such as "statesave.001".
WRITE
(FN,221) J
221 FORMAT('statesave.',I3.3)
CALL
OPENCR (15,FN, 'UNFORMATTED')
READ (15)
ITSAV,ARRAY
CLOSE
(UNIT = 15,STATUS = 'KEEP')
IF
(ITERSAVE /= ITSAV) THEN
! The following will print in case of an unlikely situation; such as a file system failure or similar catastrophe.
PRINT *,'Invalid Recovery'
STOP
ENDIF
ELSE
! Normal initialization procedure.
CALL
INIT (ARRAY)
ENDIF
! Do for ITMAX iterations.
DO I = ITERSAVE +
1,ITMAX
CALL SUB1
(ARRAY)
IF (MOD
(I, INTERVAL) == 0) THEN
! Write the save state data and alternate between the
two save
! state files. This is to insure the state of the program
is
! saved completely. If the system goes down before the
! WRITE(15) completes, then iter.save is not updated and
the
! previous state save files can be used.
J = MOD(I/INTERVAL,2) + 1
WRITE (FN,221) J
CALL OPENCR (15,FN, 'UNFORMATTED')
WRITE (15) I,ARRAY
CLOSE (UNIT = 15,STATUS ='KEEP')
! Notice that iter.save is only updated after the
current save
! state file is completely written.
CALL OPENCR (16,'iter.save', 'FORMATTED')
WRITE (16,*) I
CLOSE (UNIT = 16,STATUS ='KEEP')
ENDIF
END DO
! Save the final result in a file named "final.done".
CALL OPENCR
(17,'final.done', 'UNFORMATTED')
WRITE (17) ARRAY
CLOSE (UNIT = 17,STATUS =
'KEEP')
! Delete iter.save.
CALL OPENCR
(16,'iter.save', 'FORMATTED')
CLOSE (UNIT = 16,STATUS =
'DELETE')
! Delete the state save files. These may be kept for
future use, especially if they are used for restarts from other
! than system failures. If they are not necessary after
completion of the job, deletion is advisable since they can
! use a lot of disk space.
DO I = 1,2
WRITE
(FN,221) I
CALL
OPENCR (15,FN,'UNFORMATTED')
CLOSE
(UNIT = 15,STATUS = 'DELETE')
END DO
END
SUBROUTINE OPENCR(U, FN,
FM)
! Open a file and create it, if it does not already exist.
INTEGER U
CHARACTER*(*) FN, FM
! Open the file.
OPEN (FILE = FN, UNIT = U, FORM = FM)
RETURN
END
============================================================================
When PBS recovers after an orderly shutdown, it recovers all checkpointed files with recoverable images. These images and files are created from the jobs that were running at the time PBS was shut down. As described in PBS-Initiated Checkpointing, these jobs can be checkpointed automatically by PBS when it is shut down, or manually checkpointed by using the qsub -cinterval command.
When PBS recovers after a shutdown or a system failure, it looks for checkpointed files with recoverable images; these recoverable images may have been forced by the qsub -cinterval command. PBS will recover any process that has been checkpointed and has a recoverable image of the process. You can use qsub -cn to keep the job from recovering.
Automatically Resubmitting Jobs
By reading the manually saved state of the job, you can get a PBS job to resubmit itself so that the next PBS job will begin where your job ends. You may want to checkpoint manually and restart a job to run a very long job in stages. This is preferable with a job that requires more time than the longest available batch queue. PBS policy states that you may only have a job resubmit itself at the end of the job.
The qsub script shown below assumes a job that can manually checkpoint itself.
# Resubmitting job example.
#PBS -S /bin/ksh
# Use the Korn shell
#PBS -lcput=7000
# CPU time limit of 7000 seconds.
#PBS -lmem=20mw # 20 megawords memory size
limit.
cd
~/myjob #
Change directory to where
# executable is located (~/myjob)
./myprog3 >> file.out
if (-e iter.save) then # The -e checks for the
existence
# of the file named iter.save,
# which contains the manually
# checkpointed data. If this file
# and will be resubmitted.
qsub resub.job
endif
In this example, the file iter.save, necessary for the preceding example to work, was created during run-time. The above code also relies on the fact that the recovery files are deleted upon successful completion of a run. The script in this example is in the file named resub.job, located in the directory ~/myjob. It is initially run by changing the directory where the executable file resides.