TIFR - PORTABLE BATCHING SYSTEM

 

Prev

Table of Contents

Next

Checkpointing PBS Jobs

PBS-initiated Checkpointing

Manual Checkpointing

Coding Programs for Manual Checkpointing

Recovering Checkpointed Jobs

Automatically Resubmitting Jobs

Checkpointing PBS Jobs

Under normal circumstances, when a system is shut down, all running PBS jobs are checkpointed. All necessary information about these jobs is saved so that they may be continued (recovered) when the system is restarted. Unfortunately, since this does not always happen (as with a system crash), users should checkpoint longer jobs. Operations that PBS will perform automatically and information on how to increase the probability of recovering a job are discussed below.

TOP

PBS-initiated Checkpointing

When PBS is shut down in an orderly manner (that is, any shutdown other than a system crash) it automatically checkpoints every running job, except those that were submitted with the qsub -cn option (indicating that checkpoint/recovery is not to occur). Checkpoint files are created that contain a recoverable image of a job, which includes all necessary state information to resume the processes associated with the job.

If the system checkpoint space is exceeded, any jobs not yet checkpointed will be bypassed and will not be recoverable when the system restarts.These will have to be rerun from the beginning.

When PBS restarts, all checkpoint files are processed. If a job is deemed recoverable, the system recovers it and continues running the job from the point at which it stopped. A job is considered recoverable if:
 

If a job is not recoverable, it is restarted; the job will begin again at the first line of the script.

TOP

Manual Checkpointing

Although PBS automatically checkpoints and recovers all recoverable jobs after it is shut down and restarted, you should manually checkpoint your jobs, for the following reasons:

There are two methods to manually checkpoint jobs. The first method is to use the qsub parameter `-c interval' . The second method is to write code within your program that periodically saves the state of the program. This second method is necessary if checkpointing is required for the last four reasons cited above.

TOP

Coding Programs for Manual Checkpointing

Programs can be manually checkpointed by inserting code within your program to incrementally save data as the job is processed. Use this procedure to run a job that requires more than the maximum time available in the longest PBS batch queue. It's a good idea to checkpoint jobs that will take longer than 30 minutes to run. In the event of a system failure, all running jobs will restart from the beginning.

The three basic steps for manually checkpointing a job in this manner are:

  1. Allow the program to save its state or recovery data. This includes any variables that cannot be recomputed easily, and the iteration, or step or pass number. Other variables, such as coefficients of a computation that can be recomputed in a recovery procedure (or marked for recomputation) need not be included.
  2. Provide a way for the programmer to detect the presence of the recovery data and a way for the program to use this data.
  3. Provide a way for the program to test for the validity of the recovery data. You might want to allow a way for the program not only to detect bad recovery data, but to have a recourse, such as a secondary set of recovery data.

The program used in Program 1 is also used in Program 2, which describes the process of inserting code to checkpoint a job manually. Note that this program has no checkpointing processes coded into it.
=================================================================================
Program 1.

! Program with no recovery procedure.

   REAL, DIMENSION(5000) :: ARRAY

! Set maximum number of iterations.

   ITMAX = 9999

! Normal initialization procedure.

   CALL INIT (ARRAY)

! Do for ITMAX number of iterations. Subroutine SUB1 does
! all the work on the array.

   DO I = 1,ITMAX
      CALL SUB1 (ARRAY)
   CONTINUE

! Save the final result in a file named "final.done".

   OPEN(FILE='final.done',UNIT=17,FORM='UNFORMATTED')
   WRITE (17) ARRAY
   CLOSE (UNIT = 17, STATUS ='KEEP')
   END

Program 2:

The program below provides two saved-state files to ensure the reliability of the recovery procedure. In this example, the current state of the program is defined by the information stored in ARRAY and the current iteration number, I. In other codes, the information necessary to define the state of the calculation may be in several arrays, common blocks, and variables. If you use a second saved-state file, and the iteration counts do not match, the older saved-state file is used. This provides a much safer mechanism for recovery, especially if the saved-state file is large. Large saved-state recovery, especially if the saved-state file is large. Large saved-state files can become unrecoverable because it takes longer for the
write to complete. This provides a larger window of time for a coincidental crash.

! This program has a recovery procedure with two files to save the state of the program. This will allow the program to
! recover if there is a system failure during the primary state save.

      REAL, DIMENSION(5000) :: ARRAY
      CHARACTER (KIND=13) FN

! Set maximum number of iterations.

      ITMAX = 9999

! INTERVAL is the frequency of saving data.

      INTERVAL = 20

! Open a scratch file named iter.save. (OPENCR is a subroutine that is included below. It will create a file if it does not
! exist.)

      CALL OPENCR (16,'iter.save', 'FORMATTED')

! The END = option allows for an initialization before execution of the code, if the file iter.save does not exist.
! The variable ITERSAVE is the number of iterations performed before the program's status was saved.
!ITERSAVE = 0 or end of file indicates it is the beginning of a task and any normal initialization should be performed.

      READ (16,*,END = 100) ITERSAVE
      GOTO 101
100   CONTINUE
      ITERSAVE = 0
101   CONTINUE
      CLOSE (16,STATUS = 'KEEP')

! This procedure uses two files for saving the state of the program. One holds the current data and one holds the
! previous save state. This insures that at least one of the two most recent save states is valid. The one to be used is
! determined by the count value in the scratch file iter.save

      IF (ITERSAVE /= 0) THEN

! The number 2 in the MOD function indicates there is alternation between the two save files. This could be
! modified to save more than two states.

         J = MOD(ITERSAVE/INTERVAL,2) + 1

! The following is an internal WRITE statement. This is  similar to the ENCODE statement found in some versions of
! Fortran. The result is placed in the character variable FN instead of going to an output device. The I3.3 in the FORMAT
! statement forces leading zeros in the output. This produces a character string such as "statesave.001".

         WRITE (FN,221) J
221   FORMAT('statesave.',I3.3)
         CALL OPENCR (15,FN, 'UNFORMATTED')
         READ (15) ITSAV,ARRAY
         CLOSE (UNIT = 15,STATUS = 'KEEP')
         IF (ITERSAVE /= ITSAV) THEN

! The following will print in case of an unlikely situation; such as a file system failure or similar catastrophe.

            PRINT *,'Invalid Recovery'
            STOP
         ENDIF
      ELSE

! Normal initialization procedure.

         CALL INIT (ARRAY)
      ENDIF

! Do for ITMAX iterations.

      DO I = ITERSAVE + 1,ITMAX
         CALL SUB1 (ARRAY)
         IF (MOD (I, INTERVAL) == 0) THEN

! Write the save state data and alternate between the two save
! state files. This is to insure the state of the program is
! saved completely. If the system goes down before the
! WRITE(15) completes, then iter.save is not updated and the
! previous state save files can be used.

            J = MOD(I/INTERVAL,2) + 1
            WRITE (FN,221) J
            CALL OPENCR (15,FN, 'UNFORMATTED')
               WRITE (15) I,ARRAY
               CLOSE (UNIT = 15,STATUS ='KEEP')

! Notice that iter.save is only updated after the current save
! state file is completely written.

               CALL OPENCR (16,'iter.save', 'FORMATTED')
               WRITE (16,*) I
               CLOSE (UNIT = 16,STATUS ='KEEP')
          ENDIF
      END DO

! Save the final result in a file named "final.done".

      CALL OPENCR (17,'final.done', 'UNFORMATTED')
      WRITE (17) ARRAY
      CLOSE (UNIT = 17,STATUS = 'KEEP')

! Delete iter.save.

      CALL OPENCR (16,'iter.save', 'FORMATTED')
      CLOSE (UNIT = 16,STATUS = 'DELETE')

! Delete the state save files. These may be kept for future use, especially if they are used for restarts from other
! than system failures. If they are not necessary after completion of the job, deletion is advisable since they can
! use a lot of disk space.

      DO I = 1,2
         WRITE (FN,221) I
         CALL OPENCR (15,FN,'UNFORMATTED')
         CLOSE (UNIT = 15,STATUS = 'DELETE')
      END DO
      END
      SUBROUTINE OPENCR(U, FN, FM)

! Open a file and create it, if it does not already exist.

      INTEGER U
      CHARACTER*(*) FN, FM

! Open the file.

      OPEN (FILE = FN, UNIT = U, FORM = FM)

      RETURN
      END
============================================================================

TOP

Recovering Checkpointed Jobs

When PBS recovers after an orderly shutdown, it recovers all checkpointed files with recoverable images. These images and files are created from the jobs that were running at the time PBS was shut down. As described in PBS-Initiated Checkpointing, these jobs can be checkpointed automatically by PBS when it is shut down, or manually checkpointed by using the qsub -cinterval command.

When PBS recovers after a shutdown or a system failure, it looks for checkpointed files with recoverable images; these recoverable images may have been forced by the qsub -cinterval command. PBS will recover any process that has been checkpointed and has a recoverable image of the process. You can use qsub -cn to keep the job from recovering.

TOP

Automatically Resubmitting Jobs

By reading the manually saved state of the job, you can get a PBS job to resubmit itself so that the next PBS job will begin where your job ends. You may want to checkpoint manually and restart a job to run a very long job in stages. This is preferable with a job that requires more time than the longest available batch queue. PBS policy states that you may only have a job resubmit itself at the end of the job.

The qsub script shown below assumes a job that can manually checkpoint itself.

# Resubmitting job example.
#PBS -S /bin/ksh        # Use the Korn shell
#PBS -lcput=7000        # CPU time limit of 7000 seconds.
#PBS -lmem=20mw         # 20 megawords memory size limit.
cd ~/myjob              # Change directory to where
                        # executable is located (~/myjob)
./myprog3 >> file.out
if (-e iter.save) then  # The -e checks for the existence
                        # of the file named iter.save,
                        # which contains the manually
                        # checkpointed data. If this file
                        # and will be resubmitted.
qsub resub.job
endif

In this example, the file iter.save, necessary for the preceding example to work, was created during run-time. The above code also relies on the fact that the recovery files are deleted upon successful completion of a run. The script in this example is in the file named resub.job, located in the directory ~/myjob. It is initially run by changing the directory where the executable file resides.

TOP