T) Explain about the job recovery file(.rec) abinitio ?
- Co>Op monitors & records the state of jobs so that if a job fails, it can be restarted. This state info is stored in files associated with the job and enables the Co>Op to roll back the system to its initial state, or to its state as of the most recent completed checkpoint. Generally, if the application encounters a failure, all hosts and their respective files will be rolled back to their initial state or their state as of the most recent completed checkpoint; you recover the job simply by rerunning it.
- Details: An AI job is considered completed when the mp run command returns. This means that all the processes associated with the job — excluding commands u might have added in the script end — have completed. These include the process on the host system that executes the script, and all processes the job has started on remote computers. If any of these processes terminate abnormally, Co>Op terminates the entire job and cleans up as much as possible. When an AI job runs, the Co>Op creates a file in the working directory on the host system with the name jobname.rec.
- This file contains a set of pointers to the log files on the host and on every computer associated with the job. The log files enable the Co>Op to roll back the system to its initial state or to its state as of the most recent checkpoint. If the job completes successfully, the recovery files are removed (they are also removed when a single-phase graph is rolled back). If the app encounters a software failure (Ex, one of the processes signals an error or the operator aborts the app), all hosts & their respective files are rolled back to their initial state, as if the app had not run at all. The files return to the state they were in at the start, all temporary files & storage are deleted, & all processes r terminated. If the program contains checkpoint commands, the state restored is that of the most recent completed checkpoint.
- When a job has been rolled back, u recover it simply by rerunning it. Of course, the cause of the original failure might repeat itself when you rerun the failed job. You will have to determine the cause of the failure by investigation or by debugging. When a check pointed application is rerun, the Co>Op performs a fast-forward replay of the successful phases. During this replay, no programs run & no data flows; that is, the phases r not actually repeated (although the monitoring system cannot detect the difference bet’n the replay & an actual execution).
- When the replayed phases are completed, the Co>Op runs the failed phase again. Note that it might not always be possible for the Co>Op to restore the system to an earlier state. Ex, a failure could occur because a host or its native OS crashed. In this case, it is not possible to cleanly shut down flow or file operations, nor to roll back file operations performed in the current phase. In fact, it is likely that intermediate or temporary files will be left around. To complete the cleanup and get the job running again, you must perform a manual rollback. You do this with the command m_rollback. The syntax is: m_rollback [-d] [-i] [-h] recoveryfile Running m_rollback recoveryfile rolls the job back to its initial state or the last completed checkpoint. Using the -d option deletes the partially run job & the recovery file.
Comments
Post a Comment