Posts

Showing posts with the label m_rollback

T) Explain about the job recovery file(.rec) abinitio ?

 Co>Op monitors & records the state of jobs so that if a job fails, it can be restarted. This state info is stored in files associated with the job and enables the Co>Op to roll back the system to its initial state, or to its state as of the most recent completed checkpoint. Generally, if the application encounters a failure, all hosts and their respective files will be rolled back to their initial state or their state as of the most recent completed checkpoint; you recover the job simply by rerunning it.    Details: An AI job is considered completed when the mp run command returns. This means that all the processes associated with the job — excluding commands u might have added in the script end — have completed. These include the process on the host system that executes the script, and all processes the job has started on remote computers. If any of these processes terminate abnormally, Co>Op terminates the entire job and cleans up as much as possible....