View source on GitHub |
Callback to back up and restore the training state.
Inherits From: Callback
tf.keras.callbacks.experimental.BackupAndRestore(
backup_dir
)
BackupAndRestore
callback is intended to recover training from an
interruption that has happened in the middle of a Model.fit
execution, by
backing up the training states in a temporary checkpoint file (with the help
of a tf.train.CheckpointManager
), at the end of each epoch. Each backup
overwrites the previously written checkpoint file, so at any given time there
is at most one such checkpoint file for backup/restoring purpose.
If training restarts before completion, the training state (which includes the
Model
weights and epoch number) is restored to the most recently saved state
at the beginning of a new Model.fit
run. At the completion of a Model.fit
run, the temporary checkpoint file is deleted.
Note that the user is responsible to bring jobs back after the interruption. This callback is important for the backup and restore mechanism for fault tolerance purpose, and the model to be restored from an previous checkpoint is expected to be the same as the one used to back up. If user changes arguments passed to compile or fit, the checkpoint saved for fault tolerance can become invalid.
Note:
- This callback is not compatible with eager execution disabled.
- A checkpoint is saved at the end of each epoch. After restoring,
Model.fit
redoes any partial work during the unfinished epoch in which the training got restarted (so the work done before the interruption doesn't affect the final model state). - This works for both single worker and multi-worker modes. When
Model.fit
is used withtf.distribute
, it supportstf.distribute.MirroredStrategy
,tf.distribute.MultiWorkerMirroredStrategy
,tf.distribute.TPUStrategy
, andtf.distribute.experimental.ParameterServerStrategy
.
Example:
class InterruptingCallback(tf.keras.callbacks.Callback):
def on_epoch_begin(self, epoch, logs=None):
if epoch == 4:
raise RuntimeError('Interrupting!')
callback = tf.keras.callbacks.experimental.BackupAndRestore(
backup_dir="/tmp/backup")
model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])
model.compile(tf.keras.optimizers.SGD(), loss='mse')
try:
model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10,
batch_size=1, callbacks=[callback, InterruptingCallback()],
verbose=0)
except:
pass
history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10,
batch_size=1, callbacks=[callback], verbose=0)
# Only 6 more epochs are run, since first trainning got interrupted at
# zero-indexed epoch 4, second training will continue from 4 to 9.
len(history.history['loss'])
6
Methods
set_model
set_model(
model
)
set_params
set_params(
params
)