Spaces:
Runtime error
A newer version of the Gradio SDK is available:
6.2.0
This example can be used to train a small LLaMA model using NeMo2.0 and understand how some of the resiliency features like the ones listed below work.
- in-job restart
- straggler detection
- asynchronous checkpointing
Official documentation for these features can be found in the NeMo user guide.
See resiliency-in-pretraining-demo.ipynb for a demo of these features. You can deploy a Launchable hands-on tutorial which is deployed on a Crusoe GPU cloud instance accelerated by NVIDIA by following https://nvda.ws/42aNqnM.
crash_simulator.py is a crash simulator that can be used to simulate a fatal crash at a specific step and thus see the capabilities of the in-job restart resiliency feature.
preemption_simulator.py is a preemption simulator that can be used to simulate signal.SIGTERM and thus see the capabilities of the preemption feature.