subhankarg's picture
Upload folder using huggingface_hub
0558aa4 verified

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

This example can be used to train a small LLaMA model using NeMo2.0 and understand how some of the resiliency features like the ones listed below work.

  • in-job restart
  • straggler detection
  • asynchronous checkpointing

Official documentation for these features can be found in the NeMo user guide.

See resiliency-in-pretraining-demo.ipynb for a demo of these features. You can deploy a Launchable hands-on tutorial which is deployed on a Crusoe GPU cloud instance accelerated by NVIDIA by following​ https://nvda.ws/42aNqnM. crash_simulator.py is a crash simulator that can be used to simulate a fatal crash at a specific step and thus see the capabilities of the in-job restart resiliency feature. preemption_simulator.py is a preemption simulator that can be used to simulate signal.SIGTERM and thus see the capabilities of the preemption feature.