-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Automatic Mixed Precision option for training and evaluation. #199
Conversation
@@ -10,6 +10,9 @@ hydra: | |||
name: default | |||
|
|||
device: cuda # cpu | |||
# `use_amp` determines whether to use Automatic Mixed Precision (AMP) for training and evaluation. With AMP, | |||
# automatic gradient scaling is used. | |||
use_amp: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set it to true by default?
use_amp: false | |
use_amp: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in a second PR once all our models on the hub are fp16 checkpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree we should do it next. Good suggestion.
pin_memory=cfg.device != "cpu", | ||
pin_memory=device.type != "cpu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:D
What this does
How it was tested
CI += end-to-end testing for training and eval with AMP.
How to checkout & try? (for the reviewer)
Try training with use_amp = false/true. I tried this with ACT/Aloha and saw a reduction in training time and memory usage.
Try evaluating with AMP:
For me, it took 61 seconds without AMP and 48 seconds with AMP.
Finally, I trained ACT/aloha_sim_transfer_cube_human with the same recipe as https://huggingface.co/lerobot/act_aloha_sim_transfer_cube_human but only to 25k iters. I evaluated 50 episodes and matched the the success rate of the baseline at 30k iters: ie ~56% (wandb run here, but I used the wrong simulation env during training https://wandb.ai/alexander-soare/lerobot/runs/vrrckq4p?nw=nwuseralexandersoare)