Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Automatic Mixed Precision option for training and evaluation. #199

Merged
merged 8 commits into from
May 20, 2024

Conversation

alexander-soare
Copy link
Collaborator

@alexander-soare alexander-soare commented May 20, 2024

What this does

  • As titled.
  • Side change: Some tweaks to end-to-end test params to compensate for the extra time added here (reducing model sizes)

How it was tested

CI += end-to-end testing for training and eval with AMP.

How to checkout & try? (for the reviewer)

Try training with use_amp = false/true. I tried this with ACT/Aloha and saw a reduction in training time and memory usage.

Try evaluating with AMP:

python lerobot/scripts/eval.py -p lerobot/diffusion_pusht eval.n_episodes=50 eval.batch_size=50 eval.use_async_envs=true env.episode_length=300 +use_amp=true

For me, it took 61 seconds without AMP and 48 seconds with AMP.

Finally, I trained ACT/aloha_sim_transfer_cube_human with the same recipe as https://huggingface.co/lerobot/act_aloha_sim_transfer_cube_human but only to 25k iters. I evaluated 50 episodes and matched the the success rate of the baseline at 30k iters: ie ~56% (wandb run here, but I used the wrong simulation env during training https://wandb.ai/alexander-soare/lerobot/runs/vrrckq4p?nw=nwuseralexandersoare)

@alexander-soare alexander-soare marked this pull request as draft May 20, 2024 12:02
@alexander-soare alexander-soare added the ⚡️ Performance Performance-related label May 20, 2024
@alexander-soare alexander-soare marked this pull request as ready for review May 20, 2024 13:42
@@ -10,6 +10,9 @@ hydra:
name: default

device: cuda # cpu
# `use_amp` determines whether to use Automatic Mixed Precision (AMP) for training and evaluation. With AMP,
# automatic gradient scaling is used.
use_amp: false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set it to true by default?

Suggested change
use_amp: false
use_amp: true

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in a second PR once all our models on the hub are fp16 checkpoints?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree we should do it next. Good suggestion.

Comment on lines -392 to +416
pin_memory=cfg.device != "cpu",
pin_memory=device.type != "cpu",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

@alexander-soare alexander-soare merged commit b6c216b into huggingface:main May 20, 2024
5 checks passed
@alexander-soare alexander-soare deleted the use_amp branch May 20, 2024 17:58
HalvardBariller pushed a commit to HalvardBariller/lerobot that referenced this pull request May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ Performance Performance-related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants