Epsilon in AdamW matters

When training LLM, Sifal shows that switching from default eps=1e-8 to eps=1e-10 in AdamW optimizer can lead to better results. He showcases this on his toy example where default epsilon oscillates when searching for local minima compared to proposed one. However, this only applies when training is NOT done with half-precision.

Klioui, S. (2026). The Epsilon Trap: When Adam Stops Being Adam. Sifal Klioui Blog. link