ES-SAC: A Hybrid Evolution Strategy and Reinforcement Learning Approach for Humanoid Locomotion Control

BULLETIN OF THE POLISH ACADEMY OF SCIENCES. TECHNICAL SCIENCES, cilt.74, sa.4, ss.1-12, 2026 (Scopus)

State-of-the-art deep reinforcement learning (DRL) techniques such as Soft Actor-Critic (SAC), Twin Delayed Deep Deterministic

Policy Gradient (TD3), and Deep Deterministic Policy Gradient (DDPG) demonstrate promising results in developing control strategies. In this

study,weproposeES-SAC,ahybridlearningframeworkthatintegratesEvolutionaryStrategy(ES)withtheSACalgorithmtoenhancehumanoid

robotlocomotioncontrol. ES-SACleveragestheglobalsearchcapabilitiesofevolutionaryalgorithmsandthesampleefficiencyandconvergence

properties of DRL. The performance of the ES-SAC agent was evaluated on a bipedal robot simulation and compared to other hybrid methods

employing deterministic agents, including ES-TD3 and ES-DDPG. The ES-SAC agent exhibited superior average reward performance and a

morestablelearningprocess. Incontrast,theES-TD3agentachievedfastercoursecompletionbutexhibitedcontrolinstabilities. Thisstudyalso

highlights the importance of physical and behavioral metrics – such as torque efficiency, horizontal and vertical deflection, and Q0 values – in

assessing the reliability of DRL-based locomotion control. Our findings suggest that relying solely on cumulative reward for evaluation can be

misleading, underscoring the need fora more comprehensive analysisin future research.