Exploiting chaos for learning
Robot Learning with Lyapunov Rewards
- Performance of an effective RL agent is dependent on a well-designed reward model.
- Usually require domain knowledge and custom heurestics
- Learner becomes inefficent when reward is generalized across all task and becomes sparse
- Only gains reward of 1 upon completion of task 0 elsewhere.
- This auxiliary exploration is possible in simulation but impractical for real robot systems
- Need to achieve stabilization using minial external domain knowledge
- Exploit the dynamics
- Use truncated lyapunov exponents to reward the controller, this incentivizes it to explore the system with maximal sensitive states.
- Natural reward function for exploration
- Test on classic benchmarks: pendulum swing up task
- double pendulum (PILCO is good for this)
- challenging exploration of state space with chaotic dynamics
- single pendulum:
- simple dynamics
- frequently used for new methods
- cart-pole:
- more complicated then simple pendulum
- For the reward function to be effective:
- should be dense and informative to speed up learning
- Estimate Lyapunov exponent from samples
- “Ride the crest of sensitivity” to maximum instability
Lyapunov Spectra of Dynamical Systems
Obervations
- SuPLE does not require the specification of a target state
- Guides the agent towards unstable equilibria (and stabilize the agent there)
- No need to specify the unstable equilibria.
A dynamical systems framework for Reinforcement Learning Safety and Robustness verification
- Analyze the combination of RL agent and it’s enviornment as a discrete-time autonomous dynamical system (use closed loop system).
- Use FTLE to visualize LCS that act as hidden “skeleton” governing system’s behavior.
- Framework provides an interpretable assessment of policy behavior
- Identify flaws in policies that appear successful based on reward alone.
- Suseptibility to adversarial perturbations.
- Small changes to an agent’s sensory input can often cause catastropic failure, leading to unsafe behavior.
- Significant barrier in deployment as sensors are noisy and enviornments are unpredictable/adversarial.
- “Robustness” problem solutions:
- constrained optimiation
- Lyapunov based methods
- zero sum game against adversary
- rely on statistical assupmtions/computational expensive
- Use instead dynamics to analyze performance of policy
- Use FTLE as a measure:
- scalar field that measures maximal rate at which close traj separate over a finite time window.
- Regions with high FTLE for ridges called LCS
- partitions the flow into distint coherent regions
- Framework maps the dynamic structure to the critical properties of an RL agent’s policy:
- Repelling LCS: compute FTLSE for forward time
- the resulting LCS is repelling:
- divides nearby trajectories that cannot cross over (barrier)
- for a safe policy, agent must learn to create strong repelling LCS that wall off unsafe regions of the state space
- their presence verifies that agent has learned an effective avoidance policy, “keep-out” zones. Attracting LCS (verify robustness):
- Act as highways and attractors
- For robust policy, find a single dominant basin of attraction centered on the desired goal state.
- Existence of other strong attractors reveal “trap” states where the agent can become permanently stuck
- Computation will require inverting the flow map?
- Can approx the location by simulating a large number of trajectories from random init poses and creating a 2D histogram of their final states at $T_int$
- High-density regions are attractors
RL through Dynamics System
- When policy $\pi$ is deployed, it creates a closed-loop system with the env.
- \[s_{k+1}=f(s_k) = T(s_k, \pi(s_k))\]
- Use flow maps $\Phi$ to understand long term system behavior (over a finite time horizon )
- \[\Phi(s_0) = f^T(s_0) = f \cdot f \cdot ... \cdot f(s_0)\]
Calculating FTLE
- To calculate local deformation around a state $s_0$
- Find the deformation gradient tensor, which is the Jacobian of the flow map:
- $J(s_0) = \frac{\partial \Phi(s_0)}{\partial s_0}$
- Jacobian describes both the stretching and rotation around the neighbourhood
- To isolate stretching (main for traj seperation), use right Cauchy-Green deformation tensor, $C$:
- $C(s_0) = J(s_0)^TJ(s_0)
- Largest eigenvalue is the max squared stretching initiated at $s_0$.
- FTLE definition: average exponential rate of maximal stretching: \(\sigma(s_0, T_int) = \frac{1}{\mid T \mid}ln \sqrt{\lambda_{max}(C(s_0))}\)
Calculate FTLE on discrete state space
- $\Phi$ is not an analytical function, Jacobian cannot be computed directly.
- Approximate the Jacobian using a finite difference scheme.
- For $s_0$ and step size $h$:
- Consider immediate neighbours, define perturbed initial points along each coordinate axis $i$: $s_{0,i} = s_0 + he_i$
- $e_i$ is the unit vector in the $i$-th dimension.
- Final perturbation vector (after flow map transformation):
- $u_i = \Phi(s_{0,i}) - \Phi(s_0)$
- These vectors constructs an approx of $J$
- $J \approx \frac{1}{h}[u_1][u_2]…[u_n]$
- Can compute $C$ and $\sigma$ from $J$
- FTLE complexity:
- Dominant step is flow map calc
- For each of the $\mid S \mid$ states, the algorithm simulates policy forward for $T$ steps
- Total complexityL $O(\mid S \mid \cdot T)$
Experiments:
- DQN agent trained on 2D grid-worlds
- Gym enviornments:
- SAC, TD3, TQC used (from Hugging Face)
- For env with 3d or more, analyze 2D slices of the dynamical landscape.
- Fix other state to logical constant values (ex: zero velocity and zero angle)
- MountainCarContinuous: requires an agent to build momentum to escape a value
- Pendulum-v1: stabilization problem, balance at top
- LunarLander Contiuous v2: 8 dim problem
Lagrangian Coherent Structures (LCS) and Finite Time Lyapunov Exponents (FTLE)
- Coherent structures from merging of fluid flows
Dynamic Potential-Based Reward Shaping
- Static potential based Reward Shaping: \(F(s, s') = \gamma \Phi(s') - \Phi(s)\)
- Always results in optimal policy that is learn without reward shaping.
- Typically implemented using domain-specific heurestic knowledge
- Need to automate the encoding of knowledge
- These methods alter the potential of states online whilst the agent is learning.
- “Dynamic Shaping”
- Widely held opnion for this to work is that potential function must converge before the agent can.
- But even without this assunption, policy converged with chaning reward transformation.
- Only one condition needed for dynamic potential-based reward shaping: \(F(s, t, s', t') = \gamma \Phi(s', t') - \Phi(s,t)\)