Exploiting chaos for learning

Robot Learning with Lyapunov Rewards

Performance of an effective RL agent is dependent on a well-designed reward model.
- Usually require domain knowledge and custom heurestics
- Learner becomes inefficent when reward is generalized across all task and becomes sparse
  - Only gains reward of 1 upon completion of task 0 elsewhere.
  - This auxiliary exploration is possible in simulation but impractical for real robot systems
- Need to achieve stabilization using minial external domain knowledge
  - Exploit the dynamics
  - Use truncated lyapunov exponents to reward the controller, this incentivizes it to explore the system with maximal sensitive states.
  - Natural reward function for exploration
Test on classic benchmarks: pendulum swing up task
- double pendulum (PILCO is good for this)
  - challenging exploration of state space with chaotic dynamics
- single pendulum:
  - simple dynamics
  - frequently used for new methods
- cart-pole:
  - more complicated then simple pendulum
For the reward function to be effective:
- should be dense and informative to speed up learning
Estimate Lyapunov exponent from samples
“Ride the crest of sensitivity” to maximum instability

Lyapunov Spectra of Dynamical Systems

To find the Lyapunov Exponents of a dynamical system consider the evolution of the principal axes ($n$ dim) at $t=0$ in state space.
- i-th exponent $\forall i: \lambda_i = lin_{t \rightarrow \infty} \frac{1}{t}log_2(\frac{p_i(t)}{p_i(0)})$
- Lyapunov spectra (where $\lambda_i \geq \lambda_{i+1}$): $\Lambda (\lambda_1, \lambda_2, ..., \lambda_n)$
- $p(i)$ is the principal axis norm of the $i$-th axis at time $t$.
- Positive LE is $\lambda_i^+$
- Sum of positive LE: $SuPLE = \sum_i \lambda_i^+$
- Positive LEs are related to the expansion of state space volume
- Negative LEs are related to contraction.
- SuPLE is the total volume expansion
- usually, the largest postive LE $\lambda_1^+$ dominates, hence orthaognalization is required
- Lyapunov spectra for a state $\Lambda(s)$ is usually estimated for some finite time (truncated LE)

Obervations

SuPLE does not require the specification of a target state
- Guides the agent towards unstable equilibria (and stabilize the agent there)
- No need to specify the unstable equilibria.

A dynamical systems framework for Reinforcement Learning Safety and Robustness verification

Analyze the combination of RL agent and it’s enviornment as a discrete-time autonomous dynamical system (use closed loop system).
Use FTLE to visualize LCS that act as hidden “skeleton” governing system’s behavior.
Framework provides an interpretable assessment of policy behavior
- Identify flaws in policies that appear successful based on reward alone.
Suseptibility to adversarial perturbations.
- Small changes to an agent’s sensory input can often cause catastropic failure, leading to unsafe behavior.
- Significant barrier in deployment as sensors are noisy and enviornments are unpredictable/adversarial.
“Robustness” problem solutions:
- constrained optimiation
- Lyapunov based methods
  - zero sum game against adversary
- rely on statistical assupmtions/computational expensive
  - not interpretable
Use instead dynamics to analyze performance of policy
Use FTLE as a measure:
- scalar field that measures maximal rate at which close traj separate over a finite time window.
- Regions with high FTLE for ridges called LCS
- partitions the flow into distint coherent regions
Framework maps the dynamic structure to the critical properties of an RL agent’s policy:
- Repelling LCS: compute FTLSE for forward time
  - the resulting LCS is repelling:
    - divides nearby trajectories that cannot cross over (barrier)
    - for a safe policy, agent must learn to create strong repelling LCS that wall off unsafe regions of the state space
  - their presence verifies that agent has learned an effective avoidance policy, “keep-out” zones. Attracting LCS (verify robustness):
  - Act as highways and attractors
  - For robust policy, find a single dominant basin of attraction centered on the desired goal state.
  - Existence of other strong attractors reveal “trap” states where the agent can become permanently stuck
  - Computation will require inverting the flow map?
  - Can approx the location by simulating a large number of trajectories from random init poses and creating a 2D histogram of their final states at $T_int$
    - High-density regions are attractors

RL through Dynamics System

When policy $\pi$ is deployed, it creates a closed-loop system with the env.
- \[s_{k+1}=f(s_k) = T(s_k, \pi(s_k))\]
- Use flow maps $\Phi$ to understand long term system behavior (over a finite time horizon )
  - \[\Phi(s_0) = f^T(s_0) = f \cdot f \cdot ... \cdot f(s_0)\]

Calculating FTLE

To calculate local deformation around a state $s_0$
- Find the deformation gradient tensor, which is the Jacobian of the flow map:
  - $J(s_0) = \frac{\partial \Phi(s_0)}{\partial s_0}$
  - Jacobian describes both the stretching and rotation around the neighbourhood
  - To isolate stretching (main for traj seperation), use right Cauchy-Green deformation tensor, $C$:
  - $C(s_0) = J(s_0)^TJ(s_0)
  - Largest eigenvalue is the max squared stretching initiated at $s_0$.
FTLE definition: average exponential rate of maximal stretching: $\sigma(s_0, T_int) = \frac{1}{\mid T \mid}ln \sqrt{\lambda_{max}(C(s_0))}$

Calculate FTLE on discrete state space

$\Phi$ is not an analytical function, Jacobian cannot be computed directly.
- Approximate the Jacobian using a finite difference scheme.
- For $s_0$ and step size $h$:
  - Consider immediate neighbours, define perturbed initial points along each coordinate axis $i$: $s_{0,i} = s_0 + he_i$
  - $e_i$ is the unit vector in the $i$-th dimension.
- Final perturbation vector (after flow map transformation):
- $u_i = \Phi(s_{0,i}) - \Phi(s_0)$
  - These vectors constructs an approx of $J$
  - $J \approx \frac{1}{h}[u_1][u_2]…[u_n]$
  - Can compute $C$ and $\sigma$ from $J$
FTLE complexity:
- Dominant step is flow map calc
  - For each of the $\mid S \mid$ states, the algorithm simulates policy forward for $T$ steps
  - Total complexityL $O(\mid S \mid \cdot T)$

Experiments:

DQN agent trained on 2D grid-worlds
Gym enviornments:
- SAC, TD3, TQC used (from Hugging Face)
- For env with 3d or more, analyze 2D slices of the dynamical landscape.
  - Fix other state to logical constant values (ex: zero velocity and zero angle)
- MountainCarContinuous: requires an agent to build momentum to escape a value
- Pendulum-v1: stabilization problem, balance at top
- LunarLander Contiuous v2: 8 dim problem
  Lagrangian Coherent Structures (LCS) and Finite Time Lyapunov Exponents (FTLE)
Coherent structures from merging of fluid flows

Dynamic Potential-Based Reward Shaping

Static potential based Reward Shaping: $F(s, s') = \gamma \Phi(s') - \Phi(s)$
Always results in optimal policy that is learn without reward shaping.
Typically implemented using domain-specific heurestic knowledge
Need to automate the encoding of knowledge
- These methods alter the potential of states online whilst the agent is learning.
- “Dynamic Shaping”
- Widely held opnion for this to work is that potential function must converge before the agent can.
- But even without this assunption, policy converged with chaning reward transformation.
Only one condition needed for dynamic potential-based reward shaping: $F(s, t, s', t') = \gamma \Phi(s', t') - \Phi(s,t)$

Exploiting chaos for learning

Robot Learning with Lyapunov Rewards

Lyapunov Spectra of Dynamical Systems

Obervations

A dynamical systems framework for Reinforcement Learning Safety and Robustness verification

RL through Dynamics System

Calculating FTLE

Calculate FTLE on discrete state space

Experiments:

Lagrangian Coherent Structures (LCS) and Finite Time Lyapunov Exponents (FTLE)

Dynamic Potential-Based Reward Shaping