Hamilton Jacobi Reachability

Based on the following overview papers .

Introduction

HJ Reachability analyis is a faormal verfication method for guranteeing performance and safety properties of dynamical systems
A reach-avoid state is computed:
- set of state from which system can be driven to a target
Applications:
- Aircraft landing, MPC of quadrotors, path planning, real time safe motion planning.
Limitations:
- Intractable as the state space dimension increase
  - Computation of grid representations of PDEs (Curse of dimensionality)
Backward reachable sets for safety:
- Start from known unsafe states and compute backward reachable sets which the system should avoid.
- HJ is applicable to nonlinear systems, and handles control and disturbance, and can create arbitrary shaped sets.

Backward reachable set: set of states such that the trajectories that start from this set can reach some given target set.
- If the target consits of failure states, then the BRS contains states that are unsafe (potentially) and should be avoided.
The computation for BRS is in terms of a two-player game:
- Player 1 and Player 2 being control inputs
- Player 1 $(a(.))$ will try to steer the system away from the target
- Player 2 $(b(.))$ will try to steer the system toward the target
- Solving the dynamics system in backwards time: $\dot{x}(s) = f(x(s), a(s), b(s)), s \in [t,0], a(s) \in \mathcal{A}, b(s) \in \mathcal{B}$
- So, the BRS set is as follows: $\mathcal{G}(t) = \{x:\exists \gamma \in \Gamma (t), \forall a(.) \in A, \zeta (0; x, t, a(.), \gamma [a](.)) \in \mathcal(G)_0 \}$
  - Where $\zeta$ represents the trajectory starting from $x$ and after $t$ ending in $G_0$.
- Requires solving a “differential game”
- Player 2 can only use non-anticipative strategies.
  - It cannot respond differently to Player 1 controls until they become different
  - Player 2 also has the advantage of factoring in Player 1’s choice of input at every instant $t$
    - “Instantaneious informational advantage”
    - Important to safety and robust control
Reach Avoid Problems: goal is to control the agent to reach a target set of states while simultaneously avoiding a failure set of states.

Cost definition: $J_t (x, a(.), b(.)) = \int^0_t c(x(s), a(s), b(s), s)ds + q(x(0))$
Lower value of the game (under nonanticipative strat assumption) $G(t,x) = inf_{\gamma \in \Gamma(t)} sup_{a(.)\in A} J_t(x, a(.), \gamma[a](.))$
- Best possible value obtained by player 1 even if player 2 played optimally.
- Player 1 ($a$) is trying to maximize score, while Player 2 ($b$) is trying to counter and minimize the score.
$G(t, x)$ is the viscosity solution to HJ-Isaacs PDE:
- \[D_tG(t, x) + H(t,x, \nabla G(t,x)) = 0, G(0, x) = q(x)\]
- \[H(t, x, \lambda) = max_a min_b c(x,a,b,t) + \lambda f(x,a,b)\]
- Hamiltonian encodes the instantaneous battle between two players at each point in time and space
- $\nabla G(t,x)$ is called co state
- Optimal solution of $a$ ($b$ can be found similarily) given the value function: $a^*(t,x) = arg max_a min_b c(x,a,b,t) + \lambda f(x,a,b)$

BRS: set of states from which the system can reach a target set at exactly time 0
BRT: set of states from which the system can reach a target within a duration of $\mid t \mid$.
- More aplicable to safety
- Tubes capture this notion
- Definition: $\mathcal{G}(t) = \{x:\exists \gamma \in \Gamma (t), \forall a(.) \in A, \exists s \in [t,0], \zeta (s; x, t, a(.), \gamma [a](.)) \in \mathcal(G)_0 \}$
- Can be computed by solving a final value PDE (like BRS)

Discretizing the statespace and using dynamic programming methods
- Can work on systems with at most 6 dimensions.
Learn HJ-Reachability value function in conjunction with learning control policies
- Bellman Formulation

No disturbance assumption for this.
Optimal control formulation: $V(x) := sup_{u(.)}inf_{t \geq 0}l(\zeta^u_x)$
inf term means find the smallest $l(x)$ within the trajectory of a control policy for all time
sup term finds the best policy which has the maximum (inf state within its trajectory)
If $V(x) \geq$, then there exists $u(.)$ that can keep the trajetory outside the failure state
If $V(x) < 0$, then there is no $u(.)$ that can keep the trajectory from going to failure.
$l(x)$ here is the signed distance function from failure set.
Discrete approximate formulation with no disturbances: $V(x) = min \{ l(x), max_{u \in U} V(x+f(x,u)\triangle t) \} = max \{ l(x), min_{u \in U} V(x') \}$

Use discounted Bellman formulation: $V(x) = (1-\gamma)l(x) + \gamma min\{ l(x), max_{u \in U} V(x') \}$
$\gamma \in [0,1)$
This discount factor can be seen as the probability of the episode continuting and its converse $1-\gamma$ represents the probability of transitionaing to a terminal state.
Update rule for Q-learning scheme: $Q_{k+1}(x, u) \leftarrow Q_k(x,u) + \alpha_k[(1-\gamma)l(x)+\gamma min\{ l(x), max_{u' \in U} Q(f(x,u), u') \} - Q_k(x,u)]$
Can similarily by extended to policy optimization methods like actor critic as well.