Direct Collocation

Based on the following tutorial paper

Introduction

Form of trajectory optimization
- Find the best choice of trajectory to get from A to B, by selecting inputs to the system (controls) as functions of time.
- Solution to Traj Opt is feasible if it satisifes constraints, its control sequence is called admissable.
- Optimal trajectory is subset of the feasible trajectory which minimizes some objective function.
- Constraints:
  - System dynamic constraint: describe how the system changes over time (typically non-linear)
  - Path constraints: enforces restriction along the trajectory.
    - Should be used for example to keep the foot of a walking robot above the ground during a step (or generally keep the robot out of some unsafe set quantified by a value function).
  - Boundary constraints: restrictions on the initial/final state of the system.
  - Constant limits on state or control variables which the state or control cannot violate.
  - Limits on initial and final time (and state)
Traj Opt are a collection of techniques that find open-loop solution to an optimal control problem.
- Get a sequence of controls (as a function of time) that moves from a single state to some final state.
- Dynamic Programming: finds an optimal policy. It provides optimal control for every point in the state space.
  - Hence, it’s closed loop, reacts to change in state
- Traj Opt useful for systems that are high-dimensional, have large state space or need to very accurate.
  - Given it is open-loop, must be combined with stabilizing controller when applied to a real system.
  - Sometimes traj opt will fail to converge or converge to local optimum.
- Given Dyn Prog is closed-loop no additional controller needed.
  - Also, will always find the globally optimal solution
  - But very expensive and scales exponentially with dimension of the problem “curse of dimensionality”.

Non linear programming

Basically, a constrained parameter optimization problem with non-linear objective or constraint
Typicial formulation: $min_z J(z)$ subject to: $f(z) = 0$ $g(z) \leq 0$ $z_{low} \leq z \leq z_{upp}$

Direct collocation method

Traj Opt can be direct or indirect
Direct methods discretizes the traj opt problem, converting it into a non-linear program.
- This conversion is known as “transcription” (converting from one format to another)
Direct Transcription (DT) methods are able to discretice by approximating all the continuous functions as polynomial splines.
- Spline is a function made up of a sequence of polynomial segments.
  - Polynomials used as finite coeffecient and easy to integrate/derive.
Key insight: discretization of time (and enforcing constraints at each of those times for the state, control, dynamics)

Trapezoidal collocation

Discretizing integrals

Approximate contiuous integrals $\int w(x) dt$ as summations $\sum c_k w_k$.
- Summation only requires the value of the integrand $w(t_k) = w_k$ at the collocation points $t_k$.
- Apply trapezoid rule for integration bewteen each coll point. $\int_{t_0}^{t_F}w(\tau, x(\tau), u(\tau)) d\tau \approx \sum_{k=0}^{N-1} \frac{1}{2} h_k (w_k+w_{k+1})$

Discretizing system dynamics

DT represents system dynamics as a set of constraints
For trapezoidal coll, write dyamics in integral ofrm and then approximate using trapezoid quadrature. $\dot{x} = f$ $\int_{t_k}^{t_{k+1}}\dot{x}dt = \int_{t_k}^{t_{k+1}}fdt$ $x_{k+1} - x_k \approx \frac{1}{2}h_k (f_{k+1} + f_k)$
This approximation is then applied between every pair of collocation points $k \in 0…(N-1)$.

Implementing constraints

Other constrains are enforced at their respective collocation points (path constraint applied all points, boundary constraint applied at first and last points.)
Traj Opt problems with path constraints much harder to solve than those without.
Creatng splines and interpolating
Trap DT approximates the control trajectory and system dynamics as piecewise linear functions (linear spliens).
Knot point: any point that joints two polynomial segments
- for trap DT same as coll points.
- For $t \in [t_k, t_{k+1}]$
- $\tau = t-t_k$ and $h_k = t_{k+1}-t_k$ $u(t) \approx u_k + \frac{\tau}{h_k}(u_{k+1}-u_k)$
State trajectory is represented by a quadratic spline $x(t) \approx x_k + f_k\tau+\frac{\tau^2}{2h_k}(f_{k+1}-f_k)$

Practical Considerations

Initialization

Traj Opt requires a good initial guess to begin optimization.
- Bad init can cause the NLP solver to fail even a solvable problem
- Presence of constraints make it worse: some starting points will have no feasible solution
- NLP solvers cannot always find a solution, and if they do it is locally optimal.
Usually init requires problem specific knowledge. More of an art than science, still some general tips:
- Assume straight line in state space between final and intial state.
- Solve a simpler version of the problem (simpler kinematics, unconstrained) and use that as an initial guess
  - Good way to find bugs in both your problem statement and code.

Short time steps (dense mesh) vs longer time steps (coarse mesh)
Short time steps: high compute cost
Mesh refinement: process by which a Traj Opt problem is solved on a sequence of different collocation meshes
- Coarse to finer
- Segments with less error (coarse vs finer) are not replaced
- With great error replaced by finer mesh solution.

Error Analysis

Error from discretization (from both method and grid)
- Find at non-coll points usig error coordinates

Debugging Code:

Family of solution available instead of a unique optimal
- Solution add regularization term to artifically create a unique optimal
Non-smooth solution (control) might cause slow convergence
- Introduce smoothing term such that the solution is numerically stiff but not discontinous.
- Or, solve using multi-phase method.
Objective and constraint functions are consistent
- Make sure problem state is not contradictory
- Make sure you have a good initial guess

Consistent Functions:

DT solves a TrajOpt problem by converting it into a NLP.
Objective and and constraint functions have to be consistent
- Consistent: performs the exact same sequence of arithmetic operations on each call
  - No logical branches, is deterministic, and have ouputs that vary smoothly with inputs
  - abs is not consistent, derivative is discontinuous at the origin.
  - min max are not consistent either
    - problem is the gradient: does not give information on where to move
- Neat trick for many inconsistent functions to be implemented consistently by introducing extra decision variables called slack variables and constraints to the problem.
  - Can instead also use smoothing
Continuous first derivates are required by most solvers when computing gradients.
- Also, good to have Hessian computability (second derivatives)

Time performance

For cartpole took 5 sec
For bipedal robot, coarse: 25 sec, fine: 60 secs

Model-Based Reinforcement Learning via Latent-Space Collocation

Trajectory Planning in latent space of world models for long-horizon tasks
- Longer Horizon task bad with planning algorithms that simulate forward the dynamics, as small deviations compound which each dynamics call.
- This is called the credit assignment problem:
  - determining past states and actions which deserve credit/blame for outcomes, especially when rewards are delayed/sparse or noisy.
- Better to optimize locally based on neighboring states in the trajectory.
  - Still should be dynamically feasible
Optimal Control Problem $max_{s_{2:T}, a_{1:T-1}} \sum_t r(s_t)$ such that: $s_{t+1} = f(s_t, a_t)$
Only requires learning a dynamics model and a reward function
- collocation can be used here as plug and play
Here applied over much higher dimensional state vectors
In literature:
- Shooting formulation done for model based RL: local optima problem
- Graph based methods used too: poor scaling
- HRL methods used: plan over extended periods with intermediate subgoals (and then use seperate model-free/based controllers to reach subgoals)

Methodology

Key insight: relax dynamics constraint to find high reward region then enforce it.
- Consequence of primal-dual constrained optimization
Learn Latent Dynamics
- Same components (encoder, transition and decoder structure)
- Markovian Property is upheld
- Many env are stochastic or partially observable, hence need to account for undertainty
  - Optimize ELBO
Collocation with Probabilistic Dynamics
- \[max_{a_{1:T-1}} E_{p_\phi (z_{t+1} \mid z_t, a_t)} [\sum_t r(s_t)]\]
- Shooting methods on this objective is poor due to recursive application of the dynamics, hard credit assignment
- Collocation: pairwise dependencies between temporally adjacent latent and no recursive application.
- Due to stochasticity, formulate coll as dist. constrainsts
  - Optimize a sequence of distributions $max_{q(z_2:T), a_{1:T-1}} \sum_t E_{q(z_t)} r(z_t)$ such that: $q({z_{t+1}}) = E_{q(z_t)}p_\phi (z_{t+1} \mid z_t, a_t)$ $q(z_1) = p(z_1)$
  - Same as the above objective
  - Relaxes dynamic constraint firt to maximize reward then enforces constraints
  - In contrast, shooting requires recursive app of dynamics and backprop through time, difficult to optimize.
Can do deterministicly
- Using particle approach, where you take the mean (expected value) of the distribution to be the dynamics constrain
- Variance is 0 as approx as one particle
With gaussian plans
- Parameterize $q$ as a diagonal covariance Gaussian $q(z_1:T) = \mathcal{N}(\mu_{1:T}, \sigma^2_{1:T})$
- Moment Matching
  - Equating statistical moments (mean, var, etc) between two dist. typically matching complex unknown dist with a simpler parameteric one like a Gaussian. (how are the mean and var set? apriori?)
  - Want to minimize differences between corresponding moments
  - For $p$ moments are estimated with samples
Fo Visual Model-Based RL
- Optimize by reformulating problem as saddle point problem on the Langrangian
  - Similar to dual descent a min max problem
  - Alternate between max nd min until convergence Very Imp
  - Not guranteed to converge but works in practice
- CNNs used for encoder and decoder, RNN used for transition dynamics
  - Reward decoder present
- MPC used to carry out plan upto some timesteps
  - then replan for closed loop behavior

Direct Collocation

Introduction

Non linear programming

Direct collocation method

Trapezoidal collocation

Discretizing integrals

Discretizing system dynamics

Implementing constraints

Creatng splines and interpolating

Practical Considerations

Initialization

Mesh Refinement

Error Analysis

Debugging Code:

Consistent Functions:

Time performance

Model-Based Reinforcement Learning via Latent-Space Collocation

Methodology