Find the best choice of trajectory to get from A to B, by selecting inputs to the system (controls) as functions of time.
Solution to Traj Opt is feasible if it satisifes constraints, its control sequence is called admissable.
Optimal trajectory is subset of the feasible trajectory which minimizes some objective function.
Constraints:
System dynamic constraint: describe how the system changes over time (typically non-linear)
Path constraints: enforces restriction along the trajectory.
Should be used for example to keep the foot of a walking robot above the ground during a step (or generally keep the robot out of some unsafe set quantified by a value function).
Boundary constraints: restrictions on the initial/final state of the system.
Constant limits on state or control variables which the state or control cannot violate.
Limits on initial and final time (and state)
Traj Opt are a collection of techniques that find open-loop solution to an optimal control problem.
Get a sequence of controls (as a function of time) that moves from a single state to some final state.
Dynamic Programming: finds an optimal policy. It provides optimal control for every point in the state space.
Hence, it’s closed loop, reacts to change in state
Traj Opt useful for systems that are high-dimensional, have large state space or need to very accurate.
Given it is open-loop, must be combined with stabilizing controller when applied to a real system.
Sometimes traj opt will fail to converge or converge to local optimum.
Given Dyn Prog is closed-loop no additional controller needed.
Also, will always find the globally optimal solution
But very expensive and scales exponentially with dimension of the problem “curse of dimensionality”.
Non linear programming
Basically, a constrained parameter optimization problem with non-linear objective or constraint
Typicial formulation: \(min_z J(z)\) subject to: \(f(z) = 0\) \(g(z) \leq 0\) \(z_{low} \leq z \leq z_{upp}\)
Direct collocation method
Traj Opt can be direct or indirect
Direct methods discretizes the traj opt problem, converting it into a non-linear program.
This conversion is known as “transcription” (converting from one format to another)
Direct Transcription (DT) methods are able to discretice by approximating all the continuous functions as polynomial splines.
Spline is a function made up of a sequence of polynomial segments.
Polynomials used as finite coeffecient and easy to integrate/derive.
Key insight: discretization of time (and enforcing constraints at each of those times for the state, control, dynamics)
Summation only requires the value of the integrand $w(t_k) = w_k$ at the collocation points $t_k$.
Apply trapezoid rule for integration bewteen each coll point. \(\int_{t_0}^{t_F}w(\tau, x(\tau), u(\tau)) d\tau \approx \sum_{k=0}^{N-1} \frac{1}{2} h_k (w_k+w_{k+1})\)
Discretizing system dynamics
DT represents system dynamics as a set of constraints
For trapezoidal coll, write dyamics in integral ofrm and then approximate using trapezoid quadrature. \(\dot{x} = f\) \(\int_{t_k}^{t_{k+1}}\dot{x}dt = \int_{t_k}^{t_{k+1}}fdt\) \(x_{k+1} - x_k \approx \frac{1}{2}h_k (f_{k+1} + f_k)\)
This approximation is then applied between every pair of collocation points $k \in 0…(N-1)$.
Implementing constraints
Other constrains are enforced at their respective collocation points (path constraint applied all points, boundary constraint applied at first and last points.)
Traj Opt problems with path constraints much harder to solve than those without.
Creatng splines and interpolating
Trap DT approximates the control trajectory and system dynamics as piecewise linear functions (linear spliens).
Knot point: any point that joints two polynomial segments
State trajectory is represented by a quadratic spline \(x(t) \approx x_k + f_k\tau+\frac{\tau^2}{2h_k}(f_{k+1}-f_k)\)
Practical Considerations
Initialization
Traj Opt requires a good initial guess to begin optimization.
Bad init can cause the NLP solver to fail even a solvable problem
Presence of constraints make it worse: some starting points will have no feasible solution
NLP solvers cannot always find a solution, and if they do it is locally optimal.
Usually init requires problem specific knowledge. More of an art than science, still some general tips:
Assume straight line in state space between final and intial state.
Solve a simpler version of the problem (simpler kinematics, unconstrained) and use that as an initial guess
Good way to find bugs in both your problem statement and code.
Mesh Refinement
Short time steps (dense mesh) vs longer time steps (coarse mesh)
Short time steps: high compute cost
Mesh refinement: process by which a Traj Opt problem is solved on a sequence of different collocation meshes
Coarse to finer
Segments with less error (coarse vs finer) are not replaced
With great error replaced by finer mesh solution.
Error Analysis
Error from discretization (from both method and grid)
Find at non-coll points usig error coordinates
Debugging Code:
Family of solution available instead of a unique optimal
Solution add regularization term to artifically create a unique optimal
Non-smooth solution (control) might cause slow convergence
Introduce smoothing term such that the solution is numerically stiff but not discontinous.
Or, solve using multi-phase method.
Objective and constraint functions are consistent
Make sure problem state is not contradictory
Make sure you have a good initial guess
Consistent Functions:
DT solves a TrajOpt problem by converting it into a NLP.
Objective and and constraint functions have to be consistent
Consistent: performs the exact same sequence of arithmetic operations on each call
No logical branches, is deterministic, and have ouputs that vary smoothly with inputs
abs is not consistent, derivative is discontinuous at the origin.
min max are not consistent either
problem is the gradient: does not give information on where to move
Neat trick for many inconsistent functions to be implemented consistently by introducing extra decision variables called slack variables and constraints to the problem.
Can instead also use smoothing
Continuous first derivates are required by most solvers when computing gradients.
Also, good to have Hessian computability (second derivatives)
Time performance
For cartpole took 5 sec
For bipedal robot, coarse: 25 sec, fine: 60 secs
Model-Based Reinforcement Learning via Latent-Space Collocation
Trajectory Planning in latent space of world models for long-horizon tasks
Longer Horizon task bad with planning algorithms that simulate forward the dynamics, as small deviations compound which each dynamics call.
This is called the credit assignment problem:
determining past states and actions which deserve credit/blame for outcomes, especially when rewards are delayed/sparse or noisy.
Better to optimize locally based on neighboring states in the trajectory.
Still should be dynamically feasible
Optimal Control Problem \(max_{s_{2:T}, a_{1:T-1}} \sum_t r(s_t)\) such that: $s_{t+1} = f(s_t, a_t)$
Only requires learning a dynamics model and a reward function
collocation can be used here as plug and play
Here applied over much higher dimensional state vectors
In literature:
Shooting formulation done for model based RL: local optima problem
Graph based methods used too: poor scaling
HRL methods used: plan over extended periods with intermediate subgoals (and then use seperate model-free/based controllers to reach subgoals)
Methodology
Key insight: relax dynamics constraint to find high reward region then enforce it.
Consequence of primal-dual constrained optimization
Learn Latent Dynamics
Same components (encoder, transition and decoder structure)
Markovian Property is upheld
Many env are stochastic or partially observable, hence need to account for undertainty
Shooting methods on this objective is poor due to recursive application of the dynamics, hard credit assignment
Collocation: pairwise dependencies between temporally adjacent latent and no recursive application.
Due to stochasticity, formulate coll as dist. constrainsts
Optimize a sequence of distributions \(max_{q(z_2:T), a_{1:T-1}} \sum_t E_{q(z_t)} r(z_t)\) such that: $q({z_{t+1}}) = E_{q(z_t)}p_\phi (z_{t+1} \mid z_t, a_t)$ $q(z_1) = p(z_1)$
Same as the above objective
Relaxes dynamic constraint firt to maximize reward then enforces constraints
In contrast, shooting requires recursive app of dynamics and backprop through time, difficult to optimize.
Can do deterministicly
Using particle approach, where you take the mean (expected value) of the distribution to be the dynamics constrain
Variance is 0 as approx as one particle
With gaussian plans
Parameterize $q$ as a diagonal covariance Gaussian $q(z_1:T) = \mathcal{N}(\mu_{1:T}, \sigma^2_{1:T})$
Moment Matching
Equating statistical moments (mean, var, etc) between two dist. typically matching complex unknown dist with a simpler parameteric one like a Gaussian. (how are the mean and var set? apriori?)
Want to minimize differences between corresponding moments
For $p$ moments are estimated with samples
Fo Visual Model-Based RL
Optimize by reformulating problem as saddle point problem on the Langrangian
Similar to dual descent a min max problem
Alternate between max nd min until convergence Very Imp
Not guranteed to converge but works in practice
CNNs used for encoder and decoder, RNN used for transition dynamics