Imitation Learning
Imitation Learning
- Policies are learned from expert-collected demonstrations
- Done via Behavior Cloning
- Becomes a supervised learning problem: mapping states to actions
- Diffusion policy
- Good for multimodal action distributions
- Applied in offline RL
- Diffuser
- learns a denoising diff model on trajectories, including both states and actions in a model based RL setting
- Decision Diffuser: compositionality over skills, rewards and constraints, diffused over states and uses IDM to extract actions form the plan
- Restricted to low-dim states
- UniPi
- Action Chunking with Transformers (ACT)
- Dataset of expert demos do not provide sufficent state dist coverage to effectively solve a given task
- Additional data exists in the form of action-free or suboptimal data (failed policy rollouts, play data, misc. env interacts.)
- BC cannot leverage this data as it assumes optimal actions
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
- Robot actions are multimodal distributions, have sequential correlation, and requirement of high precision.
- Diffusion Policy: directly outputing an action sequence conditioned on visual observations, for $K$ denoising iterations.
- Is able to express multimodal action distributions.
- Because of the addition of Gaussian noise
- High dimensional output space and stable training.
- Closed-loop action sequences: receding horizon control (similar to MPC). Continuously re-plan in a closed-loop manner.
- Use of time-series diffusion transformer.
SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards
- training an agent to imitate an expert, given expert action demonstrations and ability to interact with the enviornment.
- agent does not observe a reward signal or query the expert, nor have state trans dynamics.
- BC uses supervised learning to imitate
- compounding errors cause drift from the demonstrated states and encounter OOD states
- agent does not know how to return back to demonstrated states
- IRL partially solved this problem (GAIL paper)
- construct a reward signal from demonstration through adversarial training; difficult to implement and use in practice
- Adversarial methods encourage imitation by:
- providing an incentive to imitate the demo actions and states
- providing an incentive to take actions that it back to demo staes, when it encounters new OOD states.
- Greedy methods only does the first one.
- This paper uses adversarial approach by using constant rewards (instead of learned rewards)
- give agent a reward of 1 for matching demo action in a demo staet
- reward of 0 for all other behavior
- this method achieves:
- by sparse +1 reward for following demo
- by using RL instead of supervised learning for training
- initialzie agent with soft Q-learning
- initialize the experience replay buffer with expert demo
- set rewards to +1 in the demo exp, and setting rewards to 0 in all of the new experiences the agent collects while interacting with the enviornment.
- Soft Q Imitation Learning (SQIL)
- Not continuous actions then?
- To implement for continuous action, swtich to an actor critic method (like SAC). Just need an “off-policy” algorithm.
- Algorithm:
- Intialize two dataset: demo (with expert demo) and sample (empty at start)
- Until the Q network has not converged
- Update the theta of Q using the squared soft Bellman error:
- Sample 50% of data from demo and 50% from sample
- demo have a reward of 1, sample has a reward of 0
- sample transition with imitaion policy
- fill the sample buffer
- Do until convergence
- Regardless of the number of samples sampled, the update will contain a 50-50 split between demo and sample
- As time passes, the agent will be incentivized to pick more and more demo like action-state trajectory, effectively mimicking the expert.