Imitation Learning

Policies are learned from expert-collected demonstrations
- Done via Behavior Cloning
- Becomes a supervised learning problem: mapping states to actions
Diffusion policy
- Good for multimodal action distributions
- Applied in offline RL
- Diffuser
  - learns a denoising diff model on trajectories, including both states and actions in a model based RL setting
  - Decision Diffuser: compositionality over skills, rewards and constraints, diffused over states and uses IDM to extract actions form the plan
  - Restricted to low-dim states
- UniPi
Action Chunking with Transformers (ACT)
Dataset of expert demos do not provide sufficent state dist coverage to effectively solve a given task
- Additional data exists in the form of action-free or suboptimal data (failed policy rollouts, play data, misc. env interacts.)
- BC cannot leverage this data as it assumes optimal actions

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Robot actions are multimodal distributions, have sequential correlation, and requirement of high precision.
Diffusion Policy: directly outputing an action sequence conditioned on visual observations, for $K$ denoising iterations.
- Is able to express multimodal action distributions.
  - Because of the addition of Gaussian noise
- High dimensional output space and stable training.
- Closed-loop action sequences: receding horizon control (similar to MPC). Continuously re-plan in a closed-loop manner.
- Use of time-series diffusion transformer.

SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards

training an agent to imitate an expert, given expert action demonstrations and ability to interact with the enviornment.
- agent does not observe a reward signal or query the expert, nor have state trans dynamics.
BC uses supervised learning to imitate
- compounding errors cause drift from the demonstrated states and encounter OOD states
- agent does not know how to return back to demonstrated states
- IRL partially solved this problem (GAIL paper)
  - construct a reward signal from demonstration through adversarial training; difficult to implement and use in practice
Adversarial methods encourage imitation by:
1. providing an incentive to imitate the demo actions and states
2. providing an incentive to take actions that it back to demo staes, when it encounters new OOD states.
Greedy methods only does the first one.
This paper uses adversarial approach by using constant rewards (instead of learned rewards)
- give agent a reward of 1 for matching demo action in a demo staet
- reward of 0 for all other behavior
this method achieves:
1. by sparse +1 reward for following demo
2. by using RL instead of supervised learning for training
initialzie agent with soft Q-learning
- initialize the experience replay buffer with expert demo
- set rewards to +1 in the demo exp, and setting rewards to 0 in all of the new experiences the agent collects while interacting with the enviornment.
- Soft Q Imitation Learning (SQIL)
Not continuous actions then?
- To implement for continuous action, swtich to an actor critic method (like SAC). Just need an “off-policy” algorithm.
Algorithm:
- Intialize two dataset: demo (with expert demo) and sample (empty at start)
- Until the Q network has not converged
- Update the theta of Q using the squared soft Bellman error:
  - Sample 50% of data from demo and 50% from sample
  - demo have a reward of 1, sample has a reward of 0
- sample transition with imitaion policy
- fill the sample buffer
- Do until convergence
- Regardless of the number of samples sampled, the update will contain a 50-50 split between demo and sample
- As time passes, the agent will be incentivized to pick more and more demo like action-state trajectory, effectively mimicking the expert.